Assignment 8: Linear Regression¶
Due: 2020-11-03
Linear Regression¶
Find a dataset suitable for regression. We recommend a dataset from the UCI repository (see below)
Fit a linear regression model, measure the fit with two metrics, and make a plot that helps visualize the result.
Examine the coefficients or the residuals to try to interpret the results and explain what this regression model is doing.
Try fitting the model only on one feature. Justify your choice of feature based on the results above. Plot this result.
Accept the assignment to create your submission repository
Part 2: Test Train Splits
if you successfully completed this experiment in assignment 7, you don’t need to repeat it, but it might be interesting to try
Do an experiment to compare test set size vs performance:
Train a the linear regression model on 20%, 30%, … , 80% of the data.
Save the results of both test and train accuracy for each size training data in a DataFrame with columns [‘train_pct’,’n_train_samples’,’n_test_samples’,’train_acc’,’test_acc’]
Plot the accuracies vs training percentage.
Explain these results. What is the best test/train split. Why?
Part 3: Other models
Try fitting LASSO and explaining what it does. Does LASSO make better predictions on your data?
Do you think a model more complex than linear would be better? How could you tell?
Grading¶
Include description of what you’re doing, why you’re doing it, and what the results mean at each step of your analysis so that we can tell that you understand.
For regression level 2, complete part 1.
For evaluate level 2, complete part 2.
If you’re curious, try a more complex regression model as in part 3. This will get you an early start on level 3 for regression.
If you don’t successfully complete the whole assignment, pseudocode or partial answers to the questions could earn you level 1 for either skill.
FAQ¶
How do I find a good dataset?¶
Look for a dataset with numerical features and a categorical target variable.
If you look at the UCI website you can search for datasets for Classification and numerical and look through what those search filters give you some options. You might want to choose a dataset with less than about 20 attributes. To make the training fast, try to find a dataset with 1000 samples or less, or use read functions to use only a small chunk of the data. Remember to read about the dataset and note what you’re predicting.