Assignment 8: Linear Regression

Due: 2020-11-03

Linear Regression

Find a dataset suitable for regression. We recommend a dataset from the UCI repository (see below)

Fit a linear regression model, measure the fit with two metrics, and make a plot that helps visualize the result.

Examine the coefficients or the residuals to try to interpret the results and explain what this regression model is doing.

Try fitting the model only on one feature. Justify your choice of feature based on the results above. Plot this result.

Accept the assignment to create your submission repository

Part 2: Test Train Splits

if you successfully completed this experiment in assignment 7, you don’t need to repeat it, but it might be interesting to try

Do an experiment to compare test set size vs performance:

  1. Train a the linear regression model on 20%, 30%, … , 80% of the data.

  2. Save the results of both test and train accuracy for each size training data in a DataFrame with columns [‘train_pct’,’n_train_samples’,’n_test_samples’,’train_acc’,’test_acc’]

  3. Plot the accuracies vs training percentage.

  4. Explain these results. What is the best test/train split. Why?

Part 3: Other models

Try fitting LASSO and explaining what it does. Does LASSO make better predictions on your data?

Do you think a model more complex than linear would be better? How could you tell?

Grading

Include description of what you’re doing, why you’re doing it, and what the results mean at each step of your analysis so that we can tell that you understand.

For regression level 2, complete part 1.

For evaluate level 2, complete part 2.

If you’re curious, try a more complex regression model as in part 3. This will get you an early start on level 3 for regression.

If you don’t successfully complete the whole assignment, pseudocode or partial answers to the questions could earn you level 1 for either skill.

FAQ

How do I find a good dataset?

Look for a dataset with numerical features and a categorical target variable.

If you look at the UCI website you can search for datasets for Classification and numerical and look through what those search filters give you some options. You might want to choose a dataset with less than about 20 attributes. To make the training fast, try to find a dataset with 1000 samples or less, or use read functions to use only a small chunk of the data. Remember to read about the dataset and note what you’re predicting.