8. Assignment 8: Regression#

8.1. Quick Facts#

task	skill
fit a linear regression model	regression (2)
evaluate fit of linear regression	evaluate (2)
use multiple metrics evaluate performance	evaluate (2)
interpret how decisions (test/train size, model parameters) impact model performance	evaluate (2)
interpret the model performance in the context of the dataset	process (2)
analyze the impact of model parameters on model performance	process (2)
use loops and lists effectively	python (2)
use EDA techniques to examine the experimental results	summarize (2), visualize (2)
create a dataset by combining data from multiple sources	construct (2)

Find a dataset suitable for regression. We recommend a dataset from the UCI repository. Complete the following in a single notebook.

Fit a linear regression model, measure the fit with two metrics, and make a plot that helps visualize the result.

Include a basic description of the data(what the features are)
Write your own description of what the regression task is and why a linear model is a reasonable model to try for this data.
Fit a linear model with 75% training data
Test it on 25% held out test data and measure the fit with two metrics and one plot
Inspect the model to answer:
- Does this model make sense?
- What to the coefficients tell you?
- What to the residuals tell you?
Repeat the split, train, and test steps 5 times.
- Is the performance consistent enough you trust it?
Interpret the model and its performance in terms of the application. Some questions you might want to answer in order to do this include:

Try fitting the model only on one feature. Justify your choice of feature based on the results above. Plot this result.

Note

If you have the relevant level 2 achievements (evaluation, summarize, visualize) you can skip this part, but it might still be interesting.

Do an experiment to compare test set size vs performance:

Re-fit your regression model using 10%, 30%, … , 90% of the data for training. Save the results of both test and train r2 and MSE for each size training data in a DataFrame with columns [‘train_pct’,‘n_train_samples’,‘n_test_samples’,‘train_r2’,‘test_r2’,‘train_mse’,‘test_mse’]
Plot the metrics vs training percentage in a line graph.
Interpret these results. How does training vs test size impact the model?

Thinking Ahead

Try these experiments with a different type of regression.
How do your evaluation experiment results compare in regression vs classification?