Regression

Setting up for regression¶

We’re going to predict tip from total bill. This is a regression problem because the target, tip is:

available in the data (makes it supervised) and
a continuous value. The problems we’ve seen so far were all classification, species of iris and the character in that corners data were both categorical.

Using linear regression is also a good choice because it makes sense that the tip would be approximately linearly related to the total bill, most people pick some percentage of the total bill. If we our prior knowledge was that people typically tipped with some more complicated function, this would not be a good model.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import pandas as pd
sns.set_theme(palette='colorblind')

def pretty_type(object):
    ''' 
    return minimal types for complex object
    '''
    full_type = str(type(object)).replace('class','').strip("'<>' ")
    full_type_list = full_type.split('.')
    return ' '.join([full_type_list[0], full_type_list[-1]])

We wil load this data from seaborn

tips_df = sns.load_dataset('tips')
tips_df.head()

We will, for now, use only the total_bill to predict the tip

tips_X = tips_df[['total_bill']]
tips_y = tips_df['tip']
pretty_type(tips_df[['total_bill']])

'pandas DataFrame'

Why two brackets?

The double [[]] means that we make a list first and use that to select from the tips_df^[1]. This means we get a whole pandas DataFrame instead of a pandas Series

This is important because when we use sklearn with a pandas object (Series or DataFrame), it picks out the values, like this:

tips_df['total_bill'].values

array([16.99, 10.34, 21.01, 23.68, 24.59, 25.29,  8.77, 26.88, 15.04,
       14.78, 10.27, 35.26, 15.42, 18.43, 14.83, 21.58, 10.33, 16.29,
       16.97, 20.65, 17.92, 20.29, 15.77, 39.42, 19.82, 17.81, 13.37,
       12.69, 21.7 , 19.65,  9.55, 18.35, 15.06, 20.69, 17.78, 24.06,
       16.31, 16.93, 18.69, 31.27, 16.04, 17.46, 13.94,  9.68, 30.4 ,
       18.29, 22.23, 32.4 , 28.55, 18.04, 12.54, 10.29, 34.81,  9.94,
       25.56, 19.49, 38.01, 26.41, 11.24, 48.27, 20.29, 13.81, 11.02,
       18.29, 17.59, 20.08, 16.45,  3.07, 20.23, 15.01, 12.02, 17.07,
       26.86, 25.28, 14.73, 10.51, 17.92, 27.2 , 22.76, 17.29, 19.44,
       16.66, 10.07, 32.68, 15.98, 34.83, 13.03, 18.28, 24.71, 21.16,
       28.97, 22.49,  5.75, 16.32, 22.75, 40.17, 27.28, 12.03, 21.01,
       12.46, 11.35, 15.38, 44.3 , 22.42, 20.92, 15.36, 20.49, 25.21,
       18.24, 14.31, 14.  ,  7.25, 38.07, 23.95, 25.71, 17.31, 29.93,
       10.65, 12.43, 24.08, 11.69, 13.42, 14.26, 15.95, 12.48, 29.8 ,
        8.52, 14.52, 11.38, 22.82, 19.08, 20.27, 11.17, 12.26, 18.26,
        8.51, 10.33, 14.15, 16.  , 13.16, 17.47, 34.3 , 41.19, 27.05,
       16.43,  8.35, 18.64, 11.87,  9.78,  7.51, 14.07, 13.13, 17.26,
       24.55, 19.77, 29.85, 48.17, 25.  , 13.39, 16.49, 21.5 , 12.66,
       16.21, 13.81, 17.51, 24.52, 20.76, 31.71, 10.59, 10.63, 50.81,
       15.81,  7.25, 31.85, 16.82, 32.9 , 17.89, 14.48,  9.6 , 34.63,
       34.65, 23.33, 45.35, 23.17, 40.55, 20.69, 20.9 , 30.46, 18.15,
       23.1 , 15.69, 19.81, 28.44, 15.48, 16.58,  7.56, 10.34, 43.11,
       13.  , 13.51, 18.71, 12.74, 13.  , 16.4 , 20.53, 16.47, 26.59,
       38.73, 24.27, 12.76, 30.06, 25.89, 48.33, 13.27, 28.17, 12.9 ,
       28.15, 11.59,  7.74, 30.14, 12.16, 13.42,  8.58, 15.98, 13.42,
       16.27, 10.09, 20.45, 13.28, 22.12, 24.01, 15.69, 11.61, 10.77,
       15.53, 10.07, 12.6 , 32.83, 35.83, 29.03, 27.18, 22.67, 17.82,
       18.78])

This is because originally sklearn actually only worked on numpy arrays and so what they do now is pull it out of the pandas object, it gives us a numpy array.

All sklearn estimator objects are designed to work on multiple features, which mean they requires that the features have a 2D shape, even if there is only 1 feature and this not 2D.

However, picking out only one column, by default has 1 dimension:

tips_df['total_bill'].values.shape, tips_df['total_bill'].values.ndim

((244,), 1)

there is only 1 number in the shape, it only has a number of rows.

but this way is:

tips_df[['total_bill']].values.shape, tips_df[['total_bill']].values.ndim

((244, 1), 2)

this way has a 1 in the second position. It is a bit counter-intuitive that a single column has two different representations like this, but it is how numpy works. You can learn more in the numpy guide

We could also fix this by adding an axis. It doesn’t change values, but changes the way it is represented enough that it has the properties we need.

tips_df['total_bill'].values[:,np.newaxis].ndim

2

Next, we split the data

tips_X_train,tips_X_test, tips_y_train, tips_y_test = train_test_split(tips_X, tips_y,
                                                                       train_size=.8)

Fitting a regression model¶

We instantiate the object as we do with all other sklearn estimator objects.

regr = linear_model.LinearRegression()

we can inspect it to get a before

regr.__dict__

{'fit_intercept': True,
 'copy_X': True,
 'tol': 1e-06,
 'n_jobs': None,
 'positive': False}

and fit like the others too:

regr.fit(tips_X_train,tips_y_train)

and compare the after

regr.__dict__

{'fit_intercept': True,
 'copy_X': True,
 'tol': 1e-06,
 'n_jobs': None,
 'positive': False,
 'feature_names_in_': array(['total_bill'], dtype=object),
 'n_features_in_': 1,
 'coef_': array([0.11077399]),
 'rank_': 1,
 'singular_': array([123.26936335]),
 'intercept_': np.float64(0.8233560751381508)}

We can also save the predictions

tips_y_pred = regr.predict(tips_X_test)

and get a $R^2$ score for the model

regr.score(tips_X_test,tips_y_test)

0.35973565317431033

or the mean squared error:

mean_squared_error(tips_y_pred,tips_y_test)

0.9134820235550511

Regression Predictions¶

Linear regression is making the assumption that the target is a linear function of the features so that we can use the equation for a line(for scalar variables):

y_i =mx_i + b

(1)

becomes equivalent to the following code assuming they were all vectors/matrices:

i = 1 # could be any number from 0 ... len(target)
target[i] = regr.coef_*features[i]+ regr.intercept_

You can match these up one for one.

target is $y$ (this is why we usually put the _y_ in the varaible name)
regr.coef_ is the slope, $m$
features are the $x$ (like for y, we label that way)
and regr.intercept_ is the $b$ or the y intercept.

We will look at each of them and store them to variables here

slope = regr.coef_[0]
intercept = regr.intercept_

now we can pull out our first $x_i$ and calculate a predicted value of $y$ manually

x0 = tips_X_test.iloc[0].values[0]
manual_pred0 = slope * x0 + intercept
model_pred0 = tips_y_pred[0]
manual_pred0 == model_pred0

np.True_

and see that the manual prediction (np.float64(5.5988226331831985)) matches exactly the prediction from the predict method (np.float64(5.5988226331831985))^[2].

Visualizing Regression¶

Since we only have one feature, we can visualize what was learned here. We will use plain matplotlib plots because we are plotting from numpy arrays not data frames.

plt.scatter(tips_X_test,tips_y_test, color='black')
plt.scatter(tips_X_test,tips_y_pred, color='blue')

This plots the predictions in blue vs the real data in black. The above uses them both as scatter plots to make it more clear, below, like in class, I’ll use a normally inconvenient aspect of a line plot to show all the predictions the model could make in this range.

plt.scatter(tips_X_test,tips_y_test, color='black')
plt.plot(tips_X_test,tips_y_pred, color='blue')

Evaluating Regression - Mean Squared Error¶

From the plot, we can see that there is some error for each point, so accuracy that we’ve been using, won’t work. One idea is to look at how much error there is in each prediction, we can look at that visually first, these are called the residuals.

What happened to the plot in class?

We tried to plot the residuals in class:

plt.scatter(tips_X_test, tips_y_test,  color='black')
plt.plot(tips_X_test, tips_y_pred, color='blue', linewidth=3)

# draw vertical lines frome each data point to its predict value
[plt.plot([x,x],[yp,yt], color='red', linewidth=3)
                 for x, yp, yt in zip(tips_X_test, tips_y_pred,tips_y_test)];

but it didn’t look right.

I should have noticed from the total_bill on the horizontal axis what was going on it ws plotting the y values vs total_bill. We can confirm that is what happened by looking at what the data provided to the plot function is:

[[[x,x],[yp,yt]] for x, yp, yt in zip(tips_X_test, tips_y_pred,tips_y_test)]

[[['total_bill', 'total_bill'], [np.float64(5.5988226331831985), 5.0]]]

then we can fix it by getting out the values instead as I do below.

plt.plot(tips_X_test, tips_y_pred, color='blue', linewidth=3, label ='predictions')

# draw vertical lines frome each data point to its predict value
[plt.plot([x,x],[yp,yt], color='red', linewidth=3)
                 for x, yp, yt in zip(tips_X_test.values, tips_y_pred,tips_y_test)];

plt.plot([x0, x0],[tips_y_pred[0],tips_y_test.iloc[0]], color='red', linewidth=3, label='residual')

# plot these last so they are visually on top
plt.scatter(tips_X_test, tips_y_test,  color='black', label='data')
plt.legend(loc=2)
plt.xlabel('total bill ($)')
plt.ylabel('tip ($)')

In this code block:

I use zip a builtin function in Python to iterate over all of the test samples and predictions (they’re all the same length) and plot each red line in a list comprehension. This could have been a for loop, but the comprehension is slighly more compact visually. The zip allows us to have a Pythonic, easy to read loop that iterates over multiple variables.
To make a vertical line, I make a line plot with just two values. My plotted data is: [x,x],[yp,yt] so the first point is (x,yp) and the second is x,yt.
The ; at the end of each line suppressed the text output. (try removing it to see what it does)

We can use the average length of these red lines to capture the error. To get the length, we can take the difference between the prediction and the data for each point. Some would be positive and others negative, so we will square each one then take the average. This will be the mean squared error

mean_squared_error(tips_y_test,tips_y_pred)

0.9134820235550511

Which is equivalent to:

np.mean((tips_y_test-tips_y_pred)**2)

np.float64(0.9134820235550511)

To interpret this, we can take the square root to get it back into dollars. This becomes equivalent to taking the mean absolute value of the error.

mean_abs_error = np.sqrt(mean_squared_error(tips_y_test,tips_y_pred))
mean_abs_error

np.float64(0.9557625351283922)

Still, to know if this is a big or small error, we have to compare it to the values we were predicting

avg_tip = tips_y_test.mean()
avg_tip

np.float64(3.0404081632653064)

the average error($0.956) is not that big, in absolute terms, but it’s large relative to the average tip ($3.04) being wrong by np.float64(31.44)% is not great.

Evaluating Regression - R2¶

We can also use the $R^2$ score, the coefficient of determination.

If we have the following:

$n$ `=len(y_test)``
$y$ =y_test
$y_i$ =y_test[i]
$\hat{y}$ = y_pred
$\bar{y} = \frac{1}{n}\sum_{i=0}^n y_i$ = sum(y_test)/len(y_test)

R^2(y, \hat{y}) = 1 - \frac{\sum_{i=0}^n (y_i - \hat{y}_i)^2}{\sum_{i=0}^n (y_i - \bar{y}_i)^2}

(2)

r2_score(tips_y_test,tips_y_pred)

0.35973565317431033

This is a bit harder to interpret, but we can use some additional plots to visualize. This code simulates data by randomly picking 20 points, spreading them out and makes the “predicted” y values by picking a slope of 3. Then I simulated various levels of noise, by sampling noise and multiplying the same noise vector by different scales and adding all of those to a data frame with the column name the r score for if that column of target values was the truth.

Then I added some columns of y values that were with different slopes and different functions of x. These all have the small amount of noise.

# decide a number of points
N = 20
# pick some random points
x = 10*np.random.random(N)
# make a line with a fixed slope
y_pred_demo = 3*x
# put it in a data frame
ex_df = pd.DataFrame(data = x,columns = ['x'])
ex_df['y_pred'] = y_pred_demo
# set some amounts of noise
n_levels = range(1,18,2)
# sample random values between 0 and 1, then make the spread between -1 and +1
noise = (np.random.random(N)-.5)*2
# add the noise and put it in the data frame with col title the r2
for n in n_levels:
    y_true = y_pred_demo + n* noise
    ex_df['r2 = '+ str(np.round(r2_score(y_pred_demo,y_true),3))] = y_true

# set some other functions
f_x_list = [2*x,3.5*x,.5*x**2, .03*x**3, 10*np.sin(x)+x*3,3*np.log(x**2)]
# add them to the noise and store with r2 col title
for fx in f_x_list:
    y_true = fx + noise
    ex_df['r2 = '+ str(np.round(r2_score(y_pred_demo,y_true),3))] = y_true    

# make the data tidy for plotting
xy_df = ex_df.melt(id_vars=['x','y_pred'],var_name='rscore',value_name='y')
xy_df.head()

This creates a single tidy or tall dataframe with data from:

20 random points in [0,1] and then scales them to be between 0-10
‘predicted’ model that is $3x$ , always.
‘true’ data that has multiple levels of noise added the loop over n_levels
‘true’ data from other functions f_x_list

Now we can plot it all:

# make a custom grid by using facet grid directly
g = sns.FacetGrid(data = xy_df,col='rscore',col_wrap=3,aspect=1.5,height=3)
# plot the lin
g.map(plt.plot, 'x','y_pred',color='k')
# plot the dta
g.map(sns.scatterplot, "x", "y",)

<seaborn.axisgrid.FacetGrid at 0x7f145b32bcb0>

In these, you can see the varying levels of how much the data agrees with te prediction and the corresponding $R^2$ .

Regression auto scoring¶

as all sklearn objects, it has a built in score method

regr.score(tips_X_test, tips_y_test)

0.35973565317431033

this matches the $R^2$ score:

r2_score(tips_y_test,tips_y_pred)

0.35973565317431033

and is different from mse

mean_squared_error(tips_y_test,tips_y_pred)

0.9134820235550511

tips_df.head()

Mutlivariate Regression¶

Recall the equation for a line:

\hat{y} = mx+b

(3)

When we have multiple variables instead of a scalar $x$ we can have a vector $\mathbf{x}$ and instead of a single slope, we have a vector of coefficients $\beta$

\hat{y} = \beta^T\mathbf{x} + \beta_0

(4)

where $\beta$ is the regr_db.coef_ and $\beta_0$ is regr_db.intercept_ and that’s a vector multiplication and $\hat{y}$ is y_pred and $y$ is y_test.

In scalar form, a vector multiplication can be written like

\hat{y} = \sum_{k=0}^d(x_k*\beta_k) + \beta_0

(5)

where there are $d$ features, that is $d$ = len(X_test[k]) and $k$ indexed into it.

We can also load data from Scikit learn.

This dataset includes 10 features measured on a given date and an measure of diabetes disease progression measured one year later. The predictor we can train with this data might be someting a doctor uses to calculate a patient’s risk.

diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y = True)

This model predicts what lab measure a patient will have one year in the future based on lab measures in a given day. Since we see that this is not a very high r2, we can say that this is not a perfect predictor, but a Doctor, who better understands the score would have to help interpret the core.

diabetes_X[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187239, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632753, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567042, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286131, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665608,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02268774, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187239,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03198764, -0.04664087]])

X_train, X_test, y_train, y_test = train_test_split(diabetes_X, diabetes_y)

and fit:

regr_diabetes = linear_model.LinearRegression()
regr_diabetes.fit(X_train, y_train)

We can look at the estimator again and see what it learned. It describes the model like a line:

\hat{y} = mx+b

(6)

except in this case it’s multivariate, so we can write it like:

\hat{y} = \beta^Tx + \beta_0

(7)

where $\beta$ is the regr_db.coef_ and $\beta_0$ is regr_db.intercept_ and that’s a vector multiplication and $\hat{y}$ is y_pred and $y$ is y_test.

In scalar form it can be written like

\hat{y} = \sum_{k=0}^d(x_k*\beta_k) + \beta_0

(8)

where there are $d$ features, that is $d$ = len(X_test[k]) and $k$ indexed into it. For example in the below $k=0$

regr_diabetes.score(X_test, y_test)

0.39559100670095426

regr_diabetes.__dict__

{'fit_intercept': True,
 'copy_X': True,
 'tol': 1e-06,
 'n_jobs': None,
 'positive': False,
 'n_features_in_': 10,
 'coef_': array([  -22.50631216,  -286.88191021,   500.77161988,   347.0257024 ,
        -1148.38129358,   734.48689317,   189.13010628,   228.00061815,
          865.34237424,   118.89437816]),
 'rank_': 10,
 'singular_': array([1.7369086 , 1.07640059, 0.99187017, 0.83867701, 0.73326035,
        0.6779834 , 0.63145094, 0.56705339, 0.24837583, 0.08012279]),
 'intercept_': np.float64(152.29417772737628)}

regr.__dict__

{'fit_intercept': True,
 'copy_X': True,
 'tol': 1e-06,
 'n_jobs': None,
 'positive': False,
 'feature_names_in_': array(['total_bill'], dtype=object),
 'n_features_in_': 1,
 'coef_': array([0.11077399]),
 'rank_': 1,
 'singular_': array([123.26936335]),
 'intercept_': np.float64(0.8233560751381508)}

LASSO¶

To solve a regular linear regression problem and find the weights $w$ to make the line:

y=Xw

(9)

The fit method find the $w$ to minimize the following objective function, for $N$ samples, :

\frac{1}{2N}||y- Xw||_2^2

(10)

Note that this is basically minimizing the mean squared error.

Lasso allows us to pick a subset of the features at the same time we learn the weights

The objective is to minimize:

\frac{1}{2N}||y- Xw||_2^2 +\alpha||w||_1

(11)

or in ~python like syntax:

(1 / (2 * n_samples)) * ||y - Xx||^2_2 + alpha * ||w||_1

It is implemented as an estimator object like every other one we have seen

lasso = linear_model.Lasso()
lasso.fit(X_train, y_train)
lasso.score(X_test, y_test)

0.36169518506096787

We see it learns a model with most of the coefficients set to 0.

coefs = lasso.coef_
n_coefs = len(coefs) - sum(coefs==0)
coefs

array([  0.        ,  -0.        , 368.81149218,  34.15192845,
         0.        ,   0.        ,  -0.        ,   0.        ,
       321.23734932,   0.        ])

we can increase the number of allowed parameters by using making alpha smaller because making $\alpha$ smaller allows $||w||_1$ to be bigger while keeping their product $\alpha||w||_1$ the same which is added to the total predictive error.

lasso = linear_model.Lasso(alpha=.25)
lasso.fit(X_train, y_train)
lasso.score(X_test, y_test)

0.44273288001220656

coefs = lasso.coef_
n_coefs = len(coefs) - sum(coefs==0)
coefs

array([  -0.        ,  -80.95822654,  483.2260629 ,  241.62510048,
         -0.        ,   -0.        , -203.81326993,    0.        ,
        433.23980188,   38.37984675])

lasso = linear_model.Lasso(alpha=.05)
lasso.fit(X_train, y_train)
lasso.score(X_test, y_test)

0.42167063026078067

we can compare this to the original score

coefs = lasso.coef_
n_coefs = len(coefs) - sum(coefs==0)
coefs

array([  -0.        , -237.4919946 ,  504.95936128,  314.72846588,
       -147.64062613,   -0.        , -258.06532906,   20.35895618,
        515.16050299,  103.79540234])

how could you inspect the residuals with more than one input feature¶

Let’s get the predictions:

y_pred = regr_diabetes.predict(X_test)

One thing we can do is plot the predictions vs the real values, this would ideall be a perfect line

plt.scatter(y_test, y_pred)
plt.plot(y_test,y_test,'k')

There is some noise as epxected, and further the errors are not uniformly distributed (the black line is the true vs itself) for lower values the model tends to over predict (the predictions are higher than the real value) and for the higher values it tends to underpredict (the predictions are mostly lower than the real value).

We could also plot the residuals similarly:

plt.scatter(y_test, y_test-y_pred)

we see the same pattern.

We could also plot the residuals vs the features, to do this, we will make a dataframe and use seaborn. First, we’ll get the feature names.

db_bunch = datasets.load_diabetes()
feature_names = db_bunch.feature_names

Now make a wide dataframe and melt it to be tall

diabetes_residuals_wide = pd.DataFrame(data = X_test, columns =feature_names)
diabetes_residuals_wide['y_pred'] = y_pred
diabetes_residuals_wide['y_test'] = y_test
diabetes_residuals_wide['residual'] = y_test - y_pred

diabetes_residuals = diabetes_residuals_wide.melt(id_vars = ['residual','y_pred','y_test'],
                        value_vars = feature_names, value_name='feature_value', var_name='feature')
diabetes_residuals.head()

and now we can make subplot for each figure and plot the residual vs the features

sns.relplot(data = diabetes_residuals, x='feature_value',y = 'residual',
          col='feature',col_wrap=3)

<seaborn.axisgrid.FacetGrid at 0x7f145b6223c0>

From this, it actually looks like none of the individual features have a lot more information left that we could find.

plt.scatter(tips_y_test, tips_y_test-tips_y_pred)

Questions¶

How does LASSO differ from regular linear regresion?¶

LASSO is a regularized model.

What is the range of $R^2$ ?¶

1 is ideal, it would be 0 if the we set the model to always predict y=x.mean() and it can be negative, because it can be arbitrarily bad.

Footnotes¶

here, the type is computed from code, hover ove the type words to see what object was typed.
↩
the numbers in this sentence are inserted with code so you can hover to see what variables they came from.
↩

Setting up for regression¶

Fitting a regression model¶

Regression Predictions¶

Visualizing Regression¶

Evaluating Regression - Mean Squared Error¶

Evaluating Regression - R2¶

Regression auto scoring¶

Mutlivariate Regression¶

LASSO¶

how could you inspect the residuals with more than one input feature¶

Questions¶

How does LASSO differ from regular linear regresion?¶

What is the range of R2R^2R2?¶

What is the range of $R^2$ ?¶