Regression

16. Regression#

16.1. Setting up for regression#

We’re going to predict tip from total bill. This is a regression problem because the target, tip is:

available in the data (makes it supervised) and
a continuous value. The problems we’ve seen so far were all classification, species of iris and the character in that corners data were both categorical.

Using linear regression is also a good choice because it makes sense that the tip would be approximately linearly related to the total bill, most people pick some percentage of the total bill. If we our prior knowledge was that people typically tipped with some more complicated function, this would not be a good model.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import pandas as pd
sns.set_theme(font_scale=2,palette='colorblind')

We wil load this data from seaborn

tips_df = sns.load_dataset('tips')
tips_df.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Since we will use only one variable, we have to do some extra work. When we use sklearn with a DataFrame, it picks out the values, like this:

tips_df['total_bill'].values

array([16.99, 10.34, 21.01, 23.68, 24.59, 25.29,  8.77, 26.88, 15.04,
       14.78, 10.27, 35.26, 15.42, 18.43, 14.83, 21.58, 10.33, 16.29,
       16.97, 20.65, 17.92, 20.29, 15.77, 39.42, 19.82, 17.81, 13.37,
       12.69, 21.7 , 19.65,  9.55, 18.35, 15.06, 20.69, 17.78, 24.06,
       16.31, 16.93, 18.69, 31.27, 16.04, 17.46, 13.94,  9.68, 30.4 ,
       18.29, 22.23, 32.4 , 28.55, 18.04, 12.54, 10.29, 34.81,  9.94,
       25.56, 19.49, 38.01, 26.41, 11.24, 48.27, 20.29, 13.81, 11.02,
       18.29, 17.59, 20.08, 16.45,  3.07, 20.23, 15.01, 12.02, 17.07,
       26.86, 25.28, 14.73, 10.51, 17.92, 27.2 , 22.76, 17.29, 19.44,
       16.66, 10.07, 32.68, 15.98, 34.83, 13.03, 18.28, 24.71, 21.16,
       28.97, 22.49,  5.75, 16.32, 22.75, 40.17, 27.28, 12.03, 21.01,
       12.46, 11.35, 15.38, 44.3 , 22.42, 20.92, 15.36, 20.49, 25.21,
       18.24, 14.31, 14.  ,  7.25, 38.07, 23.95, 25.71, 17.31, 29.93,
       10.65, 12.43, 24.08, 11.69, 13.42, 14.26, 15.95, 12.48, 29.8 ,
        8.52, 14.52, 11.38, 22.82, 19.08, 20.27, 11.17, 12.26, 18.26,
        8.51, 10.33, 14.15, 16.  , 13.16, 17.47, 34.3 , 41.19, 27.05,
       16.43,  8.35, 18.64, 11.87,  9.78,  7.51, 14.07, 13.13, 17.26,
       24.55, 19.77, 29.85, 48.17, 25.  , 13.39, 16.49, 21.5 , 12.66,
       16.21, 13.81, 17.51, 24.52, 20.76, 31.71, 10.59, 10.63, 50.81,
       15.81,  7.25, 31.85, 16.82, 32.9 , 17.89, 14.48,  9.6 , 34.63,
       34.65, 23.33, 45.35, 23.17, 40.55, 20.69, 20.9 , 30.46, 18.15,
       23.1 , 15.69, 19.81, 28.44, 15.48, 16.58,  7.56, 10.34, 43.11,
       13.  , 13.51, 18.71, 12.74, 13.  , 16.4 , 20.53, 16.47, 26.59,
       38.73, 24.27, 12.76, 30.06, 25.89, 48.33, 13.27, 28.17, 12.9 ,
       28.15, 11.59,  7.74, 30.14, 12.16, 13.42,  8.58, 15.98, 13.42,
       16.27, 10.09, 20.45, 13.28, 22.12, 24.01, 15.69, 11.61, 10.77,
       15.53, 10.07, 12.6 , 32.83, 35.83, 29.03, 27.18, 22.67, 17.82,
       18.78])

This is because originally sklearn actualy only worked on numpy arrays and sot what they do now is pull it out of the DataFrame, it gives us a numpy array.

We can change this by adding an axis. It doesn’t change values, but changes the way it is represented enough that it has the properties we need.

tips_df['total_bill'].values[:,np.newaxis]

array([[16.99],
       [10.34],
       [21.01],
       [23.68],
       [24.59],
       [25.29],
       [ 8.77],
       [26.88],
       [15.04],
       [14.78],
       [10.27],
       [35.26],
       [15.42],
       [18.43],
       [14.83],
       [21.58],
       [10.33],
       [16.29],
       [16.97],
       [20.65],
       [17.92],
       [20.29],
       [15.77],
       [39.42],
       [19.82],
       [17.81],
       [13.37],
       [12.69],
       [21.7 ],
       [19.65],
       [ 9.55],
       [18.35],
       [15.06],
       [20.69],
       [17.78],
       [24.06],
       [16.31],
       [16.93],
       [18.69],
       [31.27],
       [16.04],
       [17.46],
       [13.94],
       [ 9.68],
       [30.4 ],
       [18.29],
       [22.23],
       [32.4 ],
       [28.55],
       [18.04],
       [12.54],
       [10.29],
       [34.81],
       [ 9.94],
       [25.56],
       [19.49],
       [38.01],
       [26.41],
       [11.24],
       [48.27],
       [20.29],
       [13.81],
       [11.02],
       [18.29],
       [17.59],
       [20.08],
       [16.45],
       [ 3.07],
       [20.23],
       [15.01],
       [12.02],
       [17.07],
       [26.86],
       [25.28],
       [14.73],
       [10.51],
       [17.92],
       [27.2 ],
       [22.76],
       [17.29],
       [19.44],
       [16.66],
       [10.07],
       [32.68],
       [15.98],
       [34.83],
       [13.03],
       [18.28],
       [24.71],
       [21.16],
       [28.97],
       [22.49],
       [ 5.75],
       [16.32],
       [22.75],
       [40.17],
       [27.28],
       [12.03],
       [21.01],
       [12.46],
       [11.35],
       [15.38],
       [44.3 ],
       [22.42],
       [20.92],
       [15.36],
       [20.49],
       [25.21],
       [18.24],
       [14.31],
       [14.  ],
       [ 7.25],
       [38.07],
       [23.95],
       [25.71],
       [17.31],
       [29.93],
       [10.65],
       [12.43],
       [24.08],
       [11.69],
       [13.42],
       [14.26],
       [15.95],
       [12.48],
       [29.8 ],
       [ 8.52],
       [14.52],
       [11.38],
       [22.82],
       [19.08],
       [20.27],
       [11.17],
       [12.26],
       [18.26],
       [ 8.51],
       [10.33],
       [14.15],
       [16.  ],
       [13.16],
       [17.47],
       [34.3 ],
       [41.19],
       [27.05],
       [16.43],
       [ 8.35],
       [18.64],
       [11.87],
       [ 9.78],
       [ 7.51],
       [14.07],
       [13.13],
       [17.26],
       [24.55],
       [19.77],
       [29.85],
       [48.17],
       [25.  ],
       [13.39],
       [16.49],
       [21.5 ],
       [12.66],
       [16.21],
       [13.81],
       [17.51],
       [24.52],
       [20.76],
       [31.71],
       [10.59],
       [10.63],
       [50.81],
       [15.81],
       [ 7.25],
       [31.85],
       [16.82],
       [32.9 ],
       [17.89],
       [14.48],
       [ 9.6 ],
       [34.63],
       [34.65],
       [23.33],
       [45.35],
       [23.17],
       [40.55],
       [20.69],
       [20.9 ],
       [30.46],
       [18.15],
       [23.1 ],
       [15.69],
       [19.81],
       [28.44],
       [15.48],
       [16.58],
       [ 7.56],
       [10.34],
       [43.11],
       [13.  ],
       [13.51],
       [18.71],
       [12.74],
       [13.  ],
       [16.4 ],
       [20.53],
       [16.47],
       [26.59],
       [38.73],
       [24.27],
       [12.76],
       [30.06],
       [25.89],
       [48.33],
       [13.27],
       [28.17],
       [12.9 ],
       [28.15],
       [11.59],
       [ 7.74],
       [30.14],
       [12.16],
       [13.42],
       [ 8.58],
       [15.98],
       [13.42],
       [16.27],
       [10.09],
       [20.45],
       [13.28],
       [22.12],
       [24.01],
       [15.69],
       [11.61],
       [10.77],
       [15.53],
       [10.07],
       [12.6 ],
       [32.83],
       [35.83],
       [29.03],
       [27.18],
       [22.67],
       [17.82],
       [18.78]])

All sklearn estimator objects are designed to work on multiple features, which mean they expect that the features have a 2D shape, but when we pick out a single column’s (pandas Series) values, we get a 1D shape but after we get 2d, though the second dimension the size is 1

tips_df['total_bill'].values.shape, tips_df['total_bill'].values[:,np.newaxis].shape

((244,), (244, 1))

The target is expected to be a single feature, so we do not need to do anythign special.

tips_X = tips_df['total_bill'].values[:,np.newaxis]
tips_y = tips_df['tip']

Next, we split the data

tips_X_train, tips_X_test, tips_y_train, tips_y_test = train_test_split(tips_X,
                                                                        tips_y)

16.2. Fitting a regression model#

We instantiate the object as we do with all other sklearn estimator objects.

regr = linear_model.LinearRegression()

we can inspect it to get a before

regr.__dict__

{'fit_intercept': True, 'copy_X': True, 'n_jobs': None, 'positive': False}

and fit liek the others too.

regr.fit(tips_X_train,tips_y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

and look again:

regr.__dict__

{'fit_intercept': True,
 'copy_X': True,
 'n_jobs': None,
 'positive': False,
 'n_features_in_': 1,
 'coef_': array([0.09653355]),
 'rank_': 1,
 'singular_': array([118.14042644]),
 'intercept_': 1.0732389645412237}

we also can predict as before

tips_y_pred = regr.predict(tips_X_test)

16.3. Regression Predictions#

Linear regression is making the assumption that the target is a linear function of the features so that we can use the equation for a line(for scalar variables):

\[ y =mx + b\]

becomes equivalent to the following code assuming they were all vectors/matrices:

i = 1 # could be any number from 0 ... len(target)
target[i] = regr.coef_*features[i]+ regr.intercept_

You can match these up one for one.

target is $y$ (this is why we usually put the _y_ in the varaible name)
regr.coef_ is the slope, $m$
features are the $x$ (like for y, we label that way)
adn regr.intercept_ is the $b$ or the y intercept.

Important

I have tried to expand the detail here substantively. Use the link at the top of the page to make an issue if you have specific questions

We will look at each of them and store them to variables here

slope = regr.coef_
intercept = regr.intercept_

Important

This is what our model predicts the tip will be based on the past data. It is important to note that this is not what the tip should be by any sort of virtues. For example, a typical normative rule for tipping is to tip 15% or 20%. the model we learned, from this data, however is array([9.65335512])% + $1.0732389645412237.

now we can pull out our first $x_i$ and calculate a predicted value of $y$ manually

slope * tips_X_test[0] + intercept

array([2.1370387])

and see that this matches exactly the prediction from the predict method.

tips_y_pred[0]

2.137038698355399

type(tips_X_test)

numpy.ndarray

Important

Note that my random seed is not fixed, so the values there are going to change each time the website is updated, but that these values will always be true

16.4. Visualizing Regression#

Since we only have one feature, we can visualize what was learned here. We will use plain matplotlib plots because we are plotting from numpy arrays not data frames.

plt.scatter(tips_X_test, tips_y_test,color = 'black')
plt.scatter(tips_X_test, tips_y_pred, color='blue')

<matplotlib.collections.PathCollection at 0x7fd6240984c0>

../_images/7bcd2e826c7cbbd0f30bc306cd916235041e7033ed9a4a3bf0bdcd01a4c5c4ad.png

This plots the predictions in blue vs the real data in black. The above uses them both as scatter plots to make it more clear, below, like in class, I’ll use a normally inconvenient aspect of a line plot to show all the predictions the model could make in this range.

16.5. Evaluating Regression - Mean Squared Error#

From the plot, we can see that there is some error for each point, so accuracy that we’ve been using, won’t work. One idea is to look at how much error there is in each prediction, we can look at that visually first.

These red lines are the residuals, or errors in each prediction.

In this code block, I use zip a builtin function in Python to iterate over all of the test samples and predictions (they’re all the same length) and plot each red line in a list comprehension. This could have been a for loop, but the comprehension is slighly more compact visually. The zip allows us to have a Pythonic, easy to read loop that iterates over multiple variables.

To mape a vertical line, I make a line plot with just two values. My plotted data is: [x,x],[yp,yt] so the first point is (x,yp) and the second is x,yt.

The ; at the end of each line suppressed the text output.

# plot line prediction
plt.plot(tips_X_test, tips_y_pred, color='blue', linewidth=3);

# draw vertical lines from each data point to its predict value
[plt.plot([x,x],[yp,yt], color='red', linewidth=3, markevery=[0], marker ='^')
                 for x, yp, yt in zip(tips_X_test, tips_y_pred,tips_y_test)];

# plot these last so they are visually on top
plt.scatter(tips_X_test, tips_y_test,  color='black');

../_images/087da955e2ba3011e88e414c0b017145f9289fb4cb6d4ef4e026bbaa5de3d98a.png

We can see that theese are pretty small errors mostly, but now we can get some metrics.

To see mroe about how the code above works:

We can use the average length of these red lines to capture the error. To get the length, we can take the difference between the prediction and the data for each point. Some would be positive and others negative, so we will square each one then take the average. This will be the mean squared error

mean_squared_error(tips_y_test,tips_y_pred)

1.0142259644471954

Which is equivalent to:

np.mean((tips_y_test-tips_y_pred)**2)

1.0142259644471954

To interpret this, we can take the square root to get it back into dollars. This becomes equivalent to taking the mean absolute value of the error.

mean_abs_error = np.sqrt(mean_squared_error(tips_y_test,tips_y_pred))
mean_abs_error

1.0070878633203735

which is equivalent to:

np.mean(np.abs(tips_y_test-tips_y_pred))

0.7362179605368665

Still, to know if this is a big or small error, we have to compare it to the values we were predicting

tips_y_test.describe()

count    61.000000
mean      3.076557
std       1.539949
min       1.000000
25%       2.000000
50%       3.000000
75%       3.500000
max       9.000000
Name: tip, dtype: float64

avg_tip = tips_y_test.mean()
tip_var = tips_y_test.var()
avg_tip, tip_var

(3.07655737704918, 2.371442950819672)

So we can see that the average error is ${glue}`mae` which is relatively large compared to the average tip size of $ but is smaller than most of them.

16.6. Evaluating Regression - R2#

We can also use the $R^2$ score, the coefficient of determination.

If we have the following:

$n$ `=len(y_test)``
$y$ =y_test
$y_i$ =y_test[i]
$\hat{y}$ = y_pred
$\bar{y} = \frac{1}{n}\sum_{i=0}^n y_i$ = sum(y_test)/len(y_test)

\[ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=0}^n (y_i - \hat{y}_i)^2}{\sum_{i=0}^n (y_i - \bar{y}_i)^2} \]

r2_score(tips_y_test,tips_y_pred)

0.5651888947339372

This is a bit harder to interpret, but we can use some additional plots to visualize. This code simulates data by randomly picking 20 points, spreading them out and makes the “predicted” y values by picking a slope of 3. Then I simulated various levels of noise, by sampling noise and multiplying the same noise vector by different scales and adding all of those to a data frame with the column name the r score for if that column of target values was the truth.

Then I added some columns of y values that were with different slopes and different functions of x. These all have the small amount of noise.

# decide a number of points
N = 20
# pick some random points
x = 10*np.random.random(N)
# make a line with a fixed slope
y_pred = 3*x
# put it in a data frame
ex_df = pd.DataFrame(data = x,columns = ['x'])
ex_df['y_pred'] = y_pred
# set some amounts of noise
n_levels = range(1,18,2)
# sample random values between 0 and 1, then make the spread between -1 and +1
noise = (np.random.random(N)-.5)*2
# add the noise and put it in the data frame with col title the r2
for n in n_levels:
    y_true = y_pred + n* noise
    ex_df['r2 = '+ str(np.round(r2_score(y_pred,y_true),3))] = y_true

# set some other functions
f_x_list = [2*x,3.5*x,.5*x**2, .03*x**3, 10*np.sin(x)+x*3,3*np.log(x**2)]
# add them to the noise and store with r2 col title
for fx in f_x_list:
    y_true = fx + noise
    ex_df['r2 = '+ str(np.round(r2_score(y_pred,y_true),3))] = y_true    

# make the data tidy for plotting
xy_df = ex_df.melt(id_vars=['x','y_pred'],var_name='rscore',value_name='y')
# make a custom grid by using facet grid directly
g = sns.FacetGrid(data = xy_df,col='rscore',col_wrap=3,aspect=1.5,height=3)
# plot the lin
g.map(plt.plot, 'x','y_pred',color='k')
# plot the dta
g.map(sns.scatterplot, "x", "y",)

<seaborn.axisgrid.FacetGrid at 0x7fd6582ee100>

../_images/cc6e94e972dd23a46c024e28b59a94dacaf06df3527c69549553143304902399.png

16.7. Regression auto scoring#

as all sklearn objects, it has a built in score method

regr.score(tips_X_test, tips_y_test)

0.5651888947339372

this matches the $R^2$ score:

r2_score(tips_y_test,tips_y_pred)

0.5651888947339372

and is different from mse

mean_squared_error(tips_y_test,tips_y_pred)

1.0142259644471954

tips_df.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

16.8. Mutlivariate Regression#

Recall the equation for a line:

\[ \hat{y} = mx+b\]

When we have multiple variables instead of a scalar $x$ we can have a vector $\mathbf{x}$ and instead of a single slope, we have a vector of coefficients $\beta$

\[ \hat{y} = \beta^T\mathbf{x} + \beta_0 \]

where $\beta$ is the regr_db.coef_ and $\beta_0$ is regr_db.intercept_ and that’s a vector multiplication and $\hat{y}$ is y_pred and $y$ is y_test.

In scalar form, a vector multiplication can be written like

\[ \hat{y} = \sum_{k=0}^d(x_k*\beta_k) + \beta_0\]

where there are $d$ features, that is $d$= len(X_test[k]) and $k$ indexed into it.

We are going to use 2 variables so it would be like:

\[ \hat{y} = \beta_1*x_1 + \beta_2*x_2 + \beta_0\]

tips_X_size = tips_df[['total_bill','size']]

We do not need the newaxis once we have more than one feature

tips_X_size_train, tips_X_size_test, tips_y_size_train, tips_y_size_test = train_test_split(tips_X_size, tips_y)

regr2 = linear_model.LinearRegression()

regr2.fit(tips_X_size_train,tips_y_size_train).score(tips_X_size_test,tips_y_size_test)

0.23743547190274172

this does a little bit worse:

tips_y_size_pred = regr2.predict(tips_X_size_test)

we can also check in terms of the absolut values

np.sqrt(mean_squared_error(tips_y_size_test,tips_y_size_pred))

1.0587744531972543

We can view the coefficients

regr2.coef_

array([0.1056387 , 0.16285664])

It still computes the same way:

tips_X_size_test.values[0][0]*regr.coef_[0] + tips_X_size_test.values[0][1]*regr.coef_[1] +regr.intercept_

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[41], line 1
----> 1 tips_X_size_test.values[0][0]*regr.coef_[0] + tips_X_size_test.values[0][1]*regr.coef_[1] +regr.intercept_

IndexError: index 1 is out of bounds for axis 0 with size 1

16.9. Questions after class#

16.9.1. What key points in data we should look for that can help us to chose the right model to work with?#

Honestly after choosing the right task, (eg classification, regression, or clustering) a good path is to start with a simple model and then evaluate how it does or does not fit, then you can go more complex within a task.

Also, you want to check the assumptions of the model, like from the documentation or the math underlying it and then check if your data seems to fit those assumptions. That said, sometimes they are hard to check. For example, it is hard to view data and see if it is linear in a lot of dimensions, but you can check residuals after with respect to each feature.

16.9.2. Im confused on how residuals are being factored into these models#

The residuals are the point by point errors in the model. The errors

We can plot the residuals, or errors directly versus the input.

plt.scatter(tips_X_test,tips_y_test-tips_y_pred,)

<matplotlib.collections.PathCollection at 0x7fd61d1b3d60>

../_images/e3bed17db4e2a33a81ea926b403957d20c42d7fe91f219f0fb01b5072bb9e837.png

If our model was predicting uniformly well, these would be evenly distributed left to right.

16.9.3. How do you know if a regression model is good or works for a dataset?#

Let’s break this down. Regression is supervised learning, so there needs to be variables that you can use as features and one to use as the target. To test if supervised learning makes sense, try filling column names in place of features and target in the following sentence, that describes a prediction task: we want to predict target from features.

Regression, specifically means that the target variable needs to be continuous, not discrete. For example, in today’s class, we predicted the tip in dollars, which is continous in this sense. This is not stictly continuous in the mathematical sense, but anything that is most useful to consider as continuous.

logisitic regression is actually a classification model despite the name

16.9.4. So is regression is just used with numeric values in terms of predicting?#

Yes linear regression requires numeric values for the features as well as the target.

Other types of regression can use other types of features.

All regression is for a continusous target.

16.9.5. What are the strengths of a regression model as opposed to the previous models we’ve seen?#

Using regression, classification, or clustering is based on what type of problem you are solving.

Our simple flowchart for what task you are doing is:

flowchart TD labels{Do you have labels <br>in the dataset? } labels --> |yes| sup[Supervised Learning] labels --> |no| unsup[Unsupervised Learning] sup --> ltype{What type of variable <br> are the labels?} ltype --> |continuous| reg[Regression] ltype --> |discrete| clas[Classification] unsup --> groups{Do you think there are <br> groups within the data?} groups --> |yes | clus[Clutering]

the sklearn docs include a more detailed guide

16.9.6. Are linear models exclusively used for regression problems?#

No, there are also linear classification models.

16.9.7. What does a negative R2 score mean?#

It’s a bad fit.

16.9.8. What is the difference between a good and bad rscore?#

1 is perfect, low is bad. What is “good enough” is context dependent.

16.9.9. why were your values coming out negative?#

Mine was not a very good fit. We’ll see why on Thursday. Yours may have been better

Regression

Contents

16. Regression#

16.1. Setting up for regression#

16.2. Fitting a regression model#

16.3. Regression Predictions#

16.4. Visualizing Regression#

16.5. Evaluating Regression - Mean Squared Error#

16.6. Evaluating Regression - R2#

16.7. Regression auto scoring#

16.8. Mutlivariate Regression#

16.9. Questions after class#

16.9.1. What key points in data we should look for that can help us to chose the right model to work with?#

16.9.2. Im confused on how residuals are being factored into these models#

16.9.3. How do you know if a regression model is good or works for a dataset?#

16.9.4. So is regression is just used with numeric values in terms of predicting?#

16.9.5. What are the strengths of a regression model as opposed to the previous models we’ve seen?#

16.9.6. Are linear models exclusively used for regression problems?#

16.9.7. What does a negative R2 score mean?#

16.9.8. What is the difference between a good and bad rscore?#

16.9.9. why were your values coming out negative?#