Class 22: More Regression, More Evaluation and LASSO

  1. log prismia chat

  2. say hello in the zoom chat

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

Questions after class Monday

Some good questions were asked on the form Monday. Check the notes for insight on the r2_score and mean_square_error.

Review of Test Train splits

X_train,X_test, y_train,y_test = train_test_split(diabetes_X, diabetes_y ,
                                                  test_size=20,random_state=0)

What metric does the score method of LinearRegression use?

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
LinearRegression()
regr.score(X_test,y_test)
0.5195208400616668
y_pred = regr.predict(X_test)
r2_score(y_test,y_pred)
0.5195208400616668
mean_squared_error(y_test,y_pred)
2850.3176917525775

Digging in deeper to the Linear Regression model

Linear regression fitting involvles learning the coefficients

regr.coef_
array([ -32.3074285 , -257.44432972,  513.31945939,  338.46656647,
       -766.86983748,  455.85416891,   92.55795582,  184.75163454,
        734.92318647,   82.7231425 ])

and an intercept

 regr.intercept_
152.39189054201842

The linear regression model is

\[ y = wx + b\]

wehere it stores \(w\) in the coef_ attribute and b as the intercept_

We can check how this works by multiplying one sample by the coefficients, to get a vector

X_test[0]*regr.coef_
array([ -0.64334474, -13.0473092 ,  53.80033982,  23.71721361,
        27.58260658, -12.16168908,  -2.31326921,  -0.47892464,
         2.72784249,   3.33733048])

then taking the sum and adding the intercept

np.sum(X_test[0]*regr.coef_) + regr.intercept_
234.9126866563482

and then comparing to the predictions.

y_pred[0]
234.91268665634823

These are not exactly the same due to float rounding erros, but they’re very close.

We can also check for the whole test set

errors = np.sum(X_test*regr.coef_,axis=1) + regr.intercept_ -y_pred
errors
array([-2.84217094e-14,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        1.42108547e-14,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -1.42108547e-14, -2.84217094e-14,  0.00000000e+00])

and confirm these are very small

np.max(errors)
1.4210854715202004e-14

It’s doing linear regression as we expected.

Changing the complexity in Linear regression

It can be important to also know what the model is really doing and see how different features are used or not. Maybe, for example all of the features are expensive to measure, but for testing we measured a lot of them. We might want to simultaneously learn which features we actually need and the linear model.

LASSO can do that for us it’s objective is like linear regression, but it adds an extra term. This term forces some of the learned coefficients to be 0. Mulitplying by 0 gives 0, so that’s like throwing away that feature. The term is called the \(1-norm\), the details of the math are not important here, just the idea that it can reduce the number of features used as well.

\[ ||y-wx||_2^2 + \alpha ||w||_1 \]
lasso = linear_model.Lasso()
lasso.fit(X_train, y_train)
lasso.score(X_test,y_test)
0.4224899448032938

It uses many fewer

lasso.coef_
array([  0.        ,  -0.        , 358.27498703,   9.7141813 ,
         0.        ,   0.        ,  -0.        ,   0.        ,
       309.50796119,   0.        ])

It has a parameter alpha that must be >0 that we can use to control how many features it uses. When alpha = 0 it’s the same as linear regression, but the regular linear regression estimator uses a more numerically stable algorithm for the fit method

lasso2 = linear_model.Lasso(alpha=.5)
lasso2.fit(X_train, y_train)
lasso2.score(X_test,y_test)
0.5272242529276977

We see tha fewer are 0 now.

lasso2.coef_
array([  0.        ,  -0.        , 466.09039819, 140.20195776,
        -0.        ,  -0.        , -61.96474668,   0.        ,
       405.95829094,   0.        ])
regr.score(X_train, y_train)
0.5170933599105026
lasso2.score(X_train, y_train)
0.4516267981295532
lasso_cv = cross_val_score(lasso2, diabetes_X, diabetes_y)
lasso_cv
array([0.3712605 , 0.45675648, 0.45066569, 0.42134051, 0.47736463])
regr_cv = cross_val_score(regr,diabetes_X, diabetes_y )
regr_cv
array([0.42955643, 0.52259828, 0.4826784 , 0.42650827, 0.55024923])
np.mean(lasso_cv), np.mean(regr_cv)
(0.43547756049731523, 0.48231812211149394)
cross_val_score(regr,diabetes_X, diabetes_y ,cv=10)
array([0.55614411, 0.23056092, 0.35357777, 0.62190498, 0.26587602,
       0.61819338, 0.41815916, 0.43515232, 0.43436983, 0.68568514])

We can do cross validation for all three of these models and look at both the mean and the standard deviation. The mean tells us how well on average each model does on different subsets of the data. The standard deviation is a measure of spread of the scores. For intuition, another measure of spread is max(cv_scores) - min(cv_scores).

regr_cv = cross_val_score(regr,diabetes_X, diabetes_y ,cv=10)
np.mean(regr_cv), np.std(regr_cv)
(0.461962361958337, 0.14698789185375888)
lasso_cv = cross_val_score(lasso,diabetes_X, diabetes_y ,cv=10)
np.mean(lasso_cv), np.std(lasso_cv)
(0.3211351084864853, 0.09122225780232662)
lasso2_cv = cross_val_score(lasso2,diabetes_X, diabetes_y ,cv=10)
np.mean(lasso2_cv), np.std(lasso2_cv)
(0.41808972445133297, 0.11927596370400496)

Questions after class

What does the cv parameter in cross_val_score do?

First, let’s look at the help.

help(cross_val_score)
Help on function cross_val_score in module sklearn.model_selection._validation:

cross_val_score(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)
    Evaluate a score by cross-validation
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.
    
    X : array-like of shape (n_samples, n_features)
        The data to fit. Can be for example a list, or an array.
    
    y : array-like of shape (n_samples,) or (n_samples, n_outputs),             default=None
        The target variable to try to predict in the case of
        supervised learning.
    
    groups : array-like of shape (n_samples,), default=None
        Group labels for the samples used while splitting the dataset into
        train/test set. Only used in conjunction with a "Group" :term:`cv`
        instance (e.g., :class:`GroupKFold`).
    
    scoring : str or callable, default=None
        A str (see model evaluation documentation) or
        a scorer callable object / function with signature
        ``scorer(estimator, X, y)`` which should return only
        a single value.
    
        Similar to :func:`cross_validate`
        but only a single metric is permitted.
    
        If None, the estimator's default scorer (if available) is used.
    
    cv : int, cross-validation generator or an iterable, default=None
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
    
        - None, to use the default 5-fold cross validation,
        - int, to specify the number of folds in a `(Stratified)KFold`,
        - :term:`CV splitter`,
        - An iterable yielding (train, test) splits as arrays of indices.
    
        For int/None inputs, if the estimator is a classifier and ``y`` is
        either binary or multiclass, :class:`StratifiedKFold` is used. In all
        other cases, :class:`KFold` is used. These splitters are instantiated
        with `shuffle=False` so the splits will be the same across calls.
    
        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validation strategies that can be used here.
    
        .. versionchanged:: 0.22
            ``cv`` default value if None changed from 3-fold to 5-fold.
    
    n_jobs : int, default=None
        Number of jobs to run in parallel. Training the estimator and computing
        the score are parallelized over the cross-validation splits.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.
    
    verbose : int, default=0
        The verbosity level.
    
    fit_params : dict, default=None
        Parameters to pass to the fit method of the estimator.
    
    pre_dispatch : int or str, default='2*n_jobs'
        Controls the number of jobs that get dispatched during parallel
        execution. Reducing this number can be useful to avoid an
        explosion of memory consumption when more jobs get dispatched
        than CPUs can process. This parameter can be:
    
            - None, in which case all the jobs are immediately
              created and spawned. Use this for lightweight and
              fast-running jobs, to avoid delays due to on-demand
              spawning of the jobs
    
            - An int, giving the exact number of total jobs that are
              spawned
    
            - A str, giving an expression as a function of n_jobs,
              as in '2*n_jobs'
    
    error_score : 'raise' or numeric, default=np.nan
        Value to assign to the score if an error occurs in estimator fitting.
        If set to 'raise', the error is raised.
        If a numeric value is given, FitFailedWarning is raised.
    
        .. versionadded:: 0.20
    
    Returns
    -------
    scores : ndarray of float of shape=(len(list(cv)),)
        Array of scores of the estimator for each run of the cross validation.
    
    Examples
    --------
    >>> from sklearn import datasets, linear_model
    >>> from sklearn.model_selection import cross_val_score
    >>> diabetes = datasets.load_diabetes()
    >>> X = diabetes.data[:150]
    >>> y = diabetes.target[:150]
    >>> lasso = linear_model.Lasso()
    >>> print(cross_val_score(lasso, X, y, cv=3))
    [0.33150734 0.08022311 0.03531764]
    
    See Also
    ---------
    cross_validate : To run cross-validation on multiple metrics and also to
        return train scores, fit times and score times.
    
    cross_val_predict : Get predictions from each split of cross-validation for
        diagnostic purposes.
    
    sklearn.metrics.make_scorer : Make a scorer from a performance metric or
        loss function.

It does different things, depending on what type of value we pass it. We passed it an int (10), so what it did was split the data into 10 groups and then trains on 9/10 of those parts and tests on the last one. Then it iterates over those folds.

For classification, it can take into consideration the target value (classes) and an additionally specified group in the data. A good visualization of what it does is shown in the sklearn docs.

It uses StratifiedKfold for classification, but since we’re using regression it will use KFold. test_train_split uses ShuffleSplit by default, let’s load that too to see what it does.

Warning

The key in the following is to get the concepts not all of the details in how I evaluate and visualize. I could have made figures separately to explain the concept, but I like to show that Python is self contained.

from sklearn.model_selection import KFold, ShuffleSplit
kf = KFold(n_splits = 10)

When we use the split method it gives us a generator.

kf.split(diabetes_X, diabetes_y)
<generator object _BaseKFold.split at 0x7fd98e8c5550>

We can use this in a loop to get the list of indices that will be used to get the test and train data for each fold. To visualize what this is doing, see below.

N_samples = len(diabetes_y)
kf_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in kf.split(diabetes_X, diabetes_y):
    kf_tt_df['split ' + str(i)] = ['unused']*N_samples
    kf_tt_df['split ' + str(i)][train_idx] = 'Train'
    kf_tt_df['split ' + str(i)][test_idx] = 'Test'
    i +=1

We can count how many times ‘Test’ and ‘Train’ appear

count_test = lambda part: len([v for v in part if v=='Test'])
count_train = lambda part: len([v for v in part if v=='Train'])

When we apply this along axis=1 we to check that each sample is used in exactly 1 test set how may times each sample is used

sum(kf_tt_df.apply(count_test,axis = 1) ==1)
442

and exactly 9 training sets

sum(kf_tt_df.apply(count_test,axis = 1) ==9)
0

the describe helps ensure that all fo the values are exa

We can also visualize:

cmap = sns.color_palette("tab10",10)
g = sns.heatmap(kf_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
    linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])
../_images/2020-10-28_57_0.png

Note that unlike test_train_split this does not always randomize and shuffle the data before splitting.

If we apply those lambda functions along axis=0, we can see the size of each test set

kf_tt_df.apply(count_test,axis = 0)
split 1     45
split 2     45
split 3     44
split 4     44
split 5     44
split 6     44
split 7     44
split 8     44
split 9     44
split 10    44
dtype: int64

and training set:

kf_tt_df.apply(count_train,axis = 0)
split 1     397
split 2     397
split 3     398
split 4     398
split 5     398
split 6     398
split 7     398
split 8     398
split 9     398
split 10    398
dtype: int64

We can verify that these splits are the same size as what test_train_split does using the right settings. 10-fold splits the data into 10 parts and tests on 1, so that makes a test size of 1/10=.1, so we can use the train_test_split and check the length.

X_train2,X_test2, y_train2,y_test2 = train_test_split(diabetes_X, diabetes_y ,
                                                  test_size=.1,random_state=0)

[len(split) for split in [X_train2,X_test2,]]

Under the hood train_test_split uses ShuffleSplit We can do a similar experiment as above to see what ShuffleSplit does.

skf = ShuffleSplit(10)
N_samples = len(diabetes_y)
ss_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in skf.split(diabetes_X, diabetes_y):
    ss_tt_df['split ' + str(i)] = ['unused']*N_samples
    ss_tt_df['split ' + str(i)][train_idx] = 'Train'
    ss_tt_df['split ' + str(i)][test_idx] = 'Test'
    i +=1

ss_tt_df
split 1 split 2 split 3 split 4 split 5 split 6 split 7 split 8 split 9 split 10
0 Train Train Train Train Train Train Train Train Train Train
1 Train Train Train Test Train Train Train Train Train Train
2 Train Train Train Train Train Train Train Train Train Train
3 Train Train Train Train Train Train Train Train Train Train
4 Train Train Train Train Train Train Train Train Train Train
... ... ... ... ... ... ... ... ... ... ...
437 Train Train Train Train Test Train Train Train Test Train
438 Test Train Train Train Train Train Train Train Train Train
439 Train Train Train Train Train Train Train Train Train Train
440 Train Train Train Train Train Train Train Train Train Train
441 Train Train Train Train Test Train Train Train Train Test

442 rows × 10 columns

And plot

cmap = sns.color_palette("tab10",10)
g = sns.heatmap(ss_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
    linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])
../_images/2020-10-28_65_0.png

Now, we see the samples in each training set (gray along the columns) are a random subset of all of all of the samples.

And check the usage of each sample

sum(ss_tt_df.apply(count_test,axis = 1) ==1)
172

and exactly 9 training sets

sum(ss_tt_df.apply(count_test,axis = 1) ==9)
0

And the size of the splits

ss_tt_df.apply(count_test,axis = 0)
split 1     45
split 2     45
split 3     45
split 4     45
split 5     45
split 6     45
split 7     45
split 8     45
split 9     45
split 10    45
dtype: int64

and training set:

ss_tt_df.apply(count_train,axis = 0)
split 1     397
split 2     397
split 3     397
split 4     397
split 5     397
split 6     397
split 7     397
split 8     397
split 9     397
split 10    397
dtype: int64

Again the same sizes