31. ML Task Review Cross Validation#

31.1. Relationship between Tasks#

We learned classification first, because it shares similarities with each regression and clustering, while regression and clustering have less in common.

Classification is supervised learning for a categorical target.
Regression is supervised learning for a continuous target. Clustering is unsupervised learning for a categorical target.

Sklearn provides a nice flow chart for thinking through this.

estimator flow chart

Predicting a category is another way of saying categorical target. Predicting a quantitiy is another way of saying continuous target. Having lables or not is the difference between

The flowchart assumes you know what you want to do with data and that is the ideal scenario. You have a dataset and you have a goal. For the purpose of getting to practice with a variety of things, in this course we ask you to start with a task and then find a dataset. Assignment 9 is the last time that’s true however. Starting with Assignment 10 and the last portflios, you can choose and focus on a specific application domain and then choose the right task from there.

Thinking about this, however, you use this information to move between the tasks within a given type of data. For example, you can use the same data for clustering as you did for classification. Switching the task changes the questions though: classification evaluation tells us how separable the classes are given that classifiers decision rule. Clustering can find other subgroups or the same ones, so the evaluation we choose allows us to explore this in more ways.

Regression requires a continuous target, so we need a dataset to be suitable for that, we can’t transform from the classification dataset to a regression one.
However, we can go the other way and that’s how some classification datasets are created.

The UCI adult Dataset is a popular ML dataset that was dervied from census data. The goal is to use a variety of features to predict if a person makes more than \(50k per year or not. While income is a continuous value, they applied a threshold (\)50k) to it to make a binary variable. The dataset does not include income in dollars, only the binary indicator.

Further Reading

Recent work reconsturcted the dataset with the continuous valued income. Their repository contains the data as well as links to their paper and a video of their talk on it.

31.2. Cross Validation#

This week our goal is to learn how to optmize models. The first step in that is to get a good estimate of its performance.

We have seen that the test train splits, which are random, influence the performance.

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn import metrics

We’ll use the Iris data with a decision tree.

iris_df = sns.load_dataset('iris')

iris_X = iris_df.drop(columns=['species'])
iris_y = iris_df['species']
dt =tree.DecisionTreeClassifier()

We can split the data, fit the model, then compute a score, but since the splitting is a randomized step, the score is a random variable.

For example, if we have a coin that we want to see if it’s fair or not. We would flip it to test. One flip doesn’t tell us, but if we flip it a few times, we can estimate the probability it is heads by counting how many of the flips are heads and dividing by how many flips.

We can do something similar with our model performance. We can split the data a bunch of times and compute the score each time.

cross_val_score does this all for us.

It takes an estimator object and the data.

By default it uses 5-fold cross validation. It splits the data into 5 sections, then uses 4 of them to train and one to test. It then iterates through so that each section gets used for testing.

cross_val_score(dt,iris_X,iris_y)
array([0.96666667, 0.96666667, 0.9       , 0.93333333, 1.        ])

We get back a score for each section or “fold” of the data. We can average those to get a single estimate.

np.mean(cross_val_score(dt,iris_X,iris_y))
0.9533333333333334

We can use more folds.

np.mean(cross_val_score(dt,iris_X,iris_y,cv=10))
0.9533333333333334

We can peak inside what this actually does to see it more clearly.

31.3. What Does Cross validation really do?#

It uses StratifiedKfold for classification, but since we’re using regression it will use KFold. test_train_split uses ShuffleSplit by default, let’s load that too to see what it does.

Warning

The key in the following is to get the concepts not all of the details in how I evaluate and visualize. I could have made figures separately to explain the concept, but I like to show that Python is self contained.

from sklearn.model_selection import KFold, ShuffleSplit
kf = KFold(n_splits = 10)

When we use the split method it gives us a generator.

kf.split(diabetes_X, diabetes_y)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 kf.split(diabetes_X, diabetes_y)

NameError: name 'diabetes_X' is not defined

We can use this in a loop to get the list of indices that will be used to get the test and train data for each fold. To visualize what this is doing, see below.

N_samples = len(diabetes_y)
kf_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in kf.split(diabetes_X, diabetes_y):
    kf_tt_df['split ' + str(i)] = ['unused']*N_samples
    kf_tt_df['split ' + str(i)][train_idx] = 'Train'
    kf_tt_df['split ' + str(i)][test_idx] = 'Test'
    i +=1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 N_samples = len(diabetes_y)
      2 kf_tt_df = pd.DataFrame(index=list(range(N_samples)))
      3 i = 1

NameError: name 'diabetes_y' is not defined

We can count how many times ‘Test’ and ‘Train’ appear

count_test = lambda part: len([v for v in part if v=='Test'])
count_train = lambda part: len([v for v in part if v=='Train'])

When we apply this along axis=1 we to check that each sample is used in exactly 1 test set how may times each sample is used

sum(kf_tt_df.apply(count_test,axis = 1) ==1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 sum(kf_tt_df.apply(count_test,axis = 1) ==1)

NameError: name 'kf_tt_df' is not defined

and exactly 9 training sets

sum(kf_tt_df.apply(count_test,axis = 1) ==9)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 sum(kf_tt_df.apply(count_test,axis = 1) ==9)

NameError: name 'kf_tt_df' is not defined

the describe helps ensure that all fo the values are exa

We can also visualize:

cmap = sns.color_palette("tab10",10)
g = sns.heatmap(kf_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
    linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 cmap = sns.color_palette("tab10",10)
----> 2 g = sns.heatmap(kf_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
      3     linecolor='gray')
      4 colorbar = g.collections[0].colorbar
      5 colorbar.set_ticklabels(['Train','Test'])

NameError: name 'kf_tt_df' is not defined

Note that unlike test_train_split this does not always randomize and shuffle the data before splitting.

If we apply those lambda functions along axis=0, we can see the size of each test set

kf_tt_df.apply(count_test,axis = 0)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 kf_tt_df.apply(count_test,axis = 0)

NameError: name 'kf_tt_df' is not defined

and training set:

kf_tt_df.apply(count_train,axis = 0)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 kf_tt_df.apply(count_train,axis = 0)

NameError: name 'kf_tt_df' is not defined

We can verify that these splits are the same size as what test_train_split does using the right settings. 10-fold splits the data into 10 parts and tests on 1, so that makes a test size of 1/10=.1, so we can use the train_test_split and check the length.

X_train2,X_test2, y_train2,y_test2 = train_test_split(diabetes_X, diabetes_y ,
                                                  test_size=.1,random_state=0)

[len(split) for split in [X_train2,X_test2,]]

Under the hood train_test_split uses ShuffleSplit We can do a similar experiment as above to see what ShuffleSplit does.

skf = ShuffleSplit(10)
N_samples = len(diabetes_y)
ss_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in skf.split(diabetes_X, diabetes_y):
    ss_tt_df['split ' + str(i)] = ['unused']*N_samples
    ss_tt_df['split ' + str(i)][train_idx] = 'Train'
    ss_tt_df['split ' + str(i)][test_idx] = 'Test'
    i +=1

ss_tt_df
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 2
      1 skf = ShuffleSplit(10)
----> 2 N_samples = len(diabetes_y)
      3 ss_tt_df = pd.DataFrame(index=list(range(N_samples)))
      4 i = 1

NameError: name 'diabetes_y' is not defined

And plot

cmap = sns.color_palette("tab10",10)
g = sns.heatmap(ss_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
    linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 2
      1 cmap = sns.color_palette("tab10",10)
----> 2 g = sns.heatmap(ss_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
      3     linecolor='gray')
      4 colorbar = g.collections[0].colorbar
      5 colorbar.set_ticklabels(['Train','Test'])

NameError: name 'ss_tt_df' is not defined

31.4. Cross validation with clustering#

We can use any estimator object here.

km = KMeans(n_clusters=3)
cross_val_score(km,iris_X)
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
array([ -9.062     , -14.93195873, -18.93234207, -23.70894258,
       -19.55457726])
km.score()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 km.score()

TypeError: score() missing 1 required positional argument: 'X'

31.5. Grid Search Optimization#

We can optimize, however to determing the different parameter settings.

A simple way to do this is to fit the model for different parameters and score for each and compare.

The

from sklearn.model_selection import GridSearchCV

The GridSearchCV object is constructed first and requires an estimator object and a dictionary that describes the parameter grid to search over. The dictionary has the parameter names as the keys and the values are the values for that parameter to test.

The fit method on the Grid Search object fits all of the separate models.

In this case, we will optimize the depth of this Decision Tree.

param_grid = {'max_depth':[2,3,4,5]}
dt_opt = GridSearchCV(dt,param_grid)
dt_opt.fit(iris_X,iris_y)
GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [2, 3, 4, 5]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Then we can look at the output.

dt_opt.cv_results_
{'mean_fit_time': array([0.00185766, 0.001791  , 0.00178885, 0.00179324]),
 'std_fit_time': array([1.59000176e-04, 2.61296321e-05, 1.73072840e-05, 2.34260540e-05]),
 'mean_score_time': array([0.00122561, 0.0011929 , 0.00116849, 0.00119224]),
 'std_score_time': array([6.78506939e-05, 2.81352511e-05, 1.85231443e-05, 3.18513659e-05]),
 'param_max_depth': masked_array(data=[2, 3, 4, 5],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_depth': 2},
  {'max_depth': 3},
  {'max_depth': 4},
  {'max_depth': 5}],
 'split0_test_score': array([0.93333333, 0.96666667, 0.96666667, 0.96666667]),
 'split1_test_score': array([0.96666667, 0.96666667, 0.96666667, 0.96666667]),
 'split2_test_score': array([0.9       , 0.93333333, 0.9       , 0.9       ]),
 'split3_test_score': array([0.86666667, 0.93333333, 0.93333333, 0.96666667]),
 'split4_test_score': array([1., 1., 1., 1.]),
 'mean_test_score': array([0.93333333, 0.96      , 0.95333333, 0.96      ]),
 'std_test_score': array([0.04714045, 0.02494438, 0.03399346, 0.03265986]),
 'rank_test_score': array([4, 2, 3, 1], dtype=int32)}

We note that this is a dictionary, so to make it more readable, we can make it a DataFrame.

pd.DataFrame(dt_opt.cv_results_)
mean_fit_time std_fit_time mean_score_time std_score_time param_max_depth params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.001858 0.000159 0.001226 0.000068 2 {'max_depth': 2} 0.933333 0.966667 0.900000 0.866667 1.0 0.933333 0.047140 4
1 0.001791 0.000026 0.001193 0.000028 3 {'max_depth': 3} 0.966667 0.966667 0.933333 0.933333 1.0 0.960000 0.024944 2
2 0.001789 0.000017 0.001168 0.000019 4 {'max_depth': 4} 0.966667 0.966667 0.900000 0.933333 1.0 0.953333 0.033993 3
3 0.001793 0.000023 0.001192 0.000032 5 {'max_depth': 5} 0.966667 0.966667 0.900000 0.966667 1.0 0.960000 0.032660 1
dt
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

31.6. Questions After Class#

31.6.1. Do we have to do anything to pick the highest ranking model from the GridSearchCV function?#

No, we can use it directly. For example:

plt.figure(figsize=(15,20))
tree.plot_tree(dt_opt.best_estimator_, rounded =True, class_names = ['A','B'],
      proportion=True, filled =True, impurity=False,fontsize=10);
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 plt.figure(figsize=(15,20))
      2 tree.plot_tree(dt_opt.best_estimator_, rounded =True, class_names = ['A','B'],
      3       proportion=True, filled =True, impurity=False,fontsize=10);

NameError: name 'plt' is not defined

31.6.2. Is there anything similar to a gridsearch object for different estimators, where it can try different methods of estimation and rank them?#

No, you would use multiple gridsearch (or similar model optimizer with a different search strategy) one for each model. Each model class/ estimator object

31.6.3. I would like to learn how to apply cross validation and especially program optimization to unsupervised clustering models.#

It would look a lot like what we did with the decision tree, but we use the right parameter name, for example:

km = KMeans()
param_grid = {'n_clusters': list(range(2,8))}
km_opt = GridSearchCV(km,param_grid)
km_opt.fit(iris_X)
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
GridSearchCV(estimator=KMeans(), param_grid={'n_clusters': [2, 3, 4, 5, 6, 7]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pd.DataFrame(km_opt.cv_results_)
mean_fit_time std_fit_time mean_score_time std_score_time param_n_clusters params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.007247 0.002357 0.001162 0.000065 2 {'n_clusters': 2} -13.417292 -19.334754 -58.409800 -56.347656 -54.802470 -40.462394 19.788398 6
1 0.008210 0.000141 0.001154 0.000030 3 {'n_clusters': 3} -9.062000 -14.931959 -18.932342 -23.708943 -19.554577 -17.237964 4.945204 5
2 0.009823 0.000376 0.001202 0.000036 4 {'n_clusters': 4} -9.062000 -11.602296 -13.714817 -17.596427 -15.673617 -13.529831 2.994800 4
3 0.011454 0.000734 0.001220 0.000014 5 {'n_clusters': 5} -9.062000 -10.723786 -10.996277 -17.596427 -15.673617 -12.810421 3.249602 3
4 0.012644 0.000161 0.001230 0.000007 6 {'n_clusters': 6} -9.062000 -8.024337 -10.996277 -12.698270 -11.476818 -10.451540 1.686290 2
5 0.013671 0.000281 0.001206 0.000021 7 {'n_clusters': 7} -9.062000 -7.158558 -9.066973 -12.183027 -11.476818 -9.789475 1.819295 1

31.6.4. Is it better to split the data in more folds when using the cross-validation?#

this is a tricky question, we’ll revisit it in class on Wednesday.

31.6.5. ‟What is this “model” we are training? What are the scores, scoring?#

In this example, the model was a decision tree at the beginning and later K-means. The score describes how well the fit model works on the held out data; accuracy or a general fit statistic.

The model vs algorithm section in the introduction of the Model Based ML book (free) is a good thing to read to clarify these relationshipts.

sklearn provides a flowchart for choosing their different estimator objects. In sklearn, they implement each model as an estimator object; more specifically, they have a Base Estimator class that the other estimators inherit. For example the decision tree source shows that it inherits the ClassifierMixin and BaseDecisionTree which inherits BaseEstimator

Term

Definition

Example

task

the type of algorithm that we will use machine learning to write

classification, regression, clustering

model

the specific form and set of assumptions that will be used in the algorithm

decision tree (classification) Gaussian Naive Bayes (classification), linear regression, sparse regression/LASSO, K-means, spectral clustering, etc.

score

a measure of how well the model completes the task

accuracy (for classification), mean squared error (for regression), silhouette score (for clustering)

also review the intro to models in machine learning class notes