ML Task Review and Cross Validation

20. ML Task Review and Cross Validation#

20.1. Relationship between Tasks#

We learned classification first, because it shares similarities with each regression and clustering, while regression and clustering have less in common.

Classification is supervised learning for a categorical target.
Regression is supervised learning for a continuous target. Clustering is unsupervised learning for a categorical target.

Sklearn provides a nice flow chart for thinking through this.

estimator flow chart

Predicting a category is another way of saying categorical target. Predicting a quantitiy is another way of saying continuous target. Having lables or not is the difference between

The flowchart assumes you know what you want to do with data and that is the ideal scenario. You have a dataset and you have a goal. For the purpose of getting to practice with a variety of things, in this course we ask you to start with a task and then find a dataset. Assignment 9 is the last time that’s true however. Starting with Assignment 10 and the last portflios, you can choose and focus on a specific application domain and then choose the right task from there.

Thinking about this, however, you use this information to move between the tasks within a given type of data. For example, you can use the same data for clustering as you did for classification. Switching the task changes the questions though: classification evaluation tells us how separable the classes are given that classifiers decision rule. Clustering can find other subgroups or the same ones, so the evaluation we choose allows us to explore this in more ways.

Regression requires a continuous target, so we need a dataset to be suitable for that, we can’t transform from the classification dataset to a regression one.
However, we can go the other way and that’s how some classification datasets are created.

The UCI adult Dataset is a popular ML dataset that was dervied from census data. The goal is to use a variety of features to predict if a person makes more than \(50k per year or not. While income is a continuous value, they applied a threshold (\)50k) to it to make a binary variable. The dataset does not include income in dollars, only the binary indicator.

20.2. Cross Validation#

This week our goal is to learn how to optmize models. The first step in that is to get a good estimate of its performance.

We have seen that the test train splits, which are random, influence the performance.

# basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# models classes
from sklearn import tree
from sklearn import cluster
from sklearn import svm
# datasets
from sklearn import datasets
# model selection tools
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import metrics

We’ll use the Iris data with a decision tree.

iris_df = sns.load_dataset('iris')

iris_X = iris_df.drop(columns=['species'])
iris_y = iris_df['species']

dt =tree.DecisionTreeClassifier()

We can split the data, fit the model, then compute a score, but since the splitting is a randomized step, the score is a random variable.

iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X,iris_y)
dt.fit(iris_X_train,iris_y_train)
dt.score(iris_X_test,iris_y_test)

0.8947368421052632

Since it is random, if we repeat this, we will generally get a different value

iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X,iris_y)
dt.fit(iris_X_train,iris_y_train)
dt.score(iris_X_test,iris_y_test)

0.9473684210526315

For example, if we have a coin that we want to see if it’s fair or not. We would flip it to test. One flip doesn’t tell us, but if we flip it a few times, we can estimate the probability it is heads by counting how many of the flips are heads and dividing by how many flips.

We can do something similar with our model performance. We can split the data a bunch of times and compute the score each time.

cross_val_score does this all for us.

It takes an estimator object and the data.

By default it uses 5-fold cross validation. It splits the data into 5 sections, then uses 4 of them to train and one to test. It then iterates through so that each section gets used for testing.

cross_val_score(dt, iris_X_train,iris_y_train)

array([0.95652174, 0.91304348, 0.95454545, 1.        , 0.95454545])

We will still use the test train split to keep our test data separate from the data that we use to find our preferred parameters.

We get back a score for each section or “fold” of the data. We can average those to get a single estimate.

cross_val_score(dt, iris_X_train,iris_y_train).mean()

0.9557312252964426

We can change it to 10-fold.

cross_val_score(dt, iris_X_train,iris_y_train,cv=10)

array([0.91666667, 0.91666667, 0.81818182, 1.        , 1.        ,
       0.90909091, 1.        , 1.        , 1.        , 0.90909091])

cross_val_score(dt, iris_X_train,iris_y_train,cv=10).mean()

0.9469696969696969

20.3. What Does Cross validation really do?#

Important

This is extra detail that was not presented in class.

It uses StratifiedKfold for classification, but since we’re using regression it will use KFold. test_train_split uses ShuffleSplit by default, let’s load that too to see what it does.

Warning

The key in the following is to get the concepts not all of the details in how I evaluate and visualize. I could have made figures separately to explain the concept, but I like to show that Python is self contained.

from sklearn.model_selection import KFold, ShuffleSplit

kf = KFold(n_splits = 10)

When we use the split method it gives us a generator.

kf.split(diabetes_X, diabetes_y)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 kf.split(diabetes_X, diabetes_y)

NameError: name 'diabetes_X' is not defined

We can use this in a loop to get the list of indices that will be used to get the test and train data for each fold. To visualize what this is doing, see below.

N_samples = len(diabetes_y)
kf_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in kf.split(diabetes_X, diabetes_y):
    kf_tt_df['split ' + str(i)] = ['unused']*N_samples
    kf_tt_df['split ' + str(i)][train_idx] = 'Train'
    kf_tt_df['split ' + str(i)][test_idx] = 'Test'
    i +=1

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 N_samples = len(diabetes_y)
      2 kf_tt_df = pd.DataFrame(index=list(range(N_samples)))
      3 i = 1

NameError: name 'diabetes_y' is not defined

We can count how many times ‘Test’ and ‘Train’ appear

count_test = lambda part: len([v for v in part if v=='Test'])
count_train = lambda part: len([v for v in part if v=='Train'])

When we apply this along axis=1 we to check that each sample is used in exactly 1 test set how may times each sample is used

sum(kf_tt_df.apply(count_test,axis = 1) ==1)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 sum(kf_tt_df.apply(count_test,axis = 1) ==1)

NameError: name 'kf_tt_df' is not defined

and exactly 9 training sets

sum(kf_tt_df.apply(count_test,axis = 1) ==9)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 sum(kf_tt_df.apply(count_test,axis = 1) ==9)

NameError: name 'kf_tt_df' is not defined

the describe helps ensure that all fo the values are exa

We can also visualize:

cmap = sns.color_palette("tab10",10)
g = sns.heatmap(kf_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
    linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 2
      1 cmap = sns.color_palette("tab10",10)
----> 2 g = sns.heatmap(kf_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
      3     linecolor='gray')
      4 colorbar = g.collections[0].colorbar
      5 colorbar.set_ticklabels(['Train','Test'])

NameError: name 'kf_tt_df' is not defined

Note that unlike test_train_split this does not always randomize and shuffle the data before splitting.

If we apply those lambda functions along axis=0, we can see the size of each test set

kf_tt_df.apply(count_test,axis = 0)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 kf_tt_df.apply(count_test,axis = 0)

NameError: name 'kf_tt_df' is not defined

and training set:

kf_tt_df.apply(count_train,axis = 0)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 kf_tt_df.apply(count_train,axis = 0)

NameError: name 'kf_tt_df' is not defined

We can verify that these splits are the same size as what test_train_split does using the right settings. 10-fold splits the data into 10 parts and tests on 1, so that makes a test size of 1/10=.1, so we can use the train_test_split and check the length.

X_train2,X_test2, y_train2,y_test2 = train_test_split(diabetes_X, diabetes_y ,
                                                  test_size=.1,random_state=0)

[len(split) for split in [X_train2,X_test2,]]

Under the hood train_test_split uses ShuffleSplit We can do a similar experiment as above to see what ShuffleSplit does.

skf = ShuffleSplit(10)
N_samples = len(diabetes_y)
ss_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in skf.split(diabetes_X, diabetes_y):
    ss_tt_df['split ' + str(i)] = ['unused']*N_samples
    ss_tt_df['split ' + str(i)][train_idx] = 'Train'
    ss_tt_df['split ' + str(i)][test_idx] = 'Test'
    i +=1

ss_tt_df

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 2
      1 skf = ShuffleSplit(10)
----> 2 N_samples = len(diabetes_y)
      3 ss_tt_df = pd.DataFrame(index=list(range(N_samples)))
      4 i = 1

NameError: name 'diabetes_y' is not defined

And plot

cmap = sns.color_palette("tab10",10)
g = sns.heatmap(ss_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
    linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 2
      1 cmap = sns.color_palette("tab10",10)
----> 2 g = sns.heatmap(ss_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
      3     linecolor='gray')
      4 colorbar = g.collections[0].colorbar
      5 colorbar.set_ticklabels(['Train','Test'])

NameError: name 'ss_tt_df' is not defined

20.4. Cross validation with clustering#

We can use any estimator object here.

km = cluster.KMeans(n_clusters=3)

cross_val_score(km, iris_X_train,)

/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

array([-11.46929911, -12.40230398, -14.43061587,  -8.73932888,
        -9.72014715])

km.score()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 km.score()

TypeError: score() missing 1 required positional argument: 'X'

20.5. Grid Search Optimization#

We can optimize, however to determing the different parameter settings.

A simple way to do this is to fit the model for different parameters and score for each and compare.

param_grid = {'n_clusters':[2,3,4,5,6]}
km_opt = GridSearchCV(km, param_grid,metrics.silhouette_score)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[25], line 2
      1 param_grid = {'n_clusters':[2,3,4,5,6]}
----> 2 km_opt = GridSearchCV(km, param_grid,metrics.silhouette_score)

TypeError: __init__() takes 3 positional arguments but 4 were given

The GridSearchCV object is constructed first and requires an estimator object and a dictionary that describes the parameter grid to search over. The dictionary has the parameter names as the keys and the values are the values for that parameter to test.

The fit method on the Grid Search object fits all of the separate models.

In this case we optimize of a one dimensional “grid” just a set of values for one parameter, the number of clusters.

param_grid = {'n_clusters':[2,3,4,5,6]}
km_opt = GridSearchCV(km, param_grid)

iris_X_train.shape

(112, 4)

km_opt.fit(iris_X_train)

/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

GridSearchCV(estimator=KMeans(n_clusters=3),
             param_grid={'n_clusters': [2, 3, 4, 5, 6]})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Important

I still need to explore this question. A volunteer who wants to do this for a portfolio section can do that as well

Why does ,scoring=metrics.silhouette_score not work?

km_opt.best_params_

{'n_clusters': 6}

type(km_opt.best_estimator_)

sklearn.cluster._kmeans.KMeans

We note that this is a dictionary, so to make it more readable, we can make it a DataFrame.

pd.DataFrame(km_opt.cv_results_)

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_n_clusters	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.009852	0.005060	0.001296	0.000082	2	{'n_clusters': 2}	-23.650760	-19.564274	-30.086417	-16.560840	-21.666266	-22.305711	4.544809	5
1	0.009224	0.000611	0.001255	0.000045	3	{'n_clusters': 3}	-11.469299	-12.402304	-14.430616	-8.739329	-9.720147	-11.352339	2.004184	4
2	0.011083	0.000371	0.001230	0.000037	4	{'n_clusters': 4}	-7.320516	-9.156875	-10.272976	-9.532946	-9.259791	-9.108621	0.975528	3
3	0.013082	0.000244	0.001696	0.000824	5	{'n_clusters': 5}	-5.877640	-6.203186	-7.884785	-7.632313	-7.042492	-6.928083	0.781436	2
4	0.013854	0.000251	0.001230	0.000024	6	{'n_clusters': 6}	-6.008734	-5.310848	-6.181361	-5.676005	-6.349516	-5.905293	0.371533	1

20.6. Optimizing a Decision Tree#

Today we will optimize a decision tree over three parameters. One is the criterion, which is how it decides where to create thresholds in parameters. Gini is the default and it computes how concentrated each class is at that node, another is entropy, entropy is, generally how random something is. Intuitively these do similar things, which makes sense because they are two ways to make the same choice, but they have slightly different calculations.

The other two parameters we have seen some before. Max depth is the height of the tree and min smaples per leaf makes it keeps the leaf sizes small.

dt = tree.DecisionTreeClassifier()
params_dt = {'criterion':['gini','entropy'],'max_depth':[2,3,4],
             'min_samples_leaf':list(range(2,20,2))}

what parameters give the highest accuracy? and is the most acurate one also the fastest one?

dt_opt = GridSearchCV(dt,params_dt)
dt_opt.fit(iris_X_train,iris_y_train)

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [2, 3, 4],
                         'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We will fit it with default CV settings. And we can see the best parameters

dt_opt.best_params_

{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 2}

and we can use ti to get predictions

y_pred = dt_opt.predict(iris_X_test)

dt_df = pd.DataFrame(dt_opt.cv_results_)
dt_df.shape

(54, 16)

dt_df.columns

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_criterion', 'param_max_depth', 'param_min_samples_leaf',
       'params', 'split0_test_score', 'split1_test_score', 'split2_test_score',
       'split3_test_score', 'split4_test_score', 'mean_test_score',
       'std_test_score', 'rank_test_score'],
      dtype='object')

dt_df['mean_score_time'].idxmin() == dt_df['mean_test_score'].idxmax()

False

dt_df['mean_test_score'].idxmax(), dt_df['mean_score_time'].idxmin()

(45, 52)

Important

Remember that best is context dependent and relative. The best accuracy might not be the best overall. Automatic optimization can only find the best thing in terms of a single score.

ML Task Review and Cross Validation

Contents

20. ML Task Review and Cross Validation#

20.1. Relationship between Tasks#

20.2. Cross Validation#

20.3. What Does Cross validation really do?#

20.4. Cross validation with clustering#

20.5. Grid Search Optimization#

20.6. Optimizing a Decision Tree#