20. ML Task Review and Cross Validation#
20.1. Relationship between Tasks#
We learned classification first, because it shares similarities with each regression and clustering, while regression and clustering have less in common.
Classification is supervised learning for a categorical target.
Regression is supervised learning for a continuous target.
Clustering is unsupervised learning for a categorical target.
Sklearn provides a nice flow chart for thinking through this.
Predicting a category is another way of saying categorical target. Predicting a quantitiy is another way of saying continuous target. Having lables or not is the difference between
The flowchart assumes you know what you want to do with data and that is the ideal scenario. You have a dataset and you have a goal. For the purpose of getting to practice with a variety of things, in this course we ask you to start with a task and then find a dataset. Assignment 9 is the last time that’s true however. Starting with Assignment 10 and the last portflios, you can choose and focus on a specific application domain and then choose the right task from there.
Thinking about this, however, you use this information to move between the tasks within a given type of data. For example, you can use the same data for clustering as you did for classification. Switching the task changes the questions though: classification evaluation tells us how separable the classes are given that classifiers decision rule. Clustering can find other subgroups or the same ones, so the evaluation we choose allows us to explore this in more ways.
Regression requires a continuous target, so we need a dataset to be suitable for
that, we can’t transform from the classification dataset to a regression one.
However, we can go the other way and that’s how some classification datasets are
created.
The UCI adult Dataset is a popular ML dataset that was dervied from census data. The goal is to use a variety of features to predict if a person makes more than \(50k per year or not. While income is a continuous value, they applied a threshold (\)50k) to it to make a binary variable. The dataset does not include income in dollars, only the binary indicator.
Further Reading
Recent work reconsturcted the dataset with the continuous valued income. Their repository contains the data as well as links to their paper and a video of their talk on it.
20.2. Cross Validation#
This week our goal is to learn how to optmize models. The first step in that is to get a good estimate of its performance.
We have seen that the test train splits, which are random, influence the performance.
# basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# models classes
from sklearn import tree
from sklearn import cluster
from sklearn import svm
# datasets
from sklearn import datasets
# model selection tools
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import metrics
We’ll use the Iris data with a decision tree.
iris_df = sns.load_dataset('iris')
iris_X = iris_df.drop(columns=['species'])
iris_y = iris_df['species']
dt =tree.DecisionTreeClassifier()
We can split the data, fit the model, then compute a score, but since the splitting is a randomized step, the score is a random variable.
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X,iris_y)
dt.fit(iris_X_train,iris_y_train)
dt.score(iris_X_test,iris_y_test)
0.8947368421052632
Since it is random, if we repeat this, we will generally get a different value
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X,iris_y)
dt.fit(iris_X_train,iris_y_train)
dt.score(iris_X_test,iris_y_test)
0.9473684210526315
For example, if we have a coin that we want to see if it’s fair or not. We would flip it to test. One flip doesn’t tell us, but if we flip it a few times, we can estimate the probability it is heads by counting how many of the flips are heads and dividing by how many flips.
We can do something similar with our model performance. We can split the data a bunch of times and compute the score each time.
cross_val_score
does this all for us.
It takes an estimator object and the data.
By default it uses 5-fold cross validation. It splits the data into 5 sections, then uses 4 of them to train and one to test. It then iterates through so that each section gets used for testing.
cross_val_score(dt, iris_X_train,iris_y_train)
array([0.95652174, 0.91304348, 0.95454545, 1. , 0.95454545])
We will still use the test train split to keep our test data separate from the data that we use to find our preferred parameters.
We get back a score for each section or “fold” of the data. We can average those to get a single estimate.
cross_val_score(dt, iris_X_train,iris_y_train).mean()
0.9557312252964426
We can change it to 10-fold.
cross_val_score(dt, iris_X_train,iris_y_train,cv=10)
array([0.91666667, 0.91666667, 0.81818182, 1. , 1. ,
0.90909091, 1. , 1. , 1. , 0.90909091])
cross_val_score(dt, iris_X_train,iris_y_train,cv=10).mean()
0.9469696969696969
20.3. What Does Cross validation really do?#
Important
This is extra detail that was not presented in class.
It uses StratifiedKfold for classification, but since we’re using regression it will use KFold
. test_train_split
uses ShuffleSplit
by default, let’s load that too to see what it does.
Warning
The key in the following is to get the concepts not all of the details in how I evaluate and visualize. I could have made figures separately to explain the concept, but I like to show that Python is self contained.
from sklearn.model_selection import KFold, ShuffleSplit
kf = KFold(n_splits = 10)
When we use the split
method it gives us a generator.
kf.split(diabetes_X, diabetes_y)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 1
----> 1 kf.split(diabetes_X, diabetes_y)
NameError: name 'diabetes_X' is not defined
We can use this in a loop to get the list of indices that will be used to get the test and train data for each fold. To visualize what this is doing, see below.
N_samples = len(diabetes_y)
kf_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in kf.split(diabetes_X, diabetes_y):
kf_tt_df['split ' + str(i)] = ['unused']*N_samples
kf_tt_df['split ' + str(i)][train_idx] = 'Train'
kf_tt_df['split ' + str(i)][test_idx] = 'Test'
i +=1
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[13], line 1
----> 1 N_samples = len(diabetes_y)
2 kf_tt_df = pd.DataFrame(index=list(range(N_samples)))
3 i = 1
NameError: name 'diabetes_y' is not defined
We can count how many times ‘Test’ and ‘Train’ appear
count_test = lambda part: len([v for v in part if v=='Test'])
count_train = lambda part: len([v for v in part if v=='Train'])
When we apply this along axis=1
we to check that each sample is used in exactly 1 test set how may times each sample is used
sum(kf_tt_df.apply(count_test,axis = 1) ==1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 1
----> 1 sum(kf_tt_df.apply(count_test,axis = 1) ==1)
NameError: name 'kf_tt_df' is not defined
and exactly 9 training sets
sum(kf_tt_df.apply(count_test,axis = 1) ==9)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 1
----> 1 sum(kf_tt_df.apply(count_test,axis = 1) ==9)
NameError: name 'kf_tt_df' is not defined
the describe helps ensure that all fo the values are exa
We can also visualize:
cmap = sns.color_palette("tab10",10)
g = sns.heatmap(kf_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[17], line 2
1 cmap = sns.color_palette("tab10",10)
----> 2 g = sns.heatmap(kf_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
3 linecolor='gray')
4 colorbar = g.collections[0].colorbar
5 colorbar.set_ticklabels(['Train','Test'])
NameError: name 'kf_tt_df' is not defined
Note that unlike test_train_split
this does not always randomize and shuffle the data before splitting.
If we apply those lambda
functions along axis=0
, we can see the size of each test set
kf_tt_df.apply(count_test,axis = 0)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[18], line 1
----> 1 kf_tt_df.apply(count_test,axis = 0)
NameError: name 'kf_tt_df' is not defined
and training set:
kf_tt_df.apply(count_train,axis = 0)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[19], line 1
----> 1 kf_tt_df.apply(count_train,axis = 0)
NameError: name 'kf_tt_df' is not defined
We can verify that these splits are the same size as what test_train_split
does using the right settings. 10-fold splits the data into 10 parts and tests on 1, so that makes a test size of 1/10=.1, so we can use the train_test_split
and check the length.
X_train2,X_test2, y_train2,y_test2 = train_test_split(diabetes_X, diabetes_y ,
test_size=.1,random_state=0)
[len(split) for split in [X_train2,X_test2,]]
Under the hood train_test_split
uses ShuffleSplit
We can do a similar experiment as above to see what ShuffleSplit
does.
skf = ShuffleSplit(10)
N_samples = len(diabetes_y)
ss_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in skf.split(diabetes_X, diabetes_y):
ss_tt_df['split ' + str(i)] = ['unused']*N_samples
ss_tt_df['split ' + str(i)][train_idx] = 'Train'
ss_tt_df['split ' + str(i)][test_idx] = 'Test'
i +=1
ss_tt_df
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[20], line 2
1 skf = ShuffleSplit(10)
----> 2 N_samples = len(diabetes_y)
3 ss_tt_df = pd.DataFrame(index=list(range(N_samples)))
4 i = 1
NameError: name 'diabetes_y' is not defined
And plot
cmap = sns.color_palette("tab10",10)
g = sns.heatmap(ss_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[21], line 2
1 cmap = sns.color_palette("tab10",10)
----> 2 g = sns.heatmap(ss_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
3 linecolor='gray')
4 colorbar = g.collections[0].colorbar
5 colorbar.set_ticklabels(['Train','Test'])
NameError: name 'ss_tt_df' is not defined
20.4. Cross validation with clustering#
We can use any estimator object here.
km = cluster.KMeans(n_clusters=3)
cross_val_score(km, iris_X_train,)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
array([-11.46929911, -12.40230398, -14.43061587, -8.73932888,
-9.72014715])
km.score()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[24], line 1
----> 1 km.score()
TypeError: score() missing 1 required positional argument: 'X'
20.5. Grid Search Optimization#
We can optimize, however to determing the different parameter settings.
A simple way to do this is to fit the model for different parameters and score for each and compare.
param_grid = {'n_clusters':[2,3,4,5,6]}
km_opt = GridSearchCV(km, param_grid,metrics.silhouette_score)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[25], line 2
1 param_grid = {'n_clusters':[2,3,4,5,6]}
----> 2 km_opt = GridSearchCV(km, param_grid,metrics.silhouette_score)
TypeError: __init__() takes 3 positional arguments but 4 were given
The GridSearchCV
object is constructed first and requires an estimator object and a dictionary that describes the parameter grid to search over.
The dictionary has the parameter names as the keys and the values are the values for that parameter to test.
The fit
method on the Grid Search object fits all of the separate models.
In this case we optimize of a one dimensional “grid” just a set of values for one parameter, the number of clusters.
param_grid = {'n_clusters':[2,3,4,5,6]}
km_opt = GridSearchCV(km, param_grid)
iris_X_train.shape
(112, 4)
km_opt.fit(iris_X_train)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
GridSearchCV(estimator=KMeans(n_clusters=3), param_grid={'n_clusters': [2, 3, 4, 5, 6]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KMeans(n_clusters=3), param_grid={'n_clusters': [2, 3, 4, 5, 6]})
KMeans(n_clusters=3)
KMeans(n_clusters=3)
Important
I still need to explore this question. A volunteer who wants to do this for a portfolio section can do that as well
Why does ,scoring=metrics.silhouette_score
not work?
km_opt.best_params_
{'n_clusters': 6}
type(km_opt.best_estimator_)
sklearn.cluster._kmeans.KMeans
We note that this is a dictionary, so to make it more readable, we can make it a DataFrame.
pd.DataFrame(km_opt.cv_results_)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_n_clusters | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.009852 | 0.005060 | 0.001296 | 0.000082 | 2 | {'n_clusters': 2} | -23.650760 | -19.564274 | -30.086417 | -16.560840 | -21.666266 | -22.305711 | 4.544809 | 5 |
1 | 0.009224 | 0.000611 | 0.001255 | 0.000045 | 3 | {'n_clusters': 3} | -11.469299 | -12.402304 | -14.430616 | -8.739329 | -9.720147 | -11.352339 | 2.004184 | 4 |
2 | 0.011083 | 0.000371 | 0.001230 | 0.000037 | 4 | {'n_clusters': 4} | -7.320516 | -9.156875 | -10.272976 | -9.532946 | -9.259791 | -9.108621 | 0.975528 | 3 |
3 | 0.013082 | 0.000244 | 0.001696 | 0.000824 | 5 | {'n_clusters': 5} | -5.877640 | -6.203186 | -7.884785 | -7.632313 | -7.042492 | -6.928083 | 0.781436 | 2 |
4 | 0.013854 | 0.000251 | 0.001230 | 0.000024 | 6 | {'n_clusters': 6} | -6.008734 | -5.310848 | -6.181361 | -5.676005 | -6.349516 | -5.905293 | 0.371533 | 1 |
20.6. Optimizing a Decision Tree#
Today we will optimize a decision tree over three parameters. One is the criterion, which is how it decides where to create thresholds in parameters. Gini is the default and it computes how concentrated each class is at that node, another is entropy, entropy is, generally how random something is. Intuitively these do similar things, which makes sense because they are two ways to make the same choice, but they have slightly different calculations.
The other two parameters we have seen some before. Max depth is the height of the tree and min smaples per leaf makes it keeps the leaf sizes small.
dt = tree.DecisionTreeClassifier()
params_dt = {'criterion':['gini','entropy'],'max_depth':[2,3,4],
'min_samples_leaf':list(range(2,20,2))}
what parameters give the highest accuracy? and is the most acurate one also the fastest one?
dt_opt = GridSearchCV(dt,params_dt)
dt_opt.fit(iris_X_train,iris_y_train)
GridSearchCV(estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4], 'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4], 'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})
DecisionTreeClassifier()
DecisionTreeClassifier()
We will fit it with default CV settings. And we can see the best parameters
dt_opt.best_params_
{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 2}
and we can use ti to get predictions
y_pred = dt_opt.predict(iris_X_test)
dt_df = pd.DataFrame(dt_opt.cv_results_)
dt_df.shape
(54, 16)
dt_df.columns
Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
'param_criterion', 'param_max_depth', 'param_min_samples_leaf',
'params', 'split0_test_score', 'split1_test_score', 'split2_test_score',
'split3_test_score', 'split4_test_score', 'mean_test_score',
'std_test_score', 'rank_test_score'],
dtype='object')
dt_df['mean_score_time'].idxmin() == dt_df['mean_test_score'].idxmax()
False
dt_df['mean_test_score'].idxmax(), dt_df['mean_score_time'].idxmin()
(45, 52)
Important
Remember that best is context dependent and relative. The best accuracy might not be the best overall. Automatic optimization can only find the best thing in terms of a single score.