28. Model Selection#

We’ll pick up from where we left off on Monday.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import svm
from sklearn import tree
# import the whole model selection module
from sklearn import model_selection
sns.set_theme(palette='colorblind')

# load and split the data
iris_X, iris_y = datasets.load_iris(return_X_y=True)
iris_X_train, iris_X_test, iris_y_train, iris_y_test = model_selection.train_test_split(
  iris_X,iris_y, test_size =.2)


# create dt, set param grid & create optimizer
dt = tree.DecisionTreeClassifier()

params_dt = {'criterion':['gini','entropy'],
       'max_depth':[2,3,4,5,6],
    'min_samples_leaf':list(range(2,20,2))}

dt_opt = model_selection.GridSearchCV(dt,params_dt,cv=10)


# fit the model and optimize
dt_opt.fit(iris_X_train,iris_y_train)

# store the resutl sin a dataframe
dt_df = pd.DataFrame(dt_opt.cv_results_)


# create svm, its parameter grid and optimizer
svm_clf = svm.SVC()
param_grid = {'kernel':['linear','rbf'], 'C':[.5, .75,1,2,5,7, 10]}
svm_opt = model_selection.GridSearchCV(svm_clf,param_grid,cv=10)

# fit the model and put the CV results in a dataframe
svm_opt.fit(iris_X_train,iris_y_train)
sv_df = pd.DataFrame(svm_opt.cv_results_)

We can compare how the best models fit during validation:

svm_opt.best_score_, dt_opt.best_score_
(0.975, 0.9416666666666667)

We see that the SVM does a little bit better.

We can also apply the best estimator from each model class (that is the best parameter settings for each SVM and Decision Tree) to the test data to get our final score.

svm_opt.best_estimator_.score(iris_X_test,iris_y_test)
1.0
dt_opt.best_estimator_.score(iris_X_test,iris_y_test)
0.9333333333333333

We can also examine the results to think through this choice more clearly.

sv_df.head(2)
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_kernel params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score split5_test_score split6_test_score split7_test_score split8_test_score split9_test_score mean_test_score std_test_score rank_test_score
0 0.000493 0.000035 0.000267 0.000011 0.5 linear {'C': 0.5, 'kernel': 'linear'} 1.0 1.0 1.0 0.833333 1.0 1.0 0.916667 1.0 1.0 1.0 0.975 0.053359 1
1 0.000603 0.000013 0.000296 0.000013 0.5 rbf {'C': 0.5, 'kernel': 'rbf'} 1.0 1.0 1.0 0.833333 1.0 1.0 0.916667 1.0 1.0 1.0 0.975 0.053359 1

We can use EDA to understand how the score varied across all of the parameter settings we tried.

sv_df['mean_test_score'].describe()
count    14.000000
mean      0.967262
std       0.009509
min       0.941667
25%       0.966667
50%       0.966667
75%       0.975000
max       0.975000
Name: mean_test_score, dtype: float64
dt_df['mean_test_score'].describe()
count    90.000000
mean      0.933981
std       0.003117
min       0.925000
25%       0.933333
50%       0.933333
75%       0.933333
max       0.941667
Name: mean_test_score, dtype: float64

From this we see that in both cases the standard deviation (std) is really low. This tells us that the parameter changes didn’t impact the performance much. Combined with the overall high accuracy this tells us that the data is probably really easy to classify. If the performance had been uniformly bad, it might have instead told us that we did not try a wide enough range of parameters.

To confirm how many parameter settings we have used we can check a couple different ways. First, above in the count of the describe.

We can also calculate directly from the parameter grids before we even do the fit.

n_combinations = 1
for param, vals in param_grid.items():
    n_combinations *= len(vals)

28.1. When do differences matter?#

We can check calculate a confidence interval to determine more precisely when the performance of two models is meaningfully different.

This function calculates the 95% confidence interval. The range within which we are 95% confident the quantity we have estimated is truly within in. When we have more samples in the test set used to calculate the score, we are more confident in the estimate, so the interval is narrower.

def classification_confint(acc, n):
    '''
    Compute the 95% confidence interval for a classification problem.
    acc -- classification accuracy
    n  -- number of observations used to compute the accuracy
    Returns a tuple (lb,ub)
    '''
    interval = 1.96*np.sqrt(acc*(1-acc)/n)
    lb = max(0, acc - interval)
    ub = min(1.0, acc + interval)
    return (lb,ub)
svm_opt.best_score_, dt_opt.best_score_
(0.975, 0.9416666666666667)

We can calculate the number of observations used to compute the accuracy using the size of the training data and the fact that we set it to 10-fold cross validation. That means that 10% (100/10) of the data was used for each fold and each validation set.

len(iris_X_train)*.1
12.0
split0_score = dt_df['split0_test_score'][dt_opt.best_index_]
classification_confint(split0_score ,len(iris_X_train)*.1)
(1.0, 1.0)

Correction

since the best score is cross validated it actually uses the whole training data (by averaging 10 scores together).

classification_confint( dt_opt.best_score_,len(iris_X_train))
(0.8997320725326443, 0.983601260800689)
len(iris_y_test)
30
classification_confint(dt_opt.best_estimator_.score(iris_X_test,iris_y_test),len(iris_y_test))
(0.8440710066609742, 1.0)

If we take the exact value we can see how more samples narrows the interval.

classification_confint(.95,12)
(0.8266860375572443, 1.0)
classification_confint(.95,30)
(0.8720094022760863, 1.0)

We can further explore this by plotting directly from the calculation out of the function

n_list = list(range(5,200)])
intervals = [1.96*np.sqrt(acc*(1-acc)/n) for n in n_list]
plt.plot(n_list, intervals)
  Input In [18]
    n_list = list(range(5,200)])
                              ^
SyntaxError: closing parenthesis ']' does not match opening parenthesis '('

28.2. Questions After Class#

28.2.1. When would I choose to use an svm?#

SVMs are often very accurate, and while in some cases a decision tree may be considered more interpretable, an SVM is not necessarily too hard to interpret.

If accuracy is important, the SVM is often a really good choice.

28.2.2. How can we tell how many parameter settings we tried?#

We can calculate it in advance from the parameter grid or the length of the output DataFrame/lists in the dictionary.