20. Model Comparison#
To compare models, we will first optimize the parameters of two diffrent models and look at how the different parameters settings impact the model comparison. Later, we’ll see how to compare across models of different classes.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import svm
from sklearn import tree
# import the whole model selection module
from sklearn import model_selection
sns.set_theme(palette='colorblind')
We’ll use the iris data again.
Remember, we need to split the data into training and test. The cross validation step will hep us optimize the parameters, but we don’t want data leakage where the model has seen the test data multiple times. So, we split the data here for train and test nnd the cross validation splits the training data into train and “test” again, but that “test” is better termed validation.
# load and split the data
iris_X, iris_y = datasets.load_iris(return_X_y=True)
iris_X_train, iris_X_test, iris_y_train, iris_y_test = model_selection.train_test_split(
iris_X,iris_y, test_size =.2)
Then we can make the object, the parameter grid dictionary and the Grid Search object. We split these into separate cells, so that we can use the built in help to see more detail.
# create dt,
dt = tree.DecisionTreeClassifier()
# set param grid
params_dt = {'criterion':['gini','entropy'],
'max_depth':[2,3,4,5,6],
'min_samples_leaf':list(range(2,20,2))}
# create optimizer
dt_opt = model_selection.GridSearchCV(dt,params_dt,cv=10)
# optimize the dt parameters
dt_opt.fit(iris_X_train,iris_y_train)
# store the results in a dataframe
dt_df = pd.DataFrame(dt_opt.cv_results_)
# create svm, its parameter grid and optimizer
svm_clf = svm.SVC()
param_grid = {'kernel':['linear','rbf'], 'C':[.5, .75,1,2,5,7, 10]}
svm_opt = model_selection.GridSearchCV(svm_clf,param_grid,cv=10)
# optmize the svm put the CV results in a dataframe
svm_opt.fit(iris_X_train,iris_y_train)
sv_df = pd.DataFrame(svm_opt.cv_results_)
Then we fit the Grid search using the training data, and remember this actually resets the parameters and then cross validates multiple times.
We fit both a decision tree and an SVM for this case.
20.1. What does the grid search do?#
Important
The notes from the last class include more detail than was covered in class time on November 7, o make it up. Please see that, it includes that there are other types of searches that might be better in some applications
It fits and scores the model with each set of parameters on each fold of the data.
It saves results for each set of parameters and each fold of cross validaiton.
dt_df.shape
(90, 21)
We can confirm this matches what our parameter grid said, here we pick out the values from the parameter dictionary then store those in a list (via list comprehension) then we multiply all of those together:
np.product([len(param_list) for param_list in params_dt.values()])
90
this matches the number of rows above
we can see how much it stores for each of these parameter settings
dt_df.head(2)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_criterion | param_max_depth | param_min_samples_leaf | params | split0_test_score | split1_test_score | ... | split3_test_score | split4_test_score | split5_test_score | split6_test_score | split7_test_score | split8_test_score | split9_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000531 | 0.000062 | 0.000441 | 0.000045 | gini | 2 | 2 | {'criterion': 'gini', 'max_depth': 2, 'min_sam... | 0.916667 | 1.0 | ... | 0.75 | 0.916667 | 0.916667 | 1.0 | 1.0 | 1.0 | 1.0 | 0.95 | 0.076376 | 1 |
1 | 0.000503 | 0.000017 | 0.000430 | 0.000019 | gini | 2 | 4 | {'criterion': 'gini', 'max_depth': 2, 'min_sam... | 0.916667 | 1.0 | ... | 0.75 | 0.916667 | 0.916667 | 1.0 | 1.0 | 1.0 | 1.0 | 0.95 | 0.076376 | 1 |
2 rows × 21 columns
The GridSearchCV
object also reveals the model with the best test score to be used with score
and predict
:
dt_opt.score(iris_X_test,iris_y_test), svm_opt.score(iris_X_test,iris_y_test)
(0.8666666666666667, 0.9333333333333333)
21. Comparing performance#
We can compare how much time each model takes to score (make predictions) for each model.
dt_df['mean_score_time'].describe()
count 90.000000
mean 0.000417
std 0.000012
min 0.000408
25% 0.000411
50% 0.000415
75% 0.000418
max 0.000504
Name: mean_score_time, dtype: float64
sv_df['mean_score_time'].describe()
count 14.000000
mean 0.000461
std 0.000023
min 0.000440
25% 0.000444
50% 0.000456
75% 0.000468
max 0.000529
Name: mean_score_time, dtype: float64
In these, we can look at the means but also the standard deviation, it is important to understand how sensitive each classifier is to the parameters. If a model is really sensitive to the parameters, that can mean it might not transfer as well, or at leasst warrant further optimization.
Which model’s performance is more sensitive to the parameters?
dt_df['mean_test_score'].describe()
count 90.000000
mean 0.949444
std 0.002090
min 0.941667
25% 0.950000
50% 0.950000
75% 0.950000
max 0.950000
Name: mean_test_score, dtype: float64
sv_df['mean_test_score'].describe()
count 14.000000
mean 0.987500
std 0.005420
min 0.983333
25% 0.983333
50% 0.983333
75% 0.991667
max 1.000000
Name: mean_test_score, dtype: float64
Again here, the better fit model was similar, but this is somethign that we can look at the borader statistics
21.1. How does the best model do?#
We can get the best ranked model
dt_df['rank_test_score'].idxmin()
0
and look at its perforamnce to compare the best parameter values of each model
dt_df.loc[dt_df['rank_test_score'].idxmin()]
mean_fit_time 0.000531
std_fit_time 0.000062
mean_score_time 0.000441
std_score_time 0.000045
param_criterion gini
param_max_depth 2
param_min_samples_leaf 2
params {'criterion': 'gini', 'max_depth': 2, 'min_sam...
split0_test_score 0.916667
split1_test_score 1.0
split2_test_score 1.0
split3_test_score 0.75
split4_test_score 0.916667
split5_test_score 0.916667
split6_test_score 1.0
split7_test_score 1.0
split8_test_score 1.0
split9_test_score 1.0
mean_test_score 0.95
std_test_score 0.076376
rank_test_score 1
Name: 0, dtype: object
sv_df.loc[sv_df['rank_test_score'].idxmin()]
mean_fit_time 0.00065
std_fit_time 0.000017
mean_score_time 0.00044
std_score_time 0.000003
param_C 1
param_kernel linear
params {'C': 1, 'kernel': 'linear'}
split0_test_score 1.0
split1_test_score 1.0
split2_test_score 1.0
split3_test_score 1.0
split4_test_score 1.0
split5_test_score 1.0
split6_test_score 1.0
split7_test_score 1.0
split8_test_score 1.0
split9_test_score 1.0
mean_test_score 1.0
std_test_score 0.0
rank_test_score 1
Name: 4, dtype: object
21.2. What about training time#
In some contexts it might be important to also check how long it takes to train, for example if you wil train and retrain a lot on changing data.
dt_df['mean_fit_time'].describe()
count 90.000000
mean 0.000502
std 0.000013
min 0.000484
25% 0.000495
50% 0.000500
75% 0.000506
max 0.000581
Name: mean_fit_time, dtype: float64
sv_df['mean_fit_time'].describe()
count 14.000000
mean 0.000700
std 0.000032
min 0.000650
25% 0.000680
50% 0.000691
75% 0.000723
max 0.000764
Name: mean_fit_time, dtype: float64
21.3. How sensitive to the parameter values is each model?#
We can look at a lot of different things to check this. First, we can look back above at the std
in the describe tables above. That is how much the parameter values changed the different measurements we took.
Another thing we can look at is the ranks. If most of the models are ranked the same, that means they all got the same score, so the parameter values did not change the accuracy. That is not sensitive. This helps us see a bit more detail in how sensitive when the std are small already.
If there are no ties, that shows some sensitivity, and then it would be important to look at the std to see if it is small or not.
Important
the below may be different due to random splits
21.4. More Complete comparison#
A coomplete comparison will compare across multiple metrics and consider the relative important of those metrics on the particular context.
We can look at the performance, here the score is the accuracy and we could also look at other performance metrics.
We can compare them on time: the training time or the test time (more important).
21.5. Interpretability#
We can also compare models on their interpretability: a decision tree is easy to explain how it makes a decision.
An SVM is somewhat harder and a neural network that we will see after Thanksgiving is even more complex.
If the model is going in a webapp, it may not matter if it is interpretable, but if it is being used with experts of some sort, they will likely care about its interpretability.
For example, in healthcare, interpretability is often valued. I shared a story about a model that performed well on one metric, but looking at how it worked raised concerns. That link leads to the person who trained that model telling the story.
21.6. Model type#
For classifiers, we can compare the type of classifier, or what it learns. We can compare if it is generative (describes the data, could generate synthetic data) or discriminative (describes the decision rule). A generative model might be preferred even with lower accuracy by a scientist who wants to understand the data. It can also help to give ideas about what you might do to improve the model. If you just need the most accuracy, like if you are placing ads, you would typically use a discriminative model because it is complex, and you just want accuracy you don’t need to understand.
dt_opt.score(iris_X_test,iris_y_test), svm_opt.score(iris_X_test,iris_y_test)
(0.8666666666666667, 0.9333333333333333)
len(iris_y_test)
30
21.7. Questions#
21.7.1. Is there a sample size restriction to compare models?#
No, but you should compare them on the same data (for training and testing)
21.7.2. Does GNB also have these mean_score, mean_time parameters that DT and SVC?#
These parameters are of the GridSearchCV
object, not the DecisionTreeClassifier
or the SVC
. Therefore, they exist for any GridSearchCV
no matter what estimator object is used to instantiate it. So, yes, for a GridSearchCV
over a GaussianNB
these values would exist.
21.7.3. how does the size of the parameter grid impact the requirements of the grid search?#
It impacts how long it will take to run.
If this answer is not enough for you, please write to clarify your question and I will expand.
21.7.4. are they still using the ai to see paitent priority in hospitals?#
The model I told the story about never got deployed. ML is used to varying degrees in hosptials. There is a lot of research on machine learning for healthcare see for example the list of 2023 papers
this is an example of a well designed successfully deployed model. You can read it through URI, let me know if you do not know how to use the library to get access.
21.7.5. Why does it matter how long it takes to fit data when the time is almost instantaneous?#
When it is this low, we likely would not care. The minor difference in score time could add up because even if it is .0001s but it is called 3 million times per day, then the time(and associated compute time, and energy costs, etc) adds up.
21.7.6. when models are trained at an enterprise level / for an enterprise applications is pandas & scikit-learn still used or is a lower-level language/package used (like for netflix)#
What data scientists do is typically still using pandas and scikit-learn. For time-critical or low memory location deployed model, the fit model may be re-implemented in a faster language.