19. Model Comparison#

To compare models, we will first optimize the parameters of two diffrent models and look at how the different parameters settings impact the model comparison. Later, we’ll see how to compare across models of different classes.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import svm
from sklearn import tree
# import the whole model selection module
from sklearn import model_selection
sns.set_theme(palette='colorblind')

We’ll use the iris data again.

Remember, we need to split the data into training and test. The cross validation step will hep us optimize the parameters, but we don’t want data leakage where the model has seen the test data multiple times. So, we split the data here for train and test annd the cross validation splits the training data into train and “test” again, but this test is better termed validation.

# load and split the data
iris_X, iris_y = datasets.load_iris(return_X_y=True)
iris_X_train, iris_X_test, iris_y_train, iris_y_test = model_selection.train_test_split(
  iris_X,iris_y, test_size =.2)

Then we can make the object, the parameter grid dictionary and the Grid Search object. We split these into separate cells, so that we can use the built in help to see more detail.

# create dt,
dt = tree.DecisionTreeClassifier()

# set param grid 
params_dt = {'criterion':['gini','entropy'],
       'max_depth':[2,3,4,5,6],
    'min_samples_leaf':list(range(2,20,2))}
# create optimizer
dt_opt = model_selection.GridSearchCV(dt,params_dt,cv=10)


# optimize the dt parameters
dt_opt.fit(iris_X_train,iris_y_train)

# store the results in a dataframe
dt_df = pd.DataFrame(dt_opt.cv_results_)


# create svm, its parameter grid and optimizer
svm_clf = svm.SVC()
param_grid = {'kernel':['linear','rbf'], 'C':[.5, .75,1,2,5,7, 10]}
svm_opt = model_selection.GridSearchCV(svm_clf,param_grid,cv=10)

# optmize the svm put the CV results in a dataframe
svm_opt.fit(iris_X_train,iris_y_train)
sv_df = pd.DataFrame(svm_opt.cv_results_)

Then we fit the Grid search using the training data, and remember this actually resets the parameters and then cross validates multiple times.

We fit both a decision tree and an SVM for this case.

19.1. Comparing performance#

dt_df['mean_test_score'].describe()
count    90.000000
mean      0.922315
std       0.009902
min       0.916667
25%       0.916667
50%       0.916667
75%       0.925000
max       0.950000
Name: mean_test_score, dtype: float64
sv_df['mean_test_score'].describe()
count    14.000000
mean      0.979762
std       0.007097
min       0.958333
25%       0.977083
50%       0.983333
75%       0.983333
max       0.983333
Name: mean_test_score, dtype: float64
dt_df.head(2)
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_depth param_min_samples_leaf params split0_test_score split1_test_score ... split3_test_score split4_test_score split5_test_score split6_test_score split7_test_score split8_test_score split9_test_score mean_test_score std_test_score rank_test_score
0 0.000643 0.000296 0.000444 0.000041 gini 2 2 {'criterion': 'gini', 'max_depth': 2, 'min_sam... 0.833333 1.0 ... 0.916667 0.916667 0.916667 0.916667 0.833333 1.0 0.916667 0.916667 0.052705 25
1 0.000491 0.000010 0.000429 0.000053 gini 2 4 {'criterion': 'gini', 'max_depth': 2, 'min_sam... 0.833333 1.0 ... 0.916667 0.916667 0.916667 0.916667 0.833333 1.0 0.916667 0.916667 0.052705 25

2 rows × 21 columns

[len(v) for k,v in param_grid.items()]
[2, 7]
np.prod([len(v) for k,v in param_grid.items()])
14
np.prod([len(v) for k,v in params_dt.items()])
90
dt_opt.score(iris_X_test,iris_y_test), svm_opt.score(iris_X_test,iris_y_test)
(0.9333333333333333, 0.9333333333333333)

19.2. How sensitive to the parameter values is each model?#

We can look at a lot of different things to check this. First, we can look back above at the std in the describe tables above. That is how much the parameter values changed the different measurements we took.

Another thing we can look at is the ranks. If most of the models are ranked the same, that means they all got the same score, so the parameter values did not change the accuracy. That is not sensitive. This helps us see a bit more detail in how sensitive when the std are small already.

If there are no ties, that shows some sensitivity, and then it would be important to look at the std to see if it is small or not.

Important

the below may be different due to random splits

dt_df['rank_test_score'].value_counts()
rank_test_score
25    66
2     13
15     8
23     2
1      1
Name: count, dtype: int64
sv_df['rank_test_score'].value_counts()
rank_test_score
5     6
1     4
11    3
14    1
Name: count, dtype: int64

19.3. More Complete comparison#

A coomplete comparison will compare across multiple metrics and consider the relative important of those metrics on the particular context.

We can look at the performance, here the score is the accuracy and we could also look at other performance metrics.

We can compare them on time: the training time or the test time (more important).

19.4. Interpretability#

We can also compare models on their interpretability: a decision tree is easy to explain how it makes a decision.

An SVM is somewhat harder and a neural network that we will see after Thanksgiving is even more complex.

If the model is going in a webapp, it may not matter if it is interpretable, but if it is being used with experts of some sort, they will likely care about its interpretability.

For example, in healthcare, interpretability is often valued. I shared a story about a model that performed well on one metric, but looking at how it worked raised concerns. That link leads to the person who trained that model telling the story.

19.5. Model type#

For classifiers, we can compare the type of classifier, or what it learns. We can compare if it is generative (describes the data, could generate synthetic data) or discriminative (describes the decision rule). A generative model might be preferred even with lower accuracy by a scientist who wants to understand the data. It can also help to give ideas about what you might do to improve the model. If you just need the most accuracy, like if you are placing ads, you would typically use a discriminative model because it is complex, and you just want accuracy you don’t need to understand.

19.6. Which is faster?#

We can also compare models in terms of the time to train:

dt_df['mean_fit_time'].describe()
count    90.000000
mean      0.000494
std       0.000019
min       0.000477
25%       0.000486
50%       0.000491
75%       0.000499
max       0.000643
Name: mean_fit_time, dtype: float64
sv_df['mean_fit_time'].describe()
count    14.000000
mean      0.000669
std       0.000034
min       0.000624
25%       0.000647
50%       0.000665
75%       0.000681
max       0.000736
Name: mean_fit_time, dtype: float64

or to predict.

dt_df['mean_score_time'].describe()
count    90.000000
mean      0.000408
std       0.000010
min       0.000395
25%       0.000401
50%       0.000406
75%       0.000412
max       0.000458
Name: mean_score_time, dtype: float64
sv_df['mean_score_time'].describe()
count    14.000000
mean      0.000433
std       0.000010
min       0.000420
25%       0.000424
50%       0.000432
75%       0.000442
max       0.000448
Name: mean_score_time, dtype: float64

In general, we will typically predict more times than we fit, so the score time often matters.

19.7. Questions after class#

19.7.1. Are “AI” tools like chatgpt just complicated machine learning algorithms?|#

Yes, modern, successful AI is all ML. Strictly speaking there are AI techniques tht are not machine learning, but machine learning is currently dominant in practice. Enough so that non ML AI techniques are not even taught very much.

Specifically ChatGPT, Google Bard, DALL-E and similar are generative deep learning models. ChatGPT and Bard are both what is called a large language model. Next week, we will learn the very first step of Natural language processing and after Thanksgiving, we will look at a small language model.

19.7.2. how did you get that score is run more times than fit? I thought mean_fit_time and mean_score_time only measured the time, not number of occurences#

Your understanding of the GridSearchCV is correct. That part I was talking about which time matters more and why. In a real world exaple, as a datascientist, you would use the fit method each time you need a new version of the model. In contrast, the predict method would be called each time a user makes a request. Strictly speaking score calls the predict method len(X_train) times (one cor each sample in the test set). So this is some multipe of the time that a customer/user would experience, but it’s the same number of test samples for each model tested, so it’s a fair comparison.

So if it was at a bank deciding to give loans or not, you might update and retrain it once month (fit) but customers might apply hundreds of times per day(predict).

If it was the etsy example of unsuepervised learning for styles, they fit it once per day, so even though customers would loa the site pages thousands of times per day, they might care a little about the fit time too. So if a model that was a lot faster to fit, but only a little slower to predict, it could be worth it. article

19.7.3. Does GNB also have these mean_score, mean_time parameters that DT and SVC?#

These parameters are of the GridSearchCV object, not the DecisionTreeClassifier or the SVC. Therefore, they exist for any GridSearchCV no matter what estimator object is used to instantiate it. So, yes, for a GridSearchCV over a GaussianNB these values would exist.

19.7.5. are they still using the ai to see paitent priority in hospitals?#

The model I told the story about never got deployed. ML is used to varying degrees in hosptials. There is a lot of research on machine learning for healthcare see for example the list of 2023 papers

this is an example of a well designed successfully deployed model. You can read it through URI, let me know if you do not know how to use the library to get access.

19.7.6. Why does it matter how long it takes to fit data when the time is almost instantaneous?#

When it is this low, we likely would not care. The minor difference in score time could add up because even if it is .0001s but it is called 3 million times per day, then the time(and associated compute time, and energy costs, etc) adds up.

19.7.7. when models are trained at an enterprise level / for an enterprise applications is pandas & scikit-learn still used or is a lower-level language/package used (like for netflix)#

What data scientists do is typically still using pandas and scikit-learn. For time-critical or low memory location deployed model, the fit model may be re-implemented in a faster language.