Model Comparison
Contents
33. Model Comparison#
To compare models, we will first optimize the parameters of two diffrent models and look at how the different parameters settings impact the model comparison. Later, we’ll see how to compare across models of different classes.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import svm
from sklearn import tree
from sklearn import model_selection
We could import modules however we want, for example:
from sklearn import model_selection as ms
200*.8/5
32.0
We’ll use the iris data again.
Remember, we need to split the data into training and test. The cross validation step will hep us optimize the parameters, but we don’t want data leakage where the model has seen the test data multiple times. So, we split the data here for train and test annd the cross validation splits the training data into train and “test” again, but this test is better termed validation.
iris_df = sns.load_dataset('iris')
iris_X = iris_df.drop(columns='species')
iris_y = iris_df['species']
iris_X_train, iris_X_test, iris_y_train, iris_y_test = model_selection.train_test_split(iris_X,iris_y, test_size =.2)
Then we can make the object, the parameter grid dictionary and the Grid Search object. We split these into separate cells, so that we can use the built in help to see more detail.
dt = tree.DecisionTreeClassifier()
params_dt = {'criterion':['gini','entropy'],'max_depth':[2,3,4],
'min_samples_leaf':list(range(2,20,2))}
dt_opt = model_selection.GridSearchCV(dt, params_dt)
dt_opt.fit(iris_X_train,iris_y_train)
GridSearchCV(estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4], 'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4], 'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})
DecisionTreeClassifier()
DecisionTreeClassifier()
Then we fit the Grid search using the training data, and remember this actually resets the parameters and then cross validates multiple times.
We can reformat it into a dataframe for further analysis.
dt_cv_df = pd.DataFrame(dt_opt.cv_results_)
dt_cv_df.head()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_criterion | param_max_depth | param_min_samples_leaf | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.002271 | 0.000882 | 0.001283 | 0.000139 | gini | 2 | 2 | {'criterion': 'gini', 'max_depth': 2, 'min_sam... | 0.916667 | 0.916667 | 0.916667 | 0.875 | 0.916667 | 0.908333 | 0.016667 | 9 |
1 | 0.001738 | 0.000011 | 0.001199 | 0.000028 | gini | 2 | 4 | {'criterion': 'gini', 'max_depth': 2, 'min_sam... | 0.916667 | 0.916667 | 0.916667 | 0.875 | 0.916667 | 0.908333 | 0.016667 | 9 |
2 | 0.001755 | 0.000024 | 0.001199 | 0.000021 | gini | 2 | 6 | {'criterion': 'gini', 'max_depth': 2, 'min_sam... | 0.916667 | 0.916667 | 0.916667 | 0.875 | 0.916667 | 0.908333 | 0.016667 | 9 |
3 | 0.001751 | 0.000014 | 0.001199 | 0.000037 | gini | 2 | 8 | {'criterion': 'gini', 'max_depth': 2, 'min_sam... | 0.916667 | 0.916667 | 0.916667 | 0.875 | 0.916667 | 0.908333 | 0.016667 | 9 |
4 | 0.001762 | 0.000019 | 0.001192 | 0.000020 | gini | 2 | 10 | {'criterion': 'gini', 'max_depth': 2, 'min_sam... | 0.916667 | 0.916667 | 0.916667 | 0.875 | 0.916667 | 0.908333 | 0.016667 | 9 |
type(dt_opt.best_estimator_)
sklearn.tree._classes.DecisionTreeClassifier
dt_opt.best_params_
{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 2}
I changed the markers and the color of the error bars for readability.
plt.errorbar(x=dt_df['mean_fit_time'],y=dt_df['mean_score_time'],
xerr=dt_df['std_fit_time'],yerr=dt_df['std_score_time'],
marker='s',ecolor='r')
plt.xlabel('fit time')
plt.ylabel('score time')
# save the limits so we can reuse them
xmin, xmax, ymin, ymax = plt.axis()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 plt.errorbar(x=dt_df['mean_fit_time'],y=dt_df['mean_score_time'],
2 xerr=dt_df['std_fit_time'],yerr=dt_df['std_score_time'],
3 marker='s',ecolor='r')
4 plt.xlabel('fit time')
5 plt.ylabel('score time')
NameError: name 'dt_df' is not defined
The “points” are at the mean fit and score times. The lines are the “standard deviation” or how much we expect that number to vary, since means are an estimate. Because the data shows an upward trend, this plot tells us that mostly, the models that are slower to fit are also slower to apply. This makes sense for decision trees, deeper trees take longer to learn and longer to traverse when predicting. Because the error bars mostly overlap the other points, this tells us that mostly the variation in time is not a reliable difference. If we re-ran the GridSearch, we could get them in different orders.
To interpret the error bar plot, let’s look at a line plot of just the means, with the same limits so that it’s easier to compare to the plot above.
plt.plot(dt_df['mean_fit_time'],
dt_df['mean_score_time'], marker='s')
plt.xlabel('fit time')
plt.ylabel('score time')
# match the axis limits to above
plt.ylim(ymin, ymax)
plt.xlim(xmin,xmax)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 1
----> 1 plt.plot(dt_df['mean_fit_time'],
2 dt_df['mean_score_time'], marker='s')
3 plt.xlabel('fit time')
4 plt.ylabel('score time')
NameError: name 'dt_df' is not defined
this plot shows the mean times, without the error bars.
33.1. Fitting an SVM#
Now let’s compare with a different model, we’ll use the parameter optimized version for that model.
svmclf = svm.SVC()
param_grid_svm = {'kernel':['linear','rbf'], 'C':[.5, 1, 10]}
svm_opt = model_selection.GridSearchCV(svmclf,param_grid_svm)
svm_opt.fit(iris_X_train,iris_y_train)
GridSearchCV(estimator=SVC(), param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=SVC(), param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})
SVC()
SVC()
svm_opt.best_params_
{'C': 0.5, 'kernel': 'linear'}
svm_opt.score(iris_X_test,iris_y_test), dt_opt.score(iris_X_test,iris_y_test)
(0.9333333333333333, 0.9666666666666667)
33.2. Other ways to compare models#
We can look at the performance, here the score is the accuracy and we could also look at other performance metrics.
We can compare them on time: the training time or the test time (more important).
We can also compare models on their interpretability: a decision tree is easy to explain how it makes a decision.
We can compare if it is generative (describes the data, could generate synthetic data) or discriminative (describes the decision rule). A generative model might be preferred even with lower accuracy by a scientist who wants to understand the data. It can also help to give ideas about what you might do to improve the model. If you just need the most accuracy, like if you are placing ads, you would typically use a discriminative model because it is complex, and you just want accuracy you don’t need to understand.
33.3. Questions After Class#
33.3.1. How does the svm algorithm work with many classes? Does it basically draw a grid? Does it work like a decision tree with a single line as each branch decider?#
Multiclass SVM can be done multiple ways. In sklearn it does what is called one-vs-one. Basically it trains multiple models and votes.
33.3.2. is best_params_ a way of finding the best options when optimizing a model?#
That is to find the parameter values that had the best score when you have already optimized it.
33.3.3. I would like to know how which of the three (svm/gnb/dt) is used the most in the world of data science, in your opinion.#
Gausian Naive Bayes are definitely the least commonly used. They’re very simple so they are useful for getting the concepts of classification down, but they are very simple so they are not very reliable.
SVMs were very popular for while, but they have some underlying theoretical similarity to neural nets, and with modern computers where neural nets are popular, SVMs are less. SVMS share similar disadvantages to NNs but are less accurate in general.
Decision trees are still pretty popular, even with deep learning because they are interpretable. Right now, I would say of these three they are probably the most common.
33.3.4. is there a limit to data size when it comes to models#
Not to the model, but possibly to your computer. We also have batch training that can help with larger datasets.