21. SVM and Model Optimization#

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import tree

If we have 500 samples and use train_size=.8 in train_test_split and the default values for GridSearchCV, how many samples are in each validation set?

N = 500
train_size = .8
cv = 5
train_size*N/cv
80.0
iris_X, iris_y = datasets.load_iris(return_X_y=True)
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X,iris_y)

21.1. Fitting an SVM#

Now let’s compare with a different model, we’ll use the parameter optimized version for that model.

the sklearn docs have a good description

svm_clf = svm.SVC()
param_grid = {'kernel':['linear','rbf'], 'C':[.5, 1, 10]}
svm_opt = GridSearchCV(svm_clf,param_grid)
svm_opt.fit(iris_X_train,iris_y_train)
GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
svm_df = pd.DataFrame(svm_opt.cv_results_)
svm_df
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_kernel params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.000882 0.000178 0.000551 0.000059 0.5 linear {'C': 0.5, 'kernel': 'linear'} 0.956522 1.0 0.954545 1.0 0.954545 0.973123 0.021957 1
1 0.000853 0.000012 0.000597 0.000052 0.5 rbf {'C': 0.5, 'kernel': 'rbf'} 0.956522 1.0 0.954545 1.0 0.909091 0.964032 0.033918 5
2 0.000749 0.000028 0.000499 0.000017 1 linear {'C': 1, 'kernel': 'linear'} 0.956522 1.0 0.954545 1.0 0.954545 0.973123 0.021957 1
3 0.000877 0.000153 0.000545 0.000008 1 rbf {'C': 1, 'kernel': 'rbf'} 0.956522 1.0 0.954545 1.0 0.909091 0.964032 0.033918 5
4 0.000764 0.000042 0.000502 0.000011 10 linear {'C': 10, 'kernel': 'linear'} 0.913043 1.0 0.954545 1.0 0.954545 0.964427 0.032761 4
5 0.000760 0.000017 0.000524 0.000015 10 rbf {'C': 10, 'kernel': 'rbf'} 0.956522 1.0 0.954545 1.0 0.954545 0.973123 0.021957 1
svm_opt.best_params_
{'C': 0.5, 'kernel': 'linear'}

21.2. Other ways to compare models#

We can look at the performance, here the score is the accuracy and we could also look at other performance metrics.

We can compare them on time: the training time or the test time (more important).

We can also compare models on their interpretability: a decision tree is easy to explain how it makes a decision.

We can compare if it is generative (describes the data, could generate synthetic data) or discriminative (describes the decision rule). A generative model might be preferred even with lower accuracy by a scientist who wants to understand the data. It can also help to give ideas about what you might do to improve the model. If you just need the most accuracy, like if you are placing ads, you would typically use a discriminative model because it is complex, and you just want accuracy you don’t need to understand.