SVM and Model Optimization

21. SVM and Model Optimization#

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import tree

If we have 500 samples and use train_size=.8 in train_test_split and the default values for GridSearchCV, how many samples are in each validation set?

N = 500
train_size = .8
cv = 5

train_size*N/cv

80.0

iris_X, iris_y = datasets.load_iris(return_X_y=True)

iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X,iris_y)

21.1. Fitting an SVM#

Now let’s compare with a different model, we’ll use the parameter optimized version for that model.

the sklearn docs have a good description

svm_clf = svm.SVC()

param_grid = {'kernel':['linear','rbf'], 'C':[.5, 1, 10]}

svm_opt = GridSearchCV(svm_clf,param_grid)
svm_opt.fit(iris_X_train,iris_y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

svm_df = pd.DataFrame(svm_opt.cv_results_)

svm_df

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_C	param_kernel	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.000882	0.000178	0.000551	0.000059	0.5	linear	{'C': 0.5, 'kernel': 'linear'}	0.956522	1.0	0.954545	1.0	0.954545	0.973123	0.021957	1
1	0.000853	0.000012	0.000597	0.000052	0.5	rbf	{'C': 0.5, 'kernel': 'rbf'}	0.956522	1.0	0.954545	1.0	0.909091	0.964032	0.033918	5
2	0.000749	0.000028	0.000499	0.000017	1	linear	{'C': 1, 'kernel': 'linear'}	0.956522	1.0	0.954545	1.0	0.954545	0.973123	0.021957	1
3	0.000877	0.000153	0.000545	0.000008	1	rbf	{'C': 1, 'kernel': 'rbf'}	0.956522	1.0	0.954545	1.0	0.909091	0.964032	0.033918	5
4	0.000764	0.000042	0.000502	0.000011	10	linear	{'C': 10, 'kernel': 'linear'}	0.913043	1.0	0.954545	1.0	0.954545	0.964427	0.032761	4
5	0.000760	0.000017	0.000524	0.000015	10	rbf	{'C': 10, 'kernel': 'rbf'}	0.956522	1.0	0.954545	1.0	0.954545	0.973123	0.021957	1

svm_opt.best_params_

{'C': 0.5, 'kernel': 'linear'}

21.2. Other ways to compare models#

We can look at the performance, here the score is the accuracy and we could also look at other performance metrics.

We can compare them on time: the training time or the test time (more important).

We can also compare models on their interpretability: a decision tree is easy to explain how it makes a decision.

We can compare if it is generative (describes the data, could generate synthetic data) or discriminative (describes the decision rule). A generative model might be preferred even with lower accuracy by a scientist who wants to understand the data. It can also help to give ideas about what you might do to improve the model. If you just need the most accuracy, like if you are placing ads, you would typically use a discriminative model because it is complex, and you just want accuracy you don’t need to understand.

SVM and Model Optimization

Contents

21. SVM and Model Optimization#

21.1. Fitting an SVM#

21.2. Other ways to compare models#