26. SVM and Parameter Optimizing#

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import tree

26.1. Support Vectors#

26.1.1. Basic Idea#

Imagine we have data that is like this very separable data

We might want to choose a decision boundary to separate it. We could choose any one of these three gray lines and get 100% training accuracy.

very separable data with 3 decision boundaries

We could say that the best one is the solid one because it best seaparates the data.

very separable data with a decision boundary

SVM does this, it finds the ‘support vectors’ which are the points of each class closes to the others and then finds the decison boundary that has the maximum margin, where the margin is the space between the boundary and each class.
very separable data with a decision boundary and margin highlighted

When SVM is looking only for straight lines, it’s called linear SVM, but SVM can look for different type of boundaries. We do this by changing the kernel function. A popular one is called the radial basis function or rbf it allows smooth curvy lines.

So that the SVM can work on data like this: very separable data with a decision boundary

It can also allow handle data that is not perfectly separable like the following by minimizing the number of errors and maximizing the margin. very separable data with a decision boundary

26.1.2. SVM in Sklearn#

First we’ll load the data and separate the featurs and target (\(X\) and \(y\))

iris_df = sns.load_dataset('iris')
iris_X = iris_df.drop(columns='species')
iris_y = iris_df['species']

Next, we will split the data into test and train.

iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X,iris_y)

Fitting the model is just like other models we have seen:

  1. instantiate the object

  2. fit the model

  3. score the model on the test dat

svm_clf = svm.SVC()
svm_clf.fit(iris_X_train, iris_y_train)
svm_clf.score(iris_X_test, iris_y_test)
0.9473684210526315

We see that this fits pretty well with the default parameters.

26.2. Grid Search Optimization#

We can optimize, however to determing the different parameter settings.

A simple way to do this is to fit the model for different parameters and score for each and compare.

We’ll focus on the kernel, which controls the type of line, and \(C\) which controls the regularization.

param_grid = {'kernel':['linear','rbf'], 'C':[.5, 1, 10]}
svm_opt = GridSearchCV(svm_clf,param_grid,)

The GridSearchCV object is constructed first and requires an estimator object and a dictionary that describes the parameter grid to search over. The dictionary has the parameter names as the keys and the values are the values for that parameter to test.

The fit method on the Grid Seearch object fits all of the separate models.

svm_opt.fit(iris_X_train,iris_y_train)
GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Then we can look at the output.

svm_opt.cv_results_
{'mean_fit_time': array([0.00163226, 0.00164022, 0.00154209, 0.00159893, 0.00152874,
        0.00155783]),
 'std_fit_time': array([1.39948448e-04, 2.34362431e-05, 8.18578171e-06, 1.62563012e-05,
        2.04490324e-05, 1.54284640e-05]),
 'mean_score_time': array([0.00117993, 0.00117402, 0.00112467, 0.0011734 , 0.00112076,
        0.00113878]),
 'std_score_time': array([8.47353181e-05, 7.42611683e-06, 1.67697602e-05, 3.05814373e-05,
        7.17223956e-06, 1.47908707e-05]),
 'param_C': masked_array(data=[0.5, 0.5, 1, 1, 10, 10],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['linear', 'rbf', 'linear', 'rbf', 'linear', 'rbf'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.5, 'kernel': 'linear'},
  {'C': 0.5, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 1, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'}],
 'split0_test_score': array([1.        , 1.        , 1.        , 1.        , 0.95652174,
        0.95652174]),
 'split1_test_score': array([1.        , 1.        , 1.        , 1.        , 0.95652174,
        0.95652174]),
 'split2_test_score': array([1.        , 0.90909091, 1.        , 0.90909091, 0.95454545,
        0.95454545]),
 'split3_test_score': array([1.        , 0.95454545, 1.        , 1.        , 0.95454545,
        1.        ]),
 'split4_test_score': array([0.95454545, 0.90909091, 0.95454545, 0.95454545, 1.        ,
        0.95454545]),
 'mean_test_score': array([0.99090909, 0.95454545, 0.99090909, 0.97272727, 0.96442688,
        0.96442688]),
 'std_test_score': array([0.01818182, 0.04065578, 0.01818182, 0.03636364, 0.01780851,
        0.01780851]),
 'rank_test_score': array([1, 6, 1, 3, 4, 4], dtype=int32)}

We note that this is a dictionary, so to make it more readable, we can make it a DataFrame.

pd.DataFrame(svm_opt.cv_results_)
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_kernel params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.001632 0.000140 0.001180 0.000085 0.5 linear {'C': 0.5, 'kernel': 'linear'} 1.000000 1.000000 1.000000 1.000000 0.954545 0.990909 0.018182 1
1 0.001640 0.000023 0.001174 0.000007 0.5 rbf {'C': 0.5, 'kernel': 'rbf'} 1.000000 1.000000 0.909091 0.954545 0.909091 0.954545 0.040656 6
2 0.001542 0.000008 0.001125 0.000017 1 linear {'C': 1, 'kernel': 'linear'} 1.000000 1.000000 1.000000 1.000000 0.954545 0.990909 0.018182 1
3 0.001599 0.000016 0.001173 0.000031 1 rbf {'C': 1, 'kernel': 'rbf'} 1.000000 1.000000 0.909091 1.000000 0.954545 0.972727 0.036364 3
4 0.001529 0.000020 0.001121 0.000007 10 linear {'C': 10, 'kernel': 'linear'} 0.956522 0.956522 0.954545 0.954545 1.000000 0.964427 0.017809 4
5 0.001558 0.000015 0.001139 0.000015 10 rbf {'C': 10, 'kernel': 'rbf'} 0.956522 0.956522 0.954545 1.000000 0.954545 0.964427 0.017809 4

It also has a best_estimator_ attribute, which is an estimator object.

type(svm_opt.best_estimator_)
sklearn.svm._classes.SVC

This is the model that had the best cross validated score among all of the parameter settings tested.

svm_opt.best_estimator_.score(iris_X_test,iris_y_test)
0.9736842105263158

We can then use this model on the test data.

Try it Yourself

Find the best criterion, max depth, and minimum number of samples per leaf

dt = tree.DecisionTreeClassifier()
params_dt = {'criterion':['gini','entropy'],'max_depth':[2,3,4],
       'min_samples_leaf':list(range(2,20,2))}

To do this, we do just as we did above, instantiate and fit the model.

dt_opt = GridSearchCV(dt,params_dt)
dt_opt.fit(iris_X_train,iris_y_train)
GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [2, 3, 4],
                         'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Then we can use the best_params_ attribute to see the best parameter settings.

dt_opt.best_params_
{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 2}

26.3. Questions after class#

26.3.1. Can this be used on more types of machine learning than just decision trees and svm?#

Yes, this can be used on any estimator in scikit learn. It can even be used on other models that adhere to the required API.

GridSearchCV repeatedly:

  • sets the parameter values from param_grid

  • runs cross_val_score on the data