SVM and Parameter Optimizing
Contents
26. SVM and Parameter Optimizing#
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import tree
26.1. Support Vectors#
26.1.1. Basic Idea#
Imagine we have data that is like this
We might want to choose a decision boundary to separate it. We could choose any one of these three gray lines and get 100% training accuracy.
We could say that the best one is the solid one because it best seaparates the data.
SVM does this, it finds the ‘support vectors’ which are the points of each class closes to the others and then finds the decison boundary that has the maximum margin, where the margin is the space between the boundary and each class.
When SVM is looking only for straight lines, it’s called linear SVM, but SVM can look for different type of boundaries. We do this by changing the kernel function. A popular one is called the radial basis function or rbf
it allows smooth curvy lines.
So that the SVM can work on data like this:
It can also allow handle data that is not perfectly separable like the following by minimizing the number of errors and maximizing the margin.
26.1.2. SVM in Sklearn#
First we’ll load the data and separate the featurs and target (\(X\) and \(y\))
iris_df = sns.load_dataset('iris')
iris_X = iris_df.drop(columns='species')
iris_y = iris_df['species']
Next, we will split the data into test and train.
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris_X,iris_y)
Fitting the model is just like other models we have seen:
instantiate the object
fit the model
score the model on the test dat
svm_clf = svm.SVC()
svm_clf.fit(iris_X_train, iris_y_train)
svm_clf.score(iris_X_test, iris_y_test)
0.9473684210526315
We see that this fits pretty well with the default parameters.
26.2. Grid Search Optimization#
We can optimize, however to determing the different parameter settings.
A simple way to do this is to fit the model for different parameters and score for each and compare.
We’ll focus on the kernel, which controls the type of line, and \(C\) which controls the regularization.
param_grid = {'kernel':['linear','rbf'], 'C':[.5, 1, 10]}
svm_opt = GridSearchCV(svm_clf,param_grid,)
The GridSearchCV
object is constructed first and requires an estimator object and a dictionary that describes the parameter grid to search over.
The dictionary has the parameter names as the keys and the values are the values for that parameter to test.
The fit
method on the Grid Seearch object fits all of the separate models.
svm_opt.fit(iris_X_train,iris_y_train)
GridSearchCV(estimator=SVC(), param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=SVC(), param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})
SVC()
SVC()
Then we can look at the output.
svm_opt.cv_results_
{'mean_fit_time': array([0.00163226, 0.00164022, 0.00154209, 0.00159893, 0.00152874,
0.00155783]),
'std_fit_time': array([1.39948448e-04, 2.34362431e-05, 8.18578171e-06, 1.62563012e-05,
2.04490324e-05, 1.54284640e-05]),
'mean_score_time': array([0.00117993, 0.00117402, 0.00112467, 0.0011734 , 0.00112076,
0.00113878]),
'std_score_time': array([8.47353181e-05, 7.42611683e-06, 1.67697602e-05, 3.05814373e-05,
7.17223956e-06, 1.47908707e-05]),
'param_C': masked_array(data=[0.5, 0.5, 1, 1, 10, 10],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'param_kernel': masked_array(data=['linear', 'rbf', 'linear', 'rbf', 'linear', 'rbf'],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'C': 0.5, 'kernel': 'linear'},
{'C': 0.5, 'kernel': 'rbf'},
{'C': 1, 'kernel': 'linear'},
{'C': 1, 'kernel': 'rbf'},
{'C': 10, 'kernel': 'linear'},
{'C': 10, 'kernel': 'rbf'}],
'split0_test_score': array([1. , 1. , 1. , 1. , 0.95652174,
0.95652174]),
'split1_test_score': array([1. , 1. , 1. , 1. , 0.95652174,
0.95652174]),
'split2_test_score': array([1. , 0.90909091, 1. , 0.90909091, 0.95454545,
0.95454545]),
'split3_test_score': array([1. , 0.95454545, 1. , 1. , 0.95454545,
1. ]),
'split4_test_score': array([0.95454545, 0.90909091, 0.95454545, 0.95454545, 1. ,
0.95454545]),
'mean_test_score': array([0.99090909, 0.95454545, 0.99090909, 0.97272727, 0.96442688,
0.96442688]),
'std_test_score': array([0.01818182, 0.04065578, 0.01818182, 0.03636364, 0.01780851,
0.01780851]),
'rank_test_score': array([1, 6, 1, 3, 4, 4], dtype=int32)}
We note that this is a dictionary, so to make it more readable, we can make it a DataFrame.
pd.DataFrame(svm_opt.cv_results_)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_C | param_kernel | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.001632 | 0.000140 | 0.001180 | 0.000085 | 0.5 | linear | {'C': 0.5, 'kernel': 'linear'} | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.954545 | 0.990909 | 0.018182 | 1 |
1 | 0.001640 | 0.000023 | 0.001174 | 0.000007 | 0.5 | rbf | {'C': 0.5, 'kernel': 'rbf'} | 1.000000 | 1.000000 | 0.909091 | 0.954545 | 0.909091 | 0.954545 | 0.040656 | 6 |
2 | 0.001542 | 0.000008 | 0.001125 | 0.000017 | 1 | linear | {'C': 1, 'kernel': 'linear'} | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.954545 | 0.990909 | 0.018182 | 1 |
3 | 0.001599 | 0.000016 | 0.001173 | 0.000031 | 1 | rbf | {'C': 1, 'kernel': 'rbf'} | 1.000000 | 1.000000 | 0.909091 | 1.000000 | 0.954545 | 0.972727 | 0.036364 | 3 |
4 | 0.001529 | 0.000020 | 0.001121 | 0.000007 | 10 | linear | {'C': 10, 'kernel': 'linear'} | 0.956522 | 0.956522 | 0.954545 | 0.954545 | 1.000000 | 0.964427 | 0.017809 | 4 |
5 | 0.001558 | 0.000015 | 0.001139 | 0.000015 | 10 | rbf | {'C': 10, 'kernel': 'rbf'} | 0.956522 | 0.956522 | 0.954545 | 1.000000 | 0.954545 | 0.964427 | 0.017809 | 4 |
It also has a best_estimator_
attribute, which is an estimator object.
type(svm_opt.best_estimator_)
sklearn.svm._classes.SVC
This is the model that had the best cross validated score among all of the parameter settings tested.
svm_opt.best_estimator_.score(iris_X_test,iris_y_test)
0.9736842105263158
We can then use this model on the test data.
Try it Yourself
Find the best criterion, max depth, and minimum number of samples per leaf
dt = tree.DecisionTreeClassifier()
params_dt = {'criterion':['gini','entropy'],'max_depth':[2,3,4],
'min_samples_leaf':list(range(2,20,2))}
To do this, we do just as we did above, instantiate and fit the model.
dt_opt = GridSearchCV(dt,params_dt)
dt_opt.fit(iris_X_train,iris_y_train)
GridSearchCV(estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4], 'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=DecisionTreeClassifier(), param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4], 'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})
DecisionTreeClassifier()
DecisionTreeClassifier()
Then we can use the best_params_
attribute to see the best parameter settings.
dt_opt.best_params_
{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 2}
26.3. Questions after class#
26.3.1. Can this be used on more types of machine learning than just decision trees and svm?#
Yes, this can be used on any estimator in scikit learn. It can even be used on other models that adhere to the required API.
GridSearchCV repeatedly:
sets the parameter values from param_grid
runs cross_val_score on the data