Learning Curves
Contents
35. Learning Curves#
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import cluster
from sklearn import naive_bayes
from sklearn import svm
from sklearn import tree
# import the whole model selection module
from sklearn import model_selection
sns.set_theme(palette='colorblind')
35.1. Digits Dataset#
Today, we’ll load a new dataset and use the default sklearn data structure for datasets. We get back the default data stucture when we use a load_
function without any parameters at all.
digits = datasets.load_digits()
This shows us that the type is defined by sklearn and they called it bunch
:
type(digits)
sklearn.utils._bunch.Bunch
We can print it out to begin exploring it.
digits
{'data': array([[ 0., 0., 5., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 10., 0., 0.],
[ 0., 0., 0., ..., 16., 9., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 2., ..., 12., 0., 0.],
[ 0., 0., 10., ..., 12., 1., 0.]]),
'target': array([0, 1, 2, ..., 8, 9, 8]),
'frame': None,
'feature_names': ['pixel_0_0',
'pixel_0_1',
'pixel_0_2',
'pixel_0_3',
'pixel_0_4',
'pixel_0_5',
'pixel_0_6',
'pixel_0_7',
'pixel_1_0',
'pixel_1_1',
'pixel_1_2',
'pixel_1_3',
'pixel_1_4',
'pixel_1_5',
'pixel_1_6',
'pixel_1_7',
'pixel_2_0',
'pixel_2_1',
'pixel_2_2',
'pixel_2_3',
'pixel_2_4',
'pixel_2_5',
'pixel_2_6',
'pixel_2_7',
'pixel_3_0',
'pixel_3_1',
'pixel_3_2',
'pixel_3_3',
'pixel_3_4',
'pixel_3_5',
'pixel_3_6',
'pixel_3_7',
'pixel_4_0',
'pixel_4_1',
'pixel_4_2',
'pixel_4_3',
'pixel_4_4',
'pixel_4_5',
'pixel_4_6',
'pixel_4_7',
'pixel_5_0',
'pixel_5_1',
'pixel_5_2',
'pixel_5_3',
'pixel_5_4',
'pixel_5_5',
'pixel_5_6',
'pixel_5_7',
'pixel_6_0',
'pixel_6_1',
'pixel_6_2',
'pixel_6_3',
'pixel_6_4',
'pixel_6_5',
'pixel_6_6',
'pixel_6_7',
'pixel_7_0',
'pixel_7_1',
'pixel_7_2',
'pixel_7_3',
'pixel_7_4',
'pixel_7_5',
'pixel_7_6',
'pixel_7_7'],
'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
'images': array([[[ 0., 0., 5., ..., 1., 0., 0.],
[ 0., 0., 13., ..., 15., 5., 0.],
[ 0., 3., 15., ..., 11., 8., 0.],
...,
[ 0., 4., 11., ..., 12., 7., 0.],
[ 0., 2., 14., ..., 12., 0., 0.],
[ 0., 0., 6., ..., 0., 0., 0.]],
[[ 0., 0., 0., ..., 5., 0., 0.],
[ 0., 0., 0., ..., 9., 0., 0.],
[ 0., 0., 3., ..., 6., 0., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 0., ..., 10., 0., 0.]],
[[ 0., 0., 0., ..., 12., 0., 0.],
[ 0., 0., 3., ..., 14., 0., 0.],
[ 0., 0., 8., ..., 16., 0., 0.],
...,
[ 0., 9., 16., ..., 0., 0., 0.],
[ 0., 3., 13., ..., 11., 5., 0.],
[ 0., 0., 0., ..., 16., 9., 0.]],
...,
[[ 0., 0., 1., ..., 1., 0., 0.],
[ 0., 0., 13., ..., 2., 1., 0.],
[ 0., 0., 16., ..., 16., 5., 0.],
...,
[ 0., 0., 16., ..., 15., 0., 0.],
[ 0., 0., 15., ..., 16., 0., 0.],
[ 0., 0., 2., ..., 6., 0., 0.]],
[[ 0., 0., 2., ..., 0., 0., 0.],
[ 0., 0., 14., ..., 15., 1., 0.],
[ 0., 4., 16., ..., 16., 7., 0.],
...,
[ 0., 0., 0., ..., 16., 2., 0.],
[ 0., 0., 4., ..., 16., 2., 0.],
[ 0., 0., 5., ..., 12., 0., 0.]],
[[ 0., 0., 10., ..., 1., 0., 0.],
[ 0., 2., 16., ..., 1., 0., 0.],
[ 0., 0., 15., ..., 15., 0., 0.],
...,
[ 0., 4., 16., ..., 16., 6., 0.],
[ 0., 8., 16., ..., 16., 8., 0.],
[ 0., 1., 8., ..., 12., 1., 0.]]]),
'DESCR': ".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 1797\n :Number of Attributes: 64\n :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n :Missing Attribute Values: None\n :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\n.. topic:: References\n\n - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n Graduate Studies in Science and Engineering, Bogazici University.\n - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n Linear dimensionalityreduction using relevance weighted LDA. School of\n Electrical and Electronic Engineering Nanyang Technological University.\n 2005.\n - Claudio Gentile. A New Approximate Maximal Margin Classification\n Algorithm. NIPS. 2000.\n"}
We note that it has key value pairs, and that the last one is called DESCR
and is text that describes the data. If we send that to the print function it will be formatted more readably.
print(digits['DESCR'])
.. _digits_dataset:
Optical recognition of handwritten digits dataset
--------------------------------------------------
**Data Set Characteristics:**
:Number of Instances: 1797
:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998
This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.
Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.
For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.
.. topic:: References
- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.
This tells us that we are going to be predicting what digit (0,1,2,3,4,5,6,7,8, or 9) is in the image.
To get an idea of what the images look like, we can use matshow
which is short for matrix show. It takes a 2D matrix and plots it as a grayscale image. To get the actual color bar, we use the matplotlib plt.gray()
.
plt.gray()
plt.matshow(digits.images[9])
<matplotlib.image.AxesImage at 0x7efdd63370a0>
<Figure size 640x480 with 0 Axes>
35.2. Setting up the Problem#
digits_X = digits.data
digits_y = digits.target
bunch
objects are designed for machine learning, so they have the features as “data” and target explicitly identified.
digits_X.shape, digits_y.shape
((1797, 64), (1797,))
This has one row for each sample and has reshaped the 8x8 image into a 64 length vector. So we have one ‘feature’ for each pixel in the images.
The size of the .images
is the total number of pixel values.
1797*8*8
115008
35.3. Learning Curves#
We are going to do some model comparison, so we will instantiate estimator objects for two different classifiers.
svm_clf = svm.SVC(gamma=0.001)
gnb_clf = naive_bayes.GaussianNB()
We’re going to use a ShuffleSplit object to do Cross validation with 100 iterations to get smoother mean test and train score curves, each time with 20% data randomly selected as a validation set.
Further Reading
You can see visualization of different cross validation types in the sklearn documentation.
cv = model_selection.ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
Note
This object has a random_state
object, the GridSearchCV
that we were using didn’t have a way to control the random state directly, but it accepts not only integers, but also cross validation objects to the cv
parameter. The KFold cross validation object also has that parameter, so we could repeat what we did in previous classes by creating a KFold
object with a fixed random state.
We’ll also create a linearly spaced list of training percentages.
Important
You could speed it up by splitting it into jobs with the n_jobs
parameter
Now we can create the learning curve.
train_sizes = np.linspace(.05,1,10)
train_sizes_svm, train_scores_svm, test_scores_svm, fit_times_svm, score_times_svm = model_selection.learning_curve(
svm_clf,
digits_X,
digits_y,
cv=cv,
train_sizes=train_sizes,
return_times=True,)
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[12], line 3
1 train_sizes = np.linspace(.05,1,10)
----> 3 train_sizes_svm, train_scores_svm, test_scores_svm, fit_times_svm, score_times_svm = model_selection.learning_curve(
4 svm_clf,
5 digits_X,
6 digits_y,
7 cv=cv,
8 train_sizes=train_sizes,
9 return_times=True,)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:1579, in learning_curve(estimator, X, y, groups, train_sizes, cv, scoring, exploit_incremental_learning, n_jobs, pre_dispatch, verbose, shuffle, random_state, error_score, return_times, fit_params)
1576 for n_train_samples in train_sizes_abs:
1577 train_test_proportions.append((train[:n_train_samples], test))
-> 1579 results = parallel(
1580 delayed(_fit_and_score)(
1581 clone(estimator),
1582 X,
1583 y,
1584 scorer,
1585 train,
1586 test,
1587 verbose,
1588 parameters=None,
1589 fit_params=fit_params,
1590 return_train_score=True,
1591 error_score=error_score,
1592 return_times=return_times,
1593 )
1594 for train, test in train_test_proportions
1595 )
1596 results = _aggregate_score_dicts(results)
1597 train_scores = results["train_scores"].reshape(-1, n_unique_ticks).T
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/joblib/parallel.py:1088, in Parallel.__call__(self, iterable)
1085 if self.dispatch_one_batch(iterator):
1086 self._iterating = self._original_iterator is not None
-> 1088 while self.dispatch_one_batch(iterator):
1089 pass
1091 if pre_dispatch == "all" or n_jobs == 1:
1092 # The iterable was consumed all at once by the above for loop.
1093 # No need to wait for async callbacks to trigger to
1094 # consumption.
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/joblib/parallel.py:901, in Parallel.dispatch_one_batch(self, iterator)
899 return False
900 else:
--> 901 self._dispatch(tasks)
902 return True
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/joblib/parallel.py:819, in Parallel._dispatch(self, batch)
817 with self._lock:
818 job_idx = len(self._jobs)
--> 819 job = self._backend.apply_async(batch, callback=cb)
820 # A job can complete so quickly than its callback is
821 # called before we get here, causing self._jobs to
822 # grow. To ensure correct results ordering, .insert is
823 # used (rather than .append) in the following line
824 self._jobs.insert(job_idx, job)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/joblib/_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/joblib/_parallel_backends.py:597, in ImmediateResult.__init__(self, batch)
594 def __init__(self, batch):
595 # Don't delay the application, to avoid keeping the input
596 # arguments in memory
--> 597 self.results = batch()
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/joblib/parallel.py:288, in BatchedCalls.__call__(self)
284 def __call__(self):
285 # Set the default nested backend to self._backend but do not set the
286 # change the default number of processes to -1
287 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288 return [func(*args, **kwargs)
289 for func, args, kwargs in self.items]
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/joblib/parallel.py:288, in <listcomp>(.0)
284 def __call__(self):
285 # Set the default nested backend to self._backend but do not set the
286 # change the default number of processes to -1
287 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 288 return [func(*args, **kwargs)
289 for func, args, kwargs in self.items]
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/utils/fixes.py:117, in _FuncWrapper.__call__(self, *args, **kwargs)
115 def __call__(self, *args, **kwargs):
116 with config_context(**self.config):
--> 117 return self.function(*args, **kwargs)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:686, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
684 estimator.fit(X_train, **fit_params)
685 else:
--> 686 estimator.fit(X_train, y_train, **fit_params)
688 except Exception:
689 # Note fit time as time until error
690 fit_time = time.time() - start_time
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/svm/_base.py:252, in BaseLibSVM.fit(self, X, y, sample_weight)
249 print("[LibSVM]", end="")
251 seed = rnd.randint(np.iinfo("i").max)
--> 252 fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
253 # see comment on the other call to np.iinfo in this file
255 self.shape_fit_ = X.shape if hasattr(X, "shape") else (n_samples,)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/svm/_base.py:331, in BaseLibSVM._dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed)
317 libsvm.set_verbosity_wrap(self.verbose)
319 # we don't pass **self.get_params() to allow subclasses to
320 # add other parameters to __init__
321 (
322 self.support_,
323 self.support_vectors_,
324 self._n_support,
325 self.dual_coef_,
326 self.intercept_,
327 self._probA,
328 self._probB,
329 self.fit_status_,
330 self._num_iter,
--> 331 ) = libsvm.fit(
332 X,
333 y,
334 svm_type=solver_type,
335 sample_weight=sample_weight,
336 # TODO(1.4): Replace "_class_weight" with "class_weight_"
337 class_weight=getattr(self, "_class_weight", np.empty(0)),
338 kernel=kernel,
339 C=self.C,
340 nu=self.nu,
341 probability=self.probability,
342 degree=self.degree,
343 shrinking=self.shrinking,
344 tol=self.tol,
345 cache_size=self.cache_size,
346 coef0=self.coef0,
347 gamma=self._gamma,
348 epsilon=self.epsilon,
349 max_iter=self.max_iter,
350 random_seed=random_seed,
351 )
353 self._warn_from_fit_status()
KeyboardInterrupt:
It returns the list of the counts for each training size (we input percentages and it returns counts)
train_sizes_svm
The other parameters, it returns a list for each length that’s 100 long because our cross validation was 100 iterations.
fit_times_svm.shape
We can save it in a DataFrame after averaging over the 100 trials.
svm_learning_df = pd.DataFrame(data = train_sizes_svm, columns = ['train_size'])
# svm_learning_df['train_size'] = train_sizes_svm
svm_learning_df['train_score'] = np.mean(train_scores_svm,axis=1)
svm_learning_df['test_score'] = np.mean(test_scores_svm,axis=1)
svm_learning_df['fit_time'] = np.mean(fit_times_svm,axis=1)
svm_learning_df['score_times'] = np.mean(score_times_svm,axis=1)
svm_learning_df.head()
We can use our skills in transforming data to make it easier to exmine just a subset of the scores.
svm_learning_df_scores = svm_learning_df.melt(id_vars=['train_size'],
value_vars=['train_score','test_score'])
svm_learning_df_scores.head(2)
This new DataFrame allows us to make convenient plots.
sns.lineplot(data=svm_learning_df_scores,x='train_size',y='value',hue='variable')
35.3.1. Gaussian Naive Bayes#
We can do the same thing with GNB
train_sizes_gnb, train_scores_gnb, test_scores_gnb, fit_times_gnb, score_times_gnb = model_selection.learning_curve(
gnb_clf,
digits_X,
digits_y,
cv=cv,
train_sizes=train_sizes,
return_times=True,)
gnb_learning_df = pd.DataFrame(data = train_sizes_gnb, columns = ['train_size'])
# gnb_learning_df['train_size'] = train_sizes_gnb
gnb_learning_df['train_score'] = np.mean(train_scores_gnb,axis=1)
gnb_learning_df['test_score'] = np.mean(test_scores_gnb,axis=1)
gnb_learning_df['fit_time'] = np.mean(fit_times_gnb,axis=1)
gnb_learning_df['score_times_gnb'] = np.mean(score_times_gnb,axis=1)
gnb_learning_scores = gnb_learning_df.melt(id_vars=['train_size'],value_vars=['train_score','test_score'])
sns.lineplot(data = gnb_learning_scores, x ='train_size', y='value',hue='variable')
Notice in this case that the training accuracy starts high with the test accuracy low. This big gap means that the model was overfitting to something that was different about the training set from the test set. It was
35.4. Questions After Class#
35.4.1. how do I run the code to pull issues?#
This uses the GitHub CLI
gh issue list --state all -L 45 --json title,url,state > grade-tracker.json
35.4.2. Is fit time as important as accuracy? I would think generally for real life application we would want results over time.#
Fit time is generally not as important as accuracy when deploying a model. This question gets at a really important point. Some of the metrics that we have for machine learning algorithms are for evaluating the learning algorithm, if someone develops a new learning algorithm that can perform as well as old ones, but faster that’s really helpful. You are correct,
That said, the score time can be really important in a deployed model.
35.4.3. Why in the SVC model did we used gamma=0.001 and not other values? Why does that parameter represent in the model?#
The gamma \(\gamma\) parameter for the default rbf kernel controls basically how wavy the line is. I set it to a value that is known to work well for this dataset because, for time reasons, I did not want to also do a grid search.
35.4.4. I’m sure it will be in the notes but a better understanding of how learning curve works#
35.4.5. Can you go over the melt function again?#
• running code will be posted tonight correct?
• nothing at the moment