from scipy.special import expit
from sklearn.datasets import make_classification
from sklearn.neural_network import MLPClassifier
from sklearn import svm
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set_theme(palette='colorblind')We will load an image dataset of handwritten digits.
digits = datasets.load_digits()
digits_X = digits.data
digits_y = digits.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(digits_X,digits_y)digits.images[0]array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])digits_X[:1]array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,
15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,
12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,
0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])digits_X.shape(1797, 64)digits_y[:5]array([0, 1, 2, 3, 4])We can use a neural network, even small one here
mlp = MLPClassifier(
hidden_layer_sizes=(16),
max_iter=300,
solver="lbfgs",
verbose=10,
random_state=1,
learning_rate_init=0.1,
)mlp.fit(X_train,y_train).score(X_test,y_test)0.8977777777777778mlp = MLPClassifier(
hidden_layer_sizes=(16),
max_iter=500,
solver="lbfgs",
verbose=10,
random_state=1,
learning_rate_init=0.1,
)
mlp.fit(X_train,y_train).score(X_test,y_test)0.8977777777777778mlp = MLPClassifier(
hidden_layer_sizes=(16),
max_iter=1000,
solver="lbfgs",
verbose=10,
random_state=1,
learning_rate_init=0.1,
)
mlp.fit(X_train,y_train).score(X_test,y_test)0.8977777777777778Letting it converge fully can be better but can also risk overfitting. Stopping it via a max, is called early stopping and is a very common strategy.
20/280.7142857142857143More on NN¶
Numerical Optimization algorithms are at the core of many of the fit methods.
One way we can optimize a function is to take the derivative, set it equal to zero and sovle for the parameter. If we know the funciton is convex (like a bowl or valley shape) then the place where the derivative (slope) is 0 is the bottom or lowest point of the valley.
Numerial optimzaiton is for when we can’t analytically solve that problem once we set it equal to zero. Optimizaiton algorithms are sort of like search algorithms but can work in high dimensions and use strategy based on calculus.
The basic idea in many numerical optimization algorithms is to start at a point (initial setting of the coefficients in this case) and then compute the value of the function then change the coefficients a little and compute again. We can use those two point to see if the direction we “moved” or the way we changed the parameters made it better or worse. If it was better, we change them more in the same direction, (if we made both smaller then we make them both smaller again) if it got worse, we change in a different direction.
You can think of this like trying to find the bottom of a valley, without being able to see, just check your altitude. You take a step left, right, forward or back and then see if your altitude went up or down.
LBGFS acutally uses the derivative, so it’s like you can see the direction of the hill you’re on, but you have to keep taking steps and then if you reacha point where you can’t go down anymore you know you are done. When the algorithm finds it can’t get better, that’s called convergence.
Stochastic gradient descent works in high dimensions where it’s too hard to do the derivative, but you can randomly move in different directions (or take the partial derivate in a small numbe rof defintions). Adam is a specical version fo that with better strategy.
Numerical optimization is a whole research area. In graduate school, I took a whole semester long course just learning different algorithms for this.
Deep Learning Ecosystem¶
While sklearn can do basic neural networks, it doesn’t have many types of neurons or fancy layers.
What makes deep learning work is its ability to transform data in new ways. We will see them in more det
Two popular Deep Learning Libraries are Pytorch (code) and TensorFlow (code), both are open source These are both complex, high performance libraries with optimized code. We have used python in class because it is a user-friendly programming language, but it is not optimal, performance-wise, so much of the code in these libraries is in C++.