38. Neural Networks#

We started thinking about machine learning wiht the idea that the basic idea is that we assume that our target variable (\(y_i\)) is related to the features \(\mathbf{x}_i\) by some function (for sample \(i\)):

\[ y_i =f(\mathbf{x}_i)\]

But we don’t know that function exactly, so we assume a type (a decision tree, a boundary for SVM, a probability distribution) that has some parameters \(\theta\) and then use a machine learning algorithm \(\mathcal{A}\) to estimate the parameters for \(f\). In the decision tree the parameters are the thresholds to compare to, in the GaussianNB the parameters are the mean and variance, in SVM it’s the support vectors that define the margin.

\[\theta = \mathcal{A}(X,y) \]

That we can use to test on our test data:

\[ \hat{y}_i = f(x_i;\theta) \]

A neural net allows us to not assume a specific form for \(f\) first, it does universal function approximation. For one hidden layer and a binary classification problem:

\[f(x) = W_2g(W_1^T x +b_1) + b_2 \]

where the function \(g\) is called the activation function. so we approximate some unknown, complicated function $f4 by taking a weighted sum of all of the inputs, and passing those through another, known function.

from sklearn.neural_network import MLPClassifier
from sklearn import svm
import pandas as pd
import sklearn

from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn import model_selection

We’re going to use the digits dataset again.

digits = datasets.load_digits()
digits_X = digits.data
digits_y = digits.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(digits_X,digits_y)
digits.images[0]
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

Sklearn provides an estimator for the Multi-Llayer Perceptron (MLP). We can see one with one layer to start.

mlp = MLPClassifier(
  hidden_layer_sizes=(16),
  max_iter=100,
  alpha=1e-4,
  solver="lbfgs",
  verbose=10,
  random_state=1,
  learning_rate_init=0.1,
)
mlp.fit(X_train,y_train).score(X_test,y_test)
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =         1210     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.38465D+00    |proj g|=  7.11984D+00

At iterate    1    f=  8.17092D+00    |proj g|=  7.15655D+00

At iterate    2    f=  3.24269D+00    |proj g|=  2.03901D+00

At iterate    3    f=  2.37778D+00    |proj g|=  4.04393D-01

At iterate    4    f=  2.27009D+00    |proj g|=  2.30412D-01

At iterate    5    f=  2.13336D+00    |proj g|=  2.77935D-01

At iterate    6    f=  1.98856D+00    |proj g|=  3.48912D-01

At iterate    7    f=  1.72317D+00    |proj g|=  5.67503D-01

At iterate    8    f=  1.58445D+00    |proj g|=  3.28824D-01

At iterate    9    f=  1.49734D+00    |proj g|=  3.93677D-01

At iterate   10    f=  1.39751D+00    |proj g|=  6.71191D-01

At iterate   11    f=  1.24519D+00    |proj g|=  6.21984D-01

At iterate   12    f=  1.15236D+00    |proj g|=  3.47640D-01

At iterate   13    f=  1.09906D+00    |proj g|=  3.37278D-01

At iterate   14    f=  1.05696D+00    |proj g|=  1.99040D-01

At iterate   15    f=  1.01247D+00    |proj g|=  3.01317D-01

At iterate   16    f=  9.81886D-01    |proj g|=  3.18408D-01

At iterate   17    f=  9.50129D-01    |proj g|=  1.37785D-01

At iterate   18    f=  8.86505D-01    |proj g|=  6.48561D-01

At iterate   19    f=  8.42838D-01    |proj g|=  4.68309D-01

At iterate   20    f=  8.14585D-01    |proj g|=  2.55414D-01

At iterate   21    f=  8.04329D-01    |proj g|=  1.39807D-01

At iterate   22    f=  7.92212D-01    |proj g|=  2.50519D-01

At iterate   23    f=  7.52172D-01    |proj g|=  3.55555D-01

At iterate   24    f=  7.12535D-01    |proj g|=  4.43539D-01

At iterate   25    f=  6.76793D-01    |proj g|=  3.37357D-01

At iterate   26    f=  6.30656D-01    |proj g|=  3.18045D-01

At iterate   27    f=  5.97827D-01    |proj g|=  2.18595D-01

At iterate   28    f=  5.56462D-01    |proj g|=  4.66162D-01

At iterate   29    f=  5.12931D-01    |proj g|=  5.42735D-01

At iterate   30    f=  4.75139D-01    |proj g|=  3.87003D-01

At iterate   31    f=  4.47944D-01    |proj g|=  1.68393D-01

At iterate   32    f=  3.99425D-01    |proj g|=  2.43929D-01

At iterate   33    f=  3.40320D-01    |proj g|=  2.72603D-01

At iterate   34    f=  2.96089D-01    |proj g|=  1.90089D-01

At iterate   35    f=  2.67168D-01    |proj g|=  1.90955D-01

At iterate   36    f=  2.50857D-01    |proj g|=  1.26385D-01

At iterate   37    f=  2.35988D-01    |proj g|=  1.17831D-01

At iterate   38    f=  2.18963D-01    |proj g|=  8.50825D-02

At iterate   39    f=  2.11071D-01    |proj g|=  1.77688D-01

At iterate   40    f=  1.98677D-01    |proj g|=  9.43588D-02

At iterate   41    f=  1.86685D-01    |proj g|=  1.66775D-01

At iterate   42    f=  1.77408D-01    |proj g|=  5.20009D-02

At iterate   43    f=  1.66641D-01    |proj g|=  7.74835D-02

At iterate   44    f=  1.58065D-01    |proj g|=  1.49709D-01

At iterate   45    f=  1.46905D-01    |proj g|=  8.88187D-02

At iterate   46    f=  1.36816D-01    |proj g|=  5.73866D-02

At iterate   47    f=  1.23886D-01    |proj g|=  7.55668D-02

At iterate   48    f=  1.15665D-01    |proj g|=  9.47044D-02

At iterate   49    f=  1.09152D-01    |proj g|=  4.45269D-02

At iterate   50    f=  1.05623D-01    |proj g|=  4.65699D-02

At iterate   51    f=  1.00235D-01    |proj g|=  5.09395D-02

At iterate   52    f=  9.48982D-02    |proj g|=  8.88693D-02

At iterate   53    f=  9.28047D-02    |proj g|=  1.12882D-01

At iterate   54    f=  8.96244D-02    |proj g|=  3.43012D-02

At iterate   55    f=  8.61751D-02    |proj g|=  3.49997D-02

At iterate   56    f=  8.31065D-02    |proj g|=  5.05298D-02

At iterate   57    f=  7.91082D-02    |proj g|=  9.86852D-02

At iterate   58    f=  7.52857D-02    |proj g|=  2.76082D-02

At iterate   59    f=  7.36021D-02    |proj g|=  2.15727D-02

At iterate   60    f=  7.14372D-02    |proj g|=  3.03184D-02

At iterate   61    f=  6.98529D-02    |proj g|=  4.84770D-02

At iterate   62    f=  6.77778D-02    |proj g|=  2.84951D-02

At iterate   63    f=  6.43985D-02    |proj g|=  3.16001D-02

At iterate   64    f=  6.18409D-02    |proj g|=  3.41687D-02

At iterate   65    f=  5.94981D-02    |proj g|=  6.72442D-02

At iterate   66    f=  5.63869D-02    |proj g|=  3.87181D-02

At iterate   67    f=  5.43224D-02    |proj g|=  2.40502D-02

At iterate   68    f=  5.04368D-02    |proj g|=  4.50986D-02

At iterate   69    f=  4.91181D-02    |proj g|=  4.22925D-02

At iterate   70    f=  4.82442D-02    |proj g|=  1.64042D-02

At iterate   71    f=  4.62087D-02    |proj g|=  1.91735D-02

At iterate   72    f=  4.53585D-02    |proj g|=  5.17174D-02

At iterate   73    f=  4.44345D-02    |proj g|=  2.37669D-02

At iterate   74    f=  4.35600D-02    |proj g|=  1.54700D-02

At iterate   75    f=  4.19505D-02    |proj g|=  4.78467D-02

At iterate   76    f=  4.01091D-02    |proj g|=  3.61117D-02

At iterate   77    f=  3.93031D-02    |proj g|=  5.45957D-02

At iterate   78    f=  3.77409D-02    |proj g|=  4.47122D-02

At iterate   79    f=  3.66431D-02    |proj g|=  2.68009D-02

At iterate   80    f=  3.61166D-02    |proj g|=  1.88642D-02

At iterate   81    f=  3.46562D-02    |proj g|=  3.54308D-02

At iterate   82    f=  3.32586D-02    |proj g|=  2.91068D-02

At iterate   83    f=  3.25174D-02    |proj g|=  1.45019D-01

At iterate   84    f=  2.97030D-02    |proj g|=  1.89358D-02

At iterate   85    f=  2.91339D-02    |proj g|=  1.45290D-02

At iterate   86    f=  2.82205D-02    |proj g|=  1.66224D-02

At iterate   87    f=  2.69000D-02    |proj g|=  4.25724D-02

At iterate   88    f=  2.53736D-02    |proj g|=  1.74524D-02

At iterate   89    f=  2.40236D-02    |proj g|=  2.30625D-02

At iterate   90    f=  2.33232D-02    |proj g|=  5.30508D-02

At iterate   91    f=  2.25585D-02    |proj g|=  1.73283D-02

At iterate   92    f=  2.17492D-02    |proj g|=  2.37386D-02

At iterate   93    f=  2.11327D-02    |proj g|=  3.03222D-02

At iterate   94    f=  2.05165D-02    |proj g|=  3.73192D-02

At iterate   95    f=  1.99713D-02    |proj g|=  1.13639D-02

At iterate   96    f=  1.94213D-02    |proj g|=  1.85706D-02

At iterate   97    f=  1.87514D-02    |proj g|=  4.22578D-02

At iterate   98    f=  1.77619D-02    |proj g|=  4.12577D-02

At iterate   99    f=  1.68314D-02    |proj g|=  3.94293D-02

At iterate  100    f=  1.61334D-02    |proj g|=  2.85155D-02

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
 1210    100    105      1     0     0   2.852D-02   1.613D-02
  F =   1.6133397523403377E-002

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 
 This problem is unconstrained.
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:536: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
0.9444444444444444

We can compare it to SVM:

svm_clf = svm.SVC(gamma=0.001)
svm_clf.fit(X_train, y_train)
svm_clf.score(X_test,y_test)
0.9911111111111112

We saw that the SVM performed a bit better, but this is a simple problem. We can also compare these based on much they store, the number of parameters is realted to the complexity.

import numpy as np
np.prod(list(svm_clf.support_vectors_.shape))
43840
np.sum([np.prod(list(c.shape)) for c in mlp.coefs_])
1184
mlp.coefs_
[array([[-0.04544804,  0.12067436, -0.27379334, ...,  0.20709945,
         -0.25885548,  0.0933671 ],
        [-0.04583192,  0.02999214, -0.18729936, ...,  0.20070956,
         -0.22816233, -0.08203855],
        [ 0.21594462, -0.03501911, -0.13935016, ..., -0.12740689,
          0.11827951,  0.10964127],
        ...,
        [-0.147186  , -0.04885168, -0.08565972, ...,  0.09101391,
          0.06372955, -0.17216516],
        [ 0.00779572, -0.10042942, -0.24737746, ..., -0.21098225,
         -0.29069528, -0.3007683 ],
        [-0.25377729,  0.10587514, -0.14924823, ...,  0.05460927,
         -0.04542482, -0.41864772]]),
 array([[-0.41943486, -0.00737221, -0.25013001,  0.439859  ,  0.37306531,
          0.11786679,  0.45738627, -0.14901288,  0.38219732, -0.05069864],
        [-0.23985806,  0.43464955,  0.1786623 , -0.40418738,  0.28901034,
          0.19216991,  0.10145914,  0.23048402, -0.22734463,  0.10672655],
        [ 1.50728715, -0.60528337, -0.58972384, -0.6401797 ,  0.56687058,
         -0.09242594,  0.50447357, -0.42329376, -0.05893486, -0.33303548],
        [-0.27191219,  0.43490182, -0.17180129,  0.38448361, -0.09324696,
          0.0273306 ,  0.17686513, -0.02678721,  0.18037779,  0.28704752],
        [-0.21111149,  0.37659092,  0.05968257, -0.37111487, -0.2169511 ,
          0.06768203, -0.24890928, -0.32179172, -0.32849203,  0.28098543],
        [ 0.09756746,  0.44452722, -0.15026417,  0.10032271,  0.09480416,
          0.11042807, -0.42391493,  0.23923038,  0.4291504 ,  0.03247214],
        [-0.2671188 ,  0.68772797, -0.78955948, -0.51844567,  0.76283629,
         -0.34335334,  0.64458257,  0.01396712,  0.55997088,  0.16110103],
        [-0.50501192,  0.08995027,  1.07395013,  0.55929958, -1.41081918,
         -0.70475564,  0.07336966, -0.90049891,  0.14576965,  0.00196221],
        [-0.20436992, -0.44920319, -0.33073311,  0.62834909, -0.76827309,
          1.41211618,  0.01292927, -0.18972977,  0.06599439, -0.23435678],
        [-0.19058572, -0.08386045, -0.23341893,  0.21336841,  0.14390203,
         -0.41810974,  0.41809338,  0.26053527,  0.22008943,  0.20386476],
        [-0.3055057 ,  0.08001114,  0.12541927,  0.10975577,  0.64811392,
         -0.47326167, -0.57528549,  1.23272911,  0.01363304, -0.8735477 ],
        [ 0.23720435,  0.25039731,  0.33801366, -0.23823171, -0.32028353,
          0.04038248, -0.17781702,  0.27510892,  0.09953463,  0.22304349],
        [ 0.11837086, -0.27071898, -0.35738878, -0.13505994, -0.20340629,
         -0.18721021,  0.31019539,  0.07292583, -0.06197665, -0.3445675 ],
        [-0.18990245,  0.05164007, -0.24995147, -0.40145973, -0.06863325,
          0.25951727, -0.04901717,  0.2628618 , -0.01059058,  0.24381035],
        [ 0.20932278,  0.110846  , -0.23817858,  0.06744052,  0.17817823,
         -0.15697341,  0.42354465,  0.33805084, -0.12109949,  0.11158405],
        [ 0.18276065, -0.03845522, -0.93979366, -0.29424286,  0.2612766 ,
         -0.04896757, -0.16762708, -0.20991491,  0.4357143 ,  1.33815728]])]
mlp64 = MLPClassifier(
  hidden_layer_sizes=(64),
  max_iter=100,
  alpha=1e-4,
  solver="lbfgs",
  verbose=10,
  random_state=1,
  learning_rate_init=0.1,
)
mlp64.fit(X_train,y_train).score(X_test,y_test)
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =         4810     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.04302D+01    |proj g|=  8.27241D+00

At iterate    1    f=  9.71830D+00    |proj g|=  4.82977D+00

At iterate    2    f=  8.42003D+00    |proj g|=  4.51230D+00

At iterate    3    f=  7.00682D+00    |proj g|=  2.93108D+00

At iterate    4    f=  5.03981D+00    |proj g|=  2.34336D+00

At iterate    5    f=  3.73519D+00    |proj g|=  3.90926D+00

At iterate    6    f=  2.14804D+00    |proj g|=  1.00646D+00

At iterate    7    f=  1.70277D+00    |proj g|=  9.90044D-01

At iterate    8    f=  1.13413D+00    |proj g|=  8.83988D-01

At iterate    9    f=  8.57417D-01    |proj g|=  3.85550D-01

At iterate   10    f=  6.57149D-01    |proj g|=  1.93061D-01

At iterate   11    f=  4.93338D-01    |proj g|=  1.53888D-01

At iterate   12    f=  3.09465D-01    |proj g|=  2.08518D-01

At iterate   13    f=  2.19399D-01    |proj g|=  5.27747D-02

At iterate   14    f=  1.85512D-01    |proj g|=  1.10762D-01

At iterate   15    f=  1.48434D-01    |proj g|=  8.75330D-02

At iterate   16    f=  1.24102D-01    |proj g|=  4.39839D-02

At iterate   17    f=  1.01487D-01    |proj g|=  4.86889D-02

At iterate   18    f=  8.29490D-02    |proj g|=  1.49483D-01

At iterate   19    f=  6.69635D-02    |proj g|=  4.15894D-02

At iterate   20    f=  6.02523D-02    |proj g|=  2.71720D-02

At iterate   21    f=  5.17567D-02    |proj g|=  2.61958D-02

At iterate   22    f=  3.89563D-02    |proj g|=  4.77516D-02

At iterate   23    f=  3.42232D-02    |proj g|=  5.94463D-02

At iterate   24    f=  2.88778D-02    |proj g|=  2.07471D-02

At iterate   25    f=  2.53482D-02    |proj g|=  1.42178D-02

At iterate   26    f=  2.11186D-02    |proj g|=  2.77706D-02

At iterate   27    f=  1.76381D-02    |proj g|=  3.44683D-02

At iterate   28    f=  1.49355D-02    |proj g|=  1.24816D-02

At iterate   29    f=  1.25325D-02    |proj g|=  1.11177D-02

At iterate   30    f=  1.07200D-02    |proj g|=  1.64740D-02

At iterate   31    f=  9.42996D-03    |proj g|=  2.08513D-02

At iterate   32    f=  8.31343D-03    |proj g|=  9.77348D-03

At iterate   33    f=  7.17646D-03    |proj g|=  7.13752D-03

At iterate   34    f=  6.18960D-03    |proj g|=  9.10418D-03

At iterate   35    f=  4.40449D-03    |proj g|=  1.91317D-02

At iterate   36    f=  3.95283D-03    |proj g|=  2.28365D-02

At iterate   37    f=  2.91604D-03    |proj g|=  6.64352D-03

At iterate   38    f=  2.54560D-03    |proj g|=  4.86752D-03

At iterate   39    f=  1.91555D-03    |proj g|=  4.47351D-03

At iterate   40    f=  1.26359D-03    |proj g|=  3.86776D-03

At iterate   41    f=  8.54919D-04    |proj g|=  4.62880D-03

At iterate   42    f=  6.15180D-04    |proj g|=  1.76956D-03

At iterate   43    f=  5.09893D-04    |proj g|=  1.40789D-03

At iterate   44    f=  3.69196D-04    |proj g|=  1.03219D-03

At iterate   45    f=  2.98530D-04    |proj g|=  1.43634D-03

At iterate   46    f=  2.50175D-04    |proj g|=  8.15814D-04

At iterate   47    f=  1.97223D-04    |proj g|=  6.75762D-04

At iterate   48    f=  1.33989D-04    |proj g|=  4.20325D-04

At iterate   49    f=  9.62924D-05    |proj g|=  4.81079D-04

At iterate   50    f=  6.09419D-05    |proj g|=  1.89177D-04

At iterate   51    f=  4.67968D-05    |proj g|=  1.32816D-04

At iterate   52    f=  2.80361D-05    |proj g|=  1.34469D-04

At iterate   53    f=  1.84513D-05    |proj g|=  4.20248D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
 4810     53     54      1     0     0   4.202D-05   1.845D-05
  F =   1.8451253494679939E-005

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            
 This problem is unconstrained.
0.9777777777777777

38.1. Questions After Class#

38.1.1. Roughly, how does the model know to use certain functions as the fitting becomes more complex (e.g. sin(x), ln(x), e^x)?#

It does not learn an analytical form; it just approximates it.

38.1.2. when doing the .score on the mlp does the limit vary or does it have a set limit on its own?#

38.1.3. What is tensorflow used for that scikit cant do?#

Tensorflow can do more types of networks and has more options for training. Most importantly, it has code optmizations so that you can use more complex hardware directly.

38.1.4. when you say weight, what does that mean?#

Weights are coefficients, or the weight of that feature.

38.1.5. what is an artificial neuron?#

An artificial neuron is one “unit” of calculation. A neuron takes a weighted sum of all of its inputs (including a bias term) and passes it through an “activation function” that squashes the values of output into [0,1].

38.1.6. what real life problems require tensorflow?#

All modern ML applications are tensorflow, pytorch or similar.

38.1.7. What do the hidden layers of the neural network represent?#

We do not specify exactly what they represent up front; we can use model explanation techniques and visualization tools to examine them after the fact and try to interpret them if needed.

38.1.8. What is the best way to optimize a neural net? would it be jut adding more layers?#

You could specify some of the parameters and use GridSearch as well. There are types of layers as well. We will see that later.

38.1.9. Are the weights given to the hidden layers initially random?#

Typically yes, they can be initialized randomly and then they are learned.

38.1.10. I’ve heard that cleaning data generally is a majority of a data scientists work is this generally true?#

38.1.11. What does it mean to “translate a jupyter notebook into python scripts”? what exactly are scripts?#

a script is a file that can be run non interactively. That is it can be run straight through without relying on any user input.

38.1.12. does jupyter notebook have to be used for data science or can we used other types of languages?#

You can use other languages and even use Python with a script or interactively in another IDE.

38.1.13. How are issues of privacy handled for people like Cass, some of the models they spoke about required a lot of personal data?#

They do not release the data to just anyone, but they do use a lot of personal data. Mostly, the release anonymized aggregated data so that it is not possible to find an individual. There are privacy and security procedures to protect the linked data and limit who has access to it.