33. Predicting with Neural Networks#

from scipy.special import expit
from sklearn.datasets import make_classification
from sklearn.neural_network import MLPClassifier

from sklearn import svm
import pandas as pd
import numpy as np
import sklearn

from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn import model_selection
# from skearn.model_selection import train_test_split

import seaborn as sns

Today, were going to use very simple data in order to examin how a neural network works.

X, y = make_classification(n_samples=100, random_state=1,n_features=2,n_redundant=0)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, stratify=y,

First, we’ll train and score a tiny neural net: with 1 hidden layer of 1 neuron.

clf = MLPClassifier(
  hidden_layer_sizes=(1), # 1 hidden layer, 1 aritficial neuron
  max_iter=100, # maximum 100 interations in optimization
  alpha=1e-4, # regularization
  solver="lbfgs", #optimization algorithm  
  verbose=10, # how much detail to print
  activation= 'identity' # how to transform the hidden layer beofore passing it to the next layer
clf.fit(X_train, y_train)

clf.score(X_test, y_test)

           * * *

Now we can see that it actually has another activation, that we didn’t change the output layer still has a logistic activation layer, which we want. If we didn’t then the output layer wouldn’t be able to be interpretted as a probability, because probability always needs to be between 0 and 1.


The logistic function looks like this:

x_logistic = np.linspace(-10,10,100)
y_logistic = expit(x_logistic)
[<matplotlib.lines.Line2D at 0x7f49d5ab9760>]

The fit method learned the following weights:

        [ 0.19720034]]),

and biases

[array([0.02652099]), array([5.74570119])]

These are called coefficients and intercepts because the weights are mutliplied by the inputs and the biases you can interpret as geometrically as shifting things, like a line intercept (recall y=mx+b)

33.1. Reconstructing the Predict method#

we’ll use an acutally new point, we can make one up


we want a numpy array so we will cast it

pt = np.array([[-1,2]])

numpy’s matmul does matrix multiplicaion (multiply columns by rows element wise and sum)

\[f(x) = W_2g(W_1^T x +b_1) + b_2 \]

the \(g\) is the activation function, which we set to identity \(g(x) = x\) so we don’t have to do more

np.matmul(pt,clf.coefs_[0]) + clf.intercepts_[0])*clf.coefs_[1] + clf.intercepts_[1])
  Input In [11]
    np.matmul(pt,clf.coefs_[0]) + clf.intercepts_[0])*clf.coefs_[1] + clf.intercepts_[1])
SyntaxError: unmatched ')'

but we’re not quite done, the output layer still transforms using the logistic function, which is also known as expit and we have imported from scipy.

expit((np.matmul(pt,clf.coefs_[0]) + clf.intercepts_[0])*clf.coefs_[1] + clf.intercepts_[1])

We can compare this to the classifier’s output. It outputs a probability for each class, we only comptued the probabilyt of the 1 class.

array([[9.99726525e-01, 2.73474984e-04]])

and we can see how it predicts on that point.


A single artificial neuron like the function below. where it has parameters that have to be determined before we can use it on an input vector.

def aritificial_neuron_template(activation,weights,bias,inputs):
    simple artificial neuron

    activation : function
        activation function of the neuron
    weights : numpy aray
        wights for summing inputs
    bias: numpy array
        bias term added to the weighted sum
    inputs : numpy array
        input to the neuron

    return activation(np.matmul(inputs,weights) +bias)

# two common activation functions
identity_activation = lambda x: x
logistic_activation = lambda x: expit(x)

When we instantiate the multilyer perceptron object, MLPClassifier, we pick the activation function and when we give data to the fit method, we get the weights and biases.

A neural network passes the data to the hidden layer, and the output of the hidden layer to the output layer. In our neural network, we have just one neuron at each layer.

So the predict_proba method is the same as the following:


To make this easier to read, we can make the intermediate neurons their own lambda functions.

hidden_neuron = lambda x: aritificial_neuron_template(identity_activation,clf.coefs_[0],clf.intercepts_[0],x)
output_neuron = lambda x: aritificial_neuron_template(expit,clf.coefs_[1],clf.intercepts_[1],x)


We can confirm that this works the same as the predict probability method:

array([[9.99726525e-01, 2.73474984e-04]])

33.2. More Features and More Hidden Neurons#

First, we’ll sample more features and then train a new classifier

X, y = make_classification(n_samples=100, random_state=1,n_features=4,n_redundant=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
pt_4d =np.asarray([[-1,-2,2,-1],[1.5,0,.5,1]])
clf_4d = MLPClassifier(
    activation= 'identity'

clf_4d.fit(X_train, y_train)

clf_4d.score(X_test, y_test)
To try imporving it, we will add more layers and a different activation function:

clf_4d_4h = MLPClassifier(

clf_4d_4h.fit(X_train, y_train)

clf_4d_4h.score(X_test, y_test)

At iterate   53    f=  2.16786D-02    |proj g|=  2.92993D-04

At iterate   54    f=  2.15285D-02    |proj g|=  2.69982D-04

At iterate   55    f=  2.15015D-02    |proj g|=  5.21127D-04

At iterate   56    f=  2.12361D-02    |proj g|=  4.07072D-04

At iterate   57    f=  2.12268D-02    |proj g|=  4.39523D-04

At iterate   58    f=  1.88186D-02    |proj g|=  2.90136D-03

At iterate   59    f=  1.88058D-02    |proj g|=  3.18867D-03

At iterate   60    f=  1.77612D-02    |proj g|=  3.47177D-03

At iterate   61    f=  1.54082D-02    |proj g|=  1.11205D-02

At iterate   62    f=  1.05133D-02    |proj g|=  2.26634D-03

At iterate   63    f=  9.48293D-03    |proj g|=  8.16960D-04

At iterate   64    f=  8.91869D-03    |proj g|=  1.20060D-03

At iterate   65    f=  8.61158D-03    |proj g|=  3.48433D-03

At iterate   66    f=  8.29355D-03    |proj g|=  5.15635D-04

At iterate   67    f=  8.23867D-03    |proj g|=  1.06216D-03

At iterate   68    f=  8.11866D-03    |proj g|=  1.65491D-03

At iterate   69    f=  7.94600D-03    |proj g|=  1.61213D-03

At iterate   70    f=  7.06812D-03    |proj g|=  7.92228D-04

At iterate   71    f=  6.51501D-03    |proj g|=  6.50439D-04

At iter

we see some improvment.

This network is more complicated. It has 5 total neurons:

hidden_neuron_4d_h0 = lambda x: aritificial_neuron_template(logistic_activation,
hidden_neuron_4d_h1 = lambda x: aritificial_neuron_template(logistic_activation,
hidden_neuron_4d_h2 = lambda x: aritificial_neuron_template(logistic_activation,
hidden_neuron_4d_h3 = lambda x: aritificial_neuron_template(logistic_activation,
output_neuron_4d_4h = lambda x: aritificial_neuron_template(logistic_activation,
And we have to take the output of all 4 hidden neurons into the output neuron, because they are a single layer, not in sequence.

33.3. (optional) What is a numerical optimiztion algorithm?#

Numerical Optimization algorithms are at the core of many of the fit methods.

One way we can optimize a function is to take the derivative, set it equal to zero and sovle for the parameter. If we know the funciton is convex (like a bowl or valley shape) then the place where the derivative (slope) is 0 is the bottom or lowest point of the valley.

Numerial optimzaiton is for when we can’t analytically solve that problem once we set it equal to zero. Optimizaiton algorithms are sort of like search algorithms but can work in high dimensions and use strategy based on calculus.

The basic idea in many numerical optimization algorithms is to start at a point (initial setting of the coefficients in this case) and then compute the value of the function then change the coefficients a little and compute again. We can use those two point to see if the direction we “moved” or the way we changed the parameters made it better or worse. If it was better, we change them more in the same direction, (if we made both smaller then we make them both smaller again) if it got worse, we change in a different direction.

You can think of this like trying to find the bottom of a valley, without being able to see, just check your altitude. You take a step left, right, forward or back and then see if your altitude went up or down.

LBGFS acutally uses the derivative, so it’s like you can see the direction of the hill you’re on, but you have to keep taking steps and then if you reacha point where you can’t go down anymore you know you are done. When the algorithm finds it can’t get better, that’s called convergence.

Stochastic gradient descent works in high dimensions where it’s too hard to do the derivative, but you can randomly move in different directions (or take the partial derivate in a small numbe rof defintions). Adam is a specical version fo that with better strategy.

Numerical optimization is a whole research area. In graduate school, I took a whole semester long course just learning different algorithms for this.