16. Modeling and Naive Bayes#

Important

Remember to give feedback on the course so far. This feedback helps me make adjustments to the course if needed. Even if things are good for you right now, letting me know helps me keep emphasis on the things that are helping.

The form requires you to be logged into you URI Google account, but does not share your e-mail with me in the results. It is required so that people not at URI do not complete the form.

We’re going to approach machine learning from the perspective of modeling for a few reasons:

  • model based machine learning streamlines understanding the big picture

  • the model way of interpreting it aligns well with using sklearn

  • thinking in terms of models aligns with incorporating domain expertise, as in our data science definition

this paper by Christopher M. Bishop, a senior ML researcher who also wrote one of a the widely preferred graduate level ML textbooks, details advantages of a model based perspective and a more mathematical version of a model based approach to machine learning. He is a co-author on an introductory text book Model Based ML

In CSC461: Machine Learning, you can encounter an algorithm focused approach to machine learning, but I think having the model based perspective first helps you avoid common pitfalls.

Remmeber our overview of ML:

Ml overview: training data goes into the learning algorithm, which outputs the prediction algorithm. the prediciton algorithm takes a sampleand outputs a prediction

16.1. What is a Model?#

A model is a simplified representation of some part of the world. A famous quote about models is:

All models are wrong, but some are useful –George Box[^wiki]

In machine learning, we use models, that are generally statistical models.

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process wikipedia

read more in theModel Based Machine Learning Book

16.2. Models in Machine Learning#

Starting from a dataset, we first make an additional designation about how we will use the different variables (columns). We will call most of them the features, which we denote mathematically with \(\mathbf{X}\) and we’ll choose one to be the target or labels, denoted by \(\mathbf{y}\).

The core assumption for just about all machine learning is that there exists some function \(f\) so that for the \(i\)th sample

\[ y_i = f(\mathbf{x}_i) \]

\(i\) would be the index of a DataFrame

Example models are (informally):

  • we can describe the data as a set of blobs

  • we can describe the rule to separate classes as a flow chart

  • we can describe the rule to separate as a curved line

16.3. Types of Machine Learning#

Then with different additional assumptions we get different types of machine learning:

sup

–>

16.4. Supervised Learning#

we’ll focus on supervised learning first. we can take that same core assumption and use it with additional information about our target variable to determine learning task we are working to do.

\[ y_i = f(\mathbf{x}_i) \]
  • if \(y_i\) are discrete (eg flower species) we are doing classification

  • if \(y_i\) are continuous (eg height) we are doing regression

flowchart for above definitions

Further Reading

sklearn provides a popular flowchart for choosing a specific model

16.5. Machine Learning Pipeline#

To do machine learning we start with training data which we put as input to the learning algorithm. A learning algorithm might be a generic optimization procedure or a specialized procedure for a specific model. The learning algorithm outputs a trained model or the parameters of the model. When we deploy a model we pair the fit model with a prediction algorithm or decision algorithm to evaluate a new sample in the world.

In experimenting and design, we need testing data to evaluate how well our learning algorithm understood the world. We need to use previously unseen data, because if we don’t we can’t tell if the prediction algorithm is using a rule that the learning algorithm produced or just looking up from a lookup table the result. This can be thought of like the difference between memorization and understanding.

When the model does well on the training data, but not on test data, we say that it does not generalize well.

data splits in ML; features, target, training and test

16.6. Iris Dataset#

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
iris_df = sns.load_dataset('iris')

We’re trying to build an automatic flower classifier that, for measurements of a new flower returns the predicted species. To do this, we have a DataFrame with columns for species, petal width, petal length, sepal length, and sepal width. The species is what type of flower it is the petal and sepal are parts of the flower.

iris_df.head(1)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa

We can look at this data using a pair plot. It plots each pair of numerical variables in a grid of scatterplots and on the diagonal (where it would be a variable with itself) shows the distribution of that variable.

sns.pairplot(data=iris_df,hue='species')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 sns.pairplot(data=iris_df,hue='species')

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/seaborn/axisgrid.py:2148, in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
   2146     diag_kws.setdefault("fill", True)
   2147     diag_kws.setdefault("warn_singular", False)
-> 2148     grid.map_diag(kdeplot, **diag_kws)
   2150 # Maybe plot on the off-diagonals
   2151 if diag_kind is not None:

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/seaborn/axisgrid.py:1507, in PairGrid.map_diag(self, func, **kwargs)
   1505     plot_kwargs.setdefault("hue_order", self._hue_order)
   1506     plot_kwargs.setdefault("palette", self._orig_palette)
-> 1507     func(x=vector, **plot_kwargs)
   1508     ax.legend_ = None
   1510 self._add_axis_labels()

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/seaborn/distributions.py:1717, in kdeplot(data, x, y, hue, weights, palette, hue_order, hue_norm, color, fill, multiple, common_norm, common_grid, cumulative, bw_method, bw_adjust, warn_singular, log_scale, levels, thresh, gridsize, cut, clip, legend, cbar, cbar_ax, cbar_kws, ax, **kwargs)
   1713 if p.univariate:
   1715     plot_kws = kwargs.copy()
-> 1717     p.plot_univariate_density(
   1718         multiple=multiple,
   1719         common_norm=common_norm,
   1720         common_grid=common_grid,
   1721         fill=fill,
   1722         color=color,
   1723         legend=legend,
   1724         warn_singular=warn_singular,
   1725         estimate_kws=estimate_kws,
   1726         **plot_kws,
   1727     )
   1729 else:
   1731     p.plot_bivariate_density(
   1732         common_norm=common_norm,
   1733         fill=fill,
   (...)
   1743         **kwargs,
   1744     )

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/seaborn/distributions.py:996, in _DistributionPlotter.plot_univariate_density(self, multiple, common_norm, common_grid, warn_singular, fill, color, legend, estimate_kws, **plot_kws)
    993 if "x" in self.variables:
    995     if fill:
--> 996         artist = ax.fill_between(support, fill_from, density, **artist_kws)
    998     else:
    999         artist, = ax.plot(support, density, **artist_kws)

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/matplotlib/__init__.py:1423, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs)
   1420 @functools.wraps(func)
   1421 def inner(ax, *args, data=None, **kwargs):
   1422     if data is None:
-> 1423         return func(ax, *map(sanitize_sequence, args), **kwargs)
   1425     bound = new_sig.bind(ax, *args, **kwargs)
   1426     auto_label = (bound.arguments.get(label_namer)
   1427                   or bound.kwargs.get(label_namer))

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/matplotlib/axes/_axes.py:5367, in Axes.fill_between(self, x, y1, y2, where, interpolate, step, **kwargs)
   5365 def fill_between(self, x, y1, y2=0, where=None, interpolate=False,
   5366                  step=None, **kwargs):
-> 5367     return self._fill_between_x_or_y(
   5368         "x", x, y1, y2,
   5369         where=where, interpolate=interpolate, step=step, **kwargs)

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/matplotlib/axes/_axes.py:5272, in Axes._fill_between_x_or_y(self, ind_dir, ind, dep1, dep2, where, interpolate, step, **kwargs)
   5268         kwargs["facecolor"] = \
   5269             self._get_patches_for_fill.get_next_color()
   5271 # Handle united data, such as dates
-> 5272 ind, dep1, dep2 = map(
   5273     ma.masked_invalid, self._process_unit_info(
   5274         [(ind_dir, ind), (dep_dir, dep1), (dep_dir, dep2)], kwargs))
   5276 for name, array in [
   5277         (ind_dir, ind), (f"{dep_dir}1", dep1), (f"{dep_dir}2", dep2)]:
   5278     if array.ndim > 1:

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/numpy/ma/core.py:2360, in masked_invalid(a, copy)
   2332 def masked_invalid(a, copy=True):
   2333     """
   2334     Mask an array where invalid values occur (NaNs or infs).
   2335 
   (...)
   2357 
   2358     """
-> 2360     return masked_where(~(np.isfinite(getdata(a))), a, copy=copy)

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
../_images/2022-10-17_5_1.png

This data is reasonably separable beacuse the different species (indicated with colors in the plot) do not overlap much. We see that the features are distributed sort of like a normal, or Gaussian, distribution. In 2D a Gaussian distribution is like a hill, so we expect to see more points near the center and fewer on the edge of circle-ish blobs. These blobs are slightly live ovals, but not too skew.

16.7. Creating test and train#

To do machine learning, we split the data both sample wise (rows if tidy) and variable-wise (columns if tidy). First, we’ll designate the columns to use as features and as the target.

The features are the input that we wish to use to predict the target.

feature_vars = ['sepal_length', 'sepal_width','petal_length', 'petal_width',]
target_var = 'species'

Next, we’ll use a sklearn function to split the data randomly into test and train portions.

X_train, X_test, y_train, y_test = train_test_split(iris_df[feature_vars],iris_df[target_var],random_state=0)

We can see by default how many samples it puts in each set

X_train.shape
(112, 4)
X_test.shape
(38, 4)

We can also see that it picks a random subset by the index:

X_train.head()
sepal_length sepal_width petal_length petal_width
61 5.9 3.0 4.2 1.5
92 5.8 2.6 4.0 1.2
112 6.8 3.0 5.5 2.1
2 4.7 3.2 1.3 0.2
141 6.9 3.1 5.1 2.3

16.8. Instantiating our Model Object#

This is the model. In sklearn they call these objects estimator. All estimators have a similar usage. First we instantiate the object and set any hyperparameters.

Instantiating the object says we are assuming a particular type of model. In this case Gaussian Naive Bayes. This sets several assumptions in one form:

  • we assume data are Gaussian (normally) distributed

  • the features are uncorrelated/independent (Naive)

  • the best way to predict is to find the highest probability (Bayes)

this is one example of a Bayes Estimator

gnb = GaussianNB()

At this point the object is not very interesting

gnb.__dict__
{'priors': None, 'var_smoothing': 1e-09}

The fit method uses the data to learn the model’s parameters. In this case, a Gaussian distribution is characterized by a mean and variance; so the GNB classifier is characterized by one mean and one variance for each class (in 4d, like our data)

gnb.fit(X_train,y_train)
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The attributes of the estimator object (gbn) describe the data (eg the class list) and the model’s parameters. The theta_ (\(\theta\)) represents the mean and the sigma_ (\(\sigma\)) represents the variance of the distributions.

gnb.__dict__
{'priors': None,
 'var_smoothing': 1e-09,
 'classes_': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
 'feature_names_in_': array(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
       dtype=object),
 'n_features_in_': 4,
 'epsilon_': 3.2135586734693885e-09,
 'theta_': array([[4.9972973 , 3.38918919, 1.45405405, 0.24054054],
        [5.91764706, 2.75882353, 4.19117647, 1.30882353],
        [6.66341463, 2.9902439 , 5.58292683, 2.03902439]]),
 'var_': array([[0.12242513, 0.14474799, 0.01978087, 0.01159971],
        [0.2649827 , 0.11124568, 0.22139274, 0.0408045 ],
        [0.4071981 , 0.11453897, 0.30483046, 0.06579417]]),
 'class_count_': array([37., 34., 41.]),
 'class_prior_': array([0.33035714, 0.30357143, 0.36607143])}

Once we fit, we can predict

y_pred = gnb.predict(X_test)
y_pred
array(['virginica', 'versicolor', 'setosa', 'virginica', 'setosa',
       'virginica', 'setosa', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'setosa', 'versicolor', 'versicolor', 'setosa',
       'setosa', 'virginica', 'versicolor', 'setosa', 'setosa',
       'virginica', 'setosa', 'setosa', 'versicolor', 'versicolor',
       'setosa', 'virginica', 'versicolor', 'setosa', 'virginica',
       'virginica', 'versicolor', 'setosa', 'versicolor'], dtype='<U10')

we get one prediciton for each sample.

Estimator objects also have a score method. If the estimator is a classifier, that score is accuracy. We will see that for other types of estimators it is different types.

gnb.score(X_test,y_test)
1.0

We can also compute a confusion matrix as we did last week (see those notes). Now it is 3x3 though because we have 3 species instead of two outcomes like the COMPAS predictions.

confusion_matrix(y_test,y_pred)
array([[13,  0,  0],
       [ 0, 16,  0],
       [ 0,  0,  9]])

16.9. Questions After Class#

16.9.1. is there any extra material where we can learn a better understanding oh how to do the predictions and models#

The Scikit Learn User Guide is a good place, as is the Model Based ML

16.9.2. Are there any good introductions to ScikitLearn that you are aware of?#

Scikit Learn User Guide is the best one and they have a large example gallery.

16.9.3. What is a confusion matrix?#

Notes from last wed go over that.

16.9.4. Can you use machine learning for any type of data?#

Yes the features for example could be an image instead of four numbers. It could also be text. The basic ideas are the same for more complex data, so we are going to spend a lot of time building your understanding of what ML is on simple data. Past students have successfully applied ML in more complex data after this course because once you have a good understanding of the core ideas, applying it to other forms of data is easier to learn on your own.

16.9.5. Can we check how well the model did using the y_test df?#

we could compare them directly or using score that does.

y_pred == y_test
114    True
62     True
33     True
107    True
7      True
100    True
40     True
86     True
76     True
71     True
134    True
51     True
73     True
54     True
63     True
37     True
78     True
90     True
45     True
16     True
121    True
66     True
24     True
8      True
126    True
22     True
44     True
97     True
93     True
26     True
137    True
84     True
27     True
127    True
132    True
59     True
18     True
83     True
Name: species, dtype: bool
sum(y_pred == y_test)/len(y_test)
1.0
gnb.score(X_test,y_test)
1.0

We can also use any of the other metrics we saw, we’ll practice more on Wednesday

16.9.6. I want to know more about the the test_train_split() function#

the docs are a good place to start.

16.9.7. Could we use the comparison stuff we’ve been learning to test the ML algorithm?#

Yes!!

16.9.8. How should we set the random set we had for x_test to a specific group?#

the random_state parameter will fix it.

Note

If I misunderstood this question, post an issue

16.9.9. Not a question but we’re using the iris dataset in my programming with R class too!#

The iris dataset is a very popular easy to use dataset. It is good for this case because it fits the criteria of a simple classifier. It comes from the UC Irvine Machine Learning Repository a large collection of good datasets for machine learning, you can learn more about it on its page there.