Skip to article frontmatterSkip to article content

Classification with Naive Bayes

Today, we will see our first machine learning model: Gaussian Naive bayes.

This is a generative classification model

First let’s load the modules and data that we will use today.

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score 
import matplotlib.pyplot as plt

iris_df = sns.load_dataset('iris')

sns.set_theme(palette='colorblind')

Introduction to the Iris data

We’re trying to build an automatic flower classifier that, for measurements of a new flower returns the predicted species(different types of iris). To do this, we have a DataFrame with columns for species, petal width, petal length, sepal length, and sepal width. The species is what type of flower it is the petal and sepal are parts of the flower.

image of an iris with the petal width, petal length, sepal length and sepal width annotated

Figure 1:The petal and sepal are different parts of a flower, the popular iris data contains 150 examples of irises from 3 species and 4 measurements of each: petal width, petal length, sepal length and sepal width

iris_df.head()
Loading...

The species will be the target and the 4 measurements will be the features.

For the iris data, our goal is to predict the species from the measurements. To make our code really explicit about how we are using each feature we will make variables for the features and target.

feature_vars = ['sepal_length', 'sepal_width','petal_length', 'petal_width',]
target_var = 'species'

We can look at a few things first:

classes = list(pd.unique(iris_df[target_var]))
iris_df[target_var].value_counts()
species setosa 50 versicolor 50 virginica 50 Name: count, dtype: int64

We see that it has 3 classes: ['setosa', 'versicolor', 'virginica'] and that they all have the same number of samples.

We refer to this as a balanced dataset.

Separating Training and Test Data

To do machine learning, we split the data both sample wise (rows if tidy) and variable-wise (columns if tidy). First, we’ll designate the columns to use as features and as the target.

The features are the input that we wish to use to predict the target.

Next, we’ll use a sklearn function to split the data randomly into test and train portions.

The train_test_split function returns multiple values, the docs say that it returns twice as many as it is passed.

random_state_choice = 5
X_train, X_test, y_train, y_test = train_test_split(iris_df[feature_vars],
                                                    iris_df[target_var],
                                                    random_state=random_state_choice)

We passed two separate things, the features and the labels separated, so we get train and test each for each:

We can look at the training data’s head:

X_train.head()
Loading...

and the test data

X_test.head()
Loading...

we see that they have different heads but the same columns. Let’s look at their sizes:

full_r, _ = iris_df.shape
train_r, train_c = X_train.shape
test_r, test_c = X_test.shape
full_r, train_r, test_r
(150, 112, 38)

We see the total data has 150 rows and then 112 will be used for training and 38 for test

these add up:

full_r == train_r + test_r
True

and by default the percentage of training data

train_pct = train_r/full_r
train_pct
0.7466666666666667

We see by default it is 74.66666666666667%

What does Gaussian Naive Bayes do?

Gaussian Naive Bayes is a classification model that assumes:

More resources:

We can look at this data using a pair plot. It plots each pair of numerical variables in a grid of scatterplots and on the diagonal (where it would be a variable with itself) shows the distribution of that variable (with a kdeplot).

sns.pairplot(iris_df,hue='species')
<seaborn.axisgrid.PairGrid at 0x7f6e79291e20>
<Figure size 1130x1000 with 20 Axes>

This data is reasonably separable beacause the different species (indicated with colors in the plot) do not overlap much. All classifiers require separable classes.

We see that the features are distributed sort of like a normal, or Gaussian, distribution. In 1D this is the familiar bell curve. In 2D a Gaussian distribution is like a hill, so we expect to see more points near the center and fewer on the edge of circle-ish blobs. These blobs are slightly like ovals, but not too skew(diagonal).

This means that the assumptions of the Gaussian Naive Bayes model are met well enough we can expect the classifier to do well.

Instantiating our Model Object

Next we will instantiate the object for our model. In sklearn they call these objects estimator. All estimators have a similar usage. First we instantiate the object and set any hyperparameters.

Instantiating the object says we are assuming a particular type of model. In this case Gaussian Naive Bayes.

GaussianNB sets several assumptions in one form:

this is one example of a Bayes Estimator

gnb = GaussianNB()

At this point the object is not very interesting, but we can still inspect it so we can see a before and after

gnb.__dict__
{'priors': None, 'var_smoothing': 1e-09}

Fitting the model to the data

The fit method uses the data to learn the model’s parameters, it implements the learning algorithm. In this case, a Gaussian distribution is characterized by a mean and variance; so the GNB classifier is characterized by one mean and one variance for each class (in 4d, like our data).

gnb.fit(X_train, y_train)
Loading...

The attributes of the estimator object (gbn) describe the data (eg the class list) and the model’s parameters. The theta_ (often in math as θ\theta or μ\mu) represents the mean and the var_ (σ\sigma) represents the variance of the distributions.

gnb.__dict__
{'priors': None, 'var_smoothing': 1e-09, 'classes_': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'feature_names_in_': array(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype=object), 'n_features_in_': 4, 'epsilon_': np.float64(3.1373206313775512e-09), 'theta_': array([[5.00526316, 3.45263158, 1.45 , 0.23421053], [5.96666667, 2.75277778, 4.26666667, 1.31944444], [6.55789474, 2.98684211, 5.52105263, 2.02631579]]), 'var_': array([[0.14207757, 0.16196676, 0.03460527, 0.00698754], [0.26611111, 0.10471451, 0.21722223, 0.03767747], [0.40770083, 0.1090374 , 0.32745153, 0.06193906]]), 'class_count_': array([38., 36., 38.]), 'class_prior_': array([0.33928571, 0.32142857, 0.33928571])}

We can see it learned a bunch about our data.

The theta_ is the means of the distribution and the var_ is the variance.

Scoring a model

Estimator objects also have a score method. If the estimator is a classifier, that score is accuracy. We will see that for other types of estimators it is different types.

gnb.score(X_test,y_test)
0.9210526315789473

Making model predictions

we can predict for each sample as well:

y_pred = gnb.predict(X_test)
y_pred
array(['versicolor', 'versicolor', 'virginica', 'setosa', 'virginica', 'versicolor', 'setosa', 'virginica', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'setosa', 'setosa', 'virginica', 'virginica', 'setosa', 'setosa', 'versicolor', 'virginica', 'setosa', 'versicolor', 'versicolor', 'virginica', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'setosa', 'versicolor', 'versicolor', 'setosa', 'versicolor', 'setosa', 'setosa', 'virginica'], dtype='<U10')

We can also do one single sample, the iloc attrbiute lets us pick out rows by integer index even if that is not the actual index of the DataFrame

X_test.iloc[0]
sepal_length 5.8 sepal_width 2.7 petal_length 3.9 petal_width 1.2 Name: 82, dtype: float64

but if we pick one row, it returns a series, which is incompatible with the predict method.

gnb.predict(X_test.iloc[0])
/home/runner/.local/lib/python3.12/site-packages/sklearn/utils/validation.py:2749: UserWarning: X does not have valid feature names, but GaussianNB was fitted with feature names
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 gnb.predict(X_test.iloc[0])

File ~/.local/lib/python3.12/site-packages/sklearn/naive_bayes.py:105, in _BaseNB.predict(self, X)
     91 """
     92 Perform classification on an array of test vectors X.
     93 
   (...)    102     Predicted target values for X.
    103 """
    104 check_is_fitted(self)
--> 105 X = self._check_X(X)
    106 jll = self._joint_log_likelihood(X)
    107 return self.classes_[np.argmax(jll, axis=1)]

File ~/.local/lib/python3.12/site-packages/sklearn/naive_bayes.py:272, in GaussianNB._check_X(self, X)
    270 def _check_X(self, X):
    271     """Validate X, used only in predict* methods."""
--> 272     return validate_data(self, X, reset=False)

File ~/.local/lib/python3.12/site-packages/sklearn/utils/validation.py:2954, in validate_data(_estimator, X, y, reset, validate_separately, skip_check_array, **check_params)
   2952         out = X, y
   2953 elif not no_val_X and no_val_y:
-> 2954     out = check_array(X, input_name="X", **check_params)
   2955 elif no_val_X and not no_val_y:
   2956     out = _check_y(y, **check_params)

File ~/.local/lib/python3.12/site-packages/sklearn/utils/validation.py:1091, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_all_finite, ensure_non_negative, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
   1084         else:
   1085             msg = (
   1086                 f"Expected 2D array, got 1D array instead:\narray={array}.\n"
   1087                 "Reshape your data either using array.reshape(-1, 1) if "
   1088                 "your data has a single feature or array.reshape(1, -1) "
   1089                 "if it contains a single sample."
   1090             )
-> 1091         raise ValueError(msg)
   1093 if dtype_numeric and hasattr(array.dtype, "kind") and array.dtype.kind in "USV":
   1094     raise ValueError(
   1095         "dtype='numeric' is not compatible with arrays of bytes/strings."
   1096         "Convert your data to numeric values explicitly instead."
   1097     )

ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.

If we select with a range, that only includes 1, it still returns a DataFrame

X_test.iloc[0:1]
Loading...

which we can get a prediction for:

gnb.predict(X_test.iloc[0:1])
array(['versicolor'], dtype='<U10')

We could also transform with to_frame and then transpose with T or (transpose)

gnb.predict(X_test.iloc[0].to_frame().T)
array(['versicolor'], dtype='<U10')

We can also pass a 2D array (list of lists) with values in it (here I typed in values similar to the mean for setosa above, so it should predict setosa)

gnb.predict([[5.1, 3.6, 1.5, 0.25]])
/home/runner/.local/lib/python3.12/site-packages/sklearn/utils/validation.py:2749: UserWarning: X does not have valid feature names, but GaussianNB was fitted with feature names
  warnings.warn(
array(['setosa'], dtype='<U10')

This way it warns us that the feature names are missing, but it still gives a prediction.

More evaluation

Like we saw last week, we can also produce a confusion matrix for this problem. This time however it will be 3x3 since we have 3 classes: ['setosa', 'versicolor', 'virginica']

confusion_matrix(y_test,y_pred)
array([[12, 0, 0], [ 0, 13, 1], [ 0, 2, 10]])

This is a little harder to read than the 2D version but we can make it a dataframe to read it better.

n_classes = len(gnb.classes_)
prediction_labels = [['predicted class']*n_classes, gnb.classes_]
actual_labels = [['true class']*n_classes, gnb.classes_]
conf_mat = confusion_matrix(y_test,y_pred)
conf_df = pd.DataFrame(data = conf_mat, index=actual_labels, columns=prediction_labels)
conf_df
Loading...

We see that the setosa is never mistaken for other classes but he other two are mixed up a few times.

A summary “report” is also available:

print(classification_report(y_test,y_pred))
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       0.87      0.93      0.90        14
   virginica       0.91      0.83      0.87        12

    accuracy                           0.92        38
   macro avg       0.93      0.92      0.92        38
weighted avg       0.92      0.92      0.92        38

We can also get a report with a few metrics.

We see we have perfect recall and precision for setosa, as above, but we have lower for the other two because there were mistakes where versicolor and virginica were mixed up.

How does GNB make predictions?

Recall the original assumptions. We can see the third one (predicting most probable class) by looking at, for each test sample the probability it belongs to each class.

For two sample we get a probability for each class:

gnb.predict_proba(X_test.iloc[0:2])
array([[4.98983163e-068, 9.99976886e-001, 2.31140535e-005], [2.38605399e-151, 6.05788555e-001, 3.94211445e-001]])

We can visualize these by making a dataframe and plotting it.

First we make the dataframe with the probabilities, the true values and the predictions.

Notebook Cell
# make the probabilities into a dataframe labeled with classes & make the index a separate column
prob_df = pd.DataFrame(data = gnb.predict_proba(X_test), columns = gnb.classes_ ).reset_index()
# add the predictions as a new column
prob_df['predicted_species'] = y_pred
# add the true species as a new column
prob_df['true_species'] = y_test.values
prob_df.head()
Loading...

Then add some additional columns

Notebook Cell
# for plotting, make a column that combines the index & prediction
pred_text = lambda r: str( r['index']) + ',' + r['predicted_species']
prob_df['i,pred'] = prob_df.apply(pred_text,axis=1)
# same for ground truth
true_text = lambda r: str( r['index']) + ',' + r['true_species']
prob_df['i,true'] = prob_df.apply(true_text,axis=1)
# a dd a column for which are correct
prob_df['correct'] = prob_df['predicted_species'] == prob_df['true_species']
prob_df.head()
Loading...

Finally we melt the data to use it for plotting

Notebook Cell
prob_df_melted = prob_df.melt(id_vars =[ 'index', 'predicted_species','true_species','i,pred','i,true','correct'],value_vars = gnb.classes_,
                             var_name = target_var, value_name = 'probability')
prob_df_melted.head()
Loading...
sns.set_theme(font_scale=2)
sns.catplot(data =prob_df_melted, x = 'species', y='probability' ,col ='i,true',
            col_wrap=5,kind='bar', hue='species')
<seaborn.axisgrid.FacetGrid at 0x7f6e73d0cec0>
<Figure size 2500x4000 with 38 Axes>

Here we see for each point in the test set a bar for the model’s probability that it belongs to each of the 3 classes(['setosa', 'versicolor', 'virginica']).

Most points are nearly probability 1 in the correvt class and near 0 in the other classes.

For example:

prob_select_idx = prob_df_melted['index'].isin([0,2,3])
sns.catplot(data =prob_df_melted[prob_select_idx], x = 'species', 
    y='probability' ,col ='i,true',
            col_wrap=5,kind='bar', hue='species')
<seaborn.axisgrid.FacetGrid at 0x7f6e6cc4a750>
<Figure size 2500x500 with 3 Axes>

Some, however, are less clear cut, for exmple, sample 1

split = [1]
prob_select_idx = prob_df_melted['index'].isin(split)

sns.catplot(data =prob_df_melted[prob_select_idx], x = 'species', 
    y='probability' ,col ='i,true',
            col_wrap=5,kind='bar', hue='species')
<seaborn.axisgrid.FacetGrid at 0x7f6e6c954ec0>
<Figure size 2500x500 with 1 Axes>

Sample 1’s true class is virginica, but it was predicted as np.str_('versicolor'). We see in the plot, the model was actually split on this sample, 2 bars are meaningfully above 0.

We could even pick out all of the ones where the predictionw as wrong:

error_idx = prob_df_melted['correct'] ==False

sns.catplot(data =prob_df_melted[error_idx], x = 'species', 
    y='probability' ,col ='i,true',
            col_wrap=5,kind='bar', hue='species')
<seaborn.axisgrid.FacetGrid at 0x7f6e6f0cb260>
<Figure size 2500x500 with 3 Axes>

All of these have a somewhat split decision, so that is good, the model was not confidently wrong.

What does it mean to be a generative model?

Gaussian Naive Bayes is a very simple model, but it is a generative model (in constrast to a discriminative model) so we can use it to generate synthetic data that looks like the real data, based on what the model learned.

sns.set_theme(font_scale=1)
N = 50
gnb_df = pd.DataFrame(np.concatenate([np.random.multivariate_normal(th, sig*np.eye(4),N)
                 for th, sig in zip(gnb.theta_,gnb.var_)]),
                 columns = gnb.feature_names_in_)
gnb_df['species'] = [ci for cl in [[c]*N for c in gnb.classes_] for ci in cl]
gnb_df.head()
Loading...

To break this code down:

Now that we have simulated data (also known as synthetic) that represents our what our model knows, we can plot it the same way we plotted the actual iris data.

sns.pairplot(data =gnb_df, hue='species')
<seaborn.axisgrid.PairGrid at 0x7f6e6f10e690>
<Figure size 1130x1000 with 20 Axes>

These look pretty similar.

To help compare visually here they are in tabs so you can toggle back and forth.

Original
Simulated
sns.pairplot(iris_df,hue='species')
<seaborn.axisgrid.PairGrid at 0x7f6e79291e20>
<Figure size 1130x1000 with 20 Axes>

Think Ahead

Does this data meet the assumptions of Gaussian Naive Bayes?

corner_data = 'https://raw.githubusercontent.com/rhodyprog4ds/06-naive-bayes/f425ba121cc0c4dd8bcaa7ebb2ff0b40b0b03bff/data/dataset6.csv'
df6= pd.read_csv(corner_data,usecols=[1,2,3])
sns.pairplot(data=df6, hue='char',hue_order=['A','B'])
<seaborn.axisgrid.PairGrid at 0x7f6e6e171610>
<Figure size 565.5x500 with 6 Axes>

Questions

Why some of our numbers were different than yours?

The train_test_split function is random, so each time it is run it will give different results, unless the random_state is set so that it uses a particular sequence of psuedo-random numbers.

is species as in what type of flower it is?

yes

what is X_train and y_train doing?

The original data had 5 columns

iris_df.head()
Loading...

Since we will do classification which is supervised learning, the training data has two parts: the features and targets

X is only the feature columns, here ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

X_train.head()
Loading...

y is only the target column, here species:

y_train.head()
40 setosa 115 virginica 142 virginica 69 versicolor 17 setosa Name: species, dtype: object

We can see that these have the same rows selected.

We can also confirm that these are different rows from the test data

np.sum([xti in X_test.index for xti in X_train.index])
np.int64(0)

Since the above came out to 0, we know tht for every element of X_train.index it does not appear in X_test.index

I would like to know more about making model predictions

We will be doing more of this over the next few classes!

What are the different names for parameters within GNB?

The Gausian Naive Bayes classifier has two main model parameters the mean and variance as a Gaussian Distribution does. There are a few others, you can read about in the docs, but are not required.

What is np.eye() what does that mean and why use 4?

np.eye(D) creates an Identity matrix of dimension D. We used 4 because there are 4 features (['sepal_length', 'sepal_width', 'petal_length', 'petal_width']) in the data.

np.eye(4)
array([[1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]])

do we only open 1 pull request for assignment 3?

Yes

How much can classes overlap before it becomes an issue?

The classes have to be separable to get a perfect classification, if they’re not perfectly separable, then the classifier will make errors. The answer to this question depends on the cost of an error.

For the current split (with random_state 5) we got a good, but not perfect score

The test set here must overlap a little:

test_df = X_test
test_df['species'] = y_test
sns.pairplot(test_df,hue='species')
<seaborn.axisgrid.PairGrid at 0x7f6e6e117770>
<Figure size 1130x1000 with 20 Axes>

If you look in the petal_width vs petal_length sub plots there are some points that almost completly overlap.

Remember, when we had random_state=0, we got a perfect score. Let’s do that again:

X_train0, X_test0, y_train0, y_test0 = train_test_split(iris_df[feature_vars],
                                                    iris_df[target_var],
                                                    random_state=0)
gnb.fit(X_train0,y_train0).score(X_test0,y_test0)
1.0

This then has no overlap.

test_df = X_test0
test_df['species'] = y_test0
sns.pairplot(test_df,hue='species')
<seaborn.axisgrid.PairGrid at 0x7f6e6d7285c0>
<Figure size 1130x1000 with 20 Axes>