{ "cells": [ { "cell_type": "markdown", "id": "6ba152ff", "metadata": {}, "source": [ "# Class 16: Naive Bayes Classification" ] }, { "cell_type": "markdown", "id": "c86c4bc9", "metadata": {}, "source": [ "To learn a classifier, we need labeled data (features and target)\n", "\n", "We split our data twice:\n", "- sample-wise: test and train\n", "- variable-wise: features and target\n", "\n", "## Naive Bayes with Sci-kit Learn\n", "\n", "We will use a new package today, [Scikit-Learn](https://scikit-learn.org/stable/index.html). Its package name for importing is `sklearn` but we don't import it with an alias, in general. It's a large module and we most often import just the parts we need. \n", "\n", "````{margin}\n", "```{tip}\n", "Recall when a word turns green & bold in a notebook, it's a python keyword, or reserved word.\n", "```\n", "````\n", "\n", "To do that we use a new Python keyword `from`. We can identify a package and then import a submodule or a package and submodule with `.` and then import specific functions or classes." ] }, { "cell_type": "code", "execution_count": 1, "id": "9e59ac2f", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.naive_bayes import GaussianNB" ] }, { "cell_type": "markdown", "id": "2720d2a9", "metadata": {}, "source": [ "We can tell from this code that `test_train_split` is probably a function because it's in lowercase and `sklearn` follows [PEP 8](https://www.python.org/dev/peps/pep-0008/) the Python Style Guide pretty strictly. We can also check with type" ] }, { "cell_type": "code", "execution_count": 2, "id": "9c4efa49", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "function" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(train_test_split)" ] }, { "cell_type": "markdown", "id": "48de00b0", "metadata": {}, "source": [ "We can tell [`GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) is probably a class because it's in [CapWords](https://www.python.org/dev/peps/pep-0008/#class-names), also known as [camel case](https://en.wikipedia.org/wiki/Camel_case).\n", "\n", "Again we can check." ] }, { "cell_type": "code", "execution_count": 3, "id": "3264cfc4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "abc.ABCMeta" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(GaussianNB)" ] }, { "cell_type": "markdown", "id": "f435e87c", "metadata": {}, "source": [ "That's an abstract base class.\n", "\n", "Today we'll work with the [iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset, which has been used for demonstrating statistical analyses since 1936. It contains 4 measurements of flowers from 3 different species." ] }, { "cell_type": "code", "execution_count": 4, "id": "5830b3cd", "metadata": {}, "outputs": [], "source": [ "iris_df = sns.load_dataset('iris')" ] }, { "cell_type": "markdown", "id": "b25e9ae0", "metadata": {}, "source": [ "As usual, we look at the structure." ] }, { "cell_type": "code", "execution_count": 5, "id": "ccc73d47", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | sepal_length | \n", "sepal_width | \n", "petal_length | \n", "petal_width | \n", "species | \n", "
---|---|---|---|---|---|
0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "
1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "
2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "setosa | \n", "
3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "setosa | \n", "
4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "