{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "486ee836",
   "metadata": {},
   "source": [
    "# Class 20: Decision Trees and Cross Validation\n",
    "\n",
    "\n",
    "1. Share your favorite beverage (or say hi) in the zoom chat\n",
    "1. log onto prismia\n",
    "1. Accept assignment 7\n",
    "\n",
    "\n",
    "<!-- annotate: Assignment 7  -->\n",
    "## Assignment 7\n",
    "\n",
    "Make a plan with a group:\n",
    "- what methods do you need to use in part 1?\n",
    "- try to outline with psuedocode what you'll do for part 2 & 3\n",
    "\n",
    "Share any questions you have.\n",
    "\n",
    "Followup:\n",
    "1. assignment clarified to require 3 values for the parameter in part 2\n",
    "1. more tips on finding data sets added to assignment text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcff09f9",
   "metadata": {},
   "source": [
    "<!-- annotate: Complexity of Decision Trees -->\n",
    "## Complexity of Decision Trees"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d3bb4217",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load http://drsmb.co/310\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "from sklearn import tree\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.model_selection import train_test_split\n",
    "d6_url = 'https://raw.githubusercontent.com/rhodyprog4ds/06-naive-bayes/main/data/dataset6.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "46bb711d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x0</th>\n",
       "      <th>x1</th>\n",
       "      <th>char</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6.14</td>\n",
       "      <td>2.10</td>\n",
       "      <td>B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.22</td>\n",
       "      <td>2.39</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.27</td>\n",
       "      <td>5.44</td>\n",
       "      <td>B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.03</td>\n",
       "      <td>3.19</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.25</td>\n",
       "      <td>1.71</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     x0    x1 char\n",
       "0  6.14  2.10    B\n",
       "1  2.22  2.39    A\n",
       "2  2.27  5.44    B\n",
       "3  1.03  3.19    A\n",
       "4  2.25  1.71    A"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df6= pd.read_csv(d6_url,usecols=[1,2,3])\n",
    "df6.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e474021",
   "metadata": {},
   "source": [
    "````{margin}\n",
    "```{note}\n",
    "`df6.values` is a numpy array, which is a good datastructure for storing matrices of data.  We can index into numpy arrays using `[rows, columns]`.  Here, `df6.values[:,:2]` we take all the rows (`:`) and the columns up to, but not including index 2 for the features (X) `:2` and use columns at index 2 for the target(y).\n",
    "```\n",
    "````"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6f2eda9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train,  y_test = train_test_split(df6.values[:,:2],df6.values[:,2],\n",
    "                                                     train_size=.8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f68b1cc9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(min_samples_leaf=10)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt = tree.DecisionTreeClassifier(min_samples_leaf = 10)\n",
    "dt.fit(X_train,y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6d2a2d54",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|--- feature_0 <= 5.88\n",
      "|   |--- feature_1 <= 3.98\n",
      "|   |   |--- feature_0 <= 4.07\n",
      "|   |   |   |--- class: A\n",
      "|   |   |--- feature_0 >  4.07\n",
      "|   |   |   |--- class: B\n",
      "|   |--- feature_1 >  3.98\n",
      "|   |   |--- feature_0 <= 4.09\n",
      "|   |   |   |--- class: B\n",
      "|   |   |--- feature_0 >  4.09\n",
      "|   |   |   |--- class: A\n",
      "|--- feature_0 >  5.88\n",
      "|   |--- feature_1 <= 3.89\n",
      "|   |   |--- class: B\n",
      "|   |--- feature_1 >  3.89\n",
      "|   |   |--- class: A\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(tree.export_text(dt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "85e97f08",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(min_samples_leaf=50)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt2 = tree.DecisionTreeClassifier(min_samples_leaf = 50)\n",
    "dt2.fit(X_train,y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2f49a37e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|--- feature_0 <= 5.88\n",
      "|   |--- feature_1 <= 3.98\n",
      "|   |   |--- class: A\n",
      "|   |--- feature_1 >  3.98\n",
      "|   |   |--- class: B\n",
      "|--- feature_0 >  5.88\n",
      "|   |--- class: B\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(tree.export_text(dt2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8788a4ac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt2.score(X_test,y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d343b138",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt.score(X_test,y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "6862e1bd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(200, 3)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df6.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bec07b14",
   "metadata": {},
   "source": [
    "<!-- annotate: Training, Test set size and Cross Validation -->\n",
    "## Training, Test set size and Cross Validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "adef8bfc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier()"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt3 = tree.DecisionTreeClassifier()\n",
    "dt3.fit(df6.values[:-1,:2],df6.values[:-1,2],)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "92d41d9a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "|--- feature_0 <= 5.88\n",
      "|   |--- feature_1 <= 5.33\n",
      "|   |   |--- feature_0 <= 4.07\n",
      "|   |   |   |--- feature_1 <= 4.00\n",
      "|   |   |   |   |--- class: A\n",
      "|   |   |   |--- feature_1 >  4.00\n",
      "|   |   |   |   |--- class: B\n",
      "|   |   |--- feature_0 >  4.07\n",
      "|   |   |   |--- feature_1 <= 3.91\n",
      "|   |   |   |   |--- class: B\n",
      "|   |   |   |--- feature_1 >  3.91\n",
      "|   |   |   |   |--- class: A\n",
      "|   |--- feature_1 >  5.33\n",
      "|   |   |--- feature_0 <= 4.09\n",
      "|   |   |   |--- class: B\n",
      "|   |   |--- feature_0 >  4.09\n",
      "|   |   |   |--- class: A\n",
      "|--- feature_0 >  5.88\n",
      "|   |--- feature_1 <= 3.89\n",
      "|   |   |--- class: B\n",
      "|   |--- feature_1 >  3.89\n",
      "|   |   |--- class: A\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(tree.export_text(dt3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "93be35c8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_split.py:668: UserWarning: The least populated class in y has only 99 members, which is less than n_splits=100.\n",
      "  % (min_groups, self.n_splits)), UserWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([1. , 1. , 0.5, 0.5, 1. , 0.5, 1. , 0.5, 0.5, 1. , 1. , 0.5, 0.5,\n",
       "       1. , 1. , 0.5, 1. , 1. , 0.5, 1. , 1. , 0.5, 0.5, 1. , 0. , 1. ,\n",
       "       1. , 1. , 0.5, 1. , 0.5, 0.5, 0.5, 0.5, 1. , 0.5, 1. , 1. , 0.5,\n",
       "       0.5, 1. , 0.5, 0.5, 0.5, 1. , 0.5, 0.5, 1. , 1. , 0.5, 1. , 1. ,\n",
       "       1. , 1. , 1. , 0.5, 1. , 1. , 1. , 1. , 1. , 0. , 1. , 0.5, 0.5,\n",
       "       1. , 0. , 1. , 0.5, 1. , 0.5, 0. , 1. , 1. , 1. , 1. , 0.5, 0.5,\n",
       "       0.5, 1. , 1. , 1. , 1. , 0. , 1. , 1. , 1. , 1. , 1. , 0.5, 1. ,\n",
       "       1. , 1. , 0.5, 1. , 1. , 1. , 0.5, 0. , 0.5])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt4 = tree.DecisionTreeClassifier(max_depth=2)\n",
    "cv_scores = cross_val_score(dt4,df6.values[:,:2],df6.values[:,2],cv=100 )\n",
    "cv_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "672f2095",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.755"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(cv_scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4778e6e2",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.12,
    "jupytext_version": "1.6.0"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  },
  "source_map": [
   12,
   35,
   40,
   51,
   54,
   63,
   68,
   73,
   77,
   82,
   86,
   90,
   94,
   96,
   101,
   106,
   110,
   116,
   120
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}