{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cfc6d97e",
   "metadata": {},
   "source": [
    "# Model Comparison"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63992adb",
   "metadata": {},
   "source": [
    "To compare models, we will first optimize the parameters of two diffrent models and look at how the different parameters settings impact the model comparison.  Later, we'll see how to compare across models of different classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1d86e0c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "from sklearn import datasets\n",
    "from sklearn import cluster\n",
    "from sklearn import svm\n",
    "from sklearn import tree\n",
    "from sklearn import model_selection"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b9be022",
   "metadata": {},
   "source": [
    "We could import modules however we want, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "37136c2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import model_selection as ms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a5828dce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "32.0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "200*.8/5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "873d8d77",
   "metadata": {},
   "source": [
    "We'll use the iris data again.\n",
    "\n",
    "Remember, we need to split the data into training and test.  The cross validation step will hep us optimize the parameters, but we don't want *data leakage* where the model has seen the test data multiple times. So, we split the data here for train and test annd the cross validation splits the training data into train and \"test\" again, but this test is better termed validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fdb76ee9",
   "metadata": {},
   "outputs": [],
   "source": [
    "iris_df = sns.load_dataset('iris')\n",
    "iris_X = iris_df.drop(columns='species')\n",
    "iris_y = iris_df['species']\n",
    "\n",
    "iris_X_train, iris_X_test, iris_y_train, iris_y_test = model_selection.train_test_split(iris_X,iris_y, test_size =.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efe7ab0e",
   "metadata": {},
   "source": [
    "Then we can make the object, the parameter grid dictionary and the Grid Search object.  We split these into separate cells, so that we can use the built in help to see more detail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "985f4858",
   "metadata": {},
   "outputs": [],
   "source": [
    "dt = tree.DecisionTreeClassifier()\n",
    "params_dt = {'criterion':['gini','entropy'],'max_depth':[2,3,4],\n",
    "       'min_samples_leaf':list(range(2,20,2))}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e0bb3996",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>GridSearchCV(estimator=DecisionTreeClassifier(),\n",
       "             param_grid={&#x27;criterion&#x27;: [&#x27;gini&#x27;, &#x27;entropy&#x27;],\n",
       "                         &#x27;max_depth&#x27;: [2, 3, 4],\n",
       "                         &#x27;min_samples_leaf&#x27;: [2, 4, 6, 8, 10, 12, 14, 16, 18]})</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">GridSearchCV</label><div class=\"sk-toggleable__content\"><pre>GridSearchCV(estimator=DecisionTreeClassifier(),\n",
       "             param_grid={&#x27;criterion&#x27;: [&#x27;gini&#x27;, &#x27;entropy&#x27;],\n",
       "                         &#x27;max_depth&#x27;: [2, 3, 4],\n",
       "                         &#x27;min_samples_leaf&#x27;: [2, 4, 6, 8, 10, 12, 14, 16, 18]})</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">estimator: DecisionTreeClassifier</label><div class=\"sk-toggleable__content\"><pre>DecisionTreeClassifier()</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">DecisionTreeClassifier</label><div class=\"sk-toggleable__content\"><pre>DecisionTreeClassifier()</pre></div></div></div></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "GridSearchCV(estimator=DecisionTreeClassifier(),\n",
       "             param_grid={'criterion': ['gini', 'entropy'],\n",
       "                         'max_depth': [2, 3, 4],\n",
       "                         'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt_opt = model_selection.GridSearchCV(dt, params_dt)\n",
    "dt_opt.fit(iris_X_train,iris_y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6097bd2a",
   "metadata": {},
   "source": [
    "Then we fit the Grid search using the training data, and remember this actually resets the parameters and then cross validates multiple times.\n",
    "\n",
    "\n",
    "We can reformat it into a dataframe for further analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "106d5b3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "dt_cv_df = pd.DataFrame(dt_opt.cv_results_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "196b08a0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_fit_time</th>\n",
       "      <th>std_fit_time</th>\n",
       "      <th>mean_score_time</th>\n",
       "      <th>std_score_time</th>\n",
       "      <th>param_criterion</th>\n",
       "      <th>param_max_depth</th>\n",
       "      <th>param_min_samples_leaf</th>\n",
       "      <th>params</th>\n",
       "      <th>split0_test_score</th>\n",
       "      <th>split1_test_score</th>\n",
       "      <th>split2_test_score</th>\n",
       "      <th>split3_test_score</th>\n",
       "      <th>split4_test_score</th>\n",
       "      <th>mean_test_score</th>\n",
       "      <th>std_test_score</th>\n",
       "      <th>rank_test_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.002271</td>\n",
       "      <td>0.000882</td>\n",
       "      <td>0.001283</td>\n",
       "      <td>0.000139</td>\n",
       "      <td>gini</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>{'criterion': 'gini', 'max_depth': 2, 'min_sam...</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.875</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.908333</td>\n",
       "      <td>0.016667</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.001738</td>\n",
       "      <td>0.000011</td>\n",
       "      <td>0.001199</td>\n",
       "      <td>0.000028</td>\n",
       "      <td>gini</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>{'criterion': 'gini', 'max_depth': 2, 'min_sam...</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.875</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.908333</td>\n",
       "      <td>0.016667</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.001755</td>\n",
       "      <td>0.000024</td>\n",
       "      <td>0.001199</td>\n",
       "      <td>0.000021</td>\n",
       "      <td>gini</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>{'criterion': 'gini', 'max_depth': 2, 'min_sam...</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.875</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.908333</td>\n",
       "      <td>0.016667</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.001751</td>\n",
       "      <td>0.000014</td>\n",
       "      <td>0.001199</td>\n",
       "      <td>0.000037</td>\n",
       "      <td>gini</td>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>{'criterion': 'gini', 'max_depth': 2, 'min_sam...</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.875</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.908333</td>\n",
       "      <td>0.016667</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.001762</td>\n",
       "      <td>0.000019</td>\n",
       "      <td>0.001192</td>\n",
       "      <td>0.000020</td>\n",
       "      <td>gini</td>\n",
       "      <td>2</td>\n",
       "      <td>10</td>\n",
       "      <td>{'criterion': 'gini', 'max_depth': 2, 'min_sam...</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.875</td>\n",
       "      <td>0.916667</td>\n",
       "      <td>0.908333</td>\n",
       "      <td>0.016667</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \\\n",
       "0       0.002271      0.000882         0.001283        0.000139   \n",
       "1       0.001738      0.000011         0.001199        0.000028   \n",
       "2       0.001755      0.000024         0.001199        0.000021   \n",
       "3       0.001751      0.000014         0.001199        0.000037   \n",
       "4       0.001762      0.000019         0.001192        0.000020   \n",
       "\n",
       "  param_criterion param_max_depth param_min_samples_leaf  \\\n",
       "0            gini               2                      2   \n",
       "1            gini               2                      4   \n",
       "2            gini               2                      6   \n",
       "3            gini               2                      8   \n",
       "4            gini               2                     10   \n",
       "\n",
       "                                              params  split0_test_score  \\\n",
       "0  {'criterion': 'gini', 'max_depth': 2, 'min_sam...           0.916667   \n",
       "1  {'criterion': 'gini', 'max_depth': 2, 'min_sam...           0.916667   \n",
       "2  {'criterion': 'gini', 'max_depth': 2, 'min_sam...           0.916667   \n",
       "3  {'criterion': 'gini', 'max_depth': 2, 'min_sam...           0.916667   \n",
       "4  {'criterion': 'gini', 'max_depth': 2, 'min_sam...           0.916667   \n",
       "\n",
       "   split1_test_score  split2_test_score  split3_test_score  split4_test_score  \\\n",
       "0           0.916667           0.916667              0.875           0.916667   \n",
       "1           0.916667           0.916667              0.875           0.916667   \n",
       "2           0.916667           0.916667              0.875           0.916667   \n",
       "3           0.916667           0.916667              0.875           0.916667   \n",
       "4           0.916667           0.916667              0.875           0.916667   \n",
       "\n",
       "   mean_test_score  std_test_score  rank_test_score  \n",
       "0         0.908333        0.016667                9  \n",
       "1         0.908333        0.016667                9  \n",
       "2         0.908333        0.016667                9  \n",
       "3         0.908333        0.016667                9  \n",
       "4         0.908333        0.016667                9  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt_cv_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e3bef485",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sklearn.tree._classes.DecisionTreeClassifier"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(dt_opt.best_estimator_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "5d0ac456",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 2}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt_opt.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cc36924",
   "metadata": {},
   "source": [
    "I changed the markers and the color of the error bars for readability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "ec1a26c3",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'dt_df' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[11], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m plt\u001b[38;5;241m.\u001b[39merrorbar(x\u001b[38;5;241m=\u001b[39m\u001b[43mdt_df\u001b[49m[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mmean_fit_time\u001b[39m\u001b[38;5;124m'\u001b[39m],y\u001b[38;5;241m=\u001b[39mdt_df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mmean_score_time\u001b[39m\u001b[38;5;124m'\u001b[39m],\n\u001b[1;32m      2\u001b[0m      xerr\u001b[38;5;241m=\u001b[39mdt_df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mstd_fit_time\u001b[39m\u001b[38;5;124m'\u001b[39m],yerr\u001b[38;5;241m=\u001b[39mdt_df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mstd_score_time\u001b[39m\u001b[38;5;124m'\u001b[39m],\n\u001b[1;32m      3\u001b[0m              marker\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124ms\u001b[39m\u001b[38;5;124m'\u001b[39m,ecolor\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mr\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m      4\u001b[0m plt\u001b[38;5;241m.\u001b[39mxlabel(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mfit time\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m      5\u001b[0m plt\u001b[38;5;241m.\u001b[39mylabel(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mscore time\u001b[39m\u001b[38;5;124m'\u001b[39m)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'dt_df' is not defined"
     ]
    }
   ],
   "source": [
    "plt.errorbar(x=dt_df['mean_fit_time'],y=dt_df['mean_score_time'],\n",
    "     xerr=dt_df['std_fit_time'],yerr=dt_df['std_score_time'],\n",
    "             marker='s',ecolor='r')\n",
    "plt.xlabel('fit time')\n",
    "plt.ylabel('score time')\n",
    "# save the limits so we can reuse them\n",
    "xmin, xmax, ymin, ymax = plt.axis()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffad5ad4",
   "metadata": {},
   "source": [
    "The \"points\" are at the mean fit and score times. The lines are the \"standard deviation\" or how much we expect that number to vary, since means are an estimate.\n",
    "Because the data shows an upward trend, this plot tells us that mostly, the models that are slower to fit are also slower to apply. This makes sense for decision trees, deeper trees take longer to learn and longer to traverse when predicting.\n",
    "Because the error bars mostly overlap the other points, this tells us that mostly the variation in time is not a reliable difference. If we re-ran the GridSearch, we could get them in different orders.\n",
    "\n",
    "To interpret the error bar plot, let's look at a line plot of just the means, with the same limits so that it's easier to compare to the plot above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "eb9b3471",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'dt_df' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[12], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m plt\u001b[38;5;241m.\u001b[39mplot(\u001b[43mdt_df\u001b[49m[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mmean_fit_time\u001b[39m\u001b[38;5;124m'\u001b[39m],\n\u001b[1;32m      2\u001b[0m             dt_df[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mmean_score_time\u001b[39m\u001b[38;5;124m'\u001b[39m], marker\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124ms\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m      3\u001b[0m plt\u001b[38;5;241m.\u001b[39mxlabel(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mfit time\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m      4\u001b[0m plt\u001b[38;5;241m.\u001b[39mylabel(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mscore time\u001b[39m\u001b[38;5;124m'\u001b[39m)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'dt_df' is not defined"
     ]
    }
   ],
   "source": [
    "plt.plot(dt_df['mean_fit_time'],\n",
    "            dt_df['mean_score_time'], marker='s')\n",
    "plt.xlabel('fit time')\n",
    "plt.ylabel('score time')\n",
    "# match the axis limits to above\n",
    "plt.ylim(ymin, ymax)\n",
    "plt.xlim(xmin,xmax)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58b5b892",
   "metadata": {},
   "source": [
    "this plot shows the mean times, without the error bars.\n",
    "\n",
    "## Fitting an SVM\n",
    "\n",
    "Now let's compare with a different model, we'll use the parameter optimized version for that model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "b18f7a7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "svmclf = svm.SVC()\n",
    "param_grid_svm = {'kernel':['linear','rbf'], 'C':[.5, 1, 10]}\n",
    "svm_opt = model_selection.GridSearchCV(svmclf,param_grid_svm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "215702cc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>GridSearchCV(estimator=SVC(),\n",
       "             param_grid={&#x27;C&#x27;: [0.5, 1, 10], &#x27;kernel&#x27;: [&#x27;linear&#x27;, &#x27;rbf&#x27;]})</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">GridSearchCV</label><div class=\"sk-toggleable__content\"><pre>GridSearchCV(estimator=SVC(),\n",
       "             param_grid={&#x27;C&#x27;: [0.5, 1, 10], &#x27;kernel&#x27;: [&#x27;linear&#x27;, &#x27;rbf&#x27;]})</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-5\" type=\"checkbox\" ><label for=\"sk-estimator-id-5\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">estimator: SVC</label><div class=\"sk-toggleable__content\"><pre>SVC()</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-6\" type=\"checkbox\" ><label for=\"sk-estimator-id-6\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SVC</label><div class=\"sk-toggleable__content\"><pre>SVC()</pre></div></div></div></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "GridSearchCV(estimator=SVC(),\n",
       "             param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svm_opt.fit(iris_X_train,iris_y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "ec2c7b66",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'C': 0.5, 'kernel': 'linear'}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svm_opt.best_params_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e02147cc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.9333333333333333, 0.9666666666666667)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svm_opt.score(iris_X_test,iris_y_test), dt_opt.score(iris_X_test,iris_y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdfbb464",
   "metadata": {},
   "source": [
    "## Other ways to compare models\n",
    "\n",
    "We can look at the performance, here the score is the accuracy and we could also look at other performance metrics.\n",
    "\n",
    "We can compare them on time: the training time or the test time (more important).\n",
    "\n",
    "We can also compare models on their interpretability: a decision tree is easy to explain how it makes a decision.  \n",
    "\n",
    "\n",
    "We can compare if it is generative (describes the data, could generate synthetic data) or discriminative (describes the decision rule). A generative model might be preferred even with lower accuracy by a scientist who wants to understand the data.  It can also help to give ideas about what you might do to improve the model.  If you just need the most accuracy, like if you are placing ads, you would typically use a discriminative model because it is complex, and you just want accuracy you don't need to understand.\n",
    "\n",
    "\n",
    "\n",
    "## Questions After Class\n",
    "\n",
    "\n",
    "### How does the svm algorithm work with many classes? Does it basically draw a grid? Does it work like a decision tree with a single line as each branch decider?\n",
    "\n",
    "Multiclass SVM can be done multiple ways. In [sklearn it does what is called one-vs-one](https://scikit-learn.org/stable/modules/svm.html#multi-class-classification). Basically it trains multiple models and votes.\n",
    "\n",
    "\n",
    "\n",
    "### is best_params_ a way of finding the best options when optimizing a model?\n",
    "\n",
    "That is to find the parameter values that had the best score when you have already optimized it.  \n",
    "\n",
    "### I would like to know how which of the three (svm/gnb/dt) is used the most in the world of data science, in your opinion.\n",
    "\n",
    "Gausian Naive Bayes are definitely the least commonly used.  They're very simple so they are useful for getting the *concepts* of classification down, but they are very simple so they are not very reliable.\n",
    "\n",
    "SVMs were very popular for while, but they have some underlying theoretical similarity to neural nets, and with modern computers where neural nets are popular, SVMs are less.  SVMS share similar disadvantages to NNs but are less accurate in general.  \n",
    "\n",
    "Decision trees are still pretty popular, even with deep learning because they are interpretable.  Right now, I would say of these three they are probably the most common.\n",
    "\n",
    "### is there a limit to data size when it comes to models\n",
    "\n",
    "Not to the model, but possibly to your computer.  We also have batch training that can help with larger datasets."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.13,
    "jupytext_version": "1.14.1"
   }
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "source_map": [
   12,
   17,
   22,
   32,
   36,
   41,
   43,
   50,
   56,
   61,
   67,
   70,
   77,
   81,
   85,
   89,
   91,
   95,
   103,
   113,
   121,
   129,
   135,
   139,
   143,
   145
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}