{ "cells": [ { "cell_type": "markdown", "id": "cfc6d97e", "metadata": {}, "source": [ "# Model Comparison" ] }, { "cell_type": "markdown", "id": "63992adb", "metadata": {}, "source": [ "To compare models, we will first optimize the parameters of two diffrent models and look at how the different parameters settings impact the model comparison. Later, we'll see how to compare across models of different classes." ] }, { "cell_type": "code", "execution_count": 1, "id": "1d86e0c7", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import seaborn as sns\n", "import pandas as pd\n", "from sklearn import datasets\n", "from sklearn import cluster\n", "from sklearn import svm\n", "from sklearn import tree\n", "from sklearn import model_selection" ] }, { "cell_type": "markdown", "id": "7b9be022", "metadata": {}, "source": [ "We could import modules however we want, for example:" ] }, { "cell_type": "code", "execution_count": 2, "id": "37136c2a", "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection as ms" ] }, { "cell_type": "code", "execution_count": 3, "id": "a5828dce", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "32.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "200*.8/5" ] }, { "cell_type": "markdown", "id": "873d8d77", "metadata": {}, "source": [ "We'll use the iris data again.\n", "\n", "Remember, we need to split the data into training and test. The cross validation step will hep us optimize the parameters, but we don't want *data leakage* where the model has seen the test data multiple times. So, we split the data here for train and test annd the cross validation splits the training data into train and \"test\" again, but this test is better termed validation." ] }, { "cell_type": "code", "execution_count": 4, "id": "fdb76ee9", "metadata": {}, "outputs": [], "source": [ "iris_df = sns.load_dataset('iris')\n", "iris_X = iris_df.drop(columns='species')\n", "iris_y = iris_df['species']\n", "\n", "iris_X_train, iris_X_test, iris_y_train, iris_y_test = model_selection.train_test_split(iris_X,iris_y, test_size =.2)" ] }, { "cell_type": "markdown", "id": "efe7ab0e", "metadata": {}, "source": [ "Then we can make the object, the parameter grid dictionary and the Grid Search object. We split these into separate cells, so that we can use the built in help to see more detail." ] }, { "cell_type": "code", "execution_count": 5, "id": "985f4858", "metadata": {}, "outputs": [], "source": [ "dt = tree.DecisionTreeClassifier()\n", "params_dt = {'criterion':['gini','entropy'],'max_depth':[2,3,4],\n", " 'min_samples_leaf':list(range(2,20,2))}" ] }, { "cell_type": "code", "execution_count": 6, "id": "e0bb3996", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GridSearchCV(estimator=DecisionTreeClassifier(),\n", " param_grid={'criterion': ['gini', 'entropy'],\n", " 'max_depth': [2, 3, 4],\n", " 'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(estimator=DecisionTreeClassifier(),\n", " param_grid={'criterion': ['gini', 'entropy'],\n", " 'max_depth': [2, 3, 4],\n", " 'min_samples_leaf': [2, 4, 6, 8, 10, 12, 14, 16, 18]})
DecisionTreeClassifier()
DecisionTreeClassifier()
\n", " | mean_fit_time | \n", "std_fit_time | \n", "mean_score_time | \n", "std_score_time | \n", "param_criterion | \n", "param_max_depth | \n", "param_min_samples_leaf | \n", "params | \n", "split0_test_score | \n", "split1_test_score | \n", "split2_test_score | \n", "split3_test_score | \n", "split4_test_score | \n", "mean_test_score | \n", "std_test_score | \n", "rank_test_score | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.002271 | \n", "0.000882 | \n", "0.001283 | \n", "0.000139 | \n", "gini | \n", "2 | \n", "2 | \n", "{'criterion': 'gini', 'max_depth': 2, 'min_sam... | \n", "0.916667 | \n", "0.916667 | \n", "0.916667 | \n", "0.875 | \n", "0.916667 | \n", "0.908333 | \n", "0.016667 | \n", "9 | \n", "
1 | \n", "0.001738 | \n", "0.000011 | \n", "0.001199 | \n", "0.000028 | \n", "gini | \n", "2 | \n", "4 | \n", "{'criterion': 'gini', 'max_depth': 2, 'min_sam... | \n", "0.916667 | \n", "0.916667 | \n", "0.916667 | \n", "0.875 | \n", "0.916667 | \n", "0.908333 | \n", "0.016667 | \n", "9 | \n", "
2 | \n", "0.001755 | \n", "0.000024 | \n", "0.001199 | \n", "0.000021 | \n", "gini | \n", "2 | \n", "6 | \n", "{'criterion': 'gini', 'max_depth': 2, 'min_sam... | \n", "0.916667 | \n", "0.916667 | \n", "0.916667 | \n", "0.875 | \n", "0.916667 | \n", "0.908333 | \n", "0.016667 | \n", "9 | \n", "
3 | \n", "0.001751 | \n", "0.000014 | \n", "0.001199 | \n", "0.000037 | \n", "gini | \n", "2 | \n", "8 | \n", "{'criterion': 'gini', 'max_depth': 2, 'min_sam... | \n", "0.916667 | \n", "0.916667 | \n", "0.916667 | \n", "0.875 | \n", "0.916667 | \n", "0.908333 | \n", "0.016667 | \n", "9 | \n", "
4 | \n", "0.001762 | \n", "0.000019 | \n", "0.001192 | \n", "0.000020 | \n", "gini | \n", "2 | \n", "10 | \n", "{'criterion': 'gini', 'max_depth': 2, 'min_sam... | \n", "0.916667 | \n", "0.916667 | \n", "0.916667 | \n", "0.875 | \n", "0.916667 | \n", "0.908333 | \n", "0.016667 | \n", "9 | \n", "
GridSearchCV(estimator=SVC(),\n", " param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(estimator=SVC(),\n", " param_grid={'C': [0.5, 1, 10], 'kernel': ['linear', 'rbf']})
SVC()
SVC()