{ "cells": [ { "cell_type": "markdown", "id": "486ee836", "metadata": {}, "source": [ "# Class 20: Decision Trees and Cross Validation\n", "\n", "\n", "1. Share your favorite beverage (or say hi) in the zoom chat\n", "1. log onto prismia\n", "1. Accept assignment 7\n", "\n", "\n", "\n", "## Assignment 7\n", "\n", "Make a plan with a group:\n", "- what methods do you need to use in part 1?\n", "- try to outline with psuedocode what you'll do for part 2 & 3\n", "\n", "Share any questions you have.\n", "\n", "Followup:\n", "1. assignment clarified to require 3 values for the parameter in part 2\n", "1. more tips on finding data sets added to assignment text" ] }, { "cell_type": "markdown", "id": "fcff09f9", "metadata": {}, "source": [ "\n", "## Complexity of Decision Trees" ] }, { "cell_type": "code", "execution_count": 1, "id": "d3bb4217", "metadata": {}, "outputs": [], "source": [ "# %load http://drsmb.co/310\n", "import pandas as pd\n", "import seaborn as sns\n", "import numpy as np\n", "from sklearn import tree\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.model_selection import train_test_split\n", "d6_url = 'https://raw.githubusercontent.com/rhodyprog4ds/06-naive-bayes/main/data/dataset6.csv'" ] }, { "cell_type": "code", "execution_count": 2, "id": "46bb711d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x0x1char
06.142.10B
12.222.39A
22.275.44B
31.033.19A
42.251.71A
\n", "
" ], "text/plain": [ " x0 x1 char\n", "0 6.14 2.10 B\n", "1 2.22 2.39 A\n", "2 2.27 5.44 B\n", "3 1.03 3.19 A\n", "4 2.25 1.71 A" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df6= pd.read_csv(d6_url,usecols=[1,2,3])\n", "df6.head()" ] }, { "cell_type": "markdown", "id": "9e474021", "metadata": {}, "source": [ "````{margin}\n", "```{note}\n", "`df6.values` is a numpy array, which is a good datastructure for storing matrices of data. We can index into numpy arrays using `[rows, columns]`. Here, `df6.values[:,:2]` we take all the rows (`:`) and the columns up to, but not including index 2 for the features (X) `:2` and use columns at index 2 for the target(y).\n", "```\n", "````" ] }, { "cell_type": "code", "execution_count": 3, "id": "6f2eda9d", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(df6.values[:,:2],df6.values[:,2],\n", " train_size=.8)" ] }, { "cell_type": "code", "execution_count": 4, "id": "f68b1cc9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(min_samples_leaf=10)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt = tree.DecisionTreeClassifier(min_samples_leaf = 10)\n", "dt.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 5, "id": "6d2a2d54", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|--- feature_0 <= 5.88\n", "| |--- feature_1 <= 3.98\n", "| | |--- feature_0 <= 4.07\n", "| | | |--- class: A\n", "| | |--- feature_0 > 4.07\n", "| | | |--- class: B\n", "| |--- feature_1 > 3.98\n", "| | |--- feature_0 <= 4.09\n", "| | | |--- class: B\n", "| | |--- feature_0 > 4.09\n", "| | | |--- class: A\n", "|--- feature_0 > 5.88\n", "| |--- feature_1 <= 3.89\n", "| | |--- class: B\n", "| |--- feature_1 > 3.89\n", "| | |--- class: A\n", "\n" ] } ], "source": [ "print(tree.export_text(dt))" ] }, { "cell_type": "code", "execution_count": 6, "id": "85e97f08", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(min_samples_leaf=50)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt2 = tree.DecisionTreeClassifier(min_samples_leaf = 50)\n", "dt2.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 7, "id": "2f49a37e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|--- feature_0 <= 5.88\n", "| |--- feature_1 <= 3.98\n", "| | |--- class: A\n", "| |--- feature_1 > 3.98\n", "| | |--- class: B\n", "|--- feature_0 > 5.88\n", "| |--- class: B\n", "\n" ] } ], "source": [ "print(tree.export_text(dt2))" ] }, { "cell_type": "code", "execution_count": 8, "id": "8788a4ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt2.score(X_test,y_test)" ] }, { "cell_type": "code", "execution_count": 9, "id": "d343b138", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt.score(X_test,y_test)" ] }, { "cell_type": "code", "execution_count": 10, "id": "6862e1bd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(200, 3)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df6.shape" ] }, { "cell_type": "markdown", "id": "bec07b14", "metadata": {}, "source": [ "\n", "## Training, Test set size and Cross Validation" ] }, { "cell_type": "code", "execution_count": 11, "id": "adef8bfc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier()" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt3 = tree.DecisionTreeClassifier()\n", "dt3.fit(df6.values[:-1,:2],df6.values[:-1,2],)" ] }, { "cell_type": "code", "execution_count": 12, "id": "92d41d9a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|--- feature_0 <= 5.88\n", "| |--- feature_1 <= 5.33\n", "| | |--- feature_0 <= 4.07\n", "| | | |--- feature_1 <= 4.00\n", "| | | | |--- class: A\n", "| | | |--- feature_1 > 4.00\n", "| | | | |--- class: B\n", "| | |--- feature_0 > 4.07\n", "| | | |--- feature_1 <= 3.91\n", "| | | | |--- class: B\n", "| | | |--- feature_1 > 3.91\n", "| | | | |--- class: A\n", "| |--- feature_1 > 5.33\n", "| | |--- feature_0 <= 4.09\n", "| | | |--- class: B\n", "| | |--- feature_0 > 4.09\n", "| | | |--- class: A\n", "|--- feature_0 > 5.88\n", "| |--- feature_1 <= 3.89\n", "| | |--- class: B\n", "| |--- feature_1 > 3.89\n", "| | |--- class: A\n", "\n" ] } ], "source": [ "print(tree.export_text(dt3))" ] }, { "cell_type": "code", "execution_count": 13, "id": "93be35c8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_split.py:668: UserWarning: The least populated class in y has only 99 members, which is less than n_splits=100.\n", " % (min_groups, self.n_splits)), UserWarning)\n" ] }, { "data": { "text/plain": [ "array([1. , 1. , 0.5, 0.5, 1. , 0.5, 1. , 0.5, 0.5, 1. , 1. , 0.5, 0.5,\n", " 1. , 1. , 0.5, 1. , 1. , 0.5, 1. , 1. , 0.5, 0.5, 1. , 0. , 1. ,\n", " 1. , 1. , 0.5, 1. , 0.5, 0.5, 0.5, 0.5, 1. , 0.5, 1. , 1. , 0.5,\n", " 0.5, 1. , 0.5, 0.5, 0.5, 1. , 0.5, 0.5, 1. , 1. , 0.5, 1. , 1. ,\n", " 1. , 1. , 1. , 0.5, 1. , 1. , 1. , 1. , 1. , 0. , 1. , 0.5, 0.5,\n", " 1. , 0. , 1. , 0.5, 1. , 0.5, 0. , 1. , 1. , 1. , 1. , 0.5, 0.5,\n", " 0.5, 1. , 1. , 1. , 1. , 0. , 1. , 1. , 1. , 1. , 1. , 0.5, 1. ,\n", " 1. , 1. , 0.5, 1. , 1. , 1. , 0.5, 0. , 0.5])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt4 = tree.DecisionTreeClassifier(max_depth=2)\n", "cv_scores = cross_val_score(dt4,df6.values[:,:2],df6.values[:,2],cv=100 )\n", "cv_scores" ] }, { "cell_type": "code", "execution_count": 14, "id": "672f2095", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.755" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(cv_scores)" ] }, { "cell_type": "code", "execution_count": null, "id": "4778e6e2", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "jupytext": { "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.12, "jupytext_version": "1.6.0" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "source_map": [ 12, 35, 40, 51, 54, 63, 68, 73, 77, 82, 86, 90, 94, 96, 101, 106, 110, 116, 120 ] }, "nbformat": 4, "nbformat_minor": 5 }