Class 20: Decision Trees and Cross Validation

  1. Share your favorite beverage (or say hi) in the zoom chat

  2. log onto prismia

  3. Accept assignment 7

Assignment 7

Make a plan with a group:

  • what methods do you need to use in part 1?

  • try to outline with psuedocode what you’ll do for part 2 & 3

Share any questions you have.

Followup:

  1. assignment clarified to require 3 values for the parameter in part 2

  2. more tips on finding data sets added to assignment text

Complexity of Decision Trees

# %load http://drsmb.co/310
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
d6_url = 'https://raw.githubusercontent.com/rhodyprog4ds/06-naive-bayes/main/data/dataset6.csv'
df6= pd.read_csv(d6_url,usecols=[1,2,3])
df6.head()
x0 x1 char
0 6.14 2.10 B
1 2.22 2.39 A
2 2.27 5.44 B
3 1.03 3.19 A
4 2.25 1.71 A
X_train, X_test, y_train,  y_test = train_test_split(df6.values[:,:2],df6.values[:,2],
                                                     train_size=.8)
dt = tree.DecisionTreeClassifier(min_samples_leaf = 10)
dt.fit(X_train,y_train)
DecisionTreeClassifier(min_samples_leaf=10)
print(tree.export_text(dt))
|--- feature_0 <= 5.88
|   |--- feature_1 <= 3.98
|   |   |--- feature_0 <= 4.07
|   |   |   |--- class: A
|   |   |--- feature_0 >  4.07
|   |   |   |--- class: B
|   |--- feature_1 >  3.98
|   |   |--- feature_0 <= 4.09
|   |   |   |--- class: B
|   |   |--- feature_0 >  4.09
|   |   |   |--- class: A
|--- feature_0 >  5.88
|   |--- feature_1 <= 3.89
|   |   |--- class: B
|   |--- feature_1 >  3.89
|   |   |--- class: A
dt2 = tree.DecisionTreeClassifier(min_samples_leaf = 50)
dt2.fit(X_train,y_train)
DecisionTreeClassifier(min_samples_leaf=50)
print(tree.export_text(dt2))
|--- feature_0 <= 5.88
|   |--- feature_1 <= 3.98
|   |   |--- class: A
|   |--- feature_1 >  3.98
|   |   |--- class: B
|--- feature_0 >  5.88
|   |--- class: B
dt2.score(X_test,y_test)
0.6
dt.score(X_test,y_test)
1.0
df6.shape
(200, 3)

Training, Test set size and Cross Validation

dt3 = tree.DecisionTreeClassifier()
dt3.fit(df6.values[:-1,:2],df6.values[:-1,2],)
DecisionTreeClassifier()
print(tree.export_text(dt3))
|--- feature_0 <= 5.88
|   |--- feature_1 <= 5.33
|   |   |--- feature_0 <= 4.07
|   |   |   |--- feature_1 <= 4.00
|   |   |   |   |--- class: A
|   |   |   |--- feature_1 >  4.00
|   |   |   |   |--- class: B
|   |   |--- feature_0 >  4.07
|   |   |   |--- feature_1 <= 3.91
|   |   |   |   |--- class: B
|   |   |   |--- feature_1 >  3.91
|   |   |   |   |--- class: A
|   |--- feature_1 >  5.33
|   |   |--- feature_0 <= 4.09
|   |   |   |--- class: B
|   |   |--- feature_0 >  4.09
|   |   |   |--- class: A
|--- feature_0 >  5.88
|   |--- feature_1 <= 3.89
|   |   |--- class: B
|   |--- feature_1 >  3.89
|   |   |--- class: A
dt4 = tree.DecisionTreeClassifier(max_depth=2)
cv_scores = cross_val_score(dt4,df6.values[:,:2],df6.values[:,2],cv=100 )
cv_scores
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_split.py:668: UserWarning: The least populated class in y has only 99 members, which is less than n_splits=100.
  % (min_groups, self.n_splits)), UserWarning)
array([1. , 1. , 0.5, 0.5, 1. , 0.5, 1. , 0.5, 0.5, 1. , 1. , 0.5, 0.5,
       1. , 1. , 0.5, 1. , 1. , 0.5, 1. , 1. , 0.5, 0.5, 1. , 0. , 1. ,
       1. , 1. , 0.5, 1. , 0.5, 0.5, 0.5, 0.5, 1. , 0.5, 1. , 1. , 0.5,
       0.5, 1. , 0.5, 0.5, 0.5, 1. , 0.5, 0.5, 1. , 1. , 0.5, 1. , 1. ,
       1. , 1. , 1. , 0.5, 1. , 1. , 1. , 1. , 1. , 0. , 1. , 0.5, 0.5,
       1. , 0. , 1. , 0.5, 1. , 0.5, 0. , 1. , 1. , 1. , 1. , 0.5, 0.5,
       0.5, 1. , 1. , 1. , 1. , 0. , 1. , 1. , 1. , 1. , 1. , 0.5, 1. ,
       1. , 1. , 0.5, 1. , 1. , 1. , 0.5, 0. , 0.5])
np.mean(cv_scores)
0.755