Class 20: Decision Trees and Cross Validation¶
Share your favorite beverage (or say hi) in the zoom chat
log onto prismia
Accept assignment 7
Assignment 7¶
Make a plan with a group:
what methods do you need to use in part 1?
try to outline with psuedocode what you’ll do for part 2 & 3
Share any questions you have.
Followup:
assignment clarified to require 3 values for the parameter in part 2
more tips on finding data sets added to assignment text
Complexity of Decision Trees¶
# %load http://drsmb.co/310
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
d6_url = 'https://raw.githubusercontent.com/rhodyprog4ds/06-naive-bayes/main/data/dataset6.csv'
df6= pd.read_csv(d6_url,usecols=[1,2,3])
df6.head()
x0 | x1 | char | |
---|---|---|---|
0 | 6.14 | 2.10 | B |
1 | 2.22 | 2.39 | A |
2 | 2.27 | 5.44 | B |
3 | 1.03 | 3.19 | A |
4 | 2.25 | 1.71 | A |
X_train, X_test, y_train, y_test = train_test_split(df6.values[:,:2],df6.values[:,2],
train_size=.8)
dt = tree.DecisionTreeClassifier(min_samples_leaf = 10)
dt.fit(X_train,y_train)
DecisionTreeClassifier(min_samples_leaf=10)
print(tree.export_text(dt))
|--- feature_0 <= 5.88
| |--- feature_1 <= 3.98
| | |--- feature_0 <= 4.07
| | | |--- class: A
| | |--- feature_0 > 4.07
| | | |--- class: B
| |--- feature_1 > 3.98
| | |--- feature_0 <= 4.09
| | | |--- class: B
| | |--- feature_0 > 4.09
| | | |--- class: A
|--- feature_0 > 5.88
| |--- feature_1 <= 3.89
| | |--- class: B
| |--- feature_1 > 3.89
| | |--- class: A
dt2 = tree.DecisionTreeClassifier(min_samples_leaf = 50)
dt2.fit(X_train,y_train)
DecisionTreeClassifier(min_samples_leaf=50)
print(tree.export_text(dt2))
|--- feature_0 <= 5.88
| |--- feature_1 <= 3.98
| | |--- class: A
| |--- feature_1 > 3.98
| | |--- class: B
|--- feature_0 > 5.88
| |--- class: B
dt2.score(X_test,y_test)
0.6
dt.score(X_test,y_test)
1.0
df6.shape
(200, 3)
Training, Test set size and Cross Validation¶
dt3 = tree.DecisionTreeClassifier()
dt3.fit(df6.values[:-1,:2],df6.values[:-1,2],)
DecisionTreeClassifier()
print(tree.export_text(dt3))
|--- feature_0 <= 5.88
| |--- feature_1 <= 5.33
| | |--- feature_0 <= 4.07
| | | |--- feature_1 <= 4.00
| | | | |--- class: A
| | | |--- feature_1 > 4.00
| | | | |--- class: B
| | |--- feature_0 > 4.07
| | | |--- feature_1 <= 3.91
| | | | |--- class: B
| | | |--- feature_1 > 3.91
| | | | |--- class: A
| |--- feature_1 > 5.33
| | |--- feature_0 <= 4.09
| | | |--- class: B
| | |--- feature_0 > 4.09
| | | |--- class: A
|--- feature_0 > 5.88
| |--- feature_1 <= 3.89
| | |--- class: B
| |--- feature_1 > 3.89
| | |--- class: A
dt4 = tree.DecisionTreeClassifier(max_depth=2)
cv_scores = cross_val_score(dt4,df6.values[:,:2],df6.values[:,2],cv=100 )
cv_scores
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_split.py:668: UserWarning: The least populated class in y has only 99 members, which is less than n_splits=100.
% (min_groups, self.n_splits)), UserWarning)
array([1. , 1. , 0.5, 0.5, 1. , 0.5, 1. , 0.5, 0.5, 1. , 1. , 0.5, 0.5,
1. , 1. , 0.5, 1. , 1. , 0.5, 1. , 1. , 0.5, 0.5, 1. , 0. , 1. ,
1. , 1. , 0.5, 1. , 0.5, 0.5, 0.5, 0.5, 1. , 0.5, 1. , 1. , 0.5,
0.5, 1. , 0.5, 0.5, 0.5, 1. , 0.5, 0.5, 1. , 1. , 0.5, 1. , 1. ,
1. , 1. , 1. , 0.5, 1. , 1. , 1. , 1. , 1. , 0. , 1. , 0.5, 0.5,
1. , 0. , 1. , 0.5, 1. , 0.5, 0. , 1. , 1. , 1. , 1. , 0.5, 0.5,
0.5, 1. , 1. , 1. , 1. , 0. , 1. , 1. , 1. , 1. , 1. , 0.5, 1. ,
1. , 1. , 0.5, 1. , 1. , 1. , 0.5, 0. , 0.5])
np.mean(cv_scores)
0.755