7. Assignment 7: Decision Trees#

7.1. Quick Facts#

7.3. Assessment#

Table 7.1 fit a decision tree#

task

skill

fit a decision tree

classification (2)

apply a decision tree to get predictions

classification (2)

interpret the model assumed by a decision tree

classification (2)

use multiple metrics evaluate performance

evaluate (2)

interpret how decisions (test/train size, model parameters) impact model performance

evaluate (2)

interpret the classifier performance in the context of the dataset

process (2)

analyze the impact of model parameters on model performance

process (2)

use loops and lists effectively

python (2)

use EDA techniques to examine the experimental results

summarize (2), visualize (2)

create a dataset by combining data from multiple sources

construct (2)

7.4. Instructions#

Choose a datasets that is well suited for classification and that has only numerical features.

Tip

A file can be a “comma separated file” and read in with pd.read_csv even if the file name does not end in “.csv”. The part after the ‘.’ in a file name is called the file extension and its a sort of metadata built into a file. CSV is a specification for how to write data to a file, or a file format. It’s best practice to make the file extension match the file format, but it’s very much not required. Espeically the older files on the UCI repository, the extension is something else (eg dat, or data, or names), but the actual contents of the files are comma separated and compatible with read_csv

Practice using decision trees and exploring how classification works, and what evaluations mean in the following exercises.

Hint

the Wisconsin Breast Cancer data from UCI is a good option as is the wine data set, which has the red & white wines separated, so you can earn prepare with this. You could also use the NSA data and try to predict who will makes the NBA 75 based on their game stats, this would require some manipulation and so would be a way to earn construct.

7.4.1. Part 1: DT Basics#

  1. Include a basic description of the data(what the features are)

  2. Write your own description of what the classification task is and why a decision tree is a reasonable model to try for this data.

  3. Fit a decision tree with the default parameters on 50% of the data

  4. Test it on 50% held out data and generate a classification report

  5. Inspect the model to answer:

    • Does this model make sense?

    • Are there any leaves that are very small?

    • Is this an interpretable number of levels?

  6. Repeat the split, train, and test steps 5 times.

    • Is the performance consistent enough you trust it?

  7. Interpret the model and its performance in terms of the application. Some questions you might want to answer in order to do this include:

  • do you think this model is good enough to use for real?

  • is this a model you would trust?

  • do you think that a more complex model should be used?

  • do you think that maybe this task cannot be done with machine learning?

7.4.2. Part 2: Exploring Evaluation#

Do an experiment to compare test set size vs performance:

  1. Train decision tree with max depth 2 less less than the depth it found above on 10%, 30%, … , 90% of the data. Save the results of both test accuracy and training accuracy for each size training data in a DataFrame with columns [‘train_pct’,‘n_train_samples’,‘n_test_samples’,‘train_acc’,‘test_acc’]

  2. Plot the accuracies vs training percentage in a line graph.

  3. Interpret these results. How does training vs test size impact the model?

Hint

use a loop for this part, possibly also a function

7.4.3. Part 3: DT parameters#

Experiment with DT Parameters:

  1. Choose one parameter to change in the training that you think might improve the model and say why, then train a second decision tree

  2. Check the performance of the new decision tree with at least two performance metrics

  3. Did changing the parameter do what you expected?

  4. Choose a second parameter to change in the training that you think might improve the model and say why, then train a third decision tree

  5. Validate your third decision tree with at least two performance metrics.

  6. Did changing the parameter do what you expected?

Thinking Ahead

Repeat your experiment from Part 2 with cross validation and plot with error bars.

  • What is the tradeoff to be made in choosing a test/train size?

  • What is the best test/train size for this dataset?

Repeat the experiment in part 2 with variations:

  • allowing it to figure out the model depth for each training size, and recording the depth in the loop as well.

  • repeating each size 10 items, then using summary statistics on that data

Use the extensions above to experiment further with other model parameters.

some of this we’ll learn how to automate in a few weeks, but getting the ideas by doing it yourself can help