7. Assignment 7: Classification#

accept the assignment

Due: 2020-10-26

Eligible skills: (links to checklists)

  • first chance classification 1 and 2

  • evaluate 1 and 2

  • summarize 1 and 2

  • visualize 1 and 2

  • python 1 and 2

7.1. Dataset and EDA#

Choose a datasets that is well suited for classification and that has all numerical features. If you want to use a dataset with nonnumerical features you will have to convert the categorical features to one hot encoding.

Hint

Use the UCI ML repository

  1. Include a basic description of the data(what the features are)

  2. Write your own description of what the classification task is

  3. Use EDA to determine if you expect the classification to get a high accuracy or not.

  4. Explain why or why not Gaussian Naive Bayes and Decision Trees are a reasonable model to try for this data.

  5. Hypothesize which will do better and why you think that.

7.2. Basic Classification#

  1. Fit a your chosen classifier with the default parameters on 50% of the data

  2. Test it on 50% held out data and generate a classification report

  3. Inspect the model to answer the questions appropriate to your model.

    • Does this model make sense?

    • (DT) Are there any leaves that are very small?

    • (DT) Is this an interpretable number of levels?

    • (GNB) do the parameters fit the data well?

    • (GNB) do the parameters generate similar synthetic data

  4. Repeat the split, train, and test steps 5 times to use 5 different random splits of the data, save the scores into a dataframe. Compute the mean and std of the scores.

    • Is the performance consistent enough you trust it?

  5. Interpret the model and its performance in terms of the application. Example questions to consider in your response include

  • do you think this model is good enough to use for real?

  • is this a model you would trust?

  • do you think that a more complex model should be used?

  • do you think that maybe this task cannot be done with machine learning?

7.3. Exploring Problem Setups#

Important

Understanding the impact of test/train size is a part of classification. This exercise is also a chance at python level 2.

Do an experiment to compare test set size vs performance:

  1. Train a model (if decision tree set the max depth 2 less than the depth it found above) on 10%, 30%, … , 90% of the data. Compute the training accuracy and test accuracy for each size training data in a DataFrame with columns [‘train_pct’,‘n_train_samples’,‘n_test_samples’,‘train_acc’,‘test_acc’]

  2. Plot the accuracies vs training percentage in a line graph.

  3. Interpret these results. How does training vs test size impact the model?

use a loop for this part, possibly also a function

Thinking Ahead

ideas for level 3

Repeat the problem setup experiment with cross validation and plot with error bars.

  • What is the tradeoff to be made in choosing a test/train size?

  • What is the best test/train size for this dataset?

or with variations:

  • allowing it to figure out the model depth for each training size, and recording the depth in the loop as well.

  • repeating each size 10 items, then using summary statistics on that data

Use the extensions above to experiment further with other model parameters.

some of this we’ll learn how to automate in a few weeks, but getting the ideas by doing it yourself can help