Assignment 8: Clustering

8. Assignment 8: Clustering#

accept the assignment Due: 2023-11-01

8.1. Evaluation#

Eligible skills: (links to checklists)

first chance clustering 1 and 2
evaluate 1 and 2
python 1 and 2
summarize 1 and 2
visualize 1 and 2

for some of these you will need to add analysis that is not described in the instructions below, but is related to this and that skill

8.3. Instructions#

Use the same dataset you used for assignment 7, unless there was a problem. If you skipped assignment 7, choose a dataset well suited for classification. See A7 for tips.

Describe what question you would be asking in applying clustering to this dataset. What does it mean if clustering does not work well?
How does this task compare to what the classification task on this dataset?
Apply Kmeans using the known, correct number of clusters, \(K\).
Evaluate how well clustering worked on the data:
- using a true clustering metric and
- using visualization and
- using a clustering metric that uses the ground truth labels
Include a discussion of your results that addresses the following:
- describes what the clustering means
- what the metrics show
- Does this clustering work better or worse than expected based on the classification performance (if you didn’t complete assignment 7, also apply a classifier)
Repeat your analysis using a 2 different numbers (1 higher, one lower) of clusters:
- can you interpret the new clusters?
- how do they relate to the original clusters? are they completely different, did one split?
- is there a reasonable explanation for more clusters than there are classes in this dataset?

8.4. For classification#

Note

Do this only if you did not already earn classification level 2

Fit your chosen classifier with the default parameters on 50% of the data
Test it on 50% held out data and generate a classification report
Inspect the model to answer the questions appropriate to your model.
- Does this model make sense?
- (if DT) Are there any leaves that are very small?
- (if DT) Is this an interpretable number of levels?
- (if GNB) do the parameters fit the data well?
- (if GNB) do the paramters generate similar synthetic data
Interpret the model and its performance in terms of the application. Example questions to consider in your response include

do you think this model is good enough to use for real?
is this a model you would trust?
do you think that a more complex model should be used?
do you think that maybe this task cannot be done with machine learning?