# ML Task Review Cross Validation


## Relationship between Tasks

We learned classification first, because it shares similarities with each
regression and clustering, while regression and clustering have less in common.

Classification is supervised learning for a categorical target.  
Regression is supervised learning for a continuous target.
Clustering is unsupervised learning for a categorical target.


Sklearn provides a nice flow chart for thinking through this.  

![estimator flow chart](https://scikit-learn.org/stable/_static/ml_map.png)

Predicting a category is another way of saying categorical target. Predicting a
quantitiy is another way of saying continuous target. Having lables or not is
the difference between


The flowchart assumes you know what you want to do with data and that is the
ideal scenario. You have a dataset and you have a goal.
For the purpose of getting to practice with a variety of things, in this course
we ask you to start with a task and then find a dataset. Assignment 9 is the
last time that's true however. Starting with Assignment 10 and the last
portflios, you can choose and focus on a specific application domain and then
choose the right task from there.  

Thinking about this, however, you use this information to move between the tasks
within a given type of data.
For example, you can use the same data for clustering as you did for classification.
Switching the task changes the questions though: classification evaluation tells
us how separable the classes are given that classifiers decision rule. Clustering
can find other subgroups or the same ones, so the evaluation we choose allows us
to explore this in more ways.

Regression requires a continuous target, so we need a dataset to be suitable for
that, we can't transform from the classification dataset to a regression one.  
However, we can go the other way and that's how some classification datasets are
created.

The UCI [adult](https://archive.ics.uci.edu/ml/datasets/adult) Dataset is a popular ML dataset that was dervied from census
data. The goal is to use a variety of features to predict if a person makes
more than $50k per year or not. While income is a continuous value, they applied
a threshold ($50k) to it to make a binary variable. The dataset does not include
income in dollars, only the binary indicator.  


```{admonition} Further Reading
Recent work reconsturcted the dataset with the continuous valued income.
Their [repository](https://github.com/zykls/folktables) contains the data as well
as links to their paper and a video of their talk on it.
```


## Cross Validation

This week our goal is to learn how to optmize models. The first step in that is
to get a good estimate of its performance.  

We have seen that the test train splits, which are random, influence the
performance.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn import metrics

We'll use the Iris data with a decision tree.

In [2]:
iris_df = sns.load_dataset('iris')

iris_X = iris_df.drop(columns=['species'])
iris_y = iris_df['species']

In [3]:
dt =tree.DecisionTreeClassifier()

We can split the data, fit the model, then compute a score, but since the
splitting is a randomized step, the score is a random variable.

For example, if we have a coin that we want to see if it's fair or not. We would
flip it to test.  One flip doesn't tell us, but if we flip it a few times, we
can estimate the probability it is heads by counting how many of the flips are
heads and dividing by how many flips.  

We can do something similar with our model performance. We can split the data
a bunch of times and compute the score each time.

`cross_val_score` does this all for us.

It takes an estimator object and the data.

By default it uses 5-fold cross validation. It splits the data into 5 sections,
then uses 4 of them to train and one to test. It then iterates through so that
each section gets used for testing.

In [4]:
cross_val_score(dt,iris_X,iris_y)

array([0.96666667, 0.96666667, 0.9       , 0.93333333, 1.        ])

We get back a score for each section or "fold" of the data. We can average those
to get a single estimate.

In [5]:
np.mean(cross_val_score(dt,iris_X,iris_y))

0.9533333333333334

We can use more folds.

In [6]:
np.mean(cross_val_score(dt,iris_X,iris_y,cv=10))

0.9533333333333334

We can peak inside what this actually does to see it more clearly.

## What Does Cross validation really do?


It uses StratifiedKfold for classification, but since we're using regression it will use `KFold`. `test_train_split` uses `ShuffleSplit` by default, let's load that too to see what it does.

```{warning}
The key in the following is to get the _concepts_ not all of the details in how I evaluate and visualize.  I could have made figures separately to explain the concept, but I like to show that Python is self contained.
```

In [7]:
from sklearn.model_selection import KFold, ShuffleSplit

In [8]:
kf = KFold(n_splits = 10)

When we use the `split` method it gives us a generator.

In [9]:
kf.split(diabetes_X, diabetes_y)

NameError: name 'diabetes_X' is not defined

We can use this in a loop to get the list of indices that will be used to get the test and train data for each fold.  To visualize what this is  doing, see below.

In [10]:
N_samples = len(diabetes_y)
kf_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in kf.split(diabetes_X, diabetes_y):
    kf_tt_df['split ' + str(i)] = ['unused']*N_samples
    kf_tt_df['split ' + str(i)][train_idx] = 'Train'
    kf_tt_df['split ' + str(i)][test_idx] = 'Test'
    i +=1

NameError: name 'diabetes_y' is not defined

```{margin}
How would you use those indices to get a out actual test and train data?
```

We can count how many times 'Test' and 'Train' appear

In [11]:
count_test = lambda part: len([v for v in part if v=='Test'])
count_train = lambda part: len([v for v in part if v=='Train'])

When we apply this along `axis=1` we to check that each sample is used in exactly 1 test set how may times each sample is used

In [12]:
sum(kf_tt_df.apply(count_test,axis = 1) ==1)

NameError: name 'kf_tt_df' is not defined

and exactly 9 training sets

In [13]:
sum(kf_tt_df.apply(count_test,axis = 1) ==9)

NameError: name 'kf_tt_df' is not defined

the describe helps ensure that all fo the values are exa

We can also visualize:
````{margin}
```{tip}
`sns.heatmap` doesn't work on strings, so we can replace them for the plotting
```
````

In [14]:
cmap = sns.color_palette("tab10",10)
g = sns.heatmap(kf_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
    linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])

NameError: name 'kf_tt_df' is not defined

Note that unlike [`test_train_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) this does not always randomize and shuffle the data before splitting.

 If we apply those `lambda` functions along `axis=0`, we can see the size of each test set

In [15]:
kf_tt_df.apply(count_test,axis = 0)

NameError: name 'kf_tt_df' is not defined

and training set:

In [16]:
kf_tt_df.apply(count_train,axis = 0)

NameError: name 'kf_tt_df' is not defined

We can verify that these splits are the same size as what `test_train_split` does using the right settings.  10-fold splits the data into 10 parts and tests on 1, so that makes a test size of 1/10=.1, so we can use the `train_test_split` and check the length.

```
X_train2,X_test2, y_train2,y_test2 = train_test_split(diabetes_X, diabetes_y ,
                                                  test_size=.1,random_state=0)

[len(split) for split in [X_train2,X_test2,]]
```

Under the hood `train_test_split` uses `ShuffleSplit`
We can do a similar experiment as above to see what `ShuffleSplit` does.

In [17]:
skf = ShuffleSplit(10)
N_samples = len(diabetes_y)
ss_tt_df = pd.DataFrame(index=list(range(N_samples)))
i = 1
for train_idx, test_idx in skf.split(diabetes_X, diabetes_y):
    ss_tt_df['split ' + str(i)] = ['unused']*N_samples
    ss_tt_df['split ' + str(i)][train_idx] = 'Train'
    ss_tt_df['split ' + str(i)][test_idx] = 'Test'
    i +=1

ss_tt_df

NameError: name 'diabetes_y' is not defined

And plot

In [18]:
cmap = sns.color_palette("tab10",10)
g = sns.heatmap(ss_tt_df.replace({'Test':1,'Train':0}),cmap=cmap[7:9],cbar_kws={'ticks':[.25,.75]},linewidths=0,
    linecolor='gray')
colorbar = g.collections[0].colorbar
colorbar.set_ticklabels(['Train','Test'])

NameError: name 'ss_tt_df' is not defined

## Cross validation with clustering
We can use *any* estimator object here.

In [19]:
km = KMeans(n_clusters=3)

In [20]:
cross_val_score(km,iris_X)



array([ -9.062     , -14.93195873, -18.93234207, -23.70894258,
       -19.55457726])

In [21]:
km.score()

TypeError: score() missing 1 required positional argument: 'X'

## Grid Search Optimization

We can optimize, however to determing the different parameter settings.

A simple way to do this is to fit the model for different parameters and score for each and compare.

The

In [22]:
from sklearn.model_selection import GridSearchCV

The `GridSearchCV` object is constructed first and requires an estimator object and a dictionary that describes the parameter grid to search over.
The dictionary has the parameter names as the keys and the values are the values for that parameter to test.

The `fit` method on the Grid Search object fits all of the separate models.

In this case, we will optimize the depth of this Decision Tree.

In [23]:
param_grid = {'max_depth':[2,3,4,5]}
dt_opt = GridSearchCV(dt,param_grid)

In [24]:
dt_opt.fit(iris_X,iris_y)

Then we can look at the output.

In [25]:
dt_opt.cv_results_

{'mean_fit_time': array([0.00185766, 0.001791  , 0.00178885, 0.00179324]),
 'std_fit_time': array([1.59000176e-04, 2.61296321e-05, 1.73072840e-05, 2.34260540e-05]),
 'mean_score_time': array([0.00122561, 0.0011929 , 0.00116849, 0.00119224]),
 'std_score_time': array([6.78506939e-05, 2.81352511e-05, 1.85231443e-05, 3.18513659e-05]),
 'param_max_depth': masked_array(data=[2, 3, 4, 5],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_depth': 2},
  {'max_depth': 3},
  {'max_depth': 4},
  {'max_depth': 5}],
 'split0_test_score': array([0.93333333, 0.96666667, 0.96666667, 0.96666667]),
 'split1_test_score': array([0.96666667, 0.96666667, 0.96666667, 0.96666667]),
 'split2_test_score': array([0.9       , 0.93333333, 0.9       , 0.9       ]),
 'split3_test_score': array([0.86666667, 0.93333333, 0.93333333, 0.96666667]),
 'split4_test_score': array([1., 1., 1., 1.]),
 'mean_test_score': array([0.93333333, 0.96      , 0.953333

We note that this is a dictionary, so to make it more readable, we can make it a DataFrame.

In [26]:
pd.DataFrame(dt_opt.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001858,0.000159,0.001226,6.8e-05,2,{'max_depth': 2},0.933333,0.966667,0.9,0.866667,1.0,0.933333,0.04714,4
1,0.001791,2.6e-05,0.001193,2.8e-05,3,{'max_depth': 3},0.966667,0.966667,0.933333,0.933333,1.0,0.96,0.024944,2
2,0.001789,1.7e-05,0.001168,1.9e-05,4,{'max_depth': 4},0.966667,0.966667,0.9,0.933333,1.0,0.953333,0.033993,3
3,0.001793,2.3e-05,0.001192,3.2e-05,5,{'max_depth': 5},0.966667,0.966667,0.9,0.966667,1.0,0.96,0.03266,1


In [27]:
dt

## Questions After Class

### Do we have to do anything to pick the highest ranking model from the GridSearchCV function?

No, we can use it directly. For example:

In [28]:
plt.figure(figsize=(15,20))
tree.plot_tree(dt_opt.best_estimator_, rounded =True, class_names = ['A','B'],
      proportion=True, filled =True, impurity=False,fontsize=10);

NameError: name 'plt' is not defined

### Is there anything similar to a gridsearch object for different estimators, where it can try different methods of estimation and rank them?

No, you would use multiple gridsearch (or similar model optimizer with a different search strategy) one for each  model.  Each model class/ estimator object

### I would like to learn how to apply cross validation and especially program optimization to unsupervised clustering models.

It would look a lot like what we did with the decision tree, but we use the right parameter name, for example:

In [29]:
km = KMeans()
param_grid = {'n_clusters': list(range(2,8))}
km_opt = GridSearchCV(km,param_grid)

In [30]:
km_opt.fit(iris_X)





In [31]:
pd.DataFrame(km_opt.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_clusters,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.007247,0.002357,0.001162,6.5e-05,2,{'n_clusters': 2},-13.417292,-19.334754,-58.4098,-56.347656,-54.80247,-40.462394,19.788398,6
1,0.00821,0.000141,0.001154,3e-05,3,{'n_clusters': 3},-9.062,-14.931959,-18.932342,-23.708943,-19.554577,-17.237964,4.945204,5
2,0.009823,0.000376,0.001202,3.6e-05,4,{'n_clusters': 4},-9.062,-11.602296,-13.714817,-17.596427,-15.673617,-13.529831,2.9948,4
3,0.011454,0.000734,0.00122,1.4e-05,5,{'n_clusters': 5},-9.062,-10.723786,-10.996277,-17.596427,-15.673617,-12.810421,3.249602,3
4,0.012644,0.000161,0.00123,7e-06,6,{'n_clusters': 6},-9.062,-8.024337,-10.996277,-12.69827,-11.476818,-10.45154,1.68629,2
5,0.013671,0.000281,0.001206,2.1e-05,7,{'n_clusters': 7},-9.062,-7.158558,-9.066973,-12.183027,-11.476818,-9.789475,1.819295,1


### Is it better to split the data in more folds when using the cross-validation?

this is a tricky question, we'll revisit it in class on Wednesday.

### ‟What is this "model" we are training? What are the scores, scoring?

In this example, the model was a decision tree at the beginning and later K-means.  The score describes how well the fit model works on the held out data; accuracy or a general fit statistic.

The [model vs algorithm](https://www.mbmlbook.com/Introduction.html#:~:text=Models%20versus%20algorithms) section in the introduction of the Model Based ML book (free) is a good thing to read to clarify these relationshipts.  

[sklearn](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) provides a flowchart for choosing their different estimator objects. In sklearn, they implement each model as an estimator object; more specifically, they have a [Base Estimator class](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) that the other estimators inherit.  For example the [decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) [source](https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/tree/_classes.py#L674) shows that it inherits the `ClassifierMixin` and [`BaseDecisionTree`](https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/tree/_classes.py#L93) which inherits `BaseEstimator`

```{list-table}

* - Term
  - Definition
  - Example
* - task
  - the type of algorithm that we will use machine learning to write
  - classification, regression, clustering
* - model
  - the specific form and set of assumptions that will be used in the algorithm
  - decision tree (classification) Gaussian Naive Bayes (classification), linear regression, sparse regression/LASSO, K-means, spectral clustering, etc.
* - score
  - a measure of how well the model completes the task
  - accuracy (for classification), mean squared error (for regression), silhouette score (for clustering)
```

[also review the intro to models in machine learning class notes](https://rhodyprog4ds.github.io/BrownFall22/notes/2022-10-17.html#modeling-and-naive-bayes)