11. Evaluating ML Algorithms#

This week we are going to start learning about machine learning.

We are going to do this by looking at how to tell if machine learning has worked.

This is because:

  • you have to check if one worked after you build one

  • if you do not check carefully, it might only sometimes work

  • gives you a chance to learn only evaluation instead of evaluation + an ML task

11.1. What is a Machine Learning Algorithm?#

First, what is an Algorithm?

An algorithm is a set of ordered steps to complete a task.

Note that when people outside of CS talk about algorithms that impact people’s lives these are often not written directly by people anymore. THey are often the result of machine learning.

In machine learning, people write an algorithm for how to write an algorithm based on data. This often comes in the form of a statitistical model of some sort.

Ml overview: training data goes into the learning algorithm, which outputs the prediction algorithm. the prediciton algorithm takes a sampleand outputs a prediction

When we do machine learning, this can also be called:

  • data mining

  • pattern recognition

  • modeling

because we are looking for patterns in the data and typically then planning to use those patterns to make predictions or automate a task.

Each of these terms does have slightly different meanings and usage, but sometimes they’re used close to exchangeably.

11.2. How can we tell if ML is working?#

We measure the performance of the prediction algorithm, to determine if the learning algorithm worked.

11.3. Replicating the COMPAS Audit#

We are going to replicate the audit from ProPublica Machine Bias

11.3.1. Why COMPAS?#

Propublica started the COMPAS Debate with the article Machine Bias. With their article, they also released details of their methodology and their data and code. This presents a real data set that can be used for research on how data is used in a criminal justice setting without researchers having to perform their own requests for information, so it has been used and reused a lot of times.

11.3.2. Propublica COMPAS Data#

The dataset consists of COMPAS scores assigned to defendants over two years 2013-2014 in Broward County, Florida, it was released by Propublica in a GitHub Repository. These scores are determined by a proprietary algorithm designed to evaluate a persons recidivism risk - the likelihood that they will reoffend. Risk scoring algorithms are widely used by judges to inform their sentencing and bail decisions in the criminal justice system in the United States.

The journalists collected, for each person arreste din 2013 and 2014:

  • basic demographics

  • details about what they were charged with and priors

  • the COMPAS score assigned to them

  • if they had actually been re-arrested within 2 years of their arrest

This means that we have what the COMPAS algorithm predicted (in the form of a score from 1-10) and what actually happened (re-arrested or not). We can then measure how well the algorithm worked, in practice, in the real world.

import pandas as pd
from sklearn import metrics
import seaborn as sns

We’re going to work with a cleaned copy of the data released by Propublica that also has a minimal subset of features.

  • age: defendant’s age

  • c_charge_degree: degree charged (Misdemeanor of Felony)

  • race: defendant’s race

  • age_cat: defendant’s age quantized in “less than 25”, “25-45”, or “over 45”

  • score_text: COMPAS score: ‘low’(1 to 5), ‘medium’ (5 to 7), and ‘high’ (8 to 10).

  • sex: defendant’s gender

  • priors_count: number of prior charges

  • days_b_screening_arrest: number of days between charge date and arrest where defendant was screened for compas score

  • decile_score: COMPAS score from 1 to 10 (low risk to high risk)

  • is_recid: if the defendant recidivized

  • two_year_recid: if the defendant within two years

  • c_jail_in: date defendant was imprisoned

  • c_jail_out: date defendant was released from jail

  • length_of_stay: length of jail stay

compas_clean_url = 'https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv'
compas_df = pd.read_csv(compas_clean_url)
compas_df.head()
id age c_charge_degree race age_cat score_text sex priors_count days_b_screening_arrest decile_score is_recid two_year_recid c_jail_in c_jail_out length_of_stay
0 3 34 F African-American 25 - 45 Low Male 0 -1.0 3 1 1 2013-01-26 03:45:27 2013-02-05 05:36:53 10
1 4 24 F African-American Less than 25 Low Male 4 -1.0 4 1 1 2013-04-13 04:58:34 2013-04-14 07:02:04 1
2 8 41 F Caucasian 25 - 45 Medium Male 14 -1.0 6 1 1 2014-02-18 05:08:24 2014-02-24 12:18:30 6
3 10 39 M Caucasian 25 - 45 Low Female 0 -1.0 1 0 0 2014-03-15 05:35:34 2014-03-18 04:28:46 2
4 14 27 F Caucasian 25 - 45 Low Male 0 -1.0 4 0 0 2013-11-25 06:31:06 2013-11-26 08:26:57 1

11.4. One-hot Encoding#

We will audit first to see how good the algorithm is by treating the predictions as either high or not high. One way we can get to that point is to transform the score_text column from one column with three values, to 3 binary columns.

pd.get_dummies(compas_df['score_text'])
High Low Medium
0 False True False
1 False True False
2 False False True
3 False True False
4 False True False
... ... ... ...
5273 False True False
5274 True False False
5275 False False True
5276 False True False
5277 False True False

5278 rows × 3 columns

compas_onehot = pd.concat([compas_df,pd.get_dummies(compas_df['score_text'])],axis=1)

We could have done the above line in one neater step, but in class I for this was an option.

compas_df_onehot = pd.get_dummies(compas_df,columns=['score_text'])

Next lets look at the thresholds that were used so that we know what the mean

compas_onehot.groupby('score_text')['decile_score'].agg(['min','max'])
min max
score_text
High 8 10
Low 1 4
Medium 5 7

We will also audit with respect to second threshold.

compas_onehot['MedHigh'] = compas_onehot['High'] + compas_onehot['Medium']

11.5. Sklearn Performance metrics#

The first thing we usually check is the accuracy: the percentage of all samples that are correct.

metrics.accuracy_score(compas_onehot['two_year_recid'],compas_onehot['High'])
0.6288366805608185

However this does not tell us anything about what types of mistakes the algorithm made. The type of mistake often matters in terms of how we trust or deploy an algorithm. We use a confusion matrix to describe the performance in more detail.

A confusion matrix counts the number of samples of each true category that wre predicted to be in each category. In this case we have a binary prediction problem: people either are re-arrested (truth) or not and were given a high score or not(prediction). In binary problems we adopt a common language of labeling one outcome/predicted value positive and the other negative. We do this not based on the social value of the outcome, but on the numerical encoding.

In this data, being re-arrested is indicated by a 1 in the two_year_recid column, so this is the positive class and not being re-arrested is 0, so the negative class. Similarly a high score is 1, so that’s the positive prediction and not high is 0, so that is the a negative prediction.

metrics.accuracy_score(compas_onehot['two_year_recid'],compas_onehot['MedHigh'])
0.6582038651004168

docs

metrics.confusion_matrix(compas_onehot['two_year_recid'],compas_onehot['MedHigh'])
array([[1872,  923],
       [ 881, 1602]])

Note

these terms can be used in any sort of detection problem, whether machine learning is used or not

sklearn.metrics provides a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) function that we can use.

Since this is binary problem we have 4 possible outcomes:

  • true negatives(\(C_{0,0}\)): did not get a high score and were not re-arrested

  • false negatives(\(C_{1,0}\)):: did not get a high score and were re-arrested

  • false positives(\(C_{0,1}\)):: got a high score and were not re-arrested

  • true positives(\(C_{1,1}\)):: got a high score and were re-arrested

With these we can revisit accuracy:

\[ A = \frac{C_{0,0} + C_{1,1}}{C_{0,0}+ C_{1,0} + C_{0,1} + C_{1,1}} \]

and we can define new scores. Two common ones in CS are recall and precision.

Recall is:

\[ R = \frac{C_{1,1}}{C_{1,0} + C_{1,1}} \]
metrics.recall_score(compas_df['two_year_recid'],compas_df['score_text_High'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3652, in Index.get_loc(self, key)
   3651 try:
-> 3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'score_text_High'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 metrics.recall_score(compas_df['two_year_recid'],compas_df['score_text_High'])

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:3760, in DataFrame.__getitem__(self, key)
   3758 if self.columns.nlevels > 1:
   3759     return self._getitem_multilevel(key)
-> 3760 indexer = self.columns.get_loc(key)
   3761 if is_integer(indexer):
   3762     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3654, in Index.get_loc(self, key)
   3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:
-> 3654     raise KeyError(key) from err
   3655 except TypeError:
   3656     # If we have a listlike key, _check_indexing_error will raise
   3657     #  InvalidIndexError. Otherwise we fall through and re-raise
   3658     #  the TypeError.
   3659     self._check_indexing_error(key)

KeyError: 'score_text_High'

That is, among the truly positive class how many were correctly predicted? In COMPAS, it’s the percentage of the re-arrested people who got a high score.

Precision is $\( P = \frac{C_{1,1}}{C_{0,1} + C_{1,1}} \)$

metrics.recall_score(compas_onehot['two_year_recid'],compas_onehot['MedHigh'])
0.6451872734595248
metrics.precision_score(compas_onehot['two_year_recid'],compas_onehot['MedHigh'])
0.6344554455445545

11.6. Per Group Scores#

To groupby and then do the score, we can use a lambda again, with apply

acc_fx  = lambda d: metrics.accuracy_score(d['two_year_recid'],d['MedHigh'])
compas_onehot.groupby('race').apply(acc_fx)
race
African-American    0.649134
Caucasian           0.671897
dtype: float64

That lambda + apply is equivalent to:

race_acc = []
for race, rdf in compas_race:
    acc = skmetrics.accuracy_score(rdf['two_year_recid'],
             rdf['score_text_MedHigh'])
    race_acc.append([race,acc])

pd.DataFrame(race_acc, columns =['race','accuracy'])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 2
      1 race_acc = []
----> 2 for race, rdf in compas_race:
      3     acc = skmetrics.accuracy_score(rdf['two_year_recid'],
      4              rdf['score_text_MedHigh'])
      5     race_acc.append([race,acc])

NameError: name 'compas_race' is not defined
recall_fx  = lambda d: metrics.recall_score(d['two_year_recid'],d['MedHigh'])
compas_onehot.groupby('race').apply(recall_fx)
race
African-American    0.715232
Caucasian           0.503650
dtype: float64
precision_fx  = lambda d: metrics.precision_score(d['two_year_recid'],d['MedHigh'])
compas_onehot.groupby('race').apply(precision_fx)
race
African-American    0.649535
Caucasian           0.594828
dtype: float64

The recall tells us that the model has very different impact on people. On the other hand the precision tells us the scores mean about the same thing for Black and White people.

Researchers established that these are mutually exclusive, provably. We cannot have both, so it is very important to think about what the performance metrics mean and how your algorithm will be used in order to choose how to prepare a model. We will train models starting next week, but knowing these goals in advance is essential.

Importantly, this is not a statistical, computational choice that data can answer for us. This is about human values (and to some extent the law; certain domains have legal protections that require a specific condition).

The Fair Machine Learning book’s classificaiton Chapter has a section on relationships between criteria with the proofs.

Important

We used ProPublica’s COMPAS dataset to replicate (parts of, with different tools) their analysis. That is, they collected the dataset in order to audit the COMPAS algorithm and we used it for the same purpose (and to learn model evaluation). This dataset is not designed for training models, even though it has been used as such many times. This is not the best way to use this dataset and for future assignments I do not recommend using this dataset.

11.7. Prepare for Next Class#

install aif360

11.8. Portfolio#

Audience is not me, but a generally knowledgable person. For example:

  • a student deciding if they want to take this course or not. They know how to code, but not datascience.

  • a person familiar with the domain your data is from (eg a sports fan if sports data)

  • a future employer who wants to know about your skills

any of these people know big ideas, but not exactly what happened in class. You can specify which audience you’re targetting in the introduction (which is the one piece that I’m the audience for)

Goal is to show what you understand and are able to do not only what you can do, because you can do a lot of simple things by finding answers online. we want you to understand enough that when you start seeing new, real problems, you’re able to do these things on your own

  • level 1: you can follow a conversation

  • level 2: you can do it if someone gives you a rough plan

  • level 3: you can do it, given only an end goal Think of this more like a report with code as the figures than a coding assignment. To see what you’ve learned we should be able to read through on piece of text, not compare two files so it could be like: