Evaluating ML Algorithms

11. Evaluating ML Algorithms#

This week we are going to start learning about machine learning.

We are going to do this by looking at how to tell if machine learning has worked.

This is because:

you have to check if one worked after you build one
if you do not check carefully, it might only sometimes work
gives you a chance to learn only evaluation instead of evaluation + an ML task

11.1. What is a Machine Learning Algorithm?#

First, what is an Algorithm?

An algorithm is a set of ordered steps to complete a task.

Note that when people outside of CS talk about algorithms that impact people’s lives these are often not written directly by people anymore. They are often the result of machine learning.

In machine learning, people write an algorithm for how to write an algorithm based on data. This often comes in the form of a statitistical model of some sort.

Ml overview: training data goes into the learning algorithm, which outputs the prediction algorithm. the prediciton algorithm takes a sampleand outputs a prediction

When we do machine learning, this can also be called:

data mining
pattern recognition
modeling

because we are looking for patterns in the data and typically then planning to use those patterns to make predictions or automate a task.

Each of these terms does have slightly different meanings and usage, but sometimes they’re used close to exchangeably.

11.2. How can we tell if ML is working?#

We measure the performance of the prediction algorithm, to determine if the learning algorithm worked.

11.3. Replicating the COMPAS Audit#

We are going to replicate the audit from ProPublica Machine Bias

11.3.1. Why COMPAS?#

Propublica started the COMPAS Debate with the article Machine Bias. With their article, they also released details of their methodology and their data and code. This presents a real data set that can be used for research on how data is used in a criminal justice setting without researchers having to perform their own requests for information, so it has been used and reused a lot of times.

11.3.2. Propublica COMPAS Data#

The dataset consists of COMPAS scores assigned to defendants over two years 2013-2014 in Broward County, Florida, it was released by Propublica in a GitHub Repository. These scores are determined by a proprietary algorithm designed to evaluate a persons recidivism risk - the likelihood that they will reoffend. Risk scoring algorithms are widely used by judges to inform their sentencing and bail decisions in the criminal justice system in the United States.

The journalists collected, for each person arreste din 2013 and 2014:

basic demographics
details about what they were charged with and priors
the COMPAS score assigned to them
if they had actually been re-arrested within 2 years of their arrest

This means that we have what the COMPAS algorithm predicted (in the form of a score from 1-10) and what actually happened (re-arrested or not). We can then measure how well the algorithm worked, in practice, in the real world.

Now we will start our notebook

import pandas as pd
from sklearn import metrics
import seaborn as sns

We’re going to work with a cleaned copy of the data released by Propublica that also has a minimal subset of features.

age: defendant’s age
c_charge_degree: degree charged (Misdemeanor of Felony)
race: defendant’s race
age_cat: defendant’s age quantized in “less than 25”, “25-45”, or “over 45”
score_text: COMPAS score: ‘low’(1 to 5), ‘medium’ (5 to 7), and ‘high’ (8 to 10).
sex: defendant’s gender
priors_count: number of prior charges
days_b_screening_arrest: number of days between charge date and arrest where defendant was screened for compas score
decile_score: COMPAS score from 1 to 10 (low risk to high risk)
is_recid: if the defendant recidivized
two_year_recid: if the defendant within two years
c_jail_in: date defendant was imprisoned
c_jail_out: date defendant was released from jail
length_of_stay: length of jail stay

compas_clean_url = 'https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv'
compas_df = pd.read_csv(compas_clean_url)
compas_df.head()

	id	age	c_charge_degree	race	age_cat	score_text	sex	priors_count	days_b_screening_arrest	decile_score	is_recid	two_year_recid	c_jail_in	c_jail_out	length_of_stay
0	3	34	F	African-American	25 - 45	Low	Male	0	-1.0	3	1	1	2013-01-26 03:45:27	2013-02-05 05:36:53	10
1	4	24	F	African-American	Less than 25	Low	Male	4	-1.0	4	1	1	2013-04-13 04:58:34	2013-04-14 07:02:04	1
2	8	41	F	Caucasian	25 - 45	Medium	Male	14	-1.0	6	1	1	2014-02-18 05:08:24	2014-02-24 12:18:30	6
3	10	39	M	Caucasian	25 - 45	Low	Female	0	-1.0	1	0	0	2014-03-15 05:35:34	2014-03-18 04:28:46	2
4	14	27	F	Caucasian	25 - 45	Low	Male	0	-1.0	4	0	0	2013-11-25 06:31:06	2013-11-26 08:26:57	1

11.4. One-hot Encoding#

We will audit first to see how good the algorithm is by treating the predictions as either high or not high. One way we can get to that point is to transform the score_text column from one column with three values, to 3 binary columns.

First lets understand the score_text

compas_df.groupby('score_text')['decile_score'].agg(['min','max'])

	min	max
score_text
High	8	10
Low	1	4
Medium	5	7

agg is short for aggregate. It allows us to apply one or more functions to a dataframe (or groupby) along a single axis (rows or columns). It is similar to apply but allows us to pass a list of functions or function names and similar to describe but allows us to make custom summary tables.

You can learn more about agg from its docs or with visuals on pandas tutor

To see what one hot encoding looks like, we will apply it first to just the one column.

pd.get_dummies(compas_df['score_text'])

	High	Low	Medium
0	False	True	False
1	False	True	False
2	False	False	True
3	False	True	False
4	False	True	False
...	...	...	...
5273	False	True	False
5274	True	False	False
5275	False	False	True
5276	False	True	False
5277	False	True	False

5278 rows × 3 columns

Since we atually want all of the columns with that one expanded, we will apply it a different way to save it to a variable.

compas_onehot = pd.get_dummies(compas_df,columns=['score_text'])

We will also audit with respect to a second threshold.

compas_onehot['MedHigh'] = compas_onehot['High'] + compas_onehot['Medium']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'High'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 compas_onehot['MedHigh'] = compas_onehot['High'] + compas_onehot['Medium']

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'High'

11.5. Sklearn Performance metrics#

The first thing we usually check is the accuracy: the percentage of all samples that are correct.

metrics.accuracy_score(compas_onehot['two_year_recid'],
                       compas_onehot['MedHigh'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[7], line 2
      1 metrics.accuracy_score(compas_onehot['two_year_recid'],
----> 2                        compas_onehot['MedHigh'])

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

Important

all sklearn.metrics functions have the same first two parameters:

y_true the real world outcome our model tries to predict
y_pred the *model’s predictions

metrics.accuracy_score(compas_onehot['two_year_recid'],
                       compas_onehot['High'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'High'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[8], line 2
      1 metrics.accuracy_score(compas_onehot['two_year_recid'],
----> 2                        compas_onehot['High'])

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'High'

the accuracy is not great either way

However this does not tell us anything about what types of mistakes the algorithm made. The type of mistake often matters in terms of how we trust or deploy an algorithm. We use a confusion matrix to describe the performance in more detail.

A confusion matrix counts the number of samples of each true category that wre predicted to be in each category. In this case we have a binary prediction problem: people either are re-arrested (truth) or not and were given a high score or not(prediction). In binary problems we adopt a common language of labeling one outcome/predicted value positive and the other negative. We do this not based on the social value of the outcome, but on the numerical encoding.

In this data, being re-arrested is indicated by a 1 in the two_year_recid column, so this is the positive class and not being re-arrested is 0, so the negative class. Similarly a high score is 1, so that’s the positive prediction and not high is 0, so that is the a negative prediction.

docs

metrics.confusion_matrix(compas_onehot['two_year_recid'],
                       compas_onehot['MedHigh'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[9], line 2
      1 metrics.confusion_matrix(compas_onehot['two_year_recid'],
----> 2                        compas_onehot['MedHigh'])

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

Using the help above, we can label the axes and put this in a dataFrame to make it easier to read.

cf = metrics.confusion_matrix(compas_onehot['two_year_recid'],
                       compas_onehot['MedHigh'])
pd.DataFrame(data =cf,columns = ['low score','medium or high score'],
             index = ['not rearrested','rearrested'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[10], line 2
      1 cf = metrics.confusion_matrix(compas_onehot['two_year_recid'],
----> 2                        compas_onehot['MedHigh'])
      3 pd.DataFrame(data =cf,columns = ['low score','medium or high score'],
      4              index = ['not rearrested','rearrested'])

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

Note

these terms can be used in any sort of detection problem, whether machine learning is used or not

sklearn.metrics provides a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) function that we can use.

Since this is binary problem we have 4 possible outcomes:

true negatives($C_{0,0}$): did not get a high score and were not re-arrested
false negatives($C_{1,0}$):: did not get a high score and were re-arrested
false positives($C_{0,1}$):: got a high score and were not re-arrested
true positives($C_{1,1}$):: got a high score and were re-arrested

With these we can revisit accuracy:

\[ A = \frac{C_{0,0} + C_{1,1}}{C_{0,0}+ C_{1,0} + C_{0,1} + C_{1,1}} \]

and we can define new scores.

11.6. Precision and Recall#

Two common ones in CS are recall and precision.

Recall is:

\[ R = \frac{C_{1,1}}{C_{1,0} + C_{1,1}} \]

metrics.recall_score(compas_onehot['two_year_recid'],
                       compas_onehot['MedHigh'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[11], line 2
      1 metrics.recall_score(compas_onehot['two_year_recid'],
----> 2                        compas_onehot['MedHigh'])

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

That is, among the truly positive class how many were correctly predicted? In COMPAS, it’s the percentage of the re-arrested people who got a high score.

Precision is $$ P = \frac{C_{1,1}}{C_{0,1} + C_{1,1}} $$

metrics.precision_score(compas_onehot['two_year_recid'],
                       compas_onehot['MedHigh'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[12], line 2
      1 metrics.precision_score(compas_onehot['two_year_recid'],
----> 2                        compas_onehot['MedHigh'])

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

11.7. Per Group Scores#

To groupby and then do the score, we can use a lambda, with apply

acc_fx  = lambda d: metrics.accuracy_score(d['two_year_recid'],d['MedHigh'])
compas_onehot.groupby('race').apply(acc_fx)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[13], line 2
      1 acc_fx  = lambda d: metrics.accuracy_score(d['two_year_recid'],d['MedHigh'])
----> 2 compas_onehot.groupby('race').apply(acc_fx)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1353, in GroupBy.apply(self, func, *args, **kwargs)
   1351 with option_context("mode.chained_assignment", None):
   1352     try:
-> 1353         result = self._python_apply_general(f, self._selected_obj)
   1354     except TypeError:
   1355         # gh-20949
   1356         # try again, with .apply acting as a filtering
   (...)
   1360         # fails on *some* columns, e.g. a numeric operation
   1361         # on a string grouper column
   1363         return self._python_apply_general(f, self._obj_with_exclusions)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1402, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
   1367 @final
   1368 def _python_apply_general(
   1369     self,
   (...)
   1374     is_agg: bool = False,
   1375 ) -> NDFrameT:
   1376     """
   1377     Apply function f in python space
   1378 
   (...)
   1400         data after applying f
   1401     """
-> 1402     values, mutated = self.grouper.apply(f, data, self.axis)
   1403     if not_indexed_same is None:
   1404         not_indexed_same = mutated

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:767, in BaseGrouper.apply(self, f, data, axis)
    765 # group might be modified
    766 group_axes = group.axes
--> 767 res = f(group)
    768 if not mutated and not _is_indexed_like(res, group_axes, axis):
    769     mutated = True

Cell In[13], line 1, in <lambda>(d)
----> 1 acc_fx  = lambda d: metrics.accuracy_score(d['two_year_recid'],d['MedHigh'])
      2 compas_onehot.groupby('race').apply(acc_fx)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

That lambda + apply is equivalent to:

race_acc = []
for race, rdf in compas_race:
    acc = skmetrics.accuracy_score(rdf['two_year_recid'],
             rdf['score_text_MedHigh'])
    race_acc.append([race,acc])

pd.DataFrame(race_acc, columns =['race','accuracy'])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 race_acc = []
----> 2 for race, rdf in compas_race:
      3     acc = skmetrics.accuracy_score(rdf['two_year_recid'],
      4              rdf['score_text_MedHigh'])
      5     race_acc.append([race,acc])

NameError: name 'compas_race' is not defined

then we can do the same thing for recall and precision.

recall_fx  = lambda d: metrics.recall_score(d['two_year_recid'],d['MedHigh'])
compas_onehot.groupby('race').apply(recall_fx)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[15], line 2
      1 recall_fx  = lambda d: metrics.recall_score(d['two_year_recid'],d['MedHigh'])
----> 2 compas_onehot.groupby('race').apply(recall_fx)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1353, in GroupBy.apply(self, func, *args, **kwargs)
   1351 with option_context("mode.chained_assignment", None):
   1352     try:
-> 1353         result = self._python_apply_general(f, self._selected_obj)
   1354     except TypeError:
   1355         # gh-20949
   1356         # try again, with .apply acting as a filtering
   (...)
   1360         # fails on *some* columns, e.g. a numeric operation
   1361         # on a string grouper column
   1363         return self._python_apply_general(f, self._obj_with_exclusions)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1402, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
   1367 @final
   1368 def _python_apply_general(
   1369     self,
   (...)
   1374     is_agg: bool = False,
   1375 ) -> NDFrameT:
   1376     """
   1377     Apply function f in python space
   1378 
   (...)
   1400         data after applying f
   1401     """
-> 1402     values, mutated = self.grouper.apply(f, data, self.axis)
   1403     if not_indexed_same is None:
   1404         not_indexed_same = mutated

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:767, in BaseGrouper.apply(self, f, data, axis)
    765 # group might be modified
    766 group_axes = group.axes
--> 767 res = f(group)
    768 if not mutated and not _is_indexed_like(res, group_axes, axis):
    769     mutated = True

Cell In[15], line 1, in <lambda>(d)
----> 1 recall_fx  = lambda d: metrics.recall_score(d['two_year_recid'],d['MedHigh'])
      2 compas_onehot.groupby('race').apply(recall_fx)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

The recall tells us that the model has very different impact on people. On the other hand the precision tells us the scores mean about the same thing for Black and White people.

Researchers established that these are mutually exclusive, provably. We cannot have both, so it is very important to think about what the performance metrics mean and how your algorithm will be used in order to choose how to prepare a model. We will train models starting next week, but knowing these goals in advance is essential.

Importantly, this is not a statistical, computational choice that data can answer for us. This is about human values (and to some extent the law; certain domains have legal protections that require a specific condition).

The Fair Machine Learning book’s classificaiton Chapter has a section on relationships between criteria with the proofs.

Looking at this gives a larger seeming difference to make it more clear, the error rate is almost twice as high.

1-compas_onehot.groupby('race').apply(recall_fx)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[16], line 1
----> 1 1-compas_onehot.groupby('race').apply(recall_fx)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1353, in GroupBy.apply(self, func, *args, **kwargs)
   1351 with option_context("mode.chained_assignment", None):
   1352     try:
-> 1353         result = self._python_apply_general(f, self._selected_obj)
   1354     except TypeError:
   1355         # gh-20949
   1356         # try again, with .apply acting as a filtering
   (...)
   1360         # fails on *some* columns, e.g. a numeric operation
   1361         # on a string grouper column
   1363         return self._python_apply_general(f, self._obj_with_exclusions)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1402, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
   1367 @final
   1368 def _python_apply_general(
   1369     self,
   (...)
   1374     is_agg: bool = False,
   1375 ) -> NDFrameT:
   1376     """
   1377     Apply function f in python space
   1378 
   (...)
   1400         data after applying f
   1401     """
-> 1402     values, mutated = self.grouper.apply(f, data, self.axis)
   1403     if not_indexed_same is None:
   1404         not_indexed_same = mutated

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:767, in BaseGrouper.apply(self, f, data, axis)
    765 # group might be modified
    766 group_axes = group.axes
--> 767 res = f(group)
    768 if not mutated and not _is_indexed_like(res, group_axes, axis):
    769     mutated = True

Cell In[15], line 1, in <lambda>(d)
----> 1 recall_fx  = lambda d: metrics.recall_score(d['two_year_recid'],d['MedHigh'])
      2 compas_onehot.groupby('race').apply(recall_fx)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

Note this is almost double the error rate

precision_fx  = lambda d: metrics.precision_score(d['two_year_recid'],d['MedHigh'])
compas_onehot.groupby('race').apply(precision_fx)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'MedHigh'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[17], line 2
      1 precision_fx  = lambda d: metrics.precision_score(d['two_year_recid'],d['MedHigh'])
----> 2 compas_onehot.groupby('race').apply(precision_fx)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1353, in GroupBy.apply(self, func, *args, **kwargs)
   1351 with option_context("mode.chained_assignment", None):
   1352     try:
-> 1353         result = self._python_apply_general(f, self._selected_obj)
   1354     except TypeError:
   1355         # gh-20949
   1356         # try again, with .apply acting as a filtering
   (...)
   1360         # fails on *some* columns, e.g. a numeric operation
   1361         # on a string grouper column
   1363         return self._python_apply_general(f, self._obj_with_exclusions)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1402, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
   1367 @final
   1368 def _python_apply_general(
   1369     self,
   (...)
   1374     is_agg: bool = False,
   1375 ) -> NDFrameT:
   1376     """
   1377     Apply function f in python space
   1378 
   (...)
   1400         data after applying f
   1401     """
-> 1402     values, mutated = self.grouper.apply(f, data, self.axis)
   1403     if not_indexed_same is None:
   1404         not_indexed_same = mutated

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:767, in BaseGrouper.apply(self, f, data, axis)
    765 # group might be modified
    766 group_axes = group.axes
--> 767 res = f(group)
    768 if not mutated and not _is_indexed_like(res, group_axes, axis):
    769     mutated = True

Cell In[17], line 1, in <lambda>(d)
----> 1 precision_fx  = lambda d: metrics.precision_score(d['two_year_recid'],d['MedHigh'])
      2 compas_onehot.groupby('race').apply(precision_fx)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'MedHigh'

Important

We used ProPublica’s COMPAS dataset to replicate (parts of, with different tools) their analysis. That is, they collected the dataset in order to audit the COMPAS algorithm and we used it for the same purpose (and to learn model evaluation). This dataset is not designed for training models, even though it has been used as such many times. This is not the best way to use this dataset and for future assignments I do not recommend using this dataset.

11.8. Questions#

11.8.1. Does ChatGPT have a dataset similar to Compas? How accurate are the AI’s that students are using instead of learning?#

We do not know exactly what the training data for ChatGPT i, it uses OpenAI’s modelGPT3.5 and beyond. I also want to clarify that the dataset we used today was some journalists’ data to audit COMPAS. It was trained on different data.

However for GPT3, OpenAI published a paper in section 2.2 they state that their training data includes (links are mine; they had citations):

filtered version of CommonCrawl1

an expanded version of the WebText dataset

two sets of books

English Language Wikipedia

11.8.2. There was a lot in this lecture today. What generally should be the take away from today’s class?#

The performance metrics and what ML is roughly, though we will come back to that.

11.8.3. What’s the difference between precision score and recall score?#

Mathematically, the denominator.

Conceptually, the precision is a perfect score (1) if there are no false positives but the precision score is not impacted by there being false negatives. Precision is a measure of how many of the high scores were actually re-arrested, it only cares about when the algorithm predicts the positive class.

In contrast, a perfect recall only requires no false negatives and is not impacted by false positives at all. Recall only cares about the actually positive class, here the people who acutally got re-arrested and how well the model performed on those people. Recall is how many of the the thing we want were found in a radar context.

11.8.4. how to combat biased data#

This is still a really open area of research and it is context dependent. The COMPAS audit showed that there are different ways that we can write euqations to try to represent fairness and researchers proved that these cannot all be true at the same time.

In this case, I would actually not say that the “data is biased” because in statistics, biased data refers to a type of not correct measurement. This data reflects the way the actual world is biased.

11.8.5. I guess when should and shouldn’t you use one-hot?#

We use one hot encoding when we want to numerically represent categorical data and if we want to do something that is like binary for one of the values for a variable with more than 2 values.

11.8.6. What are some good intro resourses you suggest for looking more into some of the math that make up these machine learning models that we will use in this class?#

For the purposes of this course, the sklearn documentation is a good starting point. It has the specific equations that are implemented in the package, which is relevant for some models that have multiple ways they can be implemented.