15. Performance Metrics continued#

15.1. Logistics#

  • A6 posted ASAP! (I promise it will be straightforward in terms of the actual code; you do need to carefully interpret though)

  • a5 grading is behind, but if it’s not done by early tomorrow (eg 10am) then I’ll extend the portfolio deadline accordingly

  • Mid Semester Feedback

15.2. Completing the COMPAS Audit#

Let’s start where we left off, plus some additional imports

import pandas as pd
from sklearn import metrics as skmetrics
from aif360 import metrics as fairmetrics
from aif360.datasets import BinaryLabelDataset
import seaborn as sns

compas_clean_url = 'https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv'
compas_df = pd.read_csv(compas_clean_url,index_col = 'id')

compas_df = pd.get_dummies(compas_df,columns=['score_text'],)
WARNING:root:No module named 'tempeh': LawSchoolGPADataset will be unavailable. To install, run:
pip install 'aif360[LawSchoolGPA]'
WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'
WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'
WARNING:root:No module named 'fairlearn': ExponentiatedGradientReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'
WARNING:root:No module named 'fairlearn': GridSearchReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'
WARNING:root:No module named 'fairlearn': GridSearchReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'

Warning

We’ll get a warning which is okay but if you run again it will go away.

to review:

compas_df.head()
age c_charge_degree race age_cat sex priors_count days_b_screening_arrest decile_score is_recid two_year_recid c_jail_in c_jail_out length_of_stay score_text_High score_text_Low score_text_Medium
id
3 34 F African-American 25 - 45 Male 0 -1.0 3 1 1 2013-01-26 03:45:27 2013-02-05 05:36:53 10 0 1 0
4 24 F African-American Less than 25 Male 4 -1.0 4 1 1 2013-04-13 04:58:34 2013-04-14 07:02:04 1 0 1 0
8 41 F Caucasian 25 - 45 Male 14 -1.0 6 1 1 2014-02-18 05:08:24 2014-02-24 12:18:30 6 0 0 1
10 39 M Caucasian 25 - 45 Female 0 -1.0 1 0 0 2014-03-15 05:35:34 2014-03-18 04:28:46 2 0 1 0
14 27 F Caucasian 25 - 45 Male 0 -1.0 4 0 0 2013-11-25 06:31:06 2013-11-26 08:26:57 1 0 1 0

Notice today we imported the sklearn.metrics module with an alias.

skmetrics.accuracy_score(compas_df['two_year_recid'],compas_df['score_text_High'])
0.6288366805608185

More common is to use medium or high to check accuracy (or not low) we can calulate tihs by either summing two or inverting. We’ll do it as not low for now, to review using apply.

Try it Yourself

A good exercise to review data manipulation is to try creating the score_text_MedHigh column by adding the other two together (because medium or high if they’re booleans is the same as medium + high if they’re ints)

int_not = lambda a:int(not(a))
compas_df['score_text_MedHigh'] = compas_df['score_text_Low'].apply(int_not)
skmetrics.accuracy_score(compas_df['two_year_recid'],
             compas_df['score_text_MedHigh'])
0.6582038651004168

We can see this gives us a slightly higher score, but still not that great.

the int_not lambda is a function:

type(int_not)
function

it is equivalent as the following, but a more compact notation.

def int_not_f(a):
    return int(not(a))

It flips a 0 to a 1

int_not(0)
1

and th eother way

int_not(1)
0
compas_df.groupby('score_text_Medium')['decile_score'].min()
score_text_Medium
0    1
1    5
Name: decile_score, dtype: int64
compas_race = compas_df.groupby('race')

15.3. Per Group scores with groupby#

To groupby and then do the score, we can use a lambda again, with apply

acc_fx = lambda d: skmetrics.accuracy_score(d['two_year_recid'],
             d['score_text_MedHigh'])

compas_race.apply(acc_fx,)
race
African-American    0.649134
Caucasian           0.671897
dtype: float64

In this case it gives a series, but with reset_index we can make it a DataFrame and then rename the column to label it as accuracy.

compas_race.apply(acc_fx,).reset_index().rename(columns={0:'accuracy'})
race accuracy
0 African-American 0.649134
1 Caucasian 0.671897

That lambda + apply is equivalent to:

race_acc = []
for race, rdf in compas_race:
    acc = skmetrics.accuracy_score(rdf['two_year_recid'],
             rdf['score_text_MedHigh'])
    race_acc.append([race,acc])

pd.DataFrame(race_acc, columns =['race','accuracy'])
race accuracy
0 African-American 0.649134
1 Caucasian 0.671897

15.4. Using AIF360#

The AIF360 package implements fairness metrics, some of which are derived from metrics we have seen and some others. the documentation has the full list in a summary table with English explanations and details with most equations.

However, it has a few requirements:

  • its constructor takes two BinaryLabelDataset objects

  • these objects must be the same except for the label column

  • the constructor for BinaryLabelDataset only accepts all numerical DataFrames

So, we have some preparation to do.

First, we’ll make a numerical copy of the compas_df columns that we need. The only nonnumerical column that we need is race, wo we’ll make a dict to replace that/

race_num_map = {r:i for i,r, in enumerate(compas_df['race'].value_counts().index)}
race_num_map
{'African-American': 0, 'Caucasian': 1}

and here we select columns and replace the values

required_cols = ['race','two_year_recid','score_text_MedHigh']
num_compas = compas_df[required_cols].replace(race_num_map)
num_compas.head(2)
race two_year_recid score_text_MedHigh
id
3 0 1 0
4 0 1 0

Next we will make two versions, one with race & the ground truth and ht eother with race & the predictions. It’s easiest to drop the column we don’t want.

num_compas_true = num_compas.drop(columns=['score_text_MedHigh'])
num_compas_pred = num_compas.drop(columns=['two_year_recid'])

Now we make the BinaryLabelDataset objects, this type comes from AIF360 too. Basically, it is a DataFrame with extra attributes; some specific and some inherited from StructuredDataset.

# here we want actual favorable outcome
broward_true = BinaryLabelDataset(0,1,df = num_compas_true,
          label_names= ['two_year_recid'],
         protected_attribute_names=['race'])
compas_predictions = BinaryLabelDataset(0,1,df = num_compas_pred,
          label_names= ['score_text_MedHigh'],
         protected_attribute_names=['race'])

Try it Yourself

remember, you can inspect any object using the __dict__ attribute

This type also has an ignore_fields column for when comparisons are made, since the requirement is that only the content of the label column is different, but in our case also the label names are different, we have to tell it that that’s okay.

compas_predictions.ignore_fields.add('label_names')
broward_true.ignore_fields.add('label_names')

Now, we can instantiate our metric object:

compas_fair_scorer = fairmetrics.ClassificationMetric(broward_true,
                           compas_predictions,
                 unprivileged_groups=[{'race':0}],
                privileged_groups = [{'race':1}])

And finally we can compute! First, we can verify that we get the same accuracy as before

compas_fair_scorer.accuracy()
0.6582038651004168

For the aif360 metrics, they have one parameter, privleged with a defautl value of None when it’s none it computes th ewhole dataset. When True it compues only the priveleged group.

compas_fair_scorer.accuracy(True)
0.6718972895863052

Here that is Caucasion people.

When False it’s the unpriveleged group, here African American

compas_fair_scorer.accuracy(False)
0.6491338582677165

We can also compute other scores. Many fairness scores are ratios of the un priveleged group’s score to the privleged group’s score.

In Disparate Impact the ratio is of the positive outcome, independent of the predictor. So this is the ratio of the % of Black people not rearrested to % of white people rearrested.

compas_fair_scorer.disparate_impact()
0.6336457196581771

The courts use an “80%” rule saying that if this ratio is above .8 for things like employment, it’s close enough. T

compas_fair_scorer.error_rate_ratio()
1.0693789798014377

We can also do ratios of the scores. This is where the journalists found bias.

compas_fair_scorer.false_positive_rate_ratio()
0.5737241916634204

Black people were given a low score and then re-arrested only a little more than half as often as white people. (White people were give an low score and rearrested almost twice as often)

compas_fair_scorer.false_negative_rate_ratio()
1.9232342111919953

Black people were given a high score and not rearrested almost twice as often as white people.

So while the accuracy was similar (see error rate ratio) for Black and White people; the algorithm makes the opposite types of errors.

After the journalists published the piece, the people who made COMPAS countered with a technical report, arguing that that the journalists had measured fairness incorrectly.

The journalists two measures false positive rate and false negative rate use the true outcomes as the denominator.

The COMPAS creators argued that the model should be evaluated in terms of if a given score means the same thing across races; using the prediction as the denominator.

compas_fair_scorer.false_omission_rate_ratio()
0.8649767923408909
compas_fair_scorer.false_discovery_rate_ratio()
1.2118532033913119

On these two metrics, the ratio is closer to 1 and much less disparate.

The creators thought it was important for the score to mean the same thing for every person assigned a score. The journalists thought it was more important for the algorithm to have the same impact of different groups of people.
Ideally, we would like the score to both mean the same thing for different people and to have the same impact.

Researchers established that these are mutually exclusive, provably. We cannot have both, so it is very important to think about what the performance metrics mean and how your algorithm will be used in order to choose how to prepare a model. We will train models starting next week, but knowing these goals in advance is essential.

Importantly, this is not a statistical, computational choice that data can answer for us. This is about human values (and to some extent the law; certain domains have legal protections that require a specific condition).

The Fair Machine Learning book’s classificaiton Chapter has a section on relationships between criteria with the proofs.

Important

We used ProPublica’s COMPAS dataset to replicate (parts of, with different tools) their analysis. That is, they collected the dataset in order to audit the COMPAS algorithm and we used it for the same purpose (and to learn model evaluation). This dataset is not designed for training models, even though it has been used as such many times. This is not the best way to use this dataset and for future assignments I do not recommend using this dataset.

15.5. Portfolio Reminder#

If you do not need level 3s to be happy with your grade for the course (eg you want a B) and you have all the achievements so far you can skip the portfolio submission. If you do not need level 3 and you are not on track, you should submit to get caught up. This can be (and is advised to be) reflective revisions of past assignment(s).

If you need level 3 achievements for your desired grade, then you can pick a subset of the eligible skills(or all) and add new work that shows that you have learned those skills according to the level 3 checklists. the ideas page has example formats for that new work.

15.6. Questions After Class#

Today’s questions were only clarifying, so hopefully re-reading the notes is enough. If not, post a question as an issue!