12. ML Models and Auditing with AIF360#

Let’s start where we left off, plus some additional imports

import pandas as pd
from sklearn import metrics as skmetrics
from aif360 import metrics as fairmetrics
from aif360.datasets import BinaryLabelDataset
import seaborn as sns

compas_clean_url = 'https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv'
compas_df = pd.read_csv(compas_clean_url,index_col = 'id')

compas_df = pd.get_dummies(compas_df,columns=['score_text'],)
WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'
WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'
WARNING:root:No module named 'fairlearn': ExponentiatedGradientReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'
WARNING:root:No module named 'fairlearn': GridSearchReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'
WARNING:root:No module named 'inFairness': SenSeI and SenSR will be unavailable. To install, run:
pip install 'aif360[inFairness]'
WARNING:root:No module named 'fairlearn': GridSearchReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'

We may get a warning which is okay. If you run the cell again it will go away.

We are going to continue with the ProPublica COMPAS audit data. Remember it contains:

  • age: defendant’s age

  • c_charge_degree: degree charged (Misdemeanor of Felony)

  • race: defendant’s race

  • age_cat: defendant’s age quantized in “less than 25”, “25-45”, or “over 45”

  • score_text: COMPAS score: ‘low’(1 to 5), ‘medium’ (5 to 7), and ‘high’ (8 to 10).

  • sex: defendant’s gender

  • priors_count: number of prior charges

  • days_b_screening_arrest: number of days between charge date and arrest where defendant was screened for compas score

  • decile_score: COMPAS score from 1 to 10 (low risk to high risk)

  • is_recid: if the defendant recidivized

  • two_year_recid: if the defendant within two years

  • c_jail_in: date defendant was imprisoned

  • c_jail_out: date defendant was released from jail

  • length_of_stay: length of jail stay

compas_df.head()
age c_charge_degree race age_cat sex priors_count days_b_screening_arrest decile_score is_recid two_year_recid c_jail_in c_jail_out length_of_stay score_text_High score_text_Low score_text_Medium
id
3 34 F African-American 25 - 45 Male 0 -1.0 3 1 1 2013-01-26 03:45:27 2013-02-05 05:36:53 10 False True False
4 24 F African-American Less than 25 Male 4 -1.0 4 1 1 2013-04-13 04:58:34 2013-04-14 07:02:04 1 False True False
8 41 F Caucasian 25 - 45 Male 14 -1.0 6 1 1 2014-02-18 05:08:24 2014-02-24 12:18:30 6 False False True
10 39 M Caucasian 25 - 45 Female 0 -1.0 1 0 0 2014-03-15 05:35:34 2014-03-18 04:28:46 2 False True False
14 27 F Caucasian 25 - 45 Male 0 -1.0 4 0 0 2013-11-25 06:31:06 2013-11-26 08:26:57 1 False True False

Notice the last three columns. When we use pd.getdummies with its columns parameter, then we can append the columns all at once and they get the original column name prepended to the value in the new column name.

Most common is to use medium or high to check accuracy (or not low) we can calulate this by either summing two or inverting

let’s do it by inverting here

int_not = lambda a:int(not(a))
compas_df['score_text_MedHigh'] = compas_df['score_text_Low'].apply(int_not)

the st

We use the two year recid column, because that is what actually happened in the real world, if the person was re-arrested or not within two years

skmetrics.accuracy_score(compas_df['two_year_recid'],
                         compas_df['score_text_MedHigh'])
0.6582038651004168
skmetrics.accuracy_score(compas_df['two_year_recid'],
                         compas_df['score_text_High'])
0.6288366805608185

12.1. Using AIF 360#

aif360 classification metrics

12.2. Using AIF360#

The AIF360 package implements fairness metrics, some of which are derived from metrics we have seen and some others. the documentation has the full list in a summary table with English explanations and details with most equations.

However, it has a few requirements:

  • its constructor takes two BinaryLabelDataset objects

  • these objects must be the same except for the label column

  • the constructor for BinaryLabelDataset only accepts all numerical DataFrames

So, we have some preparation to do.

First, we’ll make a numerical copy of the compas_df columns that we need. The only nonnumerical column that we need is race, wo we’ll make a dict to replace that/

We need to used numerical values for the protected attribute. so lets make a mapping value

one way to get the values for race is:

compas_df['race'].value_counts().index
Index(['African-American', 'Caucasian'], dtype='object', name='race')

then we want to number them, so we can replace each string with a number. Python has enumerate for that, enumerate doc. We can combine that with a ditionary comprehension.

enumerate is generator which is a special sort of funtion. From the docs, the following is equivalent:

def enumerate(iterable, start=0):
    n = start
    for elem in iterable:
        yield n, elem
        n += 1

yield works sort of like return but it doesn’t end the function, it gives the one value back and then waits for it to be used again. The core python docs for yeild expressions explain in detail.

race_num_map = {r:i for i,r, in enumerate(compas_df['race'].value_counts().index)}
race_num_map
{'African-American': 0, 'Caucasian': 1}

so we have this mapper that works with replace:

compas_df['race'].replace(race_num_map)
id
3        0
4        0
8        1
10       1
14       1
        ..
10994    0
10995    0
10996    0
10997    0
11000    0
Name: race, Length: 5278, dtype: int64

Now we will make our smaller dataframe.

We will also only use a few of the variables.

required_cols = ['race','two_year_recid','score_text_MedHigh']
num_compas = compas_df[required_cols].replace(race_num_map)
num_compas.head(2)
race two_year_recid score_text_MedHigh
id
3 0 1 0
4 0 1 0

The scoring object requires that we have special data structures that wrap a DataFrame.

We need one aif360 binary labeled dataset for the true values and one for the predictions.

Next we will make two versions, one with race & the ground truth and ht eother with race & the predictions. It’s easiest to drop the column we don’t want.

The difference between the two datasets needs to be only the label column, so we drop the other variable from each small dataframe that we create.

num_compas_true = num_compas.drop(columns=['score_text_MedHigh'])
num_compas_pred = num_compas.drop(columns=['two_year_recid'])
num_compas_true.head()
race two_year_recid
id
3 0 1
4 0 1
8 1 1
10 1 0
14 1 0

Now we make the BinaryLabelDataset objects, this type comes from AIF360 too. Basically, it is a DataFrame with extra attributes; some specific and some inherited from StructuredDataset.

broward_true = BinaryLabelDataset(favorable_label=0,unfavorable_label=1,
                                  df = num_compas_true,
                   label_names= ['two_year_recid'],
                  protected_attribute_names=['race'])
compas_predictions = BinaryLabelDataset(favorable_label=0,unfavorable_label=1,
                                        df = num_compas_pred,
                   label_names= ['score_text_MedHigh'],
                  protected_attribute_names=['race'])

This type also has an ignore_fields column for when comparisons are made, since the requirement is that only the content of the label column is different, but in our case the label names are also different, we have to tell it that that’s okay.

# beacuse our columsn are named differently, we have to ignore that
compas_predictions.ignore_fields.add('label_names')
broward_true.ignore_fields.add('label_names')
compas_fair_scorer = fairmetrics.ClassificationMetric(broward_true,
                                                      compas_predictions,
                                 unprivileged_groups=[{'race':0}],
                                privileged_groups = [{'race':1}])

Now we can use the scores

compas_fair_scorer.accuracy()
0.6582038651004168

By default, we get the overall accuracy. This calculation matches what we got using sklearn.

For the aif360 metrics, they have one parameter, privleged with a defautl value of None when it’s none it computes the whole dataset. When True it compues only the priveleged group.

compas_fair_scorer.accuracy(True)
0.6718972895863052

Here that is Caucasion people.

When False it’s the unpriveleged group, here African American

compas_fair_scorer.accuracy(False)
0.6491338582677165

These again match what we calculated before, the advantaged group (White) for True and disadvantaged group (Black) for False

compas_fair_scorer.error_rate_difference()
0.02276343131858871

the error rate alone does not tell the whole story because there are two types of errors. Plus there are even more ways we can think about if something is fair or not.

Important

This is extra detail

12.2.1. Disparate Impact#

One way we might want to be fair is if the same % of each group of people (Black, \(A=0\) and White,\(A=1\)) get the favorable outcome (a low score).

In Disparate Impact the ratio is of the positive outcome, independent of the predictor. So this is the ratio of the % of Black people not rearrested to % of white people rearrested.

\[D = \frac{\Pr(\hat{Y} = 1|A=0)}{\Pr(\hat{Y} =1|A=1)}\]

This is equivalent to saying that the score is unrelated to race.

This type of fair is often the kind that most people think of intuitively.

compas_fair_scorer.disparate_impact()
0.6336457196581771

US court doctrine says that this quantity has to be above .8 for employment decisions.

12.3. Equalized Odds Fairness#

The journalists were concerned with the types of errors. They accepted that it is not the creators of COMPAS fault that Black people get arrested at higher rates (though actual crime rates are equal; Black neighborhoods tend to be overpoliced). They wanted to consider what actually happened and then see how COMPAS did within each group.

compas_fair_scorer.false_positive_rate(True)
0.49635036496350365
compas_fair_scorer.false_positive_rate(False)
0.2847682119205298

false positives are incorrectly got a low score.

This is different from how the problem was setup when we used sklearn because sklearn assumes tht 0 is the negative class and 1 is the “positive” class, but AIF360 lets us declare the favorable outcome(positive class) and unfavorable outcome (negative class)

White people were given a low score and then re-arrested almost twice as often as Black people.

To make a single metric, we might take a ratio. This is where the journalists found bias.

compas_fair_scorer.false_positive_rate_ratio()
0.5737241916634204

We can look at the other type of error

compas_fair_scorer.false_negative_rate(True)
0.22014051522248243
compas_fair_scorer.false_negative_rate(False)
0.4233817701453104
compas_fair_scorer.false_negative_rate_ratio()
1.9232342111919953

Black people were given a high score and not rearrested almost twice as often as white people.

So while the accuracy was similar (see error rate ratio) for Black and White people; the algorithm makes the opposite types of errors.

12.3.1. Average Odds Difference#

This is a combines the two errors we looked at separately into a single metric.

\[ \tfrac{1}{2}\left[(FPR_{A = \text{unprivileged}} - FPR_{A = \text{privileged}}) + (TPR_{A = \text{unprivileged}} - TPR_{A = \text{privileged}}))\right]\]
compas_fair_scorer.average_odds_difference()
-0.2074117039829009

After the journalists published the piece, the people who made COMPAS countered with a technical report, arguing that that the journalists had measured fairness incorrectly.

The journalists two measures false positive rate and false negative rate use the true outcomes as the denominator.

The COMPAS creators argued that the model should be evaluated in terms of if a given score means the same thing across races; using the prediction as the denominator.

We can look at their preferred metrics too

compas_fair_scorer.false_omission_rate(True)
0.4051724137931034
compas_fair_scorer.false_omission_rate(False)
0.35046473482777474
compas_fair_scorer.false_omission_rate_ratio()
0.8649767923408909
compas_fair_scorer.false_discovery_rate_ratio()
1.2118532033913119

On these two metrics, the ratio is closer to 1 and much less disparate.

The creators thought it was important for the score to mean the same thing for every person assigned a score. The journalists thought it was more important for the algorithm to have the same impact of different groups of people.
Ideally, we would like the score to both mean the same thing for different people and to have the same impact.

Researchers established that these are mutually exclusive, provably. We cannot have both, so it is very important to think about what the performance metrics mean and how your algorithm will be used in order to choose how to prepare a model. We will train models starting next week, but knowing these goals in advance is essential.

Importantly, this is not a statistical, computational choice that data can answer for us. This is about human values (and to some extent the law; certain domains have legal protections that require a specific condition).

The Fair Machine Learning book’s classification Chapter has a section on relationships between criteria with the proofs.

12.4. ML Modeling#

We’re going to approach machine learning from the perspective of modeling for a few reasons:

  • model based machine learning streamlines understanding the big picture

  • the model way of interpreting it aligns well with using sklearn

  • thinking in terms of models aligns with incorporating domain expertise, as in our data science definition

this paper by Christopher M. Bishop, a pioneering ML researcher who also wrote one of a the widely preferred graduate level ML textbooks, details advantages of a model based perspective and a more mathematical version of a model based approach to machine learning. He is a co-author on an introductory model based ML

In CSC461: Machine Learning, you can encounter an algorithm focused approach to machine learning, but I think having the model based perspective first helps you avoid common pitfalls.

12.5. What is a Model?#

A model is a simplified representation of some part of the world. A famous quote about models is:

All models are wrong, but some are useful –George Box[^wiki]

In machine learning, we use models, that are generally statistical models.

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process wikipedia

read more in theModel Based Machine Learning Book

12.6. Models in Machine Learning#

Starting from a dataset, we first make an additional designation about how we will use the different variables (columns). We will call most of them the features, which we denote mathematically with \(\mathbf{X}\) and we’ll choose one to be the target or labels, denoted by \(\mathbf{y}\).

The core assumption for just about all machine learning is that there exists some function \(f\) so that for the \(i\)th sample

\[ y_i = f(\mathbf{x}_i) \]

\(i\) would be the index of a DataFrame

12.7. Types of Machine Learning#

Then with different additional assumptions we get different types of machine learning:

12.8. Supervised Learning#

we’ll focus on supervised learning first. we can take that same core assumption and use it with additional information about our target variable to determine learning task we are working to do.

\[ y_i = f(\mathbf{x}_i) \]
  • if \(y_i\) are discrete (eg flower species) we are doing classification

  • if \(y_i\) are continuous (eg height) we are doing regression

flowchart TD obs{Do you have a <br> target variable?} obs --> |yes| sup[supervised ] obs --> |no| unsup[unsupervised ] sup --> ttyp{What type of variable is the target variable?} ttyp --> |categorical| cl[classificaiton] ttyp --> |continuous| rg[regression]

12.9. Machine Learning Pipeline#

To do machine learning we start with training data which we put as input to the learning algorithm. A learning algorithm might be a generic optimization procedure or a specialized procedure for a specific model. The learning algorithm outputs a trained model or the parameters of the model. When we deploy a model we pair the fit model with a prediction algorithm or decision algorithm to evaluate a new sample in the world.

In experimenting and design, we need testing data to evaluate how well our learning algorithm understood the world. We need to use previously unseen data, because if we don’t we can’t tell if the prediction algorithm is using a rule that the learning algorithm produced or just looking up from a lookup table the result. This can be thought of like the difference between memorization and understanding.

When the model does well on the training data, but not on test data, we say that it does not generalize well.