Performance Metrics continued
Contents
15. Performance Metrics continued#
15.1. Logistics#
A6 posted ASAP! (I promise it will be straightforward in terms of the actual code; you do need to carefully interpret though)
a5 grading is behind, but if it’s not done by early tomorrow (eg 10am) then I’ll extend the portfolio deadline accordingly
15.2. Completing the COMPAS Audit#
Let’s start where we left off, plus some additional imports
import pandas as pd
from sklearn import metrics as skmetrics
from aif360 import metrics as fairmetrics
from aif360.datasets import BinaryLabelDataset
import seaborn as sns
compas_clean_url = 'https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv'
compas_df = pd.read_csv(compas_clean_url,index_col = 'id')
compas_df = pd.get_dummies(compas_df,columns=['score_text'],)
WARNING:root:No module named 'tempeh': LawSchoolGPADataset will be unavailable. To install, run:
pip install 'aif360[LawSchoolGPA]'
WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'
WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'
WARNING:root:No module named 'fairlearn': ExponentiatedGradientReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'
WARNING:root:No module named 'fairlearn': GridSearchReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'
WARNING:root:No module named 'fairlearn': GridSearchReduction will be unavailable. To install, run:
pip install 'aif360[Reductions]'
Warning
We’ll get a warning which is okay but if you run again it will go away.
to review:
compas_df.head()
age | c_charge_degree | race | age_cat | sex | priors_count | days_b_screening_arrest | decile_score | is_recid | two_year_recid | c_jail_in | c_jail_out | length_of_stay | score_text_High | score_text_Low | score_text_Medium | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||
3 | 34 | F | African-American | 25 - 45 | Male | 0 | -1.0 | 3 | 1 | 1 | 2013-01-26 03:45:27 | 2013-02-05 05:36:53 | 10 | 0 | 1 | 0 |
4 | 24 | F | African-American | Less than 25 | Male | 4 | -1.0 | 4 | 1 | 1 | 2013-04-13 04:58:34 | 2013-04-14 07:02:04 | 1 | 0 | 1 | 0 |
8 | 41 | F | Caucasian | 25 - 45 | Male | 14 | -1.0 | 6 | 1 | 1 | 2014-02-18 05:08:24 | 2014-02-24 12:18:30 | 6 | 0 | 0 | 1 |
10 | 39 | M | Caucasian | 25 - 45 | Female | 0 | -1.0 | 1 | 0 | 0 | 2014-03-15 05:35:34 | 2014-03-18 04:28:46 | 2 | 0 | 1 | 0 |
14 | 27 | F | Caucasian | 25 - 45 | Male | 0 | -1.0 | 4 | 0 | 0 | 2013-11-25 06:31:06 | 2013-11-26 08:26:57 | 1 | 0 | 1 | 0 |
Notice today we imported the sklearn.metrics module with an alias.
skmetrics.accuracy_score(compas_df['two_year_recid'],compas_df['score_text_High'])
0.6288366805608185
More common is to use medium or high to check accuracy (or not low) we can calulate tihs by either summing two or inverting. We’ll do it as not low for now, to review using apply.
Try it Yourself
A good exercise to review data manipulation is to try creating the score_text_MedHigh
column by adding the other two together (because medium or high if they’re booleans is the same as medium + high if they’re ints)
int_not = lambda a:int(not(a))
compas_df['score_text_MedHigh'] = compas_df['score_text_Low'].apply(int_not)
skmetrics.accuracy_score(compas_df['two_year_recid'],
compas_df['score_text_MedHigh'])
0.6582038651004168
We can see this gives us a slightly higher score, but still not that great.
the int_not
lambda
is a function:
type(int_not)
function
it is equivalent as the following, but a more compact notation.
def int_not_f(a):
return int(not(a))
It flips a 0 to a 1
int_not(0)
1
and th eother way
int_not(1)
0
compas_df.groupby('score_text_Medium')['decile_score'].min()
score_text_Medium
0 1
1 5
Name: decile_score, dtype: int64
compas_race = compas_df.groupby('race')
15.3. Per Group scores with groupby#
To groupby and then do the score, we can use a lambda again, with apply
acc_fx = lambda d: skmetrics.accuracy_score(d['two_year_recid'],
d['score_text_MedHigh'])
compas_race.apply(acc_fx,)
race
African-American 0.649134
Caucasian 0.671897
dtype: float64
In this case it gives a series, but with reset_index
we can make it a DataFrame and then rename the column to label it as accuracy.
compas_race.apply(acc_fx,).reset_index().rename(columns={0:'accuracy'})
race | accuracy | |
---|---|---|
0 | African-American | 0.649134 |
1 | Caucasian | 0.671897 |
That lambda + apply is equivalent to:
race_acc = []
for race, rdf in compas_race:
acc = skmetrics.accuracy_score(rdf['two_year_recid'],
rdf['score_text_MedHigh'])
race_acc.append([race,acc])
pd.DataFrame(race_acc, columns =['race','accuracy'])
race | accuracy | |
---|---|---|
0 | African-American | 0.649134 |
1 | Caucasian | 0.671897 |
15.4. Using AIF360#
The AIF360 package implements fairness metrics, some of which are derived from metrics we have seen and some others. the documentation has the full list in a summary table with English explanations and details with most equations.
However, it has a few requirements:
its constructor takes two
BinaryLabelDataset
objectsthese objects must be the same except for the label column
the constructor for
BinaryLabelDataset
only accepts all numerical DataFrames
So, we have some preparation to do.
First, we’ll make a numerical copy of the compas_df
columns that we need. The only nonnumerical column that we need is race, wo we’ll make a dict
to replace that/
race_num_map = {r:i for i,r, in enumerate(compas_df['race'].value_counts().index)}
race_num_map
{'African-American': 0, 'Caucasian': 1}
and here we select columns and replace the values
required_cols = ['race','two_year_recid','score_text_MedHigh']
num_compas = compas_df[required_cols].replace(race_num_map)
num_compas.head(2)
race | two_year_recid | score_text_MedHigh | |
---|---|---|---|
id | |||
3 | 0 | 1 | 0 |
4 | 0 | 1 | 0 |
Next we will make two versions, one with race & the ground truth and ht eother with race & the predictions. It’s easiest to drop the column we don’t want.
num_compas_true = num_compas.drop(columns=['score_text_MedHigh'])
num_compas_pred = num_compas.drop(columns=['two_year_recid'])
Now we make the BinaryLabelDataset
objects, this type comes from AIF360 too. Basically, it is a DataFrame with extra attributes; some specific and some inherited from StructuredDataset
.
# here we want actual favorable outcome
broward_true = BinaryLabelDataset(0,1,df = num_compas_true,
label_names= ['two_year_recid'],
protected_attribute_names=['race'])
compas_predictions = BinaryLabelDataset(0,1,df = num_compas_pred,
label_names= ['score_text_MedHigh'],
protected_attribute_names=['race'])
Try it Yourself
remember, you can inspect any object using the __dict__
attribute
This type also has an ignore_fields
column for when comparisons are made, since the requirement is that only the content of the label column is different, but in our case also the label names are different, we have to tell it that that’s okay.
compas_predictions.ignore_fields.add('label_names')
broward_true.ignore_fields.add('label_names')
Now, we can instantiate our metric object:
compas_fair_scorer = fairmetrics.ClassificationMetric(broward_true,
compas_predictions,
unprivileged_groups=[{'race':0}],
privileged_groups = [{'race':1}])
And finally we can compute! First, we can verify that we get the same accuracy as before
compas_fair_scorer.accuracy()
0.6582038651004168
For the aif360 metrics, they have one parameter, privleged
with a defautl value of None
when it’s none it computes th ewhole dataset. When True
it compues only the priveleged group.
compas_fair_scorer.accuracy(True)
0.6718972895863052
Here that is Caucasion people.
When False
it’s the unpriveleged group, here African American
compas_fair_scorer.accuracy(False)
0.6491338582677165
We can also compute other scores. Many fairness scores are ratios of the un priveleged group’s score to the privleged group’s score.
In Disparate Impact the ratio is of the positive outcome, independent of the predictor. So this is the ratio of the % of Black people not rearrested to % of white people rearrested.
compas_fair_scorer.disparate_impact()
0.6336457196581771
The courts use an “80%” rule saying that if this ratio is above .8 for things like employment, it’s close enough. T
compas_fair_scorer.error_rate_ratio()
1.0693789798014377
We can also do ratios of the scores. This is where the journalists found bias.
compas_fair_scorer.false_positive_rate_ratio()
0.5737241916634204
Black people were given a low score and then re-arrested only a little more than half as often as white people. (White people were give an low score and rearrested almost twice as often)
compas_fair_scorer.false_negative_rate_ratio()
1.9232342111919953
Black people were given a high score and not rearrested almost twice as often as white people.
So while the accuracy was similar (see error rate ratio) for Black and White people; the algorithm makes the opposite types of errors.
After the journalists published the piece, the people who made COMPAS countered with a technical report, arguing that that the journalists had measured fairness incorrectly.
The journalists two measures false positive rate and false negative rate use the true outcomes as the denominator.
The COMPAS creators argued that the model should be evaluated in terms of if a given score means the same thing across races; using the prediction as the denominator.
compas_fair_scorer.false_omission_rate_ratio()
0.8649767923408909
compas_fair_scorer.false_discovery_rate_ratio()
1.2118532033913119
On these two metrics, the ratio is closer to 1 and much less disparate.
The creators thought it was important for the score to mean the same thing for every person assigned a score. The journalists thought it was more important for the algorithm to have the same impact of different groups of people.
Ideally, we would like the score to both mean the same thing for different people and to have the same impact.
Researchers established that these are mutually exclusive, provably. We cannot have both, so it is very important to think about what the performance metrics mean and how your algorithm will be used in order to choose how to prepare a model. We will train models starting next week, but knowing these goals in advance is essential.
Importantly, this is not a statistical, computational choice that data can answer for us. This is about human values (and to some extent the law; certain domains have legal protections that require a specific condition).
The Fair Machine Learning book’s classificaiton Chapter has a section on relationships between criteria with the proofs.
Important
We used ProPublica’s COMPAS dataset to replicate (parts of, with different tools) their analysis. That is, they collected the dataset in order to audit the COMPAS algorithm and we used it for the same purpose (and to learn model evaluation). This dataset is not designed for training models, even though it has been used as such many times. This is not the best way to use this dataset and for future assignments I do not recommend using this dataset.
15.5. Portfolio Reminder#
If you do not need level 3s to be happy with your grade for the course (eg you want a B) and you have all the achievements so far you can skip the portfolio submission. If you do not need level 3 and you are not on track, you should submit to get caught up. This can be (and is advised to be) reflective revisions of past assignment(s).
If you need level 3 achievements for your desired grade, then you can pick a subset of the eligible skills(or all) and add new work that shows that you have learned those skills according to the level 3 checklists. the ideas page has example formats for that new work.
15.6. Questions After Class#
Today’s questions were only clarifying, so hopefully re-reading the notes is enough. If not, post a question as an issue!