14. Intro to Machine Learning: Evaluation#

This week we are going to start learning about machine learning.

We are going to do this by looking at how to tell if machine learning has worked.

This is because:

you have to check if one worked after you build one
if you do not check carefully, it might only sometimes work
gives you a chance to learn only evaluation instead of evaluation + an ML task

We are going to do this by auditing an algorithm that was built with machine learning.

14.1. What is ML?#

First, what is an Algorithm?

An algorithm is a set of ordered steps to complete a task.

Note that when people outside of CS talk about algorithms that impact people’s lives these are often not written directly by people anymore. THey are often the result of machine learning.

In machine learning, people write an algorithm for how to write an algorithm based on data. This often comes in the form of a statitistical model of some sort

When we do machine learning, this can also be called:

data mining
pattern recognition
modeling

because we are looking for patterns in the data and typically then planning to use those patterns to make predictions or automate a task.

Each of these terms does have slightly different meanings and usage, but sometimes they’re used close to exchangeably.

14.2. Evaluating Algorithms: Propublica’s COMPAS Audit#

We are going to replicate the audit from ProPublica Machine Bias

Propublica started the COMPAS Debate with the article Machine Bias. With their article, they also released details of their methodology and their data and code. This presents a real data set that can be used for research on how data is used in a criminal justice setting without researchers having to perform their own requests for information, so it has been used and reused a lot of times.

14.3. Propublica COMPAS Data#

The dataset consists of COMPAS scores assigned to defendants over two years 2013-2014 in Broward County, Florida, it was released by Propublica in a GitHub Repository. These scores are determined by a proprietary algorithm designed to evaluate a persons recidivism risk - the likelihood that they will reoffend. Risk scoring algorithms are widely used by judges to inform their sentencing and bail decisions in the criminal justice system in the United States.

The journalists collected, for each person arreste din 2013 and 2014:

basic demographics
details about what they were charged with and priors
the COMPAS score assigned to them
if they had actually been re-arrested within 2 years of their arrest

This means that we have what the COMPAS algorithm predicted (in the form of a score from 1-10) and what actually happened (re-arrested or not). We can then measure how well the algorithm worked, in practice, in the real world.

import pandas as pd
from sklearn import metrics
import seaborn as sns

We’re going to work with a cleaned copy of the data released by Propublica that also has a minimal subset of features.

compas_clean_url = 'https://raw.githubusercontent.com/ml4sts/outreach-compas/main/data/compas_c.csv'
compas_df = pd.read_csv(compas_clean_url,index_col = 'id')

compas_df.head()

	age	c_charge_degree	race	age_cat	score_text	sex	priors_count	days_b_screening_arrest	decile_score	is_recid	two_year_recid	c_jail_in	c_jail_out	length_of_stay
id
3	34	F	African-American	25 - 45	Low	Male	0	-1.0	3	1	1	2013-01-26 03:45:27	2013-02-05 05:36:53	10
4	24	F	African-American	Less than 25	Low	Male	4	-1.0	4	1	1	2013-04-13 04:58:34	2013-04-14 07:02:04	1
8	41	F	Caucasian	25 - 45	Medium	Male	14	-1.0	6	1	1	2014-02-18 05:08:24	2014-02-24 12:18:30	6
10	39	M	Caucasian	25 - 45	Low	Female	0	-1.0	1	0	0	2014-03-15 05:35:34	2014-03-18 04:28:46	2
14	27	F	Caucasian	25 - 45	Low	Male	0	-1.0	4	0	0	2013-11-25 06:31:06	2013-11-26 08:26:57	1

Here is an explanation of these features:

age: defendant’s age
c_charge_degree: degree charged (Misdemeanor of Felony)
race: defendant’s race
age_cat: defendant’s age quantized in “less than 25”, “25-45”, or “over 45”
score_text: COMPAS score: ‘low’(1 to 5), ‘medium’ (5 to 7), and ‘high’ (8 to 10).
sex: defendant’s gender
priors_count: number of prior charges
days_b_screening_arrest: number of days between charge date and arrest where defendant was screened for compas score
decile_score: COMPAS score from 1 to 10 (low risk to high risk)
is_recid: if the defendant recidivized
two_year_recid: if the defendant within two years
c_jail_in: date defendant was imprisoned
c_jail_out: date defendant was released from jail
length_of_stay: length of jail stay

14.4. One-hot Encoding#

We will audit first to see how good the algorithm is by treating the predictions as either high or not high. One way we can get to that point is to transform the score_text column from one column with three values, to 3 binary columns.

compas_df = pd.get_dummies(compas_df,columns=['score_text'])

compas_df.head()

	age	c_charge_degree	race	age_cat	sex	priors_count	days_b_screening_arrest	decile_score	is_recid	two_year_recid	c_jail_in	c_jail_out	length_of_stay	score_text_High	score_text_Low	score_text_Medium
id
3	34	F	African-American	25 - 45	Male	0	-1.0	3	1	1	2013-01-26 03:45:27	2013-02-05 05:36:53	10	0	1	0
4	24	F	African-American	Less than 25	Male	4	-1.0	4	1	1	2013-04-13 04:58:34	2013-04-14 07:02:04	1	0	1	0
8	41	F	Caucasian	25 - 45	Male	14	-1.0	6	1	1	2014-02-18 05:08:24	2014-02-24 12:18:30	6	0	0	1
10	39	M	Caucasian	25 - 45	Female	0	-1.0	1	0	0	2014-03-15 05:35:34	2014-03-18 04:28:46	2	0	1	0
14	27	F	Caucasian	25 - 45	Male	0	-1.0	4	0	0	2013-11-25 06:31:06	2013-11-26 08:26:57	1	0	1	0

Note the last 3 columns

14.5. Performance Metrics in sklearn#

The first thing we usually check is the accuracy: the percentage of all samples that are correct.

metrics.accuracy_score(compas_df['two_year_recid'],compas_df['score_text_High'])

0.6288366805608185

However this does not tell us anything about what types of mistakes the algorithm made. The type of mistake often matters in terms of how we trust or deploy an algorithm. We use a confusion matrix to describe the performance in more detail.

A confusion matrix counts the number of samples of each true category that wre predicted to be in each category. In this case we have a binary prediction problem: people either are re-arrested (truth) or not and were given a high score or not(prediction). In binary problems we adopt a common language of labeling one outcome/predicted value positive and the other negative. We do this not based on the social value of the outcome, but on the numerical encoding.

In this data, being re-arrested is indicated by a 1 in the two_year_recid column, so this is the positive class and not being re-arrested is 0, so the negative class. Similarly a high score is 1, so that’s the positive prediction and not high is 0, so that is the a negative prediction.

Note

these terms can be used in any sort of detection problem, whether machine learning is used or not

sklearn.metrics provides a [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) function that we can use.

metrics.confusion_matrix(compas_df['two_year_recid'],compas_df['score_text_High'])

array([[2523,  272],
       [1687,  796]])

Since this is binary problem we have 4 possible outcomes:

true negatives($C_{0,0}$): did not get a high score and were not re-arrested
false negatives($C_{1,0}$):: did not get a high score and were re-arrested
false positives($C_{0,1}$):: got a high score and were not re-arrested
true positives($C_{1,1}$):: got a high score and were re-arrested

With these we can revisit accuracy:

\[ A = \frac{C_{0,0} + C_{1,1}}{C_{0,0}+ C_{1,0} + C_{0,1} + C_{1,1}} \]

and we can define new scores. Two common ones in CS are recall and precision.

Recall is:

\[ R = \frac{C_{1,1}}{C_{1,0} + C_{1,1}} \]

metrics.recall_score(compas_df['two_year_recid'],compas_df['score_text_High'])

0.3205799436165928

That is, among the truly positive class how many were correctly predicted? In COMPAS, it’s the percentage of the re-arrested people who got a high score.

Precision is $$ P = \frac{C_{1,1}}{C_{0,1} + C_{1,1}} $$

metrics.precision_score(compas_df['two_year_recid'],compas_df['score_text_High'])

0.7453183520599251

That is, among the positive predictions, what percentage was correct. In COPMAS, that is among the people who got a high score, what percentage were re-arrested.

Important

Install install aif360 before class Friday

14.6. Questions after class#

All were clarifying details that I expanded above.

Programming for Data Science at URI Fall 2022

Intro to Machine Learning: Evaluation

Contents

14. Intro to Machine Learning: Evaluation#

14.1. What is ML?#

14.2. Evaluating Algorithms: Propublica’s COMPAS Audit#

14.3. Propublica COMPAS Data#

14.4. One-hot Encoding#

14.5. Performance Metrics in sklearn#

14.6. Questions after class#