11. Missing Data and Inconsistent coding#

import pandas as pd
import seaborn as sns
import numpy as np

sns.set_theme(palette= "colorblind")
na_toy_df = pd.DataFrame(data = [[1,3,4,5],[2 ,6, np.nan]])

# make plots look nicer and increase font size
sns.set_theme(font_scale=2,palette='colorblind')
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

coffee_df = pd.read_csv(arabica_data_url)


rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'
course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)

So far, we’ve dealt with structural issues in data. but there’s a lot more to cleaning.

Today, we’ll deal with how to fix the values wihtin the data. To see the types of things:

Stanford Policy Lab Open Policing Project data readme Propublica Machine Bias the “How we acquired data” section

11.1. Missing Values#

Dealing with missing data is a whole research area. There isn’t one solution.

in 2020 there was a workshop on it

There are also many classic approaches both when training and when applying models.

example application in breast cancer detection

In pandas, even representing missing values is under experimentation. Currently, it uses numpy.NaN, but the experiment is with pd.NA.

Missing values even causes the datatypes to change

Pandas gives a few basic tools:

drop with (dropna)
fill with fillna

coffee_df.head()

	Unnamed: 0	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
0	1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	...	Green	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0
1	2	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	...	Green	1	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0
2	3	Arabica	grounds for health admin	Guatemala	san marcos barrancas "san cristobal cuch	NaN	NaN	NaN	NaN	1600 - 1800 m	...	NaN	0	May 31st, 2011	Specialty Coffee Association	36d0d00a3724338ba7937c52a378d085f2172daa	0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660	m	1600.0	1800.0	1700.0
3	4	Arabica	yidnekachew dabessa	Ethiopia	yidnekachew dabessa coffee plantation	NaN	wolensu	NaN	yidnekachew debessa coffee plantation	1800-2200	...	Green	2	March 25th, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1800.0	2200.0	2000.0
4	5	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	...	Green	2	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0

5 rows × 44 columns

The ‘Lot.Number’ has a lot of NaN values, how can we explore it?

We can look at the type:

coffee_df['Lot.Number'].dtype

dtype('O')

And we can look at the value counts.

coffee_df['Lot.Number'].value_counts()

1                             18
020/17                         6
019/17                         5
2                              3
102                            3
                              ..
11/23/0696                     1
3-59-2318                      1
8885                           1
5055                           1
017-053-0211/ 017-053-0212     1
Name: Lot.Number, Length: 221, dtype: int64

Filling can be good if you know how to fill reasonably, but don’t have data to spare by dropping. For example

you can approximate with another column
you can approximate with that column from other rows

We see that a lot are ‘1’, maybe we know that when the data was collected, if the Farm only has one lot, some people recorded ‘1’ and others left it as missing. So we could fill in with 1:

coffee_df['Lot.Number'].fillna(1).head()

  1
  1
  1
  1
  1
Name: Lot.Number, dtype: object

coffee_df['Lot.Number'].head()

  NaN
  NaN
  NaN
  NaN
  NaN
Name: Lot.Number, dtype: object

Tip

Note that even after we called fillna we display it again and the original data is unchanged.

To save the filled in column we have a few choices:

use the inplace parameter. This doesn’t offer performance advantages, but does It still copies the object, but then reassigns the pointer. Its under discussion to deprecate
write to a new DataFrame
add a column

We’ll use adding a column:

coffee_df['lot_number_clean'] = coffee_df['Lot.Number'].fillna(1)

Question in Class

When I use value counts it treats the filled ones as different. Why?

coffee_df['Lot.Number'].value_counts()

1                             18
020/17                         6
019/17                         5
2                              3
102                            3
                              ..
11/23/0696                     1
3-59-2318                      1
8885                           1
5055                           1
017-053-0211/ 017-053-0212     1
Name: Lot.Number, Length: 221, dtype: int64

coffee_df['lot_number_clean'].value_counts()

1                             1041
1                               18
020/17                           6
019/17                           5
102                              3
                              ... 
3-59-2318                        1
8885                             1
5055                             1
MCCFWXA15/16                     1
017-053-0211/ 017-053-0212       1
Name: lot_number_clean, Length: 222, dtype: int64

If we swithc to 1 as a string, then we’d see all of the one values as the same thing.

coffee_df['lot_number_clean'] = coffee_df['Lot.Number'].fillna('1')

coffee_df['lot_number_clean'].value_counts()

1                             1059
020/17                           6
019/17                           5
102                              3
103                              3
                              ... 
3-59-2318                        1
8885                             1
5055                             1
MCCFWXA15/16                     1
017-053-0211/ 017-053-0212       1
Name: lot_number_clean, Length: 221, dtype: int64

This was our goal, so in this case, it’s the right thing to do to overwrite the value.

Dropping is a good choice when you otherwise have a lot of data and the data is missing at random.

Dropping can be risky if it’s not missing at random. For example, if we saw in the coffee data that one of the scores was missing for all of the rows from one country, or even just missing more often in one country, that could bias our results.

To illustrate how dropna works, we’ll use the shape method:

coffee_df.shape

(1311, 45)

By default, it drops any row with one or more NaN values.

coffee_df.dropna().shape

(130, 45)

We could instead tell it to only drop rows with NaN in a subset of the columns.

coffee_df.dropna(subset=['altitude_low_meters']).shape

(1084, 45)

whatever you do, document it

Try it Yourself

use the na_toy_df DataFrame that’s defined in the first cell, to experiment with subset and axis parameters to understand them better.

In the Open Policing Project Data Summary we saw that they made a summary information that showed which variables had at least 70% not missing values. We can similarly choose to keep only variables that have more than a specific threshold of data, using the thresh parameter and axis=1 to drop along columns.

n_rows, n_cols = coffee_df.shape

coffee_df.dropna(thresh=.7*n_rows,axis=1).shape

(1311, 44)

This dataset is actually in pretty good shape, but if we use a more stringent threshold it drops more columns.

coffee_df.dropna(thresh=.85*n_rows,axis=1).shape

(1311, 34)

Important

Everththing after this is new material that we did not have time for in class, but is important and helpful in your assignment (and for your portflio).

11.2. Inconsistent values#

This was one of the things that many of you anticipated or had observed. A useful way to investigate for this, is to use value_counts and sort them alphabetically by the values from the original data, so that similar ones will be consecutive in the list. Once we have the value_counts() Series, the values from the coffee_df become the index, so we use sort_index.

Let’s look at the In.Country.Partner column

coffee_df['In.Country.Partner'].value_counts().sort_index()

AMECAFE                                                                                  205
Africa Fine Coffee Association                                                            49
Almacafé                                                                                 178
Asociacion Nacional Del Café                                                             155
Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.                             6
Asociación de Cafés Especiales de Nicaragua                                                8
Blossom Valley International                                                              58
Blossom Valley International\n                                                             1
Brazil Specialty Coffee Association                                                       67
Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao      1
Centro Agroecológico del Café A.C.                                                         8
Coffee Quality Institute                                                                   7
Ethiopia Commodity Exchange                                                               18
Instituto Hondureño del Café                                                              60
Kenya Coffee Traders Association                                                          22
METAD Agricultural Development plc                                                        15
NUCOFFEE                                                                                  36
Salvadoran Coffee Council                                                                 11
Specialty Coffee Ass                                                                       1
Specialty Coffee Association                                                             295
Specialty Coffee Association of Costa Rica                                                42
Specialty Coffee Association of Indonesia                                                 10
Specialty Coffee Institute of Asia                                                        16
Tanzanian Coffee Board                                                                     6
Torch Coffee Lab Yunnan                                                                    2
Uganda Coffee Development Authority                                                       22
Yunnan Coffee Exchange                                                                    12
Name: In.Country.Partner, dtype: int64

We can see there’s only one Blossom Valley International\n but 58 Blossom Valley International, the former is likely a typo, especially since \n is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.

This is another job for dictionaries, we make one with the value to replace as the key and the value to insert as the value.

partner_corrections = {'Blossom Valley International\n':'Blossom Valley International',
  'Specialty Coffee Ass':'Specialty Coffee Association'}
coffee_df['in_country_partner_clean'] = coffee_df['In.Country.Partner'].replace(
  to_replace=partner_corrections)
coffee_df['in_country_partner_clean'].value_counts().sort_index()

AMECAFE                                                                                  205
Africa Fine Coffee Association                                                            49
Almacafé                                                                                 178
Asociacion Nacional Del Café                                                             155
Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.                             6
Asociación de Cafés Especiales de Nicaragua                                                8
Blossom Valley International                                                              59
Brazil Specialty Coffee Association                                                       67
Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao      1
Centro Agroecológico del Café A.C.                                                         8
Coffee Quality Institute                                                                   7
Ethiopia Commodity Exchange                                                               18
Instituto Hondureño del Café                                                              60
Kenya Coffee Traders Association                                                          22
METAD Agricultural Development plc                                                        15
NUCOFFEE                                                                                  36
Salvadoran Coffee Council                                                                 11
Specialty Coffee Association                                                             296
Specialty Coffee Association of Costa Rica                                                42
Specialty Coffee Association of Indonesia                                                 10
Specialty Coffee Institute of Asia                                                        16
Tanzanian Coffee Board                                                                     6
Torch Coffee Lab Yunnan                                                                    2
Uganda Coffee Development Authority                                                       22
Yunnan Coffee Exchange                                                                    12
Name: in_country_partner_clean, dtype: int64

and now we see the corrected values. We can also pass lambdas or put lambas in the dictionary if there are systemic patterns.

11.3. Fixing data at load time#

Explore some of the different parameters in read_csv

How can we read in data that looks like this: mulitindex img of excel file

pd.read_csv('fancy_formatting.xlsx', header = list(range(4)))

Many problems can be repaired with parameters in read_csv.

11.4. A Cleaning Data Recipe#

not everything possible, but good enough for this course

Can you use parameters to read the data in better?
Fix the index and column headers (making these easier to use makes the rest easier)
Is the data structured well?
Are there missing values?
Do the datatypes match what you expect by looking at the head or a sample?
Are categorical variables represented in usable way?
Does your analysis require filtering or augmenting the data?

Things to keep in mind:

always save new copies of data when you mutate it
add new columns rather than overwriting columns
long variable names are better than ambiguous naming

11.5. Your observations from Monday:#

I promised we’d come back to your observations on what problems could occur in data. Here they are, organized by rough categories of when/how to fix them.

We can fix while reading in data:

decimal was indicated with ‘,’ insead of ‘.’ so pandas saw value as a string rather than a float
missing header
reading the index as a column
large datasets might be too slow or not fit in memory
missing data represeted with a value or special character

We can fix by reshaping data:

Data can get read into tables in bizarre ways depending on how the data was entered originally.
every value in one column, instead of separated

We can repair by changing values or filtering:

information represented inconsistently eg “Value” and ” Value ” or twenty-two instead of 22
blank rows or blank columns or data that is N/A
date/time information can be represented lots of different ways
representing categorical with numbers that are ambiguous
spaces or other symbols in column names
some numbers as strings, others as ints within a column
symbols being mis interpreted

Real problems, but beyond our scope:

corrupt data files

11.6. More Practice#

Instead of more practice with these manipulations, below are more examples of cleaning data to see how these types of manipulations get used.
Your goal here is not to memorize every possible thing, but to build a general idea of what good data looks like and good habits for cleaning data and keeping it reproducible.

Also here are some tips on general data management and organization.

This article is a comprehensive discussion of data cleaning.

Programming for Data Science at URI Fall 2021

Missing Data and Inconsistent coding

Contents

11. Missing Data and Inconsistent coding#

11.1. Missing Values#

11.2. Inconsistent values#

11.3. Fixing data at load time#

11.4. A Cleaning Data Recipe#

11.5. Your observations from Monday:#

11.6. More Practice#