10. Cleaning Data: fixing values#

So far, we’ve dealt with structural issues in data. but there’s a lot more to cleaning.

Today, we’ll deal with how to fix the values within the data.

Examples of how values can be represented poorly:

Stanford Policy Lab Open Policing Project data readme Propublica Machine Bias the “How we acquired data” section

Hint

you can treat this one proejct as multiple cleaned datasets and study it more carefully for A4.

10.1. Admin#

watch this repo to get announcements
grading will be updated over the weekend!
remember to accept a4

Important

Use the Submit workflow on your Actions tab for each assignment

import pandas as pd
import seaborn as sns
import numpy as np #
na_toy_df = pd.DataFrame(data = [[1,3,4,5],[2 ,6, np.nan,4]])

# make plots look nicer and increase font size
sns.set_theme(font_scale=2)
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

coffee_df = pd.read_csv(arabica_data_url)


rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'
course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)

10.2. Missing Values#

Dealing with missing data is a whole research area. There isn’t one solution.

in 2020 there was a whole workshop on missing

one organizer is the main developer of sci-kit learn the ML package we will use soon. In a 2020 invited talk he listed more automatic handling as an active area of research and a development goal for sklearn.

There are also many classic approaches both when training and when applying models.

example application in breast cancer detection

Even in pandas, dealing with missing values is under experimentation as to how to represent it symbolically

Missing values even causes the datatypes to change

That said, there are are om Pandas gives a few basic tools:

dropna
fillna

Dropping is a good choice when you otherwise have a lot of data and the data is missing at random.

Dropping can be risky if it’s not missing at random. For example, if we saw in the coffee data that one of the scores was missing for all of the rows from one country, or even just missing more often in one country, that could bias our results.

Filling can be good if you know how to fill reasonably, but don’t have data to spare by dropping. For example

you can approximate with another column
you can approximate with that column from other rows

whatever you do, document it

10.2.1. Finding Missing Values#

Our skill in summarixing and getting overviews of data helps us know when we have a problem.

coffee_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1311 entries, 0 to 1310
Data columns (total 44 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 Unnamed: 0             1311 non-null   int64  
 Species                1311 non-null   object 
 Owner                  1304 non-null   object 
 Country.of.Origin      1310 non-null   object 
 Farm.Name              955 non-null    object 
 Lot.Number             270 non-null    object 
 Mill                   1001 non-null   object 
 ICO.Number             1165 non-null   object 
 Company                1102 non-null   object 
 Altitude               1088 non-null   object 
Region                 1254 non-null   object 
Producer               1081 non-null   object 
Number.of.Bags         1311 non-null   int64  
Bag.Weight             1311 non-null   object 
In.Country.Partner     1311 non-null   object 
Harvest.Year           1264 non-null   object 
Grading.Date           1311 non-null   object 
Owner.1                1304 non-null   object 
Variety                1110 non-null   object 
Processing.Method      1159 non-null   object 
Aroma                  1311 non-null   float64
Flavor                 1311 non-null   float64
Aftertaste             1311 non-null   float64
Acidity                1311 non-null   float64
Body                   1311 non-null   float64
Balance                1311 non-null   float64
Uniformity             1311 non-null   float64
Clean.Cup              1311 non-null   float64
Sweetness              1311 non-null   float64
Cupper.Points          1311 non-null   float64
Total.Cup.Points       1311 non-null   float64
Moisture               1311 non-null   float64
Category.One.Defects   1311 non-null   int64  
Quakers                1310 non-null   float64
Color                  1095 non-null   object 
Category.Two.Defects   1311 non-null   int64  
Expiration             1311 non-null   object 
Certification.Body     1311 non-null   object 
Certification.Address  1311 non-null   object 
Certification.Contact  1311 non-null   object 
unit_of_measurement    1311 non-null   object 
altitude_low_meters    1084 non-null   float64
altitude_high_meters   1084 non-null   float64
altitude_mean_meters   1084 non-null   float64
dtypes: float64(16), int64(4), object(24)
memory usage: 450.8+ KB

10.2.2. Example Filling#

The ‘Lot.Number’ has a lot of NaN values, how can we explore it?

We can look at the type:

coffee_df['Lot.Number'].dtype

dtype('O')

And we can look at the value counts.

coffee_df['Lot.Number'].value_counts()

1                             18
020/17                         6
019/17                         5
2                              3
102                            3
                              ..
11/23/0696                     1
3-59-2318                      1
8885                           1
5055                           1
017-053-0211/ 017-053-0212     1
Name: Lot.Number, Length: 221, dtype: int64

We see that a lot are ‘1’, maybe we know that when the data was collected, if the Farm only has one lot, some people recorded ‘1’ and others left it as missing. So we could fill in with 1:

coffee_df['Lot.Number'].fillna(1).head(10)

  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
Name: Lot.Number, dtype: object

Note that even after we called fillna we display it again and the original data is unchanged.

coffee_df['Lot.Number'].head(3)

  NaN
  NaN
  NaN
Name: Lot.Number, dtype: object

To save the filled in column we have a few choices:

use the inplace parameter. This doesn’t offer performance advantages, but does It still copies the object, but then reassigns the pointer. Its under discussion to deprecate
write to a new DataFrame
add a column

We’ll use adding a column:

coffee_df['lot_number_clean'] = coffee_df['Lot.Number'].fillna('1')

coffee_df['lot_number_clean'].value_counts()

1                             1059
020/17                           6
019/17                           5
102                              3
103                              3
                              ... 
3-59-2318                        1
8885                             1
5055                             1
MCCFWXA15/16                     1
017-053-0211/ 017-053-0212       1
Name: lot_number_clean, Length: 221, dtype: int64

10.2.3. Example Dropping#

To illustrate how dropna works, we’ll use the shape method after each call.
Dropna is going to drop either rows or columns. So we

coffee_df.shape

(1311, 45)

By default, it drops any row with one or more NaN values.

coffee_df.dropna().shape

(130, 45)

We could instead tell it to only drop rows with NaN in a subset of the columns. We might do this if we were interested in studying the impact of altitude on the ratings .

coffee_df.dropna(subset=['altitude_low_meters']).shape

(1084, 45)

In the Open Policing Project Data Summary we saw that they made a summary information that showed which variables had at least 70% not missing values. We can similarly choose to keep only variables that have more than a specific threshold of data, using the thresh parameter and axis=1 to drop along columns.

The thresh parameter requires an int not a fraction of the rows, so we can save the shape values to variables and then use that

nrows, ncols = coffee_df.shape
coffee_df.dropna(thresh =.7*nrows,axis=1).shape

(1311, 44)

This dataset is actually in pretty good shape, but if we use a more stringent threshold it drops more columns.

nrows, ncols = coffee_df.shape
coffee_df.dropna(thresh =.85*nrows,axis=1).shape

(1311, 34)

10.3. Inconsistent values#

This was one of the things that many of you anticipated or had observed. A useful way to investigate for this, is to use value_counts and sort them alphabetically by the values from the original data, so that similar ones will be consecutive in the list. Once we have the value_counts() Series, the values from the coffee_df become the index, so we use sort_index.

Let’s look at the In.Country.Partner column

coffee_df['In.Country.Partner'].value_counts().sort_index()

10.4. We can see there’s only one `Blossom Valley International\n` but 58 `Blossom Valley International`, the former is likely a typo, especially since `\n` is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.#

This is another job for dictionaries, we make one with the value to replace as the key and the value to insert as the value.

partner_corrections = {'Blossom Valley International\n':'Blossom Valley International',
  'Specialty Coffee Ass':'Specialty Coffee Association'}
coffee_df['in_country_partner_clean'] = coffee_df['In.Country.Partner'].replace(
  to_replace=partner_corrections)
coffee_df['in_country_partner_clean'].value_counts().sort_index()

10.5. Fixing data at load time#

Some of the different parameters in read_csv can also fix how it reads in data.

For example header can make somethign like this:

mulitindex img of excel file

read in correctly.

10.6. A Cleaning Data Recipe#

not everything possible, but good enough for this course

Can you use parameters to read the data in better?
Fix the index and column headers (making these easier to use makes the rest easier)
Is the data strucutred well?
Are there missing values?
Do the datatypes match what you expect by looking at the head or a sample?
Are categorical variables represented in usable way?
Does your analysis require filtering or augmenting the data?

10.7. Further reading#

Instead of more practice with these manipulations, below are more examples of cleaning data to see how these types of manipulations get used.
Your goal here is not to memorize every possible thing, but to build a general idea of what good data looks like and good habits for cleaning data and keeping it reproducible.

Also here are some tips on general data management and organization.

This article is a comprehensive discussion of data cleaning.

10.8. Questions#

Important

I will add these later. I have a deadline tomorrow.

Programming for Data Science at URI Fall 2022

Cleaning Data: fixing values

Contents

10. Cleaning Data: fixing values#

10.1. Admin#

10.2. Missing Values#

10.2.1. Finding Missing Values#

10.2.2. Example Filling#

10.2.3. Example Dropping#

10.3. Inconsistent values#

10.4. We can see there’s only one `Blossom Valley International\n` but 58 `Blossom Valley International`, the former is likely a typo, especially since `\n` is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.#

10.5. Fixing data at load time#

10.6. A Cleaning Data Recipe#

10.7. Further reading#

10.8. Questions#

Programming for Data Science at URI Fall 2022

Cleaning Data: fixing values

Contents

10. Cleaning Data: fixing values#

10.1. Admin#

10.2. Missing Values#

10.2.1. Finding Missing Values#

10.2.2. Example Filling#

10.2.3. Example Dropping#

10.3. Inconsistent values#

10.4. We can see there’s only one Blossom Valley International\n but 58 Blossom Valley International, the former is likely a typo, especially since \n is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.#

10.5. Fixing data at load time#

10.6. A Cleaning Data Recipe#

10.7. Further reading#

10.8. Questions#

10.4. We can see there’s only one `Blossom Valley International\n` but 58 `Blossom Valley International`, the former is likely a typo, especially since `\n` is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.#