6. Cleaning Data - Structure#

6.1. Intro#

This week, we’ll be cleaning data.

Cleaning data is labor intensive and requires making subjective choices.
We’ll focus on, and assess you on, manipulating data correctly, making reasonable choices, and documenting the choices you make carefully.

We’ll focus on the programming tools that get used in cleaning data in class this week:

  • reshaping data

  • handling missing or incorrect values

  • renaming columns

import pandas as pd
import seaborn as sns

# make plots look nicer and increase font size
sns.set_theme(font_scale=2, palette='colorblind')

Note here we set the theme and pass the palette and font size as paramters. To learn more about this and the various options, see the seaborn aesthetics tutorial

6.2. What is Tidy Data?#

Read in the three csv files described below and store them in a dictionary

url_base = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'

data_filename_list = ['study_a.csv','study_b.csv','study_c.csv']

Here we can use a dictionary comprehension

example_data = {datafile.split('.')[0]:pd.read_csv(url_base+datafile) for datafile in data_filename_list}
  • this is a dictionary comprehension which is like a list comprehension, but builds a dictionary instead.

  • split is a string method that splits the string at the given character and returns a list

  • [0] then takes the first item out

  • : makes the part to the left a key and rigt the value

  • we can add strings to combine them

6.2.1. Breakign down that comprehension#

Tip

This section also illustrates a way you can work to understand any piece of code: take small sections and run them one at a time

We can do smaller examples of each step.

First a tiny dictionary comprehension

{char:ord(char) for char in 'abcdef'}
{'a': 97, 'b': 98, 'c': 99, 'd': 100, 'e': 101, 'f': 102}

this gives us the ascii number for each character in that string with the char as key and ord(char)’s output as the value. Note that it runs the code assigned in the value because of the () doing the function call.

Next the split method, we will use this sentence.

sentence = 'Next the split method, we will use this sentence.'
type(sentence), len(sentence)
(str, 49)

this is type str so we can use the string class methods in Python, such as split. As a str the len function tells us how many characters because a string is iterable over characters. We used this fact in the tiny dictionary above.

By default it splits at spaces

sentence.split()
['Next', 'the', 'split', 'method,', 'we', 'will', 'use', 'this', 'sentence.']

we can instead tell it on what string to split

clauses = sentence.split(',')
type(clauses)
list

split always returns a list, so we can pick the first item with [0]

clauses[0]
'Next the split method'

We can concatenate str objects with +

title = 'Dr.'
last = 'Brown'
title + last
'Dr.Brown'

6.2.2. the same data 3 ways#

We can pick a single on out this way

example_data['study_a']
name treatmenta treatmentb
0 John Smith - 2
1 Jane Doe 16 11
2 Mary Johnson 3 1

or with an additional import,

from IPython.display import display

THen we can iterate over the values in the dictionary ( the three dataframes) and see them all:

[display(df) for df in example_data.values()];
name treatmenta treatmentb
0 John Smith - 2
1 Jane Doe 16 11
2 Mary Johnson 3 1
intervention John Smith Jane Doe Mary Johnson
0 treatmenta - 16 3
1 treatmentb 2 11 1
person treatment result
0 John Smith a -
1 Jane Doe a 16
2 Mary Johnson a 3
3 John Smith b 2
4 Jane Doe b 11
5 Mary Johnson b 1

These three all show the same data, but let’s say we have two goals:

  • find the average effect per person across treatments

  • find the average effect per treatment across people

This works differently for these three versions.

For study_a we can easilty get the average per treatment, but to get the average per person, we have to go across rows, which we can do here, but doesn’t work as well with plotting

we can work across rows with the axis parameter if needed

For B, we get the average per person, but what about per treatment? again we have to go across rows instead.

For the third one, however, we can use groupby, because this one is tidy.

6.3. Encoding Missing values#

We can see the impact of a bad choice to represent a missing value with -

example_data['study_c'].dtypes
person       object
treatment    object
result       object
dtype: object

one non numerical value changed the whole column!

example_data['study_c'].describe()
person treatment result
count 6 6 6
unique 3 2 6
top John Smith a -
freq 2 3 1

so describe treats it all like categorical data

We can use the na_values parameter to fix this. Pandas recognizes a lot of different values to mean missing and store as NaN but his is not one. Find the full list in the pd.read_csv documentation of the na_values parameter

example_data = {datafile.split('.')[0]:pd.read_csv(url_base+datafile,na_values='-') 
                for datafile in data_filename_list}

now we can checka gain

example_data['study_c'].dtypes
person        object
treatment     object
result       float64
dtype: object

and we see that it is float this is because NaN is a float.

example_data['study_c']
person treatment result
0 John Smith a NaN
1 Jane Doe a 16.0
2 Mary Johnson a 3.0
3 John Smith b 2.0
4 Jane Doe b 11.0
5 Mary Johnson b 1.0

Now it shows as Nan here

6.4. Computing on Tidy Data#

So we can see how to compute on this data to compute both ways.

example_data['study_c'].groupby('person')['result'].mean()
person
Jane Doe        13.5
John Smith       2.0
Mary Johnson     2.0
Name: result, dtype: float64
example_data['study_c'].groupby('treatment')['result'].mean()
treatment
a    9.500000
b    4.666667
Name: result, dtype: float64

The original Tidy Data paper is worth reading to build a deeper understanding of these ideas.

6.5. Tidying Data#

Let’s reshape the first one to match the tidy one. First, we will save it to a DataFrame, this makes things easier to read and enables us to use the built in help in jupyter, because it can’t check types too many levels into a data structure.

Before

example_data['study_a']
name treatmenta treatmentb
0 John Smith NaN 2
1 Jane Doe 16.0 11
2 Mary Johnson 3.0 1

After

df_a = example_data['study_a']

When we melt a dataset:

  • the id_vars stay as columns

  • the data from the value_vars columns become the values in the value column

  • the column names from the value_vars become the values in the variable column

  • we can rename the value and the variable columns.

tall_a = df_a.melt(id_vars='name',var_name='treatment',value_name='result')
tall_a
name treatment result
0 John Smith treatmenta NaN
1 Jane Doe treatmenta 16.0
2 Mary Johnson treatmenta 3.0
3 John Smith treatmentb 2.0
4 Jane Doe treatmentb 11.0
5 Mary Johnson treatmentb 1.0

see visualized on pandas tutor

7. Transforming the Coffee data#

First we see another way to filter data. We can use statistics of the data to filter and remove some. .

For exmaple, lets only keep the data from the top 10 countries in terms of number of bags.

arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
# load the data
coffee_df = pd.read_csv(arabica_data_url)
coffee_df.head()
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0
1 2 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 1 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0
2 3 Arabica grounds for health admin Guatemala san marcos barrancas "san cristobal cuch NaN NaN NaN NaN 1600 - 1800 m ... NaN 0 May 31st, 2011 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 1600.0 1800.0 1700.0
3 4 Arabica yidnekachew dabessa Ethiopia yidnekachew dabessa coffee plantation NaN wolensu NaN yidnekachew debessa coffee plantation 1800-2200 ... Green 2 March 25th, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1800.0 2200.0 2000.0
4 5 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 2 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

5 rows × 44 columns

We have separate countries for the country and number of bags

So we make a table the totals how many bags per country

bags_per_country = coffee_df.groupby('Country.of.Origin')['Number.of.Bags'].sum()
bags_per_country.head(15) 
Country.of.Origin
Brazil           30534
Burundi            520
China               55
Colombia         41204
Costa Rica       10354
Cote d?Ivoire        2
Ecuador              1
El Salvador       4449
Ethiopia         11761
Guatemala        36868
Haiti              390
Honduras         13167
India               20
Indonesia         1658
Japan               20
Name: Number.of.Bags, dtype: int64

Then we can take only the highest 10, and keep only the index

# sort descending, keep only the top 10 and pick out only the country names
top_bags_country_list = bags_per_country.sort_values(ascending=False)[:10].index

# filter the original data for only the countries in the top list
top_coffee_df = coffee_df[coffee_df['Country.of.Origin'].isin(top_bags_country_list)]

we can see which countries are in the list

top_bags_country_list
Index(['Colombia', 'Guatemala', 'Brazil', 'Mexico', 'Honduras', 'Ethiopia',
       'Costa Rica', 'Nicaragua', 'El Salvador', 'Kenya'],
      dtype='object', name='Country.of.Origin')

and confirm that is all that is in our new dataframe

top_coffee_df['Country.of.Origin'].value_counts()
Country.of.Origin
Mexico         236
Colombia       183
Guatemala      181
Brazil         132
Honduras        53
Costa Rica      51
Ethiopia        44
Nicaragua       26
Kenya           25
El Salvador     21
Name: count, dtype: int64

compared to the original, it is far fewer

coffee_df['Country.of.Origin'].value_counts()
Country.of.Origin
Mexico                          236
Colombia                        183
Guatemala                       181
Brazil                          132
Taiwan                           75
United States (Hawaii)           73
Honduras                         53
Costa Rica                       51
Ethiopia                         44
Tanzania, United Republic Of     40
Thailand                         32
Uganda                           26
Nicaragua                        26
Kenya                            25
El Salvador                      21
Indonesia                        20
China                            16
Malawi                           11
Peru                             10
United States                     8
Myanmar                           8
Vietnam                           7
Haiti                             6
Philippines                       5
Panama                            4
United States (Puerto Rico)       4
Laos                              3
Burundi                           2
Ecuador                           1
Rwanda                            1
Japan                             1
Zambia                            1
Papua New Guinea                  1
Mauritius                         1
Cote d?Ivoire                     1
India                             1
Name: count, dtype: int64

Warning

This section is changed slightly from in class to be easier to read.

This limits the rows, but keeps all of the columns

top_coffee_df.columns
Index(['Unnamed: 0', 'Species', 'Owner', 'Country.of.Origin', 'Farm.Name',
       'Lot.Number', 'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region',
       'Producer', 'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner',
       'Harvest.Year', 'Grading.Date', 'Owner.1', 'Variety',
       'Processing.Method', 'Aroma', 'Flavor', 'Aftertaste', 'Acidity', 'Body',
       'Balance', 'Uniformity', 'Clean.Cup', 'Sweetness', 'Cupper.Points',
       'Total.Cup.Points', 'Moisture', 'Category.One.Defects', 'Quakers',
       'Color', 'Category.Two.Defects', 'Expiration', 'Certification.Body',
       'Certification.Address', 'Certification.Contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')

Since this has a lot of variables, we will make lists of how we want to use different columns in variables before using.

value_vars = ['Aroma', 'Flavor', 'Aftertaste', 'Acidity', 'Body',
       'Balance', 'Uniformity', 'Clean.Cup', 'Sweetness', ]
id_vars = ['Species', 'Owner', 'Country.of.Origin']
coffee_tall = top_coffee_df.melt(id_vars=id_vars, value_vars=value_vars,
              var_name='rating')
coffee_tall.head(1)
Species Owner Country.of.Origin rating value
0 Arabica metad plc Ethiopia Aroma 8.67

This data is “tidy” all of the different measurements (ratings) are different rows and we have a few columsn that identify each sample

This version plays much better with plots where we want to compare the ratings.

sns.catplot(data = coffee_tall, col='rating', y='value',x='Country.of.Origin',
            hue='Country.of.Origin',aspect=3, col_wrap =3, kind='violin') 
<seaborn.axisgrid.FacetGrid at 0x7fed404ae2e0>
../_images/676006e745ab971e76123035435acc4fbdf1dc410f310127d29f6d6b7e4efb05.png

This now we can use the rating as a variable to break the data by in our subplots. I did a few new things in this:

  • a violin plot allows us to compare how different data is distributed.

  • I used both col and col_wrap, col alone would have made nine columns, and without row all in one row. col_wrap says to line wrap after a certain number of columns (here 3)

  • aspect is an attribute of the figure level plots that controls the ratio of width to height (the aspect ratio). so aspect=3 makes it 3 times as wide as tall in each subplot, this spreads out the x tick labels so can read them

7.1. Questions After Class#

Most of the questions today were about assignment 3 or asking to explain specific things that are above. Feel free to post an issue if a question comes up later.

7.1.1. Can you change the locations of the names next to a plot?#

First, see the annotated graph and learn the technical name of the element you want to move.

annotated graph

the above figure come from matplotlib’s Anatomy of a Figure page which includes the code to generate that figure

You can control each of these but you may need to use matplotlib.

If this means the legend, seaborn can control that location. If it refers to fixing overlappign tick labels, I demonstrated that above, with aspect.