Skip to article frontmatterSkip to article content

Fixing values

Fixing Values

o far, we’ve dealt with structural issues in data. but there’s a lot more to cleaning.

Today, we’ll deal with how to fix the values within the data.

Cleaning Data review

Instead of more practice with these manipulations, below are more examples of cleaning data to see how these types of manipulations get used.
Your goal here is not to memorize every possible thing, but to build a general idea of what good data looks like and good habits for cleaning data and keeping it reproducible.

This article is a comprehensive discussion of data cleaning.

A Cleaning Data Recipe

not everything possible, but good enough for this course

  1. Can you use parameters to read the data in better?

  2. Fix the index and column headers (making these easier to use makes the rest easier)

  3. Is the data strucutred well?

  4. Are there missing values?

  5. Do the datatypes match what you expect by looking at the head or a sample?

  6. Are categorical variables represented in usable way?

  7. Does your analysis require filtering or augmenting the data?

What is clean enough?

This is a great question, without an easy answer.

It depends on what you want to do. This is why it’s important to have potential questions in mind if you are cleaning data for others and why we often have to do a little bit more preparation after a dataset has been “cleaned”

whatever you do, document it

import pandas as pd
import seaborn as sns
import numpy as np #
na_toy_df = pd.DataFrame(data = [[1,3,4,5],[2 ,6, np.nan, ], [np.nan ]*4])

# make plots look nicer and increase font size
sns.set_theme(font_scale=2, palette='colorblind')
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

coffee_df = pd.read_csv(arabica_data_url)


rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'
course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)

Missing Dtaa

Dealing with missing data is a whole research area. There isn’t one solution.

Even in pandas, dealing with missing values is under experimentation as to how to represent it symbolically

Missing values causes the datatypes to change

We tend to store missing values as NaN or use the constants:

pd.NA, np.nan
(<NA>, nan)

Pandas makes that a special typed object, but converts the whole column to float

Numpy uses float value for NaN that is defined by IEEE floating point standard

Pandas gives a few basic tools for dealing with missing values :

type(pd.NA),type(np.nan)
(pandas._libs.missing.NAType, float)

We can see a few in this toy dataset

na_toy_df
Loading...

Dropping Missing

Let’s try the default behavior of dropna

na_toy_df.dropna()
Loading...

by default it drops all of the rows where any of the elements are missing (1 or more)

This is equivalent to:

na_toy_df.dropna(how='any', subset=na_toy_df.columns)
Loading...

we can change how to its other mode:

na_toy_df.dropna(how='all', )
Loading...

in 'all' mode it only drops rows where all of the values are missing

we can also change it to work along columns (axis=1) instead

na_toy_df.dropna(how='all',axis=1)
Loading...

None of the columns are all missing so nothing is dropped

if only some of the columns, matter, we can say to only drop if any of those values are missing:

na_toy_df.dropna(subset=[0,1])
Loading...

this means no drops, but if we change the important columns, it changes

na_toy_df.dropna(subset=[0,2],how='any')
Loading...

Filling

Filling can be good if you know how to fill reasonably, but don’t have data to spare by dropping. For example

Special case, what if we’re filling a summary table?

toy_df_filled = na_toy_df.fillna(0)
toy_df_filled
Loading...

Filling missing values in the coffee data

Let’s look at a real dataset now

coffee_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1311 entries, 0 to 1310
Data columns (total 44 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             1311 non-null   int64  
 1   Species                1311 non-null   object 
 2   Owner                  1304 non-null   object 
 3   Country.of.Origin      1310 non-null   object 
 4   Farm.Name              955 non-null    object 
 5   Lot.Number             270 non-null    object 
 6   Mill                   1001 non-null   object 
 7   ICO.Number             1163 non-null   object 
 8   Company                1102 non-null   object 
 9   Altitude               1088 non-null   object 
 10  Region                 1254 non-null   object 
 11  Producer               1081 non-null   object 
 12  Number.of.Bags         1311 non-null   int64  
 13  Bag.Weight             1311 non-null   object 
 14  In.Country.Partner     1311 non-null   object 
 15  Harvest.Year           1264 non-null   object 
 16  Grading.Date           1311 non-null   object 
 17  Owner.1                1304 non-null   object 
 18  Variety                1110 non-null   object 
 19  Processing.Method      1159 non-null   object 
 20  Aroma                  1311 non-null   float64
 21  Flavor                 1311 non-null   float64
 22  Aftertaste             1311 non-null   float64
 23  Acidity                1311 non-null   float64
 24  Body                   1311 non-null   float64
 25  Balance                1311 non-null   float64
 26  Uniformity             1311 non-null   float64
 27  Clean.Cup              1311 non-null   float64
 28  Sweetness              1311 non-null   float64
 29  Cupper.Points          1311 non-null   float64
 30  Total.Cup.Points       1311 non-null   float64
 31  Moisture               1311 non-null   float64
 32  Category.One.Defects   1311 non-null   int64  
 33  Quakers                1310 non-null   float64
 34  Color                  1044 non-null   object 
 35  Category.Two.Defects   1311 non-null   int64  
 36  Expiration             1311 non-null   object 
 37  Certification.Body     1311 non-null   object 
 38  Certification.Address  1311 non-null   object 
 39  Certification.Contact  1311 non-null   object 
 40  unit_of_measurement    1311 non-null   object 
 41  altitude_low_meters    1084 non-null   float64
 42  altitude_high_meters   1084 non-null   float64
 43  altitude_mean_meters   1084 non-null   float64
dtypes: float64(16), int64(4), object(24)
memory usage: 450.8+ KB

The ‘Lot.Number’ has a lot of NaN values, how can we explore it?

We can look at the type:

coffee_df['Lot.Number'].dtype
dtype('O')

And we can look at the value counts.

coffee_df['Lot.Number'].value_counts()
Lot.Number 1 18 020/17 6 019/17 5 2016 Tainan Coffee Cupping Event Micro Lot 臺南市咖啡評鑑批次 3 102 3 .. 017/105/16039 1 14/7/2015/172 1 2017/2018-Lot01 1 2017/2018 - Lot 2 1 017-053-0211/ 017-053-0212 1 Name: count, Length: 221, dtype: int64

We see that a lot are ‘1’, maybe we know that when the data was collected, if the Farm only has one lot, some people recorded ‘1’ and others left it as missing. So we could fill in with 1:

coffee_df['Lot.Number'].fillna('1').head()
0 1 1 1 2 1 3 1 4 1 Name: Lot.Number, dtype: object
coffee_df['Lot.Number'].head()
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN Name: Lot.Number, dtype: object

Note that even after we called fillna we display it again and the original data is unchanged. To save the filled in column, technically we have a few choices:

we will add a column

coffee_df['lot_number_clean'] = coffee_df['Lot.Number'].fillna('1')
coffee_df.head(1)
Loading...
coffee_df.shape
(1311, 45)

Dropping

Dropping is a good choice when you otherwise have a lot of data and the data is missing at random.

Dropping can be risky if it’s not missing at random. For example, if we saw in the coffee data that one of the scores was missing for all of the rows from one country, or even just missing more often in one country, that could bias our results.

here will will focus on how this impacts how much data we have:

coffee_df.dropna().shape
(130, 45)

we lose a lot this way.

We could instead tell it to only drop rows with NaN in a subset of the columns.

coffee_df.dropna(subset=['altitude_low_meters']).shape
(1084, 45)

Now, it drops any row with one or more NaN values in that column.

In the Open Policing Project Data Summary we saw that they made a summary information that showed which variables had at least 70% not missing values. We can similarly choose to keep only variables that have more than a specific threshold of data, using the thresh parameter and axis=1 to drop along columns.

n_rows, _ = coffee_df.shape
coffee_df.dropna(thresh=.7*n_rows, axis=1).shape
(1311, 44)

Inconsistent values

This was one of the things that many of you anticipated or had observed. A useful way to investigate for this, is to use value_counts and sort them alphabetically by the values from the original data, so that similar ones will be consecutive in the list. Once we have the value_counts() Series, the values from the coffee_df become the index, so we use sort_index.

Let’s look at the In.Country.Partner column

coffee_df['In.Country.Partner'].value_counts().sort_index()
In.Country.Partner AMECAFE 205 Africa Fine Coffee Association 49 Almacafé 178 Asociacion Nacional Del Café 155 Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C. 6 Asociación de Cafés Especiales de Nicaragua 8 Blossom Valley International 58 Blossom Valley International\n 1 Brazil Specialty Coffee Association 67 Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao 1 Centro Agroecológico del Café A.C. 8 Coffee Quality Institute 7 Ethiopia Commodity Exchange 18 Instituto Hondureño del Café 60 Kenya Coffee Traders Association 22 METAD Agricultural Development plc 15 NUCOFFEE 36 Salvadoran Coffee Council 11 Specialty Coffee Ass 1 Specialty Coffee Association 295 Specialty Coffee Association of Costa Rica 42 Specialty Coffee Association of Indonesia 10 Specialty Coffee Institute of Asia 16 Tanzanian Coffee Board 6 Torch Coffee Lab Yunnan 2 Uganda Coffee Development Authority 22 Yunnan Coffee Exchange 12 Name: count, dtype: int64

We can see there’s only one Blossom Valley International\n but 58 Blossom Valley International, the former is likely a typo, especially since \n is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.

partner_corrections = {'Blossom Valley International\n':'Blossom Valley International',
  'Specialty Coffee Ass':'Specialty Coffee Association'}
coffee_df['in_country_partner_clean'] = coffee_df['In.Country.Partner'].replace(
  to_replace=partner_corrections)
coffee_df['in_country_partner_clean'].value_counts().sort_index()
in_country_partner_clean AMECAFE 205 Africa Fine Coffee Association 49 Almacafé 178 Asociacion Nacional Del Café 155 Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C. 6 Asociación de Cafés Especiales de Nicaragua 8 Blossom Valley International 59 Brazil Specialty Coffee Association 67 Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao 1 Centro Agroecológico del Café A.C. 8 Coffee Quality Institute 7 Ethiopia Commodity Exchange 18 Instituto Hondureño del Café 60 Kenya Coffee Traders Association 22 METAD Agricultural Development plc 15 NUCOFFEE 36 Salvadoran Coffee Council 11 Specialty Coffee Association 296 Specialty Coffee Association of Costa Rica 42 Specialty Coffee Association of Indonesia 10 Specialty Coffee Institute of Asia 16 Tanzanian Coffee Board 6 Torch Coffee Lab Yunnan 2 Uganda Coffee Development Authority 22 Yunnan Coffee Exchange 12 Name: count, dtype: int64

Multiple values in a single column

Let’s look at the column about the bag weights

coffee_df['Bag.Weight'].head()
0 60 kg 1 60 kg 2 1 3 60 kg 4 60 kg Name: Bag.Weight, dtype: object

it has both the value and the units in a single column, which is not what we want.

This would be better in two separate columns

bag_df = coffee_df['Bag.Weight'].str.split(' ').apply(pd.Series).rename({0:'bag_weight_clean',
                                                                1:'bag_weight_unit'},
                                                              axis=1)
bag_df.head()
Loading...

This:

The following subsections break down the casting and string methods in more detail

String methods

Python has a powerful string class. There is also an even more powerful string module

we only need the base str methods most of the time

example_str = 'kjksfjds sklfjsdl'
type(example_str)
str

Some helpful ones:

example_str.split()
['kjksfjds', 'sklfjsdl']

this gives a list

you can also change the separator

'phrases-with-hyphens'.split('-')
['phrases', 'with', 'hyphens']

there are also mehtods for chaning the case and other similar things. *Use these instead of implementing your own string operations!!

example_str.upper(), example_str.capitalize()
('KJKSFJDS SKLFJSDL', 'Kjksfjds sklfjsdl')

Combining parts of dataframes

bag_df.head()
Loading...

we can pass pd.concat and iterable of pandas objects (here a list of DataFrames) and it will, by default stack them vertically, or with axis=1 stack the horizontally

pd.concat([coffee_df,bag_df],axis=1)
Loading...

Casting Review

If we have a variable that is not the type we want like this:

a='5'

we can check type

type(a)
str

and we can use the name of the type we want, as a function to cast it to the new type.

int(5)
5

and check

type(int(a))
int

Unpacking jsons

We can read json data from an api using read_json as we did in the first cell today

course_gh_df.head(1)
Loading...

However, json data is often nested

If we look at one of those rows, we can see it looks sort of like a dictionary, but we want it to be a pd.Series

course_gh_df['actor'].head(3)
0 {'id': 10656079, 'login': 'brownsarahm', 'disp... 1 {'id': 153571386, 'login': 'loganmccue17', 'di... 2 {'id': 10656079, 'login': 'brownsarahm', 'disp... Name: actor, dtype: object

We can change the type, of eachc cell using apply to cast each individual dictionary to a series.

course_gh_df['actor'].apply(pd.Series).head(3)
Loading...

This particular dataset has several of these:

course_gh_df.head(1)
Loading...

We can iterate over them all and then combine them back together:

json_cols = ['actor','repo','payload','org']

pd.concat([ course_gh_df[col].apply(pd.Series)
            for col in json_cols],axis=1).head(3)
Loading...

in the cell above, workign outside in:

We can then do thie with renaming the columns to have better names

course_gh_df_clean = pd.concat([ course_gh_df[col].apply(pd.Series).rename(lambda in_col: f'{col}_{in_col}',axis=1)
            for col in json_cols],axis=1)

Here we use that rename can take a function and we format the names to be like sourcecol_subcol where sourcecol is the name of the column in the original dataset and subcol is the name fo the field inside of the dictionary value.

To combine with the rest of the columsn we can filter out the othe rcolumn names

reg_cols = [col for col in course_gh_df.columns if not(col in json_cols)]
reg_cols
['id', 'type', 'public', 'created_at']

and then combine that with the rest into a single dataframe:

course_gh_df_clean = pd.concat([course_gh_df[reg_cols]]+
    [ course_gh_df[col].apply(pd.Series).rename(lambda in_col: f'{col}_{in_col}',axis=1)
            for col in json_cols],axis=1)
course_gh_df_clean.head(3)
Loading...

Questions

Today’s questions were all about the assignment or json example.