9. Reshaping Data#

Today, we’ll begin reshaping data. We’ll cover:

filtering
applying a function to all rows
what is tidy data
reshaping data into tidy data

First some setup:

import pandas as pd
import seaborn as sns

sns.set_theme(font_scale=2, palette='colorblind')
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

9.1. Cleaning Data#

This week, we’ll be cleaning data.

Cleaning data is labor intensive and requires making subjective choices.
We’ll focus on, and assess you on, manipulating data correctly, making reasonable choices, and documenting the choices you make carefully.

We’ll focus on the programming tools that get used in cleaning data in class

this week:

reshaping data
handling missing or incorrect values
changing the representation of information

9.2. Tidy Data#

Read in the three csv files described below and store them in a list of DataFrames.

url_base = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'

datasets = ['study_a.csv','study_b.csv','study_c.csv']

df_list = [pd.read_csv(url_base + file,na_values= '') for file in datasets]

df_list[0]

	name	treatmenta	treatmentb
0	John Smith	-	2
1	Jane Doe	16	11
2	Mary Johnson	3	1

df_list[1]

	intervention	John Smith	Jane Doe	Mary Johnson
0	treatmenta	-	16	3
1	treatmentb	2	11	1

df_list[2]

	person	treatment	result
0	John Smith	a	-
1	Jane Doe	a	16
2	Mary Johnson	a	3
3	John Smith	b	2
4	Jane Doe	b	11
5	Mary Johnson	b	1

These three all show the same data, but let’s say we have two goals:

find the average effect per person across treatments
find the average effect per treatment across people

This works differently for these three versions.

df_list[0].mean()

/tmp/ipykernel_2499/2274987639.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_list[0].mean()

treatmenta   -54.333333
treatmentb     4.666667
dtype: float64

we get the average per treatment, but to get the average per person, we have to go across rows, which we can do here, but doesn’t work as well with plotting

df_list[0].mean(axis=1)

/tmp/ipykernel_2499/1371725361.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_list[0].mean(axis=1)

   2.0
  11.0
   1.0
dtype: float64

and this is not well labeled.

Let’s try the next one.

df_list[1].mean()

/tmp/ipykernel_2499/1105758803.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_list[1].mean()

John Smith      -1.0
Jane Doe        13.5
Mary Johnson     2.0
dtype: float64

Now we get the average per person, but what about per treatment? again we have to go across rows instead.

df_list[1].mean(axis=1)

/tmp/ipykernel_2499/1112167831.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_list[1].mean(axis=1)

0    9.5
1    6.0
dtype: float64

For the third one, however, we can use groupby

df_list[2].groupby('person').mean()

/tmp/ipykernel_2499/906930014.py:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
  df_list[2].groupby('person').mean()

	result
person
Jane Doe	805.5
John Smith	-1.0
Mary Johnson	15.5

df_list[2].groupby('treatment').mean()

/tmp/ipykernel_2499/2480310521.py:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
  df_list[2].groupby('treatment').mean()

	result
treatment
a	-54.333333
b	703.666667

The original Tidy Data paper is worth reading to build a deeper understanding of these ideas.

9.3. Tidying Data#

Let’s reshape the first one to match the tidy one. First, we will save it to a DataFrame, this makes things easier to read and enables us to use the built in help in jupyter, because it can’t check types too many levels into a data structure.

treat_df = df_list[0]

Let’s look at it again, so we can see

treat_df.head()

	name	treatmenta	treatmentb
0	John Smith	-	2
1	Jane Doe	16	11
2	Mary Johnson	3	1

Correction

I fixed the three data files so the spaces can be removed. You will need to

treat_df.melt(value_vars = ['treatmenta','treatmentb'],
             id_vars = ['name'],
             value_name = 'result', var_name = 'treatment' )

	name	treatment	result
0	John Smith	treatmenta	-
1	Jane Doe	treatmenta	16
2	Mary Johnson	treatmenta	3
3	John Smith	treatmentb	2
4	Jane Doe	treatmentb	11
5	Mary Johnson	treatmentb	1

tidy_treat_df = treat_df.melt(value_vars = ['treatmenta','treatmentb'],
             id_vars = ['name'],
             value_name = 'result', var_name = 'treatment' )

tidy_treat_df.groupby('name').mean()

/tmp/ipykernel_2499/3450628511.py:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
  tidy_treat_df.groupby('name').mean()


name
Jane Doe
John Smith
Mary Johnson

9.4. Filtering Data by a column#

Let’s go back to the coffee dataset

coffee_df = pd.read_csv(arabica_data_url, index_col = 0)
coffee_df.head()

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	Green	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0
2	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	Green	1	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0
3	Arabica	grounds for health admin	Guatemala	san marcos barrancas "san cristobal cuch	NaN	NaN	NaN	NaN	1600 - 1800 m	NaN	...	NaN	0	May 31st, 2011	Specialty Coffee Association	36d0d00a3724338ba7937c52a378d085f2172daa	0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660	m	1600.0	1800.0	1700.0
4	Arabica	yidnekachew dabessa	Ethiopia	yidnekachew dabessa coffee plantation	NaN	wolensu	NaN	yidnekachew debessa coffee plantation	1800-2200	oromia	...	Green	2	March 25th, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1800.0	2200.0	2000.0
5	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	Green	2	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0

5 rows × 43 columns

Recall on Friday we computed the total number of bags per country.

# compute total bags per country
bag_total_df = coffee_df.groupby('Country.of.Origin')['Number.of.Bags'].sum()

We can subset this to get only the countries with over 15000 using a boolean mask:

bag_total_df[bag_total_df>15000]

Country.of.Origin
Brazil       30534
Colombia     41204
Guatemala    36868
Mexico       24140
Name: Number.of.Bags, dtype: int64

what we put in the [] has to be the same length and each element has to be boolean

len(bag_total_df>15000)

mask = bag_total_df>15000
type(mask[0])

numpy.bool_

9.5. Augmenting a dataset#

We want the names of the countries as a list, so we extract the index of that series and then cast it to a list.

high_prod_countries =  list(bag_total_df[bag_total_df>15000].index)

Next we want to be able to check if a country is in this list, so we’ll make a lambda that can do that

high_prod = lambda c: c in high_prod_countries

Recall, the lambda keyword makes a function

type(high_prod)

function

We can test it

high_prod('Mexico'), high_prod('Ethiopa')

(True, False)

Now, we can apply that lambda function to each country in our whole coffee data frame. and save that to a new DataFrame.

coffee_df['high_production'] = coffee_df['Country.of.Origin'].apply(high_prod)

coffee_df.head()

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters	high_production
1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0	False
2	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	1	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0	False
3	Arabica	grounds for health admin	Guatemala	san marcos barrancas "san cristobal cuch	NaN	NaN	NaN	NaN	1600 - 1800 m	NaN	...	0	May 31st, 2011	Specialty Coffee Association	36d0d00a3724338ba7937c52a378d085f2172daa	0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660	m	1600.0	1800.0	1700.0	True
4	Arabica	yidnekachew dabessa	Ethiopia	yidnekachew dabessa coffee plantation	NaN	wolensu	NaN	yidnekachew debessa coffee plantation	1800-2200	oromia	...	2	March 25th, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1800.0	2200.0	2000.0	False
5	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	2	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0	False

5 rows × 44 columns

Finally, we can filter the whole data frame using that new column.

high_prod_coffee_df = coffee_df[coffee_df['high_production']]

Question from class

How can we get the ones not on that list?

low_prod_coffee_df = coffee_df[coffee_df['high_production']==False]

Try it Yourself

Replace the FIXMEs in the excerpt below to reshape the data to have a value column with the value of the score and a Score column that indicates which score is in that . Keep the color and country as values

scores_of_interest = ['Balance','Aroma','Body','Aftertaste']
attrs_of_interest = ['Country.of.Origin','Color']
high_prod_coffee_df_melted = high_prod_coffee_df.melt(
        id_vars = FIXME,
        value_vars = FIXME,
        value_name = 'Value',
        var_name = 'Score')

so that it looks like the following

high_prod_coffee_df_melted = high_prod_coffee_df.melt(
        id_vars = ['Country.of.Origin','Color'],
        value_vars = ['Balance','Aroma','Body','Aftertaste'],
        value_name = 'Value',
        var_name = 'Score')

high_prod_coffee_df_melted.head()

	Country.of.Origin	Color	Score	Value
0	Guatemala	NaN	Balance	8.42
1	Brazil	Bluish-Green	Balance	8.33
2	Mexico	Green	Balance	8.17
3	Brazil	Green	Balance	8.00
4	Brazil	Green	Balance	8.00

Try it Yourself

Plot the distribution of each score on a separate subplot and use a different color for each country. Use a kde for the distributions.

sns.displot(data = high_prod_coffee_df_melted, x ='Value', hue = 'Country.of.Origin',
            col = 'Score',col_wrap =2, aspect=1.5, kind = 'kde')

<seaborn.axisgrid.FacetGrid at 0x7f3ed8c7bd30>

Programming for Data Science at URI Fall 2021

Reshaping Data

Contents

9. Reshaping Data#

9.1. Cleaning Data#

9.2. Tidy Data#

9.3. Tidying Data#

9.4. Filtering Data by a column#

9.5. Augmenting a dataset#