7. More EDA#

import pandas as pd
import seaborn as sns
sns.set_theme(palette='colorblind')

arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
robusta_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'

coffee_df = pd.read_csv(arabica_data_url,index_col=0)

7.1. Which of the following is distributed most similarly to Sweetness?#

scores_of_interest = ['Flavor','Balance','Aroma','Body',
                      'Uniformity','Aftertaste','Sweetness']

We can answer this statistically, technically, but it will be easiest to do this looking ad a plot.

First step is to subset the data:

coffee_df[scores_of_interest].head(2)

	Flavor	Balance	Aroma	Body	Uniformity	Aftertaste	Sweetness
1	8.83	8.42	8.67	8.50	10.0	8.67	10.0
2	8.67	8.42	8.75	8.42	10.0	8.50	10.0

Then we produce a kde plot

coffee_df[scores_of_interest].plot(kind='kde')

<AxesSubplot: ylabel='Density'>

We could also do it with seaborn

sns.displot(data=coffee_df[scores_of_interest])

<seaborn.axisgrid.FacetGrid at 0x7f192e391370>

the default is not as helpful as using

sns.displot(data=coffee_df[scores_of_interest],kind='kde')

<seaborn.axisgrid.FacetGrid at 0x7f193276d1c0>

If we forget the parameter kind, we get its default value.

print('\n'.join(sns.displot.__doc__.split('\n')[5:10]))

``kind`` parameter selects the approach to use:

- :func:`histplot` (with ``kind="hist"``; the default)
- :func:`kdeplot` (with ``kind="kde"``)
- :func:`ecdfplot` (with ``kind="ecdf"``; univariate-only)

7.2. Summarizing with multiple variables#

So, we can summarize data now, but the summaries we have done so far have treated each variable one at a time. The most interesting patterns are in often in how multiple variables interact. We’ll do some modeling that looks at multivariate functions of data in a few weeks, but for now, we do a little more with summary statistics.

On Monday, we saw how to see how many reviews there were per country, using

total_per_country = coffee_df.groupby('Country.of.Origin')['Number.of.Bags'].sum()

What just happened? split-apply-combine image showing one data table, it split into 3 part, the sum applied to each part, and the sums combined back into one table

Groupby splits the whole dataframe into parts where each part has the same value for Country.of.Origin and then after that, we extracted the Number.of.Bags column, took the sum (within each separate group) and then put it all back together in one table (in this case, a Series becuase we picked one variable out)

7.2.1. How does Groupby Work?#

Important

This is more details with code examples on how the groupby works. If you want to run this code for yourself, use the download icon at the top right to download these notes as a notebook.

We can view this by saving the groupby object as a variable and exploring it.

country_grouped = coffee_df.groupby('Country.of.Origin')

country_grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f1932661910>

Trying to look at it without applying additional functions, just tells us the type. But, it’s iterable, so we can loop over.

for country,df in country_grouped:
    print(type(country), type(df))

<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>
<class 'str'> <class 'pandas.core.frame.DataFrame'>

We could manually compute things using the data structure, if needed, though using pandas functionality will usually do what we want. For example:

bag_total_dict = {}

for country,df in country_grouped:
    tot_bags =  df['Number.of.Bags'].sum()
    bag_total_dict[country] = tot_bags

pd.DataFrame.from_dict(bag_total_dict, orient='index',
                           columns = ['Number.of.Bags.Sum'])

	Number.of.Bags.Sum
Brazil	30534
Burundi	520
China	55
Colombia	41204
Costa Rica	10354
Cote d?Ivoire	2
Ecuador	1
El Salvador	4449
Ethiopia	11761
Guatemala	36868
Haiti	390
Honduras	13167
India	20
Indonesia	1658
Japan	20
Kenya	3971
Laos	81
Malawi	557
Mauritius	1
Mexico	24140
Myanmar	10
Nicaragua	6406
Panama	537
Papua New Guinea	7
Peru	2336
Philippines	259
Rwanda	150
Taiwan	1914
Tanzania, United Republic Of	3760
Thailand	1310
Uganda	3868
United States	361
United States (Hawaii)	833
United States (Puerto Rico)	71
Vietnam	10
Zambia	13

is the same as what we did before

7.3. Sorting DataFrames#

We saved the totals, so we can check what type that is.

type(total_per_country)

pandas.core.series.Series

It’s a pandas series, so we can use the sort_values method.

total_per_country.sort_values(ascending=False)

Country.of.Origin
Colombia                        41204
Guatemala                       36868
Brazil                          30534
Mexico                          24140
Honduras                        13167
Ethiopia                        11761
Costa Rica                      10354
Nicaragua                        6406
El Salvador                      4449
Kenya                            3971
Uganda                           3868
Tanzania, United Republic Of     3760
Peru                             2336
Taiwan                           1914
Indonesia                        1658
Thailand                         1310
United States (Hawaii)            833
Malawi                            557
Panama                            537
Burundi                           520
Haiti                             390
United States                     361
Philippines                       259
Rwanda                            150
Laos                               81
United States (Puerto Rico)        71
China                              55
India                              20
Japan                              20
Zambia                             13
Myanmar                            10
Vietnam                            10
Papua New Guinea                    7
Cote d?Ivoire                       2
Ecuador                             1
Mauritius                           1
Name: Number.of.Bags, dtype: int64

7.4. More plot types#

We can also look at the distributions

sns.catplot(data=coffee_df,x='Country.of.Origin',y='Number.of.Bags')

<seaborn.axisgrid.FacetGrid at 0x7f192e37b7f0>

sns.catplot(data=coffee_df,x='Country.of.Origin',y='Number.of.Bags',
           kind='violin')

<seaborn.axisgrid.FacetGrid at 0x7f192e389130>

sns.catplot(data=coffee_df,x='Country.of.Origin',y='Number.of.Bags',
           kind='bar')

<seaborn.axisgrid.FacetGrid at 0x7f192ddef520>

7.5. How can we pick the top 10 countries out?#

total_per_country.sort_values(ascending=False)[:10]

Country.of.Origin
Colombia       41204
Guatemala      36868
Brazil         30534
Mexico         24140
Honduras       13167
Ethiopia       11761
Costa Rica     10354
Nicaragua       6406
El Salvador     4449
Kenya           3971
Name: Number.of.Bags, dtype: int64

type(total_per_country.sort_values(ascending=False)[:10])

pandas.core.series.Series

this is a series, but we can change it to a DataFrame with reset_index() this will also add a new index column but a side effect is that it becomes a DataFrame again

top_total_df = total_per_country.sort_values(ascending=False)[:10].reset_index()
top_total_df

	Country.of.Origin	Number.of.Bags
0	Colombia	41204
1	Guatemala	36868
2	Brazil	30534
3	Mexico	24140
4	Honduras	13167
5	Ethiopia	11761
6	Costa Rica	10354
7	Nicaragua	6406
8	El Salvador	4449
9	Kenya	3971

sns.catplot(data=top_total_df,x='Country.of.Origin',y='Number.of.Bags',
           kind='bar',aspect=2)

<seaborn.axisgrid.FacetGrid at 0x7f192dcf8670>

this got the number of countries fewer, bu they were still hard to read because they were small and still overlapping, so I added aspect=2 to make it 2x as wide as tall and give it more space. We can also use a similar strategy to pull out just these country names and get all the data for those.

top_countries = total_per_country.sort_values(ascending=False)[:10].index

coffee_df[coffee_df['Country.of.Origin'].isin(top_countries)].head(2)

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	Green	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0
2	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	Green	1	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0

2 rows × 43 columns

I forgot the isin method in class and had to look that up, I more often need to filter by numerical columns or selecting a single value.

top_coffee_df = coffee_df[coffee_df['Country.of.Origin'].isin(top_countries)]

sns.catplot(data=top_coffee_df,x='Country.of.Origin',y='Number.of.Bags',
           kind='bar',aspect=2)

<seaborn.axisgrid.FacetGrid at 0x7f192db6ebb0>

top_coffee_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 952 entries, 1 to 1312
Data columns (total 43 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 Species                952 non-null    object 
 Owner                  945 non-null    object 
 Country.of.Origin      952 non-null    object 
 Farm.Name              694 non-null    object 
 Lot.Number             205 non-null    object 
 Mill                   760 non-null    object 
 ICO.Number             908 non-null    object 
 Company                789 non-null    object 
 Altitude               833 non-null    object 
 Region                 919 non-null    object 
Producer               812 non-null    object 
Number.of.Bags         952 non-null    int64  
Bag.Weight             952 non-null    object 
In.Country.Partner     952 non-null    object 
Harvest.Year           932 non-null    object 
Grading.Date           952 non-null    object 
Owner.1                945 non-null    object 
Variety                824 non-null    object 
Processing.Method      850 non-null    object 
Aroma                  952 non-null    float64
Flavor                 952 non-null    float64
Aftertaste             952 non-null    float64
Acidity                952 non-null    float64
Body                   952 non-null    float64
Balance                952 non-null    float64
Uniformity             952 non-null    float64
Clean.Cup              952 non-null    float64
Sweetness              952 non-null    float64
Cupper.Points          952 non-null    float64
Total.Cup.Points       952 non-null    float64
Moisture               952 non-null    float64
Category.One.Defects   952 non-null    int64  
Quakers                951 non-null    float64
Color                  808 non-null    object 
Category.Two.Defects   952 non-null    int64  
Expiration             952 non-null    object 
Certification.Body     952 non-null    object 
Certification.Address  952 non-null    object 
Certification.Contact  952 non-null    object 
unit_of_measurement    952 non-null    object 
altitude_low_meters    830 non-null    float64
altitude_high_meters   830 non-null    float64
altitude_mean_meters   830 non-null    float64
dtypes: float64(16), int64(3), object(24)
memory usage: 327.2+ KB

sns.displot(top_coffee_df, x='Sweetness', kind='kde', col='Country.of.Origin',col_wrap=3)

<seaborn.axisgrid.FacetGrid at 0x7f192dfa65b0>

We’ll see next week how to manipulate the dataset so that we can plot multiple scores this way.

Variable types and data types Related but not the same.

7.6. General Plotting Ideas#

There are lots of type of plots, we saw the basic patterns of how to use them and we’ve used a few types, but we cannot (and do not need to) go through every single type. There are general patterns that you can use that will help you think about what type of plot you might want and help you understand them to be able to customize plots.

[Seaborn’s main goal is opinionated defaults and flexible customization](https://seaborn.pydata.org/tutorial/introduction.html#opinionated-defaults-and-flexible-customization

7.6.1. Anatomy of a figure#

First is the matplotlib structure of a figure. BOth pandas and seaborn and other plotting libraries use matplotlib. Matplotlib was used in visualizing the first Black hole.

annotated graph

This is a lot of information, but these are good to know things. THe most important is the figure and the axes.

Try it Yourself

Make sure you can explain what is a figure and what are axes in your own words and why that distinction matters. Discuss in office hours if you are unsure.

that image was drawn with code and that page explains more.

7.6.2. Plotting Function types in Seaborn#

Seaborn has two levels or groups of plotting functions. Figure and axes. Figure level fucntions can plot with subplots.

summary of plot types

This is from thie overivew section of the official seaborn tutorial. It also includes a comparison of figure vs axes plotting.

The official introduction is also a good read.

7.6.3. More#

The seaborn gallery and matplotlib gallery are nice to look at too.

7.7. Styling Plots#

sns.set_theme()

sns.catplot(data=top_coffee_df,x='Country.of.Origin',y='Number.of.Bags',
           kind='bar',aspect=2)

<seaborn.axisgrid.FacetGrid at 0x7f192dac67f0>

This by default styles the plots to be more visually appealing

sns.set_theme(palette='colorblind')
sns.catplot(data=top_coffee_df,x='Country.of.Origin',y='Number.of.Bags',
           kind='bar',aspect=2)

<seaborn.axisgrid.FacetGrid at 0x7f192bb16400>

the colorblind palette is more distinguishable under a variety fo colorblindness types. for more

sns.set_theme(palette='bright')
sns.catplot(data=top_coffee_df,x='Country.of.Origin',y='Number.of.Bags',
           kind='bar',aspect=2)

<seaborn.axisgrid.FacetGrid at 0x7f1928adfee0>

Colorblind is a good default, but you can choose others that yo like more too.

7.8. More Practice#

Make a table thats total number of bags and mean and count of scored for each of the variables in the scores_of_interest list.
Make a bar chart of the mean score for each variable scores_of_interest grouped by country.

7.9. Questions#

Note

Submit questions as Issues

Programming for Data Science at URI Fall 2022

More EDA

Contents

7. More EDA#

7.1. Which of the following is distributed most similarly to Sweetness?#

7.2. Summarizing with multiple variables#

7.2.1. How does Groupby Work?#

7.3. Sorting DataFrames#

7.4. More plot types#

7.5. How can we pick the top 10 countries out?#

7.6. General Plotting Ideas#

7.6.1. Anatomy of a figure#

7.6.2. Plotting Function types in Seaborn#

7.6.3. More#

7.7. Styling Plots#

7.8. More Practice#

7.9. Questions#

7.10. Questions After Class#