4. Exploratory Data Analysis#

Now we get to start actual data science!

4.1. First, a note on assignments#

4.1.1. the goal#

  • I am not looking for “an answer”

  • You should not be either

  • I am looking for evidence that you understand the material (thus far including prereqs)

  • You should be trying to understand material

This means that in office hours, I am going to:

  • ask you questions to help you think about the problem and the material

  • help direct your attention to the right part of the error message to figure out what is wrong

4.1.2. Getting help#

Sending me a screenshot is almost guaranteed to not get you a help. Not because I do not want to, but because I literally do not have the information to get you an answer.

Typically when someone do not know how to fix something from the error message, it is because they are reading the wrong part of the error message or looking at the wrong part of the code trying to find the problem.

This means they end up screenshotting that wrong thing, so I literally cannot tell what is wrong from the screenshot.

I am not being stubborn, I just literally cannot tell what is wrong because I do not have enough information. Debugging code requires context, if you deprive me that, then I cannot help.

To get asynchronous help:

  • upload your whole notebook, errors and all

  • create an issue on that repo

4.2. This week: Exploratory Data Analysis#

  • How to summarize data

  • Interpretting summaries

  • Visualizing data

  • interpretting summaries

4.2.1. Summarizing and Visualizing Data are very important#

  • People cannot interpret high dimensional or large samples quickly

  • Important in EDA to help you make decisions about the rest of your analysis

  • Important in how you report your results

  • Summaries are similar calculations to performance metrics we will see later

  • visualizations are often essential in debugging models

THEREFORE

  • You have a lot of chances to earn summarize and visualize

  • we will be picky when we assess if you earned them or not

import pandas as pd
coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
coffee_df = pd.read_csv(coffee_data_url, index_col=0)
coffee_df.head(1)
Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Robusta ankole coffee producers coop Uganda kyangundu cooperative society NaN ankole coffee producers 0 ankole coffee producers coop 1488 sheema south western ... Green 2 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1488.0 1488.0 1488.0

1 rows × 43 columns

4.3. Describing a Dataset#

So far, we’ve loaded data in a few different ways and then we’ve examined DataFrames as a data structure, looking at what different attributes they have and what some of the methods are, and how to get data into them.

We can also get more structural information with the info

coffee_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 28 entries, 1 to 28
Data columns (total 43 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Species                28 non-null     object 
 1   Owner                  28 non-null     object 
 2   Country.of.Origin      28 non-null     object 
 3   Farm.Name              25 non-null     object 
 4   Lot.Number             6 non-null      object 
 5   Mill                   20 non-null     object 
 6   ICO.Number             17 non-null     object 
 7   Company                28 non-null     object 
 8   Altitude               25 non-null     object 
 9   Region                 26 non-null     object 
 10  Producer               26 non-null     object 
 11  Number.of.Bags         28 non-null     int64  
 12  Bag.Weight             28 non-null     object 
 13  In.Country.Partner     28 non-null     object 
 14  Harvest.Year           28 non-null     int64  
 15  Grading.Date           28 non-null     object 
 16  Owner.1                28 non-null     object 
 17  Variety                3 non-null      object 
 18  Processing.Method      10 non-null     object 
 19  Fragrance...Aroma      28 non-null     float64
 20  Flavor                 28 non-null     float64
 21  Aftertaste             28 non-null     float64
 22  Salt...Acid            28 non-null     float64
 23  Bitter...Sweet         28 non-null     float64
 24  Mouthfeel              28 non-null     float64
 25  Uniform.Cup            28 non-null     float64
 26  Clean.Cup              28 non-null     float64
 27  Balance                28 non-null     float64
 28  Cupper.Points          28 non-null     float64
 29  Total.Cup.Points       28 non-null     float64
 30  Moisture               28 non-null     float64
 31  Category.One.Defects   28 non-null     int64  
 32  Quakers                28 non-null     int64  
 33  Color                  25 non-null     object 
 34  Category.Two.Defects   28 non-null     int64  
 35  Expiration             28 non-null     object 
 36  Certification.Body     28 non-null     object 
 37  Certification.Address  28 non-null     object 
 38  Certification.Contact  28 non-null     object 
 39  unit_of_measurement    28 non-null     object 
 40  altitude_low_meters    25 non-null     float64
 41  altitude_high_meters   25 non-null     float64
 42  altitude_mean_meters   25 non-null     float64
dtypes: float64(15), int64(5), object(23)
memory usage: 9.6+ KB

Now, we can actually start to analyze the data itself.

The describe method provides us with a set of summary statistics that broadly

coffee_df.describe()
Number.of.Bags Harvest.Year Fragrance...Aroma Flavor Aftertaste Salt...Acid Bitter...Sweet Mouthfeel Uniform.Cup Clean.Cup Balance Cupper.Points Total.Cup.Points Moisture Category.One.Defects Quakers Category.Two.Defects altitude_low_meters altitude_high_meters altitude_mean_meters
count 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.0 28.000000 25.00000 25.000000 25.000000
mean 168.000000 2013.964286 7.702500 7.630714 7.559643 7.657143 7.675714 7.506786 9.904286 9.928214 7.541786 7.761429 80.868929 0.065714 2.964286 0.0 1.892857 1367.60000 1387.600000 1377.600000
std 143.226317 1.346660 0.296156 0.303656 0.342469 0.261773 0.317063 0.725152 0.238753 0.211030 0.526076 0.330507 2.441233 0.058464 12.357280 0.0 2.601129 838.06205 831.884207 833.980216
min 1.000000 2012.000000 6.750000 6.670000 6.500000 6.830000 6.670000 5.080000 9.330000 9.330000 5.250000 6.920000 73.750000 0.000000 0.000000 0.0 0.000000 40.00000 40.000000 40.000000
25% 1.000000 2013.000000 7.580000 7.560000 7.397500 7.560000 7.580000 7.500000 10.000000 10.000000 7.500000 7.580000 80.170000 0.000000 0.000000 0.0 0.000000 795.00000 795.000000 795.000000
50% 170.000000 2014.000000 7.670000 7.710000 7.670000 7.710000 7.750000 7.670000 10.000000 10.000000 7.670000 7.830000 81.500000 0.100000 0.000000 0.0 1.000000 1095.00000 1200.000000 1100.000000
75% 320.000000 2015.000000 7.920000 7.830000 7.770000 7.830000 7.830000 7.830000 10.000000 10.000000 7.830000 7.920000 82.520000 0.120000 0.000000 0.0 2.000000 1488.00000 1488.000000 1488.000000
max 320.000000 2017.000000 8.330000 8.080000 7.920000 8.000000 8.420000 8.250000 10.000000 10.000000 8.000000 8.580000 83.750000 0.130000 63.000000 0.0 9.000000 3170.00000 3170.000000 3170.000000

From this, we can draw several conclusions. FOr example straightforward ones like:

  • the smallest number of bags rated is 1 and at least 25% of the coffees rates only had 1 bag

  • the first ratings included were 2012 and last in 2017 (min & max)

  • the mean Mouthfeel was 7.5

  • Category One defects are not very common ( the 75th% is 0)

Or more nuanced ones that compare across variables like

  • the raters scored coffee higher on Uniformity.Cup and Clean.Cup than other scores (mean score; only on the ones that seem to have a scale of up to 8/10)

  • the coffee varied more in Mouthfeel and Balance that most other scores (the std; only on the ones that seem to have a scale of of up to 8/10)

  • there are 3 ratings with no altitude (count of other variables is 28; alt is 25)

And these all give us a sense of the values and the distribution or spread fo the data in each column.

4.3.1. Understanding Quantiles#

The 50% has another more common name: the median. It means 50% of the data are lower (and higher) than this value.

4.3.2. Individual variable#

We can use the descriptive statistics on individual columns as well.

coffee_df['Balance'].descirbe()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2089/957071290.py in ?()
----> 1 coffee_df['Balance'].descirbe()

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
   5985             and name not in self._accessors
   5986             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5987         ):
   5988             return self[name]
-> 5989         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'descirbe'

This is an AttributeError because there is no method or property of a pd.Series that is named 'descirbe' that tells me it is a typo.

coffee_df['Balance'].describe()
count    28.000000
mean      7.541786
std       0.526076
min       5.250000
25%       7.500000
50%       7.670000
75%       7.830000
max       8.000000
Name: Balance, dtype: float64

4.4. Individual statistics#

We can also extract each of the statistics that the describe method calculates individually, by name.

coffee_df.mean(numeric_only=True)
Number.of.Bags           168.000000
Harvest.Year            2013.964286
Fragrance...Aroma          7.702500
Flavor                     7.630714
Aftertaste                 7.559643
Salt...Acid                7.657143
Bitter...Sweet             7.675714
Mouthfeel                  7.506786
Uniform.Cup                9.904286
Clean.Cup                  9.928214
Balance                    7.541786
Cupper.Points              7.761429
Total.Cup.Points          80.868929
Moisture                   0.065714
Category.One.Defects       2.964286
Quakers                    0.000000
Category.Two.Defects       1.892857
altitude_low_meters     1367.600000
altitude_high_meters    1387.600000
altitude_mean_meters    1377.600000
dtype: float64
coffee_df.min(numeric_only=True)
Number.of.Bags             1.00
Harvest.Year            2012.00
Fragrance...Aroma          6.75
Flavor                     6.67
Aftertaste                 6.50
Salt...Acid                6.83
Bitter...Sweet             6.67
Mouthfeel                  5.08
Uniform.Cup                9.33
Clean.Cup                  9.33
Balance                    5.25
Cupper.Points              6.92
Total.Cup.Points          73.75
Moisture                   0.00
Category.One.Defects       0.00
Quakers                    0.00
Category.Two.Defects       0.00
altitude_low_meters       40.00
altitude_high_meters      40.00
altitude_mean_meters      40.00
dtype: float64

The quantiles are tricky, we cannot just .25%() to get the 25% percentile, we have to use the quantile method and pass it a value between 0 and 1.

coffee_df['Flavor'].quantile(.8)
7.83
coffee_df['Aftertaste'].mean()
7.559642857142856

4.5. Working with categorical data#

There are different columns in the describe than the the whole dataset:

coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
       'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
       'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
       'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
       'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')

So far, the stats above are only for numerical features.

We can get the prevalence of each one with value_counts

coffee_df['Color'].value_counts()
Color
Green           20
Blue-Green       3
Bluish-Green     2
Name: count, dtype: int64

Try it Yourself

Note value_counts does not count the NaN values, but count counts all of the not missing values and the shape of the DataFrame is the total number of rows. How can you get the number of missing Colors?

Describe only operates on the numerical columns, but we might want to know about the others. We can get the number of each value with value_counts

What is the most common country of origin?

coffee_df['Country.of.Origin'].value_counts()
Country.of.Origin
India            13
Uganda           10
United States     2
Ecuador           2
Vietnam           1
Name: count, dtype: int64

Value counts returns a pandas Series that has two parts: values and index

coffee_df['Country.of.Origin'].value_counts().values
array([13, 10,  2,  2,  1])
coffee_df['Country.of.Origin'].value_counts().index
Index(['India', 'Uganda', 'United States', 'Ecuador', 'Vietnam'], dtype='object', name='Country.of.Origin')

The max method takes the max of the values.

coffee_df['Country.of.Origin'].value_counts().max()
13

We can get the name of the most common country out of this Series using idmax

coffee_df['Country.of.Origin'].value_counts().idxmax()
'India'

4.6. Which country scores highest on flavor?#

Let’s try to answer this with plots first

coffee_df['Flavor'].plot(kind='bar')
<Axes: >
../_images/26a34f139c0104949428f568748f539970dcfb3271283a4f9617364146089c83.png
coffee_df['Flavor'].plot(kind='hist')
<Axes: ylabel='Frequency'>
../_images/c916aef4b4c4cdf3ed74745dd304710ee9d4626d3398c4328578acc2cd451932.png

Seaborn give us opinionated defaults

import seaborn as sns

seaborn’s alias is sns as an inside joke among the developers to the character, Samual Norman Seaborn from West Wing they named the library after per their FAQ

So we can choose a typoe of plot and give it the whole dataframe and pass the variables to different parameters for them to be used in different ways.

sns.catplot(data =coffee_df, x='Flavor',kind ='count',col='Country.of.Origin',
           col_wrap=3)
<seaborn.axisgrid.FacetGrid at 0x7fce587e5e20>
../_images/e1f1011621a1e99af2aad5d673944b086566f48eca27b4217e13d59ff5589613.png

4.6.1. Are any coffees more than one standard deviation above the average in Flavor ?#

coffee_df['Flavor'].mean() + coffee_df['Flavor'].std(), coffee_df['Flavor'].max()
(7.93437014076549, 8.08)

Yes, one is more than one standard deviation from the mean.

4.7. Questions after class#

4.7.1. can pycharm access urls?#

Probably, it should not block them and they are language features, but I do not use it to know.

4.7.2. can we do what we did today in pycharm?#

In theory, but pycharm probably will not handle the plots gracefully.

For this class, you need to submit notebook files, so pycharm will not work.

Pycharm is not typically used in datascience.

4.7.3. Are the bars on the Seaborn plot random colors, or do the colors have some meaning?#

here, each different value is a different color. We wil control it more on Thursday

4.7.4. what types of jobs use what we learned today#

Any data science role would use these skills daily.

4.7.5. Is there an equivalent for the index of the minimum value, such as idxmin()?#

4.7.6. What is the std#

standard deviation pandas docs on calculation. The wikipedia article on standard deviation is a good source on this.

4.7.7. Why can we not use matplotlib?#

You can and you may for some customizaiton, but seaborn usually makes things faster to code and easier to read. I will not teach much matplotlib, but if you know it, up to date, not outdated, you may use it in some cases.

4.7.8. One question I have is what exactly the series object can be used for#

A series is one column or one row of a dataframe.

4.7.9. how these lines of code can be used more efficiently in larger data sets?#

For extremely large datasets if it’s too much to plot, you might, for exaple, sample the dataset to only plot a subset.

Otherwise this will still work.

4.7.10. Can you change what the plot graphs rather than an index vs frequency?#

yes, we can change the plot to whatever we want.

4.7.11. actually I would like to know like with our country/flavor example can they be put in the same plot but with different colors for the different countries to each country side by side for a given score#

yes

4.7.12. Can assignment 3 can be release so we can pre-read it over.#

Yes, it is posted now

4.8. Questions we’ll answer in class on thurday#

Important

These are great questions, I’m just not going to answer them here and then again in class on Thursday

  • What other types of graphs could be used?

  • Is there a way to use both x and y in sns without creating multiple charts?

  • Using matplotlib, we never got to use it to class, and I want to know the difference between that and seanborn

  • How to change the colors of the plots.

  • What do we usually tend to look for in terms of patterns when trying to formulate questions about the data, or several pieces of data?