6. Exploratory Data Analysis#

import pandas as pd

6.1. Staying Organized#

  • See the new File structure section.

  • Be sure to accept assignments and close the feedback PR if you will not work on them

6.2. Data Frame from lists of Lists#

On Friday, we collectively made a list of strings

# %load http://drsmb.co/310read

# share a sentence or a few words about how class is going or
# one takeaway you have from class so far.
#  and attribute with a name affter a hyphen(-)
# You can remain anonymous (this page & the notes will be fully public)
# by attributing it to a celebrity or psuedonym, but include *some* sort of attribution
sentence_list = [
'Programming is a Practice. - Dr. Sarah Brown',
"So far it's going pretty good. - Matt",
'Pretty good pace, Github is still a little confusing is all - A',
"Class is going well, I'm really excited for the new skills that I will attain through this course. - Diondra",
'Good straight forward - Adam',
'Good pace and engaging - Jacob',
'I like cheese - Anon',
'Going well, I enjoy python - Aiden',
"Class is going very well, I'm excited to learn more about manipulating data sets - Greg",
'Really enjoying the class so far, Clear instructions and outcomes - Michael',
'So far this class is going well, very engaging lectures. - Anon',
'Great pace and notes have been really useful - Brandon',
'Good pace and easy to follow - Muhammad',
'Class is going well and engaging, but also a little difficult getting into the swing of things - Isaiah',
'Well paced, informative, and helpful - Vinnie',
'Spectacular -Michael Jackson',
'Very interesting! I am enjoying it a lot so far. Getting to experience pandas as well as using github/jupyter has been cool. One thing I would change though is slowing down the pace a bit. - Max'
]

We can check that by using type, first on the whole thing

type(sentence_list)
list

And then we can index into the list. Lists can be indexed by integers:

sentence_list[4]
'Good straight forward - Adam'

is one item from the list and then check its type:

type(sentence_list[4])
str

First, we’ll convert our list of strings, to a list of lists with a list comprehension. This will iterate over each string in that list and apply the string method split. Since we passed the parameter '-', it will split at that character.

[sent_attr.split('-') for sent_attr in sentence_list]
[['Programming is a Practice. ', ' Dr. Sarah Brown'],
 ["So far it's going pretty good. ", ' Matt'],
 ['Pretty good pace, Github is still a little confusing is all ', ' A'],
 ["Class is going well, I'm really excited for the new skills that I will attain through this course. ",
  ' Diondra'],
 ['Good straight forward ', ' Adam'],
 ['Good pace and engaging ', ' Jacob'],
 ['I like cheese ', ' Anon'],
 ['Going well, I enjoy python ', ' Aiden'],
 ["Class is going very well, I'm excited to learn more about manipulating data sets ",
  ' Greg'],
 ['Really enjoying the class so far, Clear instructions and outcomes ',
  ' Michael'],
 ['So far this class is going well, very engaging lectures. ', ' Anon'],
 ['Great pace and notes have been really useful ', ' Brandon'],
 ['Good pace and easy to follow ', ' Muhammad'],
 ['Class is going well and engaging, but also a little difficult getting into the swing of things ',
  ' Isaiah'],
 ['Well paced, informative, and helpful ', ' Vinnie'],
 ['Spectacular ', 'Michael Jackson'],
 ['Very interesting! I am enjoying it a lot so far. Getting to experience pandas as well as using github/jupyter has been cool. One thing I would change though is slowing down the pace a bit. ',
  ' Max']]

If we save it to a variable, we can analyze it better, for example indexing it

list_of_lists = [sent_attr.split('-') for sent_attr in sentence_list]
list_of_lists[0]
['Programming is a Practice. ', ' Dr. Sarah Brown']

This is a list, which we can check with:

type(list_of_lists[0])
list

If we take one item from that, it’s a string

list_of_lists[0][1]
' Dr. Sarah Brown'

The list of lists is the same length as our original sentence_list, because it was made from that.

len(sentence_list)
17
len(list_of_lists)
17

Then we can pass our list of lists to the DataFrame constructor and set the column headings with the columns parameter, which accepts a list.

pd.DataFrame([sent_attr.split('-') for sent_attr in sentence_list],
            columns=['sentence','attribution'])
sentence attribution
0 Programming is a Practice. Dr. Sarah Brown
1 So far it's going pretty good. Matt
2 Pretty good pace, Github is still a little con... A
3 Class is going well, I'm really excited for th... Diondra
4 Good straight forward Adam
5 Good pace and engaging Jacob
6 I like cheese Anon
7 Going well, I enjoy python Aiden
8 Class is going very well, I'm excited to learn... Greg
9 Really enjoying the class so far, Clear instru... Michael
10 So far this class is going well, very engaging... Anon
11 Great pace and notes have been really useful Brandon
12 Good pace and easy to follow Muhammad
13 Class is going well and engaging, but also a l... Isaiah
14 Well paced, informative, and helpful Vinnie
15 Spectacular Michael Jackson
16 Very interesting! I am enjoying it a lot so fa... Max

Hint

We built the list of lists here with a list comprehension because it’s only a few items, but for a list of longer lists, you might use a for loop and append

6.3. Summarizing Data#

coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
coffee_df = pd.read_csv(coffee_data_url)

So far, we’ve loaded data in a few different ways and then we’ve examined DataFrames as a data structure, looking at what different attributes they have and what some of the methods are, and how to get data into them.

coffee_df.head()
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 1 Robusta ankole coffee producers coop Uganda kyangundu cooperative society NaN ankole coffee producers 0 ankole coffee producers coop 1488 ... Green 2 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1488.0 1488.0 1488.0
1 2 Robusta nishant gurjer India sethuraman estate kaapi royale 25 sethuraman estate 14/1148/2017/21 kaapi royale 3170 ... NaN 2 October 31st, 2018 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m 3170.0 3170.0 3170.0
2 3 Robusta andrew hetzel India sethuraman estate NaN NaN 0000 sethuraman estate 1000m ... Green 0 April 29th, 2016 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m 1000.0 1000.0 1000.0
3 4 Robusta ugacof Uganda ugacof project area NaN ugacof 0 ugacof ltd 1212 ... Green 7 July 14th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1212.0 1212.0 1212.0
4 5 Robusta katuka development trust ltd Uganda katikamu capca farmers association NaN katuka development trust 0 katuka development trust ltd 1200-1300 ... Green 3 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1200.0 1300.0 1250.0

5 rows × 44 columns

Now, we can actually start to analyze the data itself.

The describe method provides us with a set of summary statistics that broadly describe the data overall.

coffee_df.describe()
Unnamed: 0 Number.of.Bags Harvest.Year Fragrance...Aroma Flavor Aftertaste Salt...Acid Bitter...Sweet Mouthfeel Uniform.Cup ... Balance Cupper.Points Total.Cup.Points Moisture Category.One.Defects Quakers Category.Two.Defects altitude_low_meters altitude_high_meters altitude_mean_meters
count 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 ... 28.000000 28.000000 28.000000 28.000000 28.000000 28.0 28.000000 25.00000 25.000000 25.000000
mean 14.500000 168.000000 2013.964286 7.702500 7.630714 7.559643 7.657143 7.675714 7.506786 9.904286 ... 7.541786 7.761429 80.868929 0.065714 2.964286 0.0 1.892857 1367.60000 1387.600000 1377.600000
std 8.225975 143.226317 1.346660 0.296156 0.303656 0.342469 0.261773 0.317063 0.725152 0.238753 ... 0.526076 0.330507 2.441233 0.058464 12.357280 0.0 2.601129 838.06205 831.884207 833.980216
min 1.000000 1.000000 2012.000000 6.750000 6.670000 6.500000 6.830000 6.670000 5.080000 9.330000 ... 5.250000 6.920000 73.750000 0.000000 0.000000 0.0 0.000000 40.00000 40.000000 40.000000
25% 7.750000 1.000000 2013.000000 7.580000 7.560000 7.397500 7.560000 7.580000 7.500000 10.000000 ... 7.500000 7.580000 80.170000 0.000000 0.000000 0.0 0.000000 795.00000 795.000000 795.000000
50% 14.500000 170.000000 2014.000000 7.670000 7.710000 7.670000 7.710000 7.750000 7.670000 10.000000 ... 7.670000 7.830000 81.500000 0.100000 0.000000 0.0 1.000000 1095.00000 1200.000000 1100.000000
75% 21.250000 320.000000 2015.000000 7.920000 7.830000 7.770000 7.830000 7.830000 7.830000 10.000000 ... 7.830000 7.920000 82.520000 0.120000 0.000000 0.0 2.000000 1488.00000 1488.000000 1488.000000
max 28.000000 320.000000 2017.000000 8.330000 8.080000 7.920000 8.000000 8.420000 8.250000 10.000000 ... 8.000000 8.580000 83.750000 0.130000 63.000000 0.0 9.000000 3170.00000 3170.000000 3170.000000

8 rows × 21 columns

We can also select one variable at a time

coffee_df['Balance'].describe()
count    28.000000
mean      7.541786
std       0.526076
min       5.250000
25%       7.500000
50%       7.670000
75%       7.830000
max       8.000000
Name: Balance, dtype: float64

To dig in on what the quantiles really mean, we can compute one manually.

First, we sort the data, then for the 25%, we select the point in index 6 because becaues there are 28 values.

balance_sorted = coffee_df['Balance'].sort_values().values
balance_sorted[6]
7.5

We can also extract each of the statistics that the describe method calculates individually, by name. The quantiles are tricky, we can’t ust .25%() to get the 25% percentile, we have to use the quantile method and pass it a value between 0 and 1.

coffee_df['Flavor'].quantile(.25)
7.5600000000000005

We can also pass other values

coffee_df['Flavor'].quantile(.8)
7.83

Calculate the mean of the Aftertaste column:

coffee_df['Aftertaste'].mean()
7.559642857142856

6.4. What about the nonnumerical variables?#

For example the color

coffee_df['Color'].head()
0    Green
1      NaN
2    Green
3    Green
4    Green
Name: Color, dtype: object

We can get the prevalence of each one with value_counts

coffee_df['Color'].value_counts()
Green           20
Blue-Green       3
Bluish-Green     2
None             1
Name: Color, dtype: int64

Try it Yourself

Note value_counts does not count the NaN values, but count counts all of the not missing values and the shape of the DataFrame is the total number of rows. How can you get the number of missing Colors?

What country is most prevalent in this dataset?

coffee_df['Country.of.Origin'].value_counts()
India            13
Uganda           10
United States     2
Ecuador           2
Vietnam           1
Name: Country.of.Origin, dtype: int64

We can get the name of the most common country out of this Series using idmax

coffee_df['Country.of.Origin'].value_counts().idxmax()
'India'

Question From Class

Q: Can we calculate the mode to find the most prevalent? A: Yes. We can also use the mode function, which works on both numerical or nonnumerical values.

coffee_df['Country.of.Origin'].mode()
0    India
Name: Country.of.Origin, dtype: object

6.5. Questions After Class#

6.5.1. General Questions#

6.5.1.1. How to know what functions are compatible with other functions?#

This is something that builds up over time, but one thing to look for is to check the types. That’s why we’ve been using type() so often in class. Having a goal for your analysis will also help.

6.5.1.2. Best place to find all the individual functions based on the .describe() function#

Each one goes by the same name, mostly, but that also the Pandas documentation is the bes way to find more information on every pandas method. On any function, see the See also section too.

6.5.1.3. Are there panda functions to read files other than CSV?#

Yes! See the hint on assignment 2 for more information on this

6.5.2. Clarifying#

6.5.2.1. Is sent_attr in the DataFrame loop its own variable that we assign, or is it a base Python function?#

That’s a variable that we created, it’s equivalent to a loop variable.

6.5.2.3. When I ran [sent_attr.split(‘-’) for sent_attr in sentence_list] in Jupyter, I got a nameerror#

I also did at first, because I hadn’t yet run the cell above so sentence_list was not defined.

Why do we use the value quantile instead of using quartile since we can use mean to find mean?

It’s just the name of the method.

6.5.3. Coure Admin and Assignment Questions#

6.5.3.1. when are the achievements we earn in class going to be inputted in brightspace?#

Assignment achievements will be posted after grading is complete. Level 1s from class will be posted most weeks on Friday or over the weekend. Right now we’re a little behind as we get everything set up.

6.5.3.2. Is there anything we have to do for the assignment after commiting the changes to#

the files?

Yes, after committing, push. Once you can see your files on the GitHub website interface, we can see them for grading

6.5.3.3. When we push the homework are we pushing it to the portfolio?#

No. when you accept the assignment it will create a new repository for that assignment. That’s why it’s importtant to accept each assignment.

6.5.3.4. Still a little unsure about what the num_numerical specifically is, is it just the number of numerical columns?#

Yes the number of numerical columns.

6.6. More Practice#

  1. Produce a table with the mean for each score.

  2. Which variables have the most missing data?

  3. What’s the total number of bags of coffee produced?

  4. Which ratings have similar ranges? (max, min)

  5. Which rating are most consistent across coffees?

  6. What score cutoff could you apply in order to select the top 10 best for Balance?