Exploratory Data Analysis
Contents
6. Exploratory Data Analysis#
import pandas as pd
6.1. Staying Organized#
See the new File structure section.
Be sure to accept assignments and close the feedback PR if you will not work on them
6.2. Data Frame from lists of Lists#
On Friday, we collectively made a list of strings
# %load http://drsmb.co/310read
# share a sentence or a few words about how class is going or
# one takeaway you have from class so far.
# and attribute with a name affter a hyphen(-)
# You can remain anonymous (this page & the notes will be fully public)
# by attributing it to a celebrity or psuedonym, but include *some* sort of attribution
sentence_list = [
'Programming is a Practice. - Dr. Sarah Brown',
"So far it's going pretty good. - Matt",
'Pretty good pace, Github is still a little confusing is all - A',
"Class is going well, I'm really excited for the new skills that I will attain through this course. - Diondra",
'Good straight forward - Adam',
'Good pace and engaging - Jacob',
'I like cheese - Anon',
'Going well, I enjoy python - Aiden',
"Class is going very well, I'm excited to learn more about manipulating data sets - Greg",
'Really enjoying the class so far, Clear instructions and outcomes - Michael',
'So far this class is going well, very engaging lectures. - Anon',
'Great pace and notes have been really useful - Brandon',
'Good pace and easy to follow - Muhammad',
'Class is going well and engaging, but also a little difficult getting into the swing of things - Isaiah',
'Well paced, informative, and helpful - Vinnie',
'Spectacular -Michael Jackson',
'Very interesting! I am enjoying it a lot so far. Getting to experience pandas as well as using github/jupyter has been cool. One thing I would change though is slowing down the pace a bit. - Max'
]
We can check that by using type, first on the whole thing
type(sentence_list)
list
And then we can index into the list. Lists can be indexed by integers:
sentence_list[4]
'Good straight forward - Adam'
is one item from the list and then check its type:
type(sentence_list[4])
str
First, we’ll convert our list of strings, to a list of lists with a list
comprehension. This will iterate over each string in that list and
apply the string method split
. Since we passed the parameter '-'
, it will
split at that character.
[sent_attr.split('-') for sent_attr in sentence_list]
[['Programming is a Practice. ', ' Dr. Sarah Brown'],
["So far it's going pretty good. ", ' Matt'],
['Pretty good pace, Github is still a little confusing is all ', ' A'],
["Class is going well, I'm really excited for the new skills that I will attain through this course. ",
' Diondra'],
['Good straight forward ', ' Adam'],
['Good pace and engaging ', ' Jacob'],
['I like cheese ', ' Anon'],
['Going well, I enjoy python ', ' Aiden'],
["Class is going very well, I'm excited to learn more about manipulating data sets ",
' Greg'],
['Really enjoying the class so far, Clear instructions and outcomes ',
' Michael'],
['So far this class is going well, very engaging lectures. ', ' Anon'],
['Great pace and notes have been really useful ', ' Brandon'],
['Good pace and easy to follow ', ' Muhammad'],
['Class is going well and engaging, but also a little difficult getting into the swing of things ',
' Isaiah'],
['Well paced, informative, and helpful ', ' Vinnie'],
['Spectacular ', 'Michael Jackson'],
['Very interesting! I am enjoying it a lot so far. Getting to experience pandas as well as using github/jupyter has been cool. One thing I would change though is slowing down the pace a bit. ',
' Max']]
If we save it to a variable, we can analyze it better, for example indexing it
list_of_lists = [sent_attr.split('-') for sent_attr in sentence_list]
list_of_lists[0]
['Programming is a Practice. ', ' Dr. Sarah Brown']
This is a list, which we can check with:
type(list_of_lists[0])
list
If we take one item from that, it’s a string
list_of_lists[0][1]
' Dr. Sarah Brown'
The list of lists is the same length as our original sentence_list, because it was made from that.
len(sentence_list)
17
len(list_of_lists)
17
Then we can pass our list of lists to the DataFrame constructor and set the
column headings with the columns
parameter, which accepts a list.
pd.DataFrame([sent_attr.split('-') for sent_attr in sentence_list],
columns=['sentence','attribution'])
sentence | attribution | |
---|---|---|
0 | Programming is a Practice. | Dr. Sarah Brown |
1 | So far it's going pretty good. | Matt |
2 | Pretty good pace, Github is still a little con... | A |
3 | Class is going well, I'm really excited for th... | Diondra |
4 | Good straight forward | Adam |
5 | Good pace and engaging | Jacob |
6 | I like cheese | Anon |
7 | Going well, I enjoy python | Aiden |
8 | Class is going very well, I'm excited to learn... | Greg |
9 | Really enjoying the class so far, Clear instru... | Michael |
10 | So far this class is going well, very engaging... | Anon |
11 | Great pace and notes have been really useful | Brandon |
12 | Good pace and easy to follow | Muhammad |
13 | Class is going well and engaging, but also a l... | Isaiah |
14 | Well paced, informative, and helpful | Vinnie |
15 | Spectacular | Michael Jackson |
16 | Very interesting! I am enjoying it a lot so fa... | Max |
Hint
We built the list of lists here
with a list comprehension because it’s only a few items, but for a list of
longer lists, you might use a for loop and append
6.3. Summarizing Data#
coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
coffee_df = pd.read_csv(coffee_data_url)
So far, we’ve loaded data in a few different ways and then we’ve examined DataFrames as a data structure, looking at what different attributes they have and what some of the methods are, and how to get data into them.
coffee_df.head()
Unnamed: 0 | Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Robusta | ankole coffee producers coop | Uganda | kyangundu cooperative society | NaN | ankole coffee producers | 0 | ankole coffee producers coop | 1488 | ... | Green | 2 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1488.0 | 1488.0 | 1488.0 |
1 | 2 | Robusta | nishant gurjer | India | sethuraman estate kaapi royale | 25 | sethuraman estate | 14/1148/2017/21 | kaapi royale | 3170 | ... | NaN | 2 | October 31st, 2018 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 3170.0 | 3170.0 | 3170.0 |
2 | 3 | Robusta | andrew hetzel | India | sethuraman estate | NaN | NaN | 0000 | sethuraman estate | 1000m | ... | Green | 0 | April 29th, 2016 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 1000.0 | 1000.0 | 1000.0 |
3 | 4 | Robusta | ugacof | Uganda | ugacof project area | NaN | ugacof | 0 | ugacof ltd | 1212 | ... | Green | 7 | July 14th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1212.0 | 1212.0 | 1212.0 |
4 | 5 | Robusta | katuka development trust ltd | Uganda | katikamu capca farmers association | NaN | katuka development trust | 0 | katuka development trust ltd | 1200-1300 | ... | Green | 3 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1200.0 | 1300.0 | 1250.0 |
5 rows × 44 columns
Now, we can actually start to analyze the data itself.
The describe
method provides us with a set of summary statistics that broadly
describe the data overall.
coffee_df.describe()
Unnamed: 0 | Number.of.Bags | Harvest.Year | Fragrance...Aroma | Flavor | Aftertaste | Salt...Acid | Bitter...Sweet | Mouthfeel | Uniform.Cup | ... | Balance | Cupper.Points | Total.Cup.Points | Moisture | Category.One.Defects | Quakers | Category.Two.Defects | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 28.000000 | 28.000000 | 28.000000 | 28.000000 | 28.000000 | 28.000000 | 28.000000 | 28.000000 | 28.000000 | 28.000000 | ... | 28.000000 | 28.000000 | 28.000000 | 28.000000 | 28.000000 | 28.0 | 28.000000 | 25.00000 | 25.000000 | 25.000000 |
mean | 14.500000 | 168.000000 | 2013.964286 | 7.702500 | 7.630714 | 7.559643 | 7.657143 | 7.675714 | 7.506786 | 9.904286 | ... | 7.541786 | 7.761429 | 80.868929 | 0.065714 | 2.964286 | 0.0 | 1.892857 | 1367.60000 | 1387.600000 | 1377.600000 |
std | 8.225975 | 143.226317 | 1.346660 | 0.296156 | 0.303656 | 0.342469 | 0.261773 | 0.317063 | 0.725152 | 0.238753 | ... | 0.526076 | 0.330507 | 2.441233 | 0.058464 | 12.357280 | 0.0 | 2.601129 | 838.06205 | 831.884207 | 833.980216 |
min | 1.000000 | 1.000000 | 2012.000000 | 6.750000 | 6.670000 | 6.500000 | 6.830000 | 6.670000 | 5.080000 | 9.330000 | ... | 5.250000 | 6.920000 | 73.750000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 40.00000 | 40.000000 | 40.000000 |
25% | 7.750000 | 1.000000 | 2013.000000 | 7.580000 | 7.560000 | 7.397500 | 7.560000 | 7.580000 | 7.500000 | 10.000000 | ... | 7.500000 | 7.580000 | 80.170000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 795.00000 | 795.000000 | 795.000000 |
50% | 14.500000 | 170.000000 | 2014.000000 | 7.670000 | 7.710000 | 7.670000 | 7.710000 | 7.750000 | 7.670000 | 10.000000 | ... | 7.670000 | 7.830000 | 81.500000 | 0.100000 | 0.000000 | 0.0 | 1.000000 | 1095.00000 | 1200.000000 | 1100.000000 |
75% | 21.250000 | 320.000000 | 2015.000000 | 7.920000 | 7.830000 | 7.770000 | 7.830000 | 7.830000 | 7.830000 | 10.000000 | ... | 7.830000 | 7.920000 | 82.520000 | 0.120000 | 0.000000 | 0.0 | 2.000000 | 1488.00000 | 1488.000000 | 1488.000000 |
max | 28.000000 | 320.000000 | 2017.000000 | 8.330000 | 8.080000 | 7.920000 | 8.000000 | 8.420000 | 8.250000 | 10.000000 | ... | 8.000000 | 8.580000 | 83.750000 | 0.130000 | 63.000000 | 0.0 | 9.000000 | 3170.00000 | 3170.000000 | 3170.000000 |
8 rows × 21 columns
We can also select one variable at a time
coffee_df['Balance'].describe()
count 28.000000
mean 7.541786
std 0.526076
min 5.250000
25% 7.500000
50% 7.670000
75% 7.830000
max 8.000000
Name: Balance, dtype: float64
To dig in on what the quantiles really mean, we can compute one manually.
First, we sort the data, then for the 25%, we select the point in index 6 because becaues there are 28 values.
balance_sorted = coffee_df['Balance'].sort_values().values
balance_sorted[6]
7.5
We can also extract each of the statistics that the describe
method calculates individually, by name. The quantiles
are tricky, we can’t ust .25%()
to get the 25% percentile, we have to use the
quantile
method and pass it a value between 0 and 1.
coffee_df['Flavor'].quantile(.25)
7.5600000000000005
We can also pass other values
coffee_df['Flavor'].quantile(.8)
7.83
Calculate the mean of the Aftertaste
column:
coffee_df['Aftertaste'].mean()
7.559642857142856
6.4. What about the nonnumerical variables?#
For example the color
coffee_df['Color'].head()
0 Green
1 NaN
2 Green
3 Green
4 Green
Name: Color, dtype: object
We can get the prevalence of each one with value_counts
coffee_df['Color'].value_counts()
Green 20
Blue-Green 3
Bluish-Green 2
None 1
Name: Color, dtype: int64
Try it Yourself
Note value_counts
does not count the NaN
values, but count
counts all of the
not missing values and the shape of the DataFrame is the total number of rows.
How can you get the number of missing Colors?
What country is most prevalent in this dataset?
coffee_df['Country.of.Origin'].value_counts()
India 13
Uganda 10
United States 2
Ecuador 2
Vietnam 1
Name: Country.of.Origin, dtype: int64
We can get the name of the most common country out of this Series using idmax
coffee_df['Country.of.Origin'].value_counts().idxmax()
'India'
Question From Class
Q: Can we calculate the mode to find the most prevalent? A: Yes. We can also use the mode function, which works on both numerical or nonnumerical values.
coffee_df['Country.of.Origin'].mode()
0 India
Name: Country.of.Origin, dtype: object
6.5. Questions After Class#
6.5.1. General Questions#
6.5.1.1. How to know what functions are compatible with other functions?#
This is something that builds up over time, but one thing to look for is to check the types. That’s why we’ve been using type()
so often in class.
Having a goal for your analysis will also help.
6.5.1.2. Best place to find all the individual functions based on the .describe() function#
Each one goes by the same name, mostly, but that also the Pandas documentation is the bes way to find more information on every pandas method. On any function, see the See also section too.
6.5.1.3. Are there panda functions to read files other than CSV?#
Yes! See the hint on assignment 2 for more information on this
6.5.2. Clarifying#
6.5.2.1. Is sent_attr in the DataFrame loop its own variable that we assign, or is it a base Python function?#
That’s a variable that we created, it’s equivalent to a loop variable.
6.5.2.2. Difference between %load and how we’ve been importing links to datasets in previous classes?#
the pandas functions read data in and put it specifically into a DataFrame.
The load magic reads a file in and basically pastes it into the jupyter cell. This one doesn’t actually put the data into memory in python.
6.5.2.3. When I ran [sent_attr.split(‘-’) for sent_attr in sentence_list] in Jupyter, I got a nameerror#
I also did at first, because I hadn’t yet run the cell above so sentence_list
was not defined.
Why do we use the value quantile instead of using quartile since we can use mean to find mean?
It’s just the name of the method.
6.5.3. Coure Admin and Assignment Questions#
6.5.3.1. when are the achievements we earn in class going to be inputted in brightspace?#
Assignment achievements will be posted after grading is complete. Level 1s from class will be posted most weeks on Friday or over the weekend. Right now we’re a little behind as we get everything set up.
6.5.3.2. Is there anything we have to do for the assignment after commiting the changes to#
the files?
Yes, after committing, push. Once you can see your files on the GitHub website interface, we can see them for grading
6.5.3.3. When we push the homework are we pushing it to the portfolio?#
No. when you accept the assignment it will create a new repository for that assignment. That’s why it’s importtant to accept each assignment.
6.5.3.4. Still a little unsure about what the num_numerical specifically is, is it just the number of numerical columns?#
Yes the number of numerical columns.
6.6. More Practice#
Produce a table with the mean for each score.
Which variables have the most missing data?
What’s the total number of bags of coffee produced?
Which ratings have similar ranges? (max, min)
Which rating are most consistent across coffees?
What score cutoff could you apply in order to select the top 10 best for Balance?