6. Exploratory Data Analysis#

import pandas as pd

6.1. Staying Organized#

See the new File structure section.
Be sure to accept assignments and close the feedback PR if you will not work on them

6.2. Data Frame from lists of Lists#

On Friday, we collectively made a list of strings

# %load http://drsmb.co/310read

# share a sentence or a few words about how class is going or
# one takeaway you have from class so far.
#  and attribute with a name affter a hyphen(-)
# You can remain anonymous (this page & the notes will be fully public)
# by attributing it to a celebrity or psuedonym, but include *some* sort of attribution
sentence_list = [
'Programming is a Practice. - Dr. Sarah Brown',
"So far it's going pretty good. - Matt",
'Pretty good pace, Github is still a little confusing is all - A',
"Class is going well, I'm really excited for the new skills that I will attain through this course. - Diondra",
'Good straight forward - Adam',
'Good pace and engaging - Jacob',
'I like cheese - Anon',
'Going well, I enjoy python - Aiden',
"Class is going very well, I'm excited to learn more about manipulating data sets - Greg",
'Really enjoying the class so far, Clear instructions and outcomes - Michael',
'So far this class is going well, very engaging lectures. - Anon',
'Great pace and notes have been really useful - Brandon',
'Good pace and easy to follow - Muhammad',
'Class is going well and engaging, but also a little difficult getting into the swing of things - Isaiah',
'Well paced, informative, and helpful - Vinnie',
'Spectacular -Michael Jackson',
'Very interesting! I am enjoying it a lot so far. Getting to experience pandas as well as using github/jupyter has been cool. One thing I would change though is slowing down the pace a bit. - Max'
]

We can check that by using type, first on the whole thing

type(sentence_list)

list

And then we can index into the list. Lists can be indexed by integers:

sentence_list[4]

'Good straight forward - Adam'

is one item from the list and then check its type:

type(sentence_list[4])

str

First, we’ll convert our list of strings, to a list of lists with a list comprehension. This will iterate over each string in that list and apply the string method split. Since we passed the parameter '-', it will split at that character.

[sent_attr.split('-') for sent_attr in sentence_list]

[['Programming is a Practice. ', ' Dr. Sarah Brown'],
 ["So far it's going pretty good. ", ' Matt'],
 ['Pretty good pace, Github is still a little confusing is all ', ' A'],
 ["Class is going well, I'm really excited for the new skills that I will attain through this course. ",
  ' Diondra'],
 ['Good straight forward ', ' Adam'],
 ['Good pace and engaging ', ' Jacob'],
 ['I like cheese ', ' Anon'],
 ['Going well, I enjoy python ', ' Aiden'],
 ["Class is going very well, I'm excited to learn more about manipulating data sets ",
  ' Greg'],
 ['Really enjoying the class so far, Clear instructions and outcomes ',
  ' Michael'],
 ['So far this class is going well, very engaging lectures. ', ' Anon'],
 ['Great pace and notes have been really useful ', ' Brandon'],
 ['Good pace and easy to follow ', ' Muhammad'],
 ['Class is going well and engaging, but also a little difficult getting into the swing of things ',
  ' Isaiah'],
 ['Well paced, informative, and helpful ', ' Vinnie'],
 ['Spectacular ', 'Michael Jackson'],
 ['Very interesting! I am enjoying it a lot so far. Getting to experience pandas as well as using github/jupyter has been cool. One thing I would change though is slowing down the pace a bit. ',
  ' Max']]

If we save it to a variable, we can analyze it better, for example indexing it

list_of_lists = [sent_attr.split('-') for sent_attr in sentence_list]
list_of_lists[0]

['Programming is a Practice. ', ' Dr. Sarah Brown']

This is a list, which we can check with:

type(list_of_lists[0])

list

If we take one item from that, it’s a string

list_of_lists[0][1]

' Dr. Sarah Brown'

The list of lists is the same length as our original sentence_list, because it was made from that.

len(sentence_list)

len(list_of_lists)

Then we can pass our list of lists to the DataFrame constructor and set the column headings with the columns parameter, which accepts a list.

pd.DataFrame([sent_attr.split('-') for sent_attr in sentence_list],
            columns=['sentence','attribution'])

	sentence	attribution
0	Programming is a Practice.	Dr. Sarah Brown
1	So far it's going pretty good.	Matt
2	Pretty good pace, Github is still a little con...	A
3	Class is going well, I'm really excited for th...	Diondra
4	Good straight forward	Adam
5	Good pace and engaging	Jacob
6	I like cheese	Anon
7	Going well, I enjoy python	Aiden
8	Class is going very well, I'm excited to learn...	Greg
9	Really enjoying the class so far, Clear instru...	Michael
10	So far this class is going well, very engaging...	Anon
11	Great pace and notes have been really useful	Brandon
12	Good pace and easy to follow	Muhammad
13	Class is going well and engaging, but also a l...	Isaiah
14	Well paced, informative, and helpful	Vinnie
15	Spectacular	Michael Jackson
16	Very interesting! I am enjoying it a lot so fa...	Max

Hint

We built the list of lists here with a list comprehension because it’s only a few items, but for a list of longer lists, you might use a for loop and append

6.3. Summarizing Data#

coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
coffee_df = pd.read_csv(coffee_data_url)

So far, we’ve loaded data in a few different ways and then we’ve examined DataFrames as a data structure, looking at what different attributes they have and what some of the methods are, and how to get data into them.

coffee_df.head()

	Unnamed: 0	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
0	1	Robusta	ankole coffee producers coop	Uganda	kyangundu cooperative society	NaN	ankole coffee producers	0	ankole coffee producers coop	1488	...	Green	2	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1488.0	1488.0	1488.0
1	2	Robusta	nishant gurjer	India	sethuraman estate kaapi royale	25	sethuraman estate	14/1148/2017/21	kaapi royale	3170	...	NaN	2	October 31st, 2018	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	3170.0	3170.0	3170.0
2	3	Robusta	andrew hetzel	India	sethuraman estate	NaN	NaN	0000	sethuraman estate	1000m	...	Green	0	April 29th, 2016	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	1000.0	1000.0	1000.0
3	4	Robusta	ugacof	Uganda	ugacof project area	NaN	ugacof	0	ugacof ltd	1212	...	Green	7	July 14th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1212.0	1212.0	1212.0
4	5	Robusta	katuka development trust ltd	Uganda	katikamu capca farmers association	NaN	katuka development trust	0	katuka development trust ltd	1200-1300	...	Green	3	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1200.0	1300.0	1250.0

5 rows × 44 columns

Now, we can actually start to analyze the data itself.

The describe method provides us with a set of summary statistics that broadly describe the data overall.

coffee_df.describe()

	Unnamed: 0	Number.of.Bags	Harvest.Year	Fragrance...Aroma	Flavor	Aftertaste	Salt...Acid	Bitter...Sweet	Mouthfeel	Uniform.Cup	...	Balance	Cupper.Points	Total.Cup.Points	Moisture	Category.One.Defects	Quakers	Category.Two.Defects	altitude_low_meters	altitude_high_meters	altitude_mean_meters
count	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	28.000000	...	28.000000	28.000000	28.000000	28.000000	28.000000	28.0	28.000000	25.00000	25.000000	25.000000
mean	14.500000	168.000000	2013.964286	7.702500	7.630714	7.559643	7.657143	7.675714	7.506786	9.904286	...	7.541786	7.761429	80.868929	0.065714	2.964286	0.0	1.892857	1367.60000	1387.600000	1377.600000
std	8.225975	143.226317	1.346660	0.296156	0.303656	0.342469	0.261773	0.317063	0.725152	0.238753	...	0.526076	0.330507	2.441233	0.058464	12.357280	0.0	2.601129	838.06205	831.884207	833.980216
min	1.000000	1.000000	2012.000000	6.750000	6.670000	6.500000	6.830000	6.670000	5.080000	9.330000	...	5.250000	6.920000	73.750000	0.000000	0.000000	0.0	0.000000	40.00000	40.000000	40.000000
25%	7.750000	1.000000	2013.000000	7.580000	7.560000	7.397500	7.560000	7.580000	7.500000	10.000000	...	7.500000	7.580000	80.170000	0.000000	0.000000	0.0	0.000000	795.00000	795.000000	795.000000
50%	14.500000	170.000000	2014.000000	7.670000	7.710000	7.670000	7.710000	7.750000	7.670000	10.000000	...	7.670000	7.830000	81.500000	0.100000	0.000000	0.0	1.000000	1095.00000	1200.000000	1100.000000
75%	21.250000	320.000000	2015.000000	7.920000	7.830000	7.770000	7.830000	7.830000	7.830000	10.000000	...	7.830000	7.920000	82.520000	0.120000	0.000000	0.0	2.000000	1488.00000	1488.000000	1488.000000
max	28.000000	320.000000	2017.000000	8.330000	8.080000	7.920000	8.000000	8.420000	8.250000	10.000000	...	8.000000	8.580000	83.750000	0.130000	63.000000	0.0	9.000000	3170.00000	3170.000000	3170.000000

8 rows × 21 columns

We can also select one variable at a time

coffee_df['Balance'].describe()

count    28.000000
mean      7.541786
std       0.526076
min       5.250000
25%       7.500000
50%       7.670000
75%       7.830000
max       8.000000
Name: Balance, dtype: float64

To dig in on what the quantiles really mean, we can compute one manually.

First, we sort the data, then for the 25%, we select the point in index 6 because becaues there are 28 values.

balance_sorted = coffee_df['Balance'].sort_values().values
balance_sorted[6]

7.5

We can also extract each of the statistics that the describe method calculates individually, by name. The quantiles are tricky, we can’t ust .25%() to get the 25% percentile, we have to use the quantile method and pass it a value between 0 and 1.

coffee_df['Flavor'].quantile(.25)

7.5600000000000005

We can also pass other values

coffee_df['Flavor'].quantile(.8)

7.83

Calculate the mean of the Aftertaste column:

coffee_df['Aftertaste'].mean()

7.559642857142856

6.4. What about the nonnumerical variables?#

For example the color

coffee_df['Color'].head()

  Green
    NaN
  Green
  Green
  Green
Name: Color, dtype: object

We can get the prevalence of each one with value_counts

coffee_df['Color'].value_counts()

Green           20
Blue-Green       3
Bluish-Green     2
None             1
Name: Color, dtype: int64

Try it Yourself

Note value_counts does not count the NaN values, but count counts all of the not missing values and the shape of the DataFrame is the total number of rows. How can you get the number of missing Colors?

What country is most prevalent in this dataset?

coffee_df['Country.of.Origin'].value_counts()

India            13
Uganda           10
United States     2
Ecuador           2
Vietnam           1
Name: Country.of.Origin, dtype: int64

We can get the name of the most common country out of this Series using idmax

coffee_df['Country.of.Origin'].value_counts().idxmax()

'India'

Question From Class

Q: Can we calculate the mode to find the most prevalent? A: Yes. We can also use the mode function, which works on both numerical or nonnumerical values.

coffee_df['Country.of.Origin'].mode()

0    India
Name: Country.of.Origin, dtype: object

6.5. Questions After Class#

6.5.1. General Questions#

6.5.1.1. How to know what functions are compatible with other functions?#

This is something that builds up over time, but one thing to look for is to check the types. That’s why we’ve been using type() so often in class. Having a goal for your analysis will also help.

6.5.1.2. Best place to find all the individual functions based on the .describe() function#

Each one goes by the same name, mostly, but that also the Pandas documentation is the bes way to find more information on every pandas method. On any function, see the See also section too.

6.5.1.3. Are there panda functions to read files other than CSV?#

Yes! See the hint on assignment 2 for more information on this

6.5.2. Clarifying#

6.5.2.1. Is sent_attr in the DataFrame loop its own variable that we assign, or is it a base Python function?#

That’s a variable that we created, it’s equivalent to a loop variable.

6.5.2.2. Difference between %load and how we’ve been importing links to datasets in previous classes?#

the pandas functions read data in and put it specifically into a DataFrame.

The load magic reads a file in and basically pastes it into the jupyter cell. This one doesn’t actually put the data into memory in python.

6.5.2.3. When I ran [sent_attr.split(‘-’) for sent_attr in sentence_list] in Jupyter, I got a nameerror#

I also did at first, because I hadn’t yet run the cell above so sentence_list was not defined.

Why do we use the value quantile instead of using quartile since we can use mean to find mean?

It’s just the name of the method.

6.5.3. Coure Admin and Assignment Questions#

6.5.3.1. when are the achievements we earn in class going to be inputted in brightspace?#

Assignment achievements will be posted after grading is complete. Level 1s from class will be posted most weeks on Friday or over the weekend. Right now we’re a little behind as we get everything set up.

6.5.3.2. Is there anything we have to do for the assignment after commiting the changes to#

the files?

Yes, after committing, push. Once you can see your files on the GitHub website interface, we can see them for grading

6.5.3.3. When we push the homework are we pushing it to the portfolio?#

No. when you accept the assignment it will create a new repository for that assignment. That’s why it’s importtant to accept each assignment.

6.5.3.4. Still a little unsure about what the num_numerical specifically is, is it just the number of numerical columns?#

Yes the number of numerical columns.

6.6. More Practice#

Produce a table with the mean for each score.
Which variables have the most missing data?
What’s the total number of bags of coffee produced?
Which ratings have similar ranges? (max, min)
Which rating are most consistent across coffees?
What score cutoff could you apply in order to select the top 10 best for Balance?

Programming for Data Science at URI Fall 2021

Exploratory Data Analysis

Contents

6. Exploratory Data Analysis#

6.1. Staying Organized#

6.2. Data Frame from lists of Lists#

6.3. Summarizing Data#

6.4. What about the nonnumerical variables?#

6.5. Questions After Class#

6.5.1. General Questions#

6.5.1.1. How to know what functions are compatible with other functions?#

6.5.1.2. Best place to find all the individual functions based on the .describe() function#

6.5.1.3. Are there panda functions to read files other than CSV?#

6.5.2. Clarifying#

6.5.2.1. Is sent_attr in the DataFrame loop its own variable that we assign, or is it a base Python function?#

6.5.2.2. Difference between %load and how we’ve been importing links to datasets in previous classes?#

6.5.2.3. When I ran [sent_attr.split(‘-’) for sent_attr in sentence_list] in Jupyter, I got a nameerror#

6.5.3. Coure Admin and Assignment Questions#

6.5.3.1. when are the achievements we earn in class going to be inputted in brightspace?#

6.5.3.2. Is there anything we have to do for the assignment after commiting the changes to#

6.5.3.3. When we push the homework are we pushing it to the portfolio?#

6.5.3.4. Still a little unsure about what the num_numerical specifically is, is it just the number of numerical columns?#

6.6. More Practice#