3. Pandas Data Frames and More Iterable Types#
Note
I was able to fix my certificate problem by running this script from the python install.
We will import pandas
import pandas as pd
We want to import the coffee data and we will keep working with it over several classes. Sometimes we want to put functions we have written or constants like this in a module of our own that we can import. For today we will put the following in a file called coffee.py
robusta_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
robusta_data_file = 'robusta_data_cleaned.csv'
Now we can import it
import coffee
and use the variable with a .
, as an attribute of the module.
coffee.robusta_data_url
'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
We could also import the variables directly so that we can use
them without the .
from coffee import robusta_data_url, robusta_data_file
like follows
ro
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 1
----> 1 ro
NameError: name 'ro' is not defined
This is a NameError
because the variable does not exist.
When we spell it correctly, it works as desired.
robusta_data_url
'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
Then we can use the variable.
coffee_df = pd.read_csv(robusta_data_url)
3.1. DataFrame parts#
first we will check the type
type(coffee_df)
pandas.core.frame.DataFrame
we saw that we can check the first 5 rows with head
coffee_df.head()
Unnamed: 0 | Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Robusta | ankole coffee producers coop | Uganda | kyangundu cooperative society | NaN | ankole coffee producers | 0 | ankole coffee producers coop | 1488 | ... | Green | 2 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1488.0 | 1488.0 | 1488.0 |
1 | 2 | Robusta | nishant gurjer | India | sethuraman estate kaapi royale | 25 | sethuraman estate | 14/1148/2017/21 | kaapi royale | 3170 | ... | NaN | 2 | October 31st, 2018 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 3170.0 | 3170.0 | 3170.0 |
2 | 3 | Robusta | andrew hetzel | India | sethuraman estate | NaN | NaN | 0000 | sethuraman estate | 1000m | ... | Green | 0 | April 29th, 2016 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 1000.0 | 1000.0 | 1000.0 |
3 | 4 | Robusta | ugacof | Uganda | ugacof project area | NaN | ugacof | 0 | ugacof ltd | 1212 | ... | Green | 7 | July 14th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1212.0 | 1212.0 | 1212.0 |
4 | 5 | Robusta | katuka development trust ltd | Uganda | katikamu capca farmers association | NaN | katuka development trust | 0 | katuka development trust ltd | 1200-1300 | ... | Green | 3 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1200.0 | 1300.0 | 1250.0 |
5 rows × 44 columns
and tail
to see the last ones.
coffee_df.tail(3)
Unnamed: 0 | Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25 | 26 | Robusta | james moore | United States | fazenda cazengo | NaN | cafe cazengo | NaN | global opportunity fund | 795 meters | ... | NaN | 6 | December 23rd, 2015 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 795.0 | 795.0 | 795.0 |
26 | 27 | Robusta | cafe politico | India | NaN | NaN | NaN | 14-1118-2014-0087 | cafe politico | NaN | ... | Green | 1 | August 25th, 2015 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | NaN | NaN | NaN |
27 | 28 | Robusta | cafe politico | Vietnam | NaN | NaN | NaN | NaN | cafe politico | NaN | ... | NaN | 9 | August 25th, 2015 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | NaN | NaN | NaN |
3 rows × 44 columns
We note that one of the columns is named Unnamed 0:
because it does not have a name in the file, for that column.
We can use shift + tab to see help. We see that the index_col
parameter to treat that column as the index.
coffee_df = pd.read_csv(robusta_data_url,index_col=0)
coffee_df.head(1)
Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | Region | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Robusta | ankole coffee producers coop | Uganda | kyangundu cooperative society | NaN | ankole coffee producers | 0 | ankole coffee producers coop | 1488 | sheema south western | ... | Green | 2 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1488.0 | 1488.0 | 1488.0 |
1 rows × 43 columns
3.1.1. Data Frame parts#
Now it looks neater and tht columns is the index.
We saw before the columns
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
'Certification.Body', 'Certification.Address', 'Certification.Contact',
'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
'altitude_mean_meters'],
dtype='object')
The rows are named by the index
attribute
coffee_df.index
Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28],
dtype='int64')
Both of these are pandas Index objects
We can also take out only the values without the index or the column names.
coffee_df.values
array([['Robusta', 'ankole coffee producers coop', 'Uganda', ..., 1488.0,
1488.0, 1488.0],
['Robusta', 'nishant gurjer', 'India', ..., 3170.0, 3170.0,
3170.0],
['Robusta', 'andrew hetzel', 'India', ..., 1000.0, 1000.0, 1000.0],
...,
['Robusta', 'james moore', 'United States', ..., 795.0, 795.0,
795.0],
['Robusta', 'cafe politico', 'India', ..., nan, nan, nan],
['Robusta', 'cafe politico', 'Vietnam', ..., nan, nan, nan]],
dtype=object)
this is a numpy.array
type
type(coffee_df.values)
numpy.ndarray
3.1.2. Segmenting rows or columns#
We can pick out a column using it’s name and []
coffee_df['Flavor']
1 8.08
2 7.75
3 7.83
4 7.92
5 7.83
6 7.92
7 7.75
8 7.75
9 7.75
10 7.83
11 7.92
12 7.83
13 7.58
14 7.75
15 7.67
16 7.75
17 7.50
18 7.58
19 7.58
20 7.42
21 7.67
22 7.42
23 7.50
24 7.58
25 7.67
26 7.33
27 6.83
28 6.67
Name: Flavor, dtype: float64
this is a pandas Series
type(coffee_df['Flavor'])
pandas.core.series.Series
We can pick a row out with loc
, the property that grabs a row by its name, a value from the index.
coffee_df.loc[7]
Species Robusta
Owner andrew hetzel
Country.of.Origin India
Farm.Name sethuraman estates
Lot.Number NaN
Mill NaN
ICO.Number NaN
Company cafemakers
Altitude 750m
Region chikmagalur
Producer Nishant Gurjer
Number.of.Bags 320
Bag.Weight 2 kg
In.Country.Partner Specialty Coffee Association
Harvest.Year 2014
Grading.Date May 15th, 2014
Owner.1 Andrew Hetzel
Variety NaN
Processing.Method NaN
Fragrance...Aroma 7.67
Flavor 7.75
Aftertaste 7.83
Salt...Acid 7.83
Bitter...Sweet 8.0
Mouthfeel 7.92
Uniform.Cup 10.0
Clean.Cup 10.0
Balance 7.75
Cupper.Points 7.83
Total.Cup.Points 82.58
Moisture 0.0
Category.One.Defects 0
Quakers 0
Color Green
Category.Two.Defects 0
Expiration May 15th, 2015
Certification.Body Specialty Coffee Association
Certification.Address ff7c18ad303d4b603ac3f8cff7e611ffc735e720
Certification.Contact 352d0cf7f3e9be14dad7df644ad65efc27605ae2
unit_of_measurement m
altitude_low_meters 750.0
altitude_high_meters 750.0
altitude_mean_meters 750.0
Name: 7, dtype: object
this is also a series.
3.1.3. Masking to Select Subsets#
We can use a boolean mask or Series with boolean values, to pick out a subset of rows.
To build up to that, we will first evaluate a boolean expression on a column.
coffee_df['Flavor']>7.8
1 True
2 False
3 True
4 True
5 True
6 True
7 False
8 False
9 False
10 True
11 True
12 True
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
Name: Flavor, dtype: bool
Now, we can save it to a variable, and then use it to mask, by treating it as an index.
high_flavor = coffee_df['Flavor']>7.8
coffee_df[high_flavor]
Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | Region | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Robusta | ankole coffee producers coop | Uganda | kyangundu cooperative society | NaN | ankole coffee producers | 0 | ankole coffee producers coop | 1488 | sheema south western | ... | Green | 2 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1488.0 | 1488.0 | 1488.0 |
3 | Robusta | andrew hetzel | India | sethuraman estate | NaN | NaN | 0000 | sethuraman estate | 1000m | chikmagalur | ... | Green | 0 | April 29th, 2016 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 1000.0 | 1000.0 | 1000.0 |
4 | Robusta | ugacof | Uganda | ugacof project area | NaN | ugacof | 0 | ugacof ltd | 1212 | central | ... | Green | 7 | July 14th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1212.0 | 1212.0 | 1212.0 |
5 | Robusta | katuka development trust ltd | Uganda | katikamu capca farmers association | NaN | katuka development trust | 0 | katuka development trust ltd | 1200-1300 | luwero central region | ... | Green | 3 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1200.0 | 1300.0 | 1250.0 |
6 | Robusta | andrew hetzel | India | NaN | NaN | (self) | NaN | cafemakers, llc | 3000' | chikmagalur | ... | Green | 0 | February 28th, 2013 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 3000.0 | 3000.0 | 3000.0 |
10 | Robusta | ugacof | Uganda | ishaka | NaN | nsubuga umar | 0 | ugacof ltd | 900-1300 | western | ... | Green | 6 | August 5th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 900.0 | 1300.0 | 1100.0 |
11 | Robusta | ugacof | Uganda | ugacof project area | NaN | ugacof | 0 | ugacof ltd | 1095 | iganga namadrope eastern | ... | Green | 1 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1095.0 | 1095.0 | 1095.0 |
12 | Robusta | nishant gurjer | India | sethuraman estate kaapi royale | RC AB | sethuraman estate | 14/1148/2016/12 | kaapi royale | 1000 | chikmagalur karnataka | ... | Green | 0 | August 23rd, 2017 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 1000.0 | 1000.0 | 1000.0 |
8 rows × 43 columns
this segments out only the selected rows.
The values are a numpy array which also has the same shape as the dataframe, but it does not incude the index or the column headers (which do not count as rows or columns anyway)
coffee_vals = coffee_df.values
We can see this with the shape
attribute
coffee_vals.shape
(28, 43)
coffee_df.shape
(28, 43)
3.2. Reading from a website#
We saw in the help that pandas has a lot of read_
methods. You can also see a list on the IO page of the docs
achievements_url = 'https://rhodyprog4ds.github.io/BrownFall24/syllabus/achievements.html'
We can read from html files (websites)
pd.read_html(achievements_url,)
[ Unnamed: 0_level_0 topics \
week Unnamed: 1_level_1
0 1 [admin, python review]
1 2 Loading data, Python review
2 3 Exploratory Data Analysis
3 4 Data Cleaning
4 5 Databases, Merging DataFrames
5 6 Modeling, classification performance metrics, ...
6 7 Naive Bayes, decision trees
7 8 Regression
8 9 Clustering
9 10 SVM, parameter tuning
10 11 KNN, Model comparison
11 12 Text Analysis
12 13 Images Analysis
13 14 Deep Learning
skills
Unnamed: 2_level_1
0 process
1 [access, prepare, summarize]
2 [summarize, visualize]
3 [prepare, summarize, visualize]
4 [access, construct, summarize]
5 [evaluate]
6 [classification, evaluate]
7 [regression, evaluate]
8 [clustering, evaluate]
9 [optimize, tools]
10 [compare, tools]
11 [unstructured]
12 [unstructured, tools]
13 [tools, compare] ,
Unnamed: 0_level_0 skill \
keyword Unnamed: 1_level_1
0 python pythonic code writing
1 process describe data science as a process
2 access access data in multiple formats
3 construct construct datasets from multiple sources
4 summarize Summarize and describe data
5 visualize Visualize data
6 prepare prepare data for analysis
7 evaluate Evaluate model performance
8 classification Apply classification
9 regression Apply Regression
10 clustering Clustering
11 optimize Optimize model parameters
12 compare compare models
13 representation Choose representations and transform data
14 workflow use industry standard data science tools and w...
Level 1 \
Unnamed: 2_level_1
0 python code that mostly runs, occasional pep8 ...
1 Identify basic components of data science
2 load data from at least one format; identify t...
3 identify what should happen to merge datasets ...
4 Describe the shape and structure of a dataset ...
5 identify plot types, generate basic plots from...
6 identify if data is or is not ready for analys...
7 Explain and compute basic performance metrics ...
8 identify and describe what classification is, ...
9 identify what data that can be used for regres...
10 describe what clustering is
11 Identify when model parameters need to be opti...
12 Qualitatively compare model classes
13 Identify options for representing text and cat...
14 Solve well strucutred fully specified problems...
Level 2 \
Unnamed: 3_level_1
0 python code that reliably runs, frequent pep8 ...
1 Describe and define each stage of the data sci...
2 Load data for processing from the most common ...
3 apply basic merges
4 compute summary statndard statistics of a whol...
5 generate multiple plot types with complete lab...
6 apply data reshaping, cleaning, and filtering ...
7 Apply and interpret basic model evaluation met...
8 fit, apply, and interpret preselected classifi...
9 fit and interpret linear regression models
10 apply basic clustering
11 Optimize basic model parameters such as model ...
12 Compare model classes in specific terms and fi...
13 Apply at least one representation to transform...
14 Solve well-strucutred, open-ended problems, ap...
Level 3
Unnamed: 4_level_1
0 reliable, efficient, pythonic code that consis...
1 Compare different ways that data science can f...
2 access data from both common and uncommon form...
3 merge data that is not automatically aligned
4 Compute and interpret various summary statisti...
5 generate complex plots with pandas and plottin...
6 apply data reshaping, cleaning, and filtering ...
7 Evaluate a model with multiple metrics and cro...
8 fit and apply classification models and select...
9 fit and explain regrularized or nonlinear regr...
10 apply multiple clustering techniques, and inte...
11 Select optimal parameters based of mutiple qua...
12 Evaluate tradeoffs between different model com...
13 apply transformations in different contexts OR...
14 Independently scope and solve realistic data s... ,
Unnamed: 0_level_0 A1 A2 \
keyword Unnamed: 1_level_1 Unnamed: 2_level_1
0 python 1 1
1 process 1 0
2 access 0 1
3 construct 0 0
4 summarize 0 0
5 visualize 0 0
6 prepare 0 0
7 evaluate 0 0
8 classification 0 0
9 regression 0 0
10 clustering 0 0
11 optimize 0 0
12 compare 0 0
13 representation 0 0
14 workflow 0 0
A3 A4 A5 \
Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1
0 0 1 1
1 0 0 0
2 1 1 1
3 0 0 1
4 1 1 1
5 1 1 0
6 0 1 1
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
A6 A7 A8 \
Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1
0 0 0 0
1 1 1 1
2 0 0 0
3 0 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 1 1
8 0 1 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
A9 A10 A11 \
Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1
0 0 0 0
1 1 1 1
2 0 0 0
3 0 0 0
4 1 1 1
5 1 1 1
6 0 0 0
7 0 1 1
8 0 1 0
9 0 0 1
10 1 0 1
11 0 1 1
12 0 0 1
13 0 0 0
14 0 1 1
A12 A13 # Assignments
Unnamed: 12_level_1 Unnamed: 13_level_1 Unnamed: 14_level_1
0 0 0 4
1 0 0 7
2 0 0 4
3 0 0 3
4 1 1 11
5 1 1 10
6 0 0 2
7 0 0 5
8 0 0 2
9 0 0 2
10 0 0 2
11 0 0 2
12 0 1 2
13 1 1 2
14 1 1 4 ]
this does not look like the others, it is a list of the DataFrames for all of the tables on the page. Even if there is only one, it will still be in a list.
Now that we know it is a list, we will save it in a variable and then, pick out one DataFrame.
ach_df_list = pd.read_html(achievements_url,)
ach_defn = ach_df_list[2]
we can pick out the one and check that it is indeed a dataFrame
type(ach_defn)
pandas.core.frame.DataFrame
3.3. Slicing#
We will use a small list to practice slicing, but it works the same in a DataFrame.
toy_list = [3,4,5,6]
we can get the last item with -1
toy_list[-1]
6
We can use :
to go up to, but not including a particular point.
toy_list[:2]
[3, 4]
Note that using 2
picks out the value in position 2, but the above stopped before that.
toy_list[2]
5
we can slice from the right, but this time it is inclusive
toy_list[-2:]
[5, 6]
we can put numbers on both sides of the :
to take a segment out
toy_list[1:3]
[4, 5]
we can use two to specify an interval at which to pick values.
toy_list[::2]
[3, 5]
or we can state the starting point with it
toy_list[1::2]
[4, 6]
3.4. Built in iterable types#
These are four different iterable constructions:
a = [char for char in 'abcde']
b = {char:i for i, char in enumerate('abcde')}
c = ('a','b','c','d','e')
d = 'a b c d e'.split(' ')
We can see their types
type(a), type(b), type(c), type(d)
(list, dict, tuple, list)
Dictionaries are really useful because they consist of key, value pairs. This is really powerful and we will use it a lot to pass complex structures into functions.
we can make another one more manually that is much ike the above.
sample_dict = {'a':1,'b':2,'c':3}
sample_dict
{'a': 1, 'b': 2, 'c': 3}
we access values of a dicationary using the key
sample_dict['a']
1
we can iterate over them using the items
method to pop them off separately.
for char,num in sample_dict.items():
print('char', char, 'is linked to number',num)
char a is linked to number 1
char b is linked to number 2
char c is linked to number 3
3.5. Saving an excerpt#
We can find the rows only with a small Number.of.Bags
, say <100:
small_batch_mask = coffee_df['Number.of.Bags']<100
small_batch = coffee_df[small_batch_mask]
After we make the mask, then subset the data, we see its a dataframe
type(small_batch)
pandas.core.frame.DataFrame
and then we can save it to csv.
small_batch.to_csv('robusta_small_batch.csv')