Environment notes¶
See the new, more detailed python instructions. Even if you had things working the updated class environment file: rhodyds
Then use:
conda activate rhodyds
to have everything available.
If you had a certificate problem, this script should help from the python install.
Tabular Data¶
Structured data is easier to work with than other data.
We’re going to focus on tabular data for now. At the end of the course, we’ll examine images, which are structured, but more complex and text, which is much less structured.
coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/refs/heads/master/data/robusta_data_cleaned.csv'
Pandas¶
We will use data with a library called pandas. By convention, we import it like:
import pandas as pd
the as
keywored creates an alias for the library name.
We can use read_csv
to load the data
pd.read_csv(coffee_data_url)
We note that one of the columns is named Unnamed 0:
because it does not have a name in the file, for that column.
We can use shift + tab to see help. We see that the index_col
parameter to treat that column as the index.
coffee_df = pd.read_csv(coffee_data_url,index_col=0)
DataFrame parts¶
first we will check the type
type(coffee_df)
pandas.core.frame.DataFrame
we saw that we can check the first 5 rows with head
coffee_df.head()
Now it looks neater and tht columns is the index.
coffee_df.head(2)
coffee_df.tail(1)
We can look at all of the column names:
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
'Certification.Body', 'Certification.Address', 'Certification.Contact',
'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
'altitude_mean_meters'],
dtype='object')
The rows are named by the index
attribute
coffee_df.index
Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28],
dtype='int64')
Both of these are pandas Index objects
We can also take out only the values without the index or the column names.
coffee_df.values
array([['Robusta', 'ankole coffee producers coop', 'Uganda', ..., 1488.0,
1488.0, 1488.0],
['Robusta', 'nishant gurjer', 'India', ..., 3170.0, 3170.0,
3170.0],
['Robusta', 'andrew hetzel', 'India', ..., 1000.0, 1000.0, 1000.0],
...,
['Robusta', 'james moore', 'United States', ..., 795.0, 795.0,
795.0],
['Robusta', 'cafe politico', 'India', ..., nan, nan, nan],
['Robusta', 'cafe politico', 'Vietnam', ..., nan, nan, nan]],
shape=(28, 43), dtype=object)
We can also look at the shape
coffee_df.shape
(28, 43)
coffee_df.head(1)
Segmenting rows or columns¶
We can pick out a column using it’s name and []
coffee_df['Species']
1 Robusta
2 Robusta
3 Robusta
4 Robusta
5 Robusta
6 Robusta
7 Robusta
8 Robusta
9 Robusta
10 Robusta
11 Robusta
12 Robusta
13 Robusta
14 Robusta
15 Robusta
16 Robusta
17 Robusta
18 Robusta
19 Robusta
20 Robusta
21 Robusta
22 Robusta
23 Robusta
24 Robusta
25 Robusta
26 Robusta
27 Robusta
28 Robusta
Name: Species, dtype: object
this is a pandas Series
type(coffee_df['Species'])
pandas.core.series.Series
Or we can pick a row by label with loc
coffee_df.loc[1]
Species Robusta
Owner ankole coffee producers coop
Country.of.Origin Uganda
Farm.Name kyangundu cooperative society
Lot.Number NaN
Mill ankole coffee producers
ICO.Number 0
Company ankole coffee producers coop
Altitude 1488
Region sheema south western
Producer Ankole coffee producers coop
Number.of.Bags 300
Bag.Weight 60 kg
In.Country.Partner Uganda Coffee Development Authority
Harvest.Year 2013
Grading.Date June 26th, 2014
Owner.1 Ankole coffee producers coop
Variety NaN
Processing.Method NaN
Fragrance...Aroma 7.83
Flavor 8.08
Aftertaste 7.75
Salt...Acid 7.92
Bitter...Sweet 8.0
Mouthfeel 8.25
Uniform.Cup 10.0
Clean.Cup 10.0
Balance 7.92
Cupper.Points 8.0
Total.Cup.Points 83.75
Moisture 0.12
Category.One.Defects 0
Quakers 0
Color Green
Category.Two.Defects 2
Expiration June 26th, 2015
Certification.Body Uganda Coffee Development Authority
Certification.Address e36d0270932c3b657e96b7b0278dfd85dc0fe743
Certification.Contact 03077a1c6bac60e6f514691634a7f6eb5c85aae8
unit_of_measurement m
altitude_low_meters 1488.0
altitude_high_meters 1488.0
altitude_mean_meters 1488.0
Name: 1, dtype: object
or by number with iloc
coffee_df.iloc[1:5:2]
Masking to Select Subsets¶
We can use a boolean mask or Series with boolean values, to pick out a subset of rows.
To build up to that, we will first evaluate a boolean expression on a column.
See a visualization of this on a smaller dataset
coffee_df['Flavor']>7.8
1 True
2 False
3 True
4 True
5 True
6 True
7 False
8 False
9 False
10 True
11 True
12 True
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
Name: Flavor, dtype: bool
Now, we can save it to a variable, and then use it to mask, by treating it as an index.
high_flavor = coffee_df['Flavor']>7.8
coffee_df[high_flavor]
this segments out only the selected rows.
General df information¶
We can get an over view
coffee_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 28 entries, 1 to 28
Data columns (total 43 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Species 28 non-null object
1 Owner 28 non-null object
2 Country.of.Origin 28 non-null object
3 Farm.Name 25 non-null object
4 Lot.Number 6 non-null object
5 Mill 20 non-null object
6 ICO.Number 17 non-null object
7 Company 28 non-null object
8 Altitude 25 non-null object
9 Region 26 non-null object
10 Producer 26 non-null object
11 Number.of.Bags 28 non-null int64
12 Bag.Weight 28 non-null object
13 In.Country.Partner 28 non-null object
14 Harvest.Year 28 non-null int64
15 Grading.Date 28 non-null object
16 Owner.1 28 non-null object
17 Variety 3 non-null object
18 Processing.Method 10 non-null object
19 Fragrance...Aroma 28 non-null float64
20 Flavor 28 non-null float64
21 Aftertaste 28 non-null float64
22 Salt...Acid 28 non-null float64
23 Bitter...Sweet 28 non-null float64
24 Mouthfeel 28 non-null float64
25 Uniform.Cup 28 non-null float64
26 Clean.Cup 28 non-null float64
27 Balance 28 non-null float64
28 Cupper.Points 28 non-null float64
29 Total.Cup.Points 28 non-null float64
30 Moisture 28 non-null float64
31 Category.One.Defects 28 non-null int64
32 Quakers 28 non-null int64
33 Color 25 non-null object
34 Category.Two.Defects 28 non-null int64
35 Expiration 28 non-null object
36 Certification.Body 28 non-null object
37 Certification.Address 28 non-null object
38 Certification.Contact 28 non-null object
39 unit_of_measurement 28 non-null object
40 altitude_low_meters 25 non-null float64
41 altitude_high_meters 25 non-null float64
42 altitude_mean_meters 25 non-null float64
dtypes: float64(15), int64(5), object(23)
memory usage: 10.7+ KB
We can view the the pandas dtypes
. This is close to the regular data type of the content, but with some differences.
coffee_df.dtypes
Species object
Owner object
Country.of.Origin object
Farm.Name object
Lot.Number object
Mill object
ICO.Number object
Company object
Altitude object
Region object
Producer object
Number.of.Bags int64
Bag.Weight object
In.Country.Partner object
Harvest.Year int64
Grading.Date object
Owner.1 object
Variety object
Processing.Method object
Fragrance...Aroma float64
Flavor float64
Aftertaste float64
Salt...Acid float64
Bitter...Sweet float64
Mouthfeel float64
Uniform.Cup float64
Clean.Cup float64
Balance float64
Cupper.Points float64
Total.Cup.Points float64
Moisture float64
Category.One.Defects int64
Quakers int64
Color object
Category.Two.Defects int64
Expiration object
Certification.Body object
Certification.Address object
Certification.Contact object
unit_of_measurement object
altitude_low_meters float64
altitude_high_meters float64
altitude_mean_meters float64
dtype: object
coffee_df.head().tail(1)
print(coffee_df)
Species Owner Country.of.Origin \
1 Robusta ankole coffee producers coop Uganda
2 Robusta nishant gurjer India
3 Robusta andrew hetzel India
4 Robusta ugacof Uganda
5 Robusta katuka development trust ltd Uganda
6 Robusta andrew hetzel India
7 Robusta andrew hetzel India
8 Robusta nishant gurjer India
9 Robusta nishant gurjer India
10 Robusta ugacof Uganda
11 Robusta ugacof Uganda
12 Robusta nishant gurjer India
13 Robusta andrew hetzel India
14 Robusta kasozi coffee farmers association Uganda
15 Robusta ankole coffee producers coop Uganda
16 Robusta andrew hetzel India
17 Robusta andrew hetzel India
18 Robusta kawacom uganda ltd Uganda
19 Robusta nitubaasa ltd Uganda
20 Robusta mannya coffee project Uganda
21 Robusta andrew hetzel India
22 Robusta andrew hetzel India
23 Robusta andrew hetzel United States
24 Robusta luis robles Ecuador
25 Robusta luis robles Ecuador
26 Robusta james moore United States
27 Robusta cafe politico India
28 Robusta cafe politico Vietnam
Farm.Name Lot.Number \
1 kyangundu cooperative society NaN
2 sethuraman estate kaapi royale 25
3 sethuraman estate NaN
4 ugacof project area NaN
5 katikamu capca farmers association NaN
6 NaN NaN
7 sethuraman estates NaN
8 sethuraman estate kaapi royale 7
9 sethuraman estate RKR
10 ishaka NaN
11 ugacof project area NaN
12 sethuraman estate kaapi royale RC AB
13 sethuraman estates NaN
14 kasozi coffee farmers NaN
15 kyangundu coop society NaN
16 sethuraman estate NaN
17 sethuraman estates NaN
18 bushenyi NaN
19 kigezi coffee farmers association NaN
20 mannya coffee project NaN
21 sethuraman estates NaN
22 sethuraman estates NaN
23 sethuraman estates NaN
24 robustasa Lavado 1
25 robustasa Lavado 3
26 fazenda cazengo NaN
27 NaN NaN
28 NaN NaN
Mill ICO.Number \
1 ankole coffee producers 0
2 sethuraman estate 14/1148/2017/21
3 NaN 0000
4 ugacof 0
5 katuka development trust 0
6 (self) NaN
7 NaN NaN
8 sethuraman estate 14/1148/2017/18
9 sethuraman estate 14/1148/2016/17
10 nsubuga umar 0
11 ugacof 0
12 sethuraman estate 14/1148/2016/12
13 NaN NaN
14 NaN 0
15 ankole coffee producers coop union ltd 0
16 NaN 0000
17 sethuraman estates NaN
18 kawacom 0
19 nitubaasa 0
20 mannya coffee project 0
21 NaN NaN
22 sethuraman estates NaN
23 sethuraman estates NaN
24 our own lab NaN
25 own laboratory NaN
26 cafe cazengo NaN
27 NaN 14-1118-2014-0087
28 NaN NaN
Company Altitude \
1 ankole coffee producers coop 1488
2 kaapi royale 3170
3 sethuraman estate 1000m
4 ugacof ltd 1212
5 katuka development trust ltd 1200-1300
6 cafemakers, llc 3000'
7 cafemakers 750m
8 kaapi royale 3140
9 kaapi royale 1000
10 ugacof ltd 900-1300
11 ugacof ltd 1095
12 kaapi royale 1000
13 cafemakers 750m
14 kasozi coffee farmers association 1367
15 ankole coffee producers coop 1488
16 sethuraman estate 1000m
17 cafemakers, llc 750m
18 kawacom uganda ltd 1600
19 nitubaasa ltd 1745
20 mannya coffee project 1200
21 cafemakers 750m
22 cafemakers, llc 750m
23 cafemakers, llc 3000'
24 robustasa NaN
25 robustasa 40
26 global opportunity fund 795 meters
27 cafe politico NaN
28 cafe politico NaN
Region ... Color Category.Two.Defects \
1 sheema south western ... Green 2
2 chikmagalur karnataka indua ... NaN 2
3 chikmagalur ... Green 0
4 central ... Green 7
5 luwero central region ... Green 3
6 chikmagalur ... Green 0
7 chikmagalur ... Green 0
8 chikmagalur karnataka india ... Bluish-Green 0
9 chikmagalur karnataka ... Green 0
10 western ... Green 6
11 iganga namadrope eastern ... Green 1
12 chikmagalur karnataka ... Green 0
13 chikmagalur ... Green 1
14 eastern ... Green 7
15 south western ... Green 2
16 chikmagalur ... Green 0
17 chikmagalur ... Blue-Green 0
18 western ... Green 1
19 western ... Green 2
20 southern ... Green 1
21 chikmagalur ... Bluish-Green 1
22 chikmagalur ... Green 0
23 chikmagalur ... Green 0
24 san juan, playas ... Blue-Green 1
25 san juan, playas ... Blue-Green 0
26 kwanza norte province, angola ... NaN 6
27 NaN ... Green 1
28 NaN ... NaN 9
Expiration Certification.Body \
1 June 26th, 2015 Uganda Coffee Development Authority
2 October 31st, 2018 Specialty Coffee Association
3 April 29th, 2016 Specialty Coffee Association
4 July 14th, 2015 Uganda Coffee Development Authority
5 June 26th, 2015 Uganda Coffee Development Authority
6 February 28th, 2013 Specialty Coffee Association
7 May 15th, 2015 Specialty Coffee Association
8 October 25th, 2018 Specialty Coffee Association
9 August 17th, 2017 Specialty Coffee Association
10 August 5th, 2015 Uganda Coffee Development Authority
11 June 26th, 2015 Uganda Coffee Development Authority
12 August 23rd, 2017 Specialty Coffee Association
13 May 19th, 2015 Specialty Coffee Association
14 July 14th, 2015 Uganda Coffee Development Authority
15 July 14th, 2015 Uganda Coffee Development Authority
16 April 29th, 2016 Specialty Coffee Association
17 June 3rd, 2014 Specialty Coffee Association
18 June 27th, 2015 Uganda Coffee Development Authority
19 June 27th, 2015 Uganda Coffee Development Authority
20 June 27th, 2015 Uganda Coffee Development Authority
21 May 19th, 2015 Specialty Coffee Association
22 June 20th, 2014 Specialty Coffee Association
23 February 28th, 2013 Specialty Coffee Association
24 January 18th, 2017 Specialty Coffee Association
25 January 18th, 2017 Specialty Coffee Association
26 December 23rd, 2015 Specialty Coffee Association
27 August 25th, 2015 Specialty Coffee Association
28 August 25th, 2015 Specialty Coffee Association
Certification.Address \
1 e36d0270932c3b657e96b7b0278dfd85dc0fe743
2 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
3 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
4 e36d0270932c3b657e96b7b0278dfd85dc0fe743
5 e36d0270932c3b657e96b7b0278dfd85dc0fe743
6 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
7 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
8 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
9 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
10 e36d0270932c3b657e96b7b0278dfd85dc0fe743
11 e36d0270932c3b657e96b7b0278dfd85dc0fe743
12 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
13 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
14 e36d0270932c3b657e96b7b0278dfd85dc0fe743
15 e36d0270932c3b657e96b7b0278dfd85dc0fe743
16 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
17 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
18 e36d0270932c3b657e96b7b0278dfd85dc0fe743
19 e36d0270932c3b657e96b7b0278dfd85dc0fe743
20 e36d0270932c3b657e96b7b0278dfd85dc0fe743
21 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
22 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
23 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
24 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
25 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
26 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
27 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
28 ff7c18ad303d4b603ac3f8cff7e611ffc735e720
Certification.Contact unit_of_measurement \
1 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
2 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
3 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
4 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
5 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
6 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
7 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
8 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
9 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
10 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
11 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
12 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
13 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
14 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
15 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
16 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
17 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
18 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
19 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
20 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m
21 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
22 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
23 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
24 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
25 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
26 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
27 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
28 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m
altitude_low_meters altitude_high_meters altitude_mean_meters
1 1488.0 1488.0 1488.0
2 3170.0 3170.0 3170.0
3 1000.0 1000.0 1000.0
4 1212.0 1212.0 1212.0
5 1200.0 1300.0 1250.0
6 3000.0 3000.0 3000.0
7 750.0 750.0 750.0
8 3140.0 3140.0 3140.0
9 1000.0 1000.0 1000.0
10 900.0 1300.0 1100.0
11 1095.0 1095.0 1095.0
12 1000.0 1000.0 1000.0
13 750.0 750.0 750.0
14 1367.0 1367.0 1367.0
15 1488.0 1488.0 1488.0
16 1000.0 1000.0 1000.0
17 750.0 750.0 750.0
18 1600.0 1600.0 1600.0
19 1745.0 1745.0 1745.0
20 1200.0 1200.0 1200.0
21 750.0 750.0 750.0
22 750.0 750.0 750.0
23 3000.0 3000.0 3000.0
24 NaN NaN NaN
25 40.0 40.0 40.0
26 795.0 795.0 795.0
27 NaN NaN NaN
28 NaN NaN NaN
[28 rows x 43 columns]
Creating a Data Frame¶
We can use the constructor to make a DataFrame from many different input formats.
For example with a list of lists
tiny_data_list = [[1,3,4],[2,6,1]]
pd.DataFrame(data =tiny_data_list, columns=['a','b','c'])
or a dictionary of lists
tiny_data_dict = {'a':[1,2],'b':[3,6],'c':[4,1]}
pd.DataFrame(tiny_data_dict)
Reading from a website¶
We saw in the help that pandas has a lot of read_
methods. You can also see a list on the IO page of the docs
grading_page = 'https://rhodyprog4ds.github.io/BrownFall25/grading/'
We can read from html files too! (websites)
This has a long output so I have hidden it by default, but you can view it.
pd.read_html(grading_page)
Output
[ outcome A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
0 process 🟨 🟨 🟨 🟨 ✅ ✅ ✅ ✅ ✅ ✅
1 communication ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
2 data ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
3 modeling NaN NaN NaN NaN NaN NaN ✅ ✅ ✅ ✅
4 exploratory 🟨 ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅,
outcome level A A- B+ B B- C+ C C- D+ D F
0 experience active 15 15 15 15 10 10 10 5 5 5 0
1 experience present 22 22 22 22 22 22 22 22 22 22 0
2 process innovative 5 4 3 2 1 0 0 0 0 0 0
3 process complete 5 5 5 5 4 4 3 2 2 1 0
4 process partial 5 5 5 5 5 5 5 5 5 3 0
5 communication innovative 3 2 1 0 0 0 0 0 0 0 0
6 communication complete 8 8 8 8 0 0 0 0 0 0 0
7 communication partial 5 5 5 5 5 5 5 5 5 3 0
8 data innovative 3 2 1 0 0 0 0 0 0 0 0
9 data complete 8 8 8 8 7 6 5 4 3 2 0
10 modeling innovative 3 2 1 0 0 0 0 0 0 0 0
11 modeling complete 4 4 4 4 3 3 2 2 1 1 0
12 exploratory innovative 3 2 1 0 0 0 0 0 0 0 0
13 exploratory complete 6 6 6 6 5 4 3 2 1 1 0]
this does not look like the others, it is a list of the DataFrames for all of the tables on the page. Even if there is only one, it will still be in a list.
Now that we know it is a list, we will save it in a variable and then, pick out one DataFrame.
If you go to the page you can see that there are two tables there. Let’s take the second one.
grading_tables_list = pd.read_html(grading_page)
min_earnings_df = grading_tables_list[1]
min_earnings_df
This table is from this website so in the notes we can also view it as a table to see that it is the same.
We can then verify that it is a DataFrame
type(min_earnings_df)
pandas.core.frame.DataFrame