3. Pandas Data Frames and More Iterable Types#

Note

I was able to fix my certificate problem by running this script from the python install.

We will import pandas

import pandas as pd

We want to import the coffee data and we will keep working with it over several classes. Sometimes we want to put functions we have written or constants like this in a module of our own that we can import. For today we will put the following in a file called coffee.py

robusta_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
robusta_data_file = 'robusta_data_cleaned.csv'

Now we can import it

import coffee

and use the variable with a ., as an attribute of the module.

coffee.robusta_data_url
'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'

We could also import the variables directly so that we can use them without the .

from coffee import robusta_data_url, robusta_data_file

like follows

ro
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 ro

NameError: name 'ro' is not defined

This is a NameError because the variable does not exist.

When we spell it correctly, it works as desired.

robusta_data_url
'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'

Then we can use the variable.

coffee_df = pd.read_csv(robusta_data_url)

3.1. DataFrame parts#

first we will check the type

type(coffee_df)
pandas.core.frame.DataFrame

we saw that we can check the first 5 rows with head

coffee_df.head()
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 1 Robusta ankole coffee producers coop Uganda kyangundu cooperative society NaN ankole coffee producers 0 ankole coffee producers coop 1488 ... Green 2 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1488.0 1488.0 1488.0
1 2 Robusta nishant gurjer India sethuraman estate kaapi royale 25 sethuraman estate 14/1148/2017/21 kaapi royale 3170 ... NaN 2 October 31st, 2018 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m 3170.0 3170.0 3170.0
2 3 Robusta andrew hetzel India sethuraman estate NaN NaN 0000 sethuraman estate 1000m ... Green 0 April 29th, 2016 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m 1000.0 1000.0 1000.0
3 4 Robusta ugacof Uganda ugacof project area NaN ugacof 0 ugacof ltd 1212 ... Green 7 July 14th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1212.0 1212.0 1212.0
4 5 Robusta katuka development trust ltd Uganda katikamu capca farmers association NaN katuka development trust 0 katuka development trust ltd 1200-1300 ... Green 3 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1200.0 1300.0 1250.0

5 rows × 44 columns

and tail to see the last ones.

coffee_df.tail(3)
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
25 26 Robusta james moore United States fazenda cazengo NaN cafe cazengo NaN global opportunity fund 795 meters ... NaN 6 December 23rd, 2015 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m 795.0 795.0 795.0
26 27 Robusta cafe politico India NaN NaN NaN 14-1118-2014-0087 cafe politico NaN ... Green 1 August 25th, 2015 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m NaN NaN NaN
27 28 Robusta cafe politico Vietnam NaN NaN NaN NaN cafe politico NaN ... NaN 9 August 25th, 2015 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m NaN NaN NaN

3 rows × 44 columns

We note that one of the columns is named Unnamed 0: because it does not have a name in the file, for that column.

We can use shift + tab to see help. We see that the index_col parameter to treat that column as the index.

coffee_df = pd.read_csv(robusta_data_url,index_col=0)
coffee_df.head(1)
Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Robusta ankole coffee producers coop Uganda kyangundu cooperative society NaN ankole coffee producers 0 ankole coffee producers coop 1488 sheema south western ... Green 2 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1488.0 1488.0 1488.0

1 rows × 43 columns

3.1.1. Data Frame parts#

Now it looks neater and tht columns is the index.

We saw before the columns

coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
       'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
       'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
       'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
       'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')

The rows are named by the index attribute

coffee_df.index
Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28],
      dtype='int64')

Both of these are pandas Index objects

We can also take out only the values without the index or the column names.

coffee_df.values
array([['Robusta', 'ankole coffee producers coop', 'Uganda', ..., 1488.0,
        1488.0, 1488.0],
       ['Robusta', 'nishant gurjer', 'India', ..., 3170.0, 3170.0,
        3170.0],
       ['Robusta', 'andrew hetzel', 'India', ..., 1000.0, 1000.0, 1000.0],
       ...,
       ['Robusta', 'james moore', 'United States', ..., 795.0, 795.0,
        795.0],
       ['Robusta', 'cafe politico', 'India', ..., nan, nan, nan],
       ['Robusta', 'cafe politico', 'Vietnam', ..., nan, nan, nan]],
      dtype=object)

this is a numpy.array type

type(coffee_df.values)
numpy.ndarray

3.1.2. Segmenting rows or columns#

We can pick out a column using it’s name and []

coffee_df['Flavor']
1     8.08
2     7.75
3     7.83
4     7.92
5     7.83
6     7.92
7     7.75
8     7.75
9     7.75
10    7.83
11    7.92
12    7.83
13    7.58
14    7.75
15    7.67
16    7.75
17    7.50
18    7.58
19    7.58
20    7.42
21    7.67
22    7.42
23    7.50
24    7.58
25    7.67
26    7.33
27    6.83
28    6.67
Name: Flavor, dtype: float64

this is a pandas Series

type(coffee_df['Flavor'])
pandas.core.series.Series

We can pick a row out with loc, the property that grabs a row by its name, a value from the index.

coffee_df.loc[7]
Species                                                   Robusta
Owner                                               andrew hetzel
Country.of.Origin                                           India
Farm.Name                                      sethuraman estates
Lot.Number                                                    NaN
Mill                                                          NaN
ICO.Number                                                    NaN
Company                                                cafemakers
Altitude                                                     750m
Region                                                chikmagalur
Producer                                           Nishant Gurjer
Number.of.Bags                                                320
Bag.Weight                                                   2 kg
In.Country.Partner                   Specialty Coffee Association
Harvest.Year                                                 2014
Grading.Date                                       May 15th, 2014
Owner.1                                             Andrew Hetzel
Variety                                                       NaN
Processing.Method                                             NaN
Fragrance...Aroma                                            7.67
Flavor                                                       7.75
Aftertaste                                                   7.83
Salt...Acid                                                  7.83
Bitter...Sweet                                                8.0
Mouthfeel                                                    7.92
Uniform.Cup                                                  10.0
Clean.Cup                                                    10.0
Balance                                                      7.75
Cupper.Points                                                7.83
Total.Cup.Points                                            82.58
Moisture                                                      0.0
Category.One.Defects                                            0
Quakers                                                         0
Color                                                       Green
Category.Two.Defects                                            0
Expiration                                         May 15th, 2015
Certification.Body                   Specialty Coffee Association
Certification.Address    ff7c18ad303d4b603ac3f8cff7e611ffc735e720
Certification.Contact    352d0cf7f3e9be14dad7df644ad65efc27605ae2
unit_of_measurement                                             m
altitude_low_meters                                         750.0
altitude_high_meters                                        750.0
altitude_mean_meters                                        750.0
Name: 7, dtype: object

this is also a series.

3.1.3. Masking to Select Subsets#

We can use a boolean mask or Series with boolean values, to pick out a subset of rows.

To build up to that, we will first evaluate a boolean expression on a column.

coffee_df['Flavor']>7.8
1      True
2     False
3      True
4      True
5      True
6      True
7     False
8     False
9     False
10     True
11     True
12     True
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
Name: Flavor, dtype: bool

Now, we can save it to a variable, and then use it to mask, by treating it as an index.

high_flavor = coffee_df['Flavor']>7.8
coffee_df[high_flavor]
Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Robusta ankole coffee producers coop Uganda kyangundu cooperative society NaN ankole coffee producers 0 ankole coffee producers coop 1488 sheema south western ... Green 2 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1488.0 1488.0 1488.0
3 Robusta andrew hetzel India sethuraman estate NaN NaN 0000 sethuraman estate 1000m chikmagalur ... Green 0 April 29th, 2016 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m 1000.0 1000.0 1000.0
4 Robusta ugacof Uganda ugacof project area NaN ugacof 0 ugacof ltd 1212 central ... Green 7 July 14th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1212.0 1212.0 1212.0
5 Robusta katuka development trust ltd Uganda katikamu capca farmers association NaN katuka development trust 0 katuka development trust ltd 1200-1300 luwero central region ... Green 3 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1200.0 1300.0 1250.0
6 Robusta andrew hetzel India NaN NaN (self) NaN cafemakers, llc 3000' chikmagalur ... Green 0 February 28th, 2013 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m 3000.0 3000.0 3000.0
10 Robusta ugacof Uganda ishaka NaN nsubuga umar 0 ugacof ltd 900-1300 western ... Green 6 August 5th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 900.0 1300.0 1100.0
11 Robusta ugacof Uganda ugacof project area NaN ugacof 0 ugacof ltd 1095 iganga namadrope eastern ... Green 1 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1095.0 1095.0 1095.0
12 Robusta nishant gurjer India sethuraman estate kaapi royale RC AB sethuraman estate 14/1148/2016/12 kaapi royale 1000 chikmagalur karnataka ... Green 0 August 23rd, 2017 Specialty Coffee Association ff7c18ad303d4b603ac3f8cff7e611ffc735e720 352d0cf7f3e9be14dad7df644ad65efc27605ae2 m 1000.0 1000.0 1000.0

8 rows × 43 columns

this segments out only the selected rows.

The values are a numpy array which also has the same shape as the dataframe, but it does not incude the index or the column headers (which do not count as rows or columns anyway)

coffee_vals = coffee_df.values

We can see this with the shape attribute

coffee_vals.shape
(28, 43)
coffee_df.shape
(28, 43)

3.2. Reading from a website#

We saw in the help that pandas has a lot of read_ methods. You can also see a list on the IO page of the docs

achievements_url = 'https://rhodyprog4ds.github.io/BrownFall24/syllabus/achievements.html'

We can read from html files (websites)

pd.read_html(achievements_url,)
[   Unnamed: 0_level_0                                             topics  \
                  week                                 Unnamed: 1_level_1   
 0                   1                             [admin, python review]   
 1                   2                        Loading data, Python review   
 2                   3                          Exploratory Data Analysis   
 3                   4                                      Data Cleaning   
 4                   5                      Databases, Merging DataFrames   
 5                   6  Modeling, classification performance metrics, ...   
 6                   7                        Naive Bayes, decision trees   
 7                   8                                         Regression   
 8                   9                                         Clustering   
 9                  10                              SVM, parameter tuning   
 10                 11                              KNN, Model comparison   
 11                 12                                      Text Analysis   
 12                 13                                    Images Analysis   
 13                 14                                      Deep Learning   
 
                              skills  
                  Unnamed: 2_level_1  
 0                           process  
 1      [access, prepare, summarize]  
 2            [summarize, visualize]  
 3   [prepare, summarize, visualize]  
 4    [access, construct, summarize]  
 5                        [evaluate]  
 6        [classification, evaluate]  
 7            [regression, evaluate]  
 8            [clustering, evaluate]  
 9                 [optimize, tools]  
 10                 [compare, tools]  
 11                   [unstructured]  
 12            [unstructured, tools]  
 13                 [tools, compare]  ,
    Unnamed: 0_level_0                                              skill  \
               keyword                                 Unnamed: 1_level_1   
 0              python                              pythonic code writing   
 1             process                 describe data science as a process   
 2              access                    access data in multiple formats   
 3           construct           construct datasets from multiple sources   
 4           summarize                        Summarize and describe data   
 5           visualize                                     Visualize data   
 6             prepare                          prepare data for analysis   
 7            evaluate                         Evaluate model performance   
 8      classification                               Apply classification   
 9          regression                                   Apply Regression   
 10         clustering                                         Clustering   
 11           optimize                          Optimize model parameters   
 12            compare                                     compare models   
 13     representation          Choose representations and transform data   
 14           workflow  use industry standard data science tools and w...   
 
                                               Level 1  \
                                    Unnamed: 2_level_1   
 0   python code that mostly runs, occasional pep8 ...   
 1           Identify basic components of data science   
 2   load data from at least one format; identify t...   
 3   identify what should happen to merge datasets ...   
 4   Describe the shape and structure of a dataset ...   
 5   identify plot types, generate basic plots from...   
 6   identify if data is or is not ready for analys...   
 7   Explain and compute basic performance metrics ...   
 8   identify and describe what classification is, ...   
 9   identify what data that can be used for regres...   
 10                        describe what clustering is   
 11  Identify when model parameters need to be opti...   
 12                Qualitatively compare model classes   
 13  Identify options for representing text and cat...   
 14  Solve well strucutred fully specified problems...   
 
                                               Level 2  \
                                    Unnamed: 3_level_1   
 0   python code that reliably runs, frequent pep8 ...   
 1   Describe and define each stage of the data sci...   
 2   Load data for processing from the most common ...   
 3                                  apply basic merges   
 4   compute summary statndard statistics of a whol...   
 5   generate multiple plot types with complete lab...   
 6   apply data reshaping, cleaning, and filtering ...   
 7   Apply and interpret basic model evaluation met...   
 8   fit, apply, and interpret preselected classifi...   
 9          fit and interpret linear regression models   
 10                             apply basic clustering   
 11  Optimize basic model parameters such as model ...   
 12  Compare model classes in specific terms and fi...   
 13  Apply at least one representation to transform...   
 14  Solve well-strucutred, open-ended problems, ap...   
 
                                               Level 3  
                                    Unnamed: 4_level_1  
 0   reliable, efficient, pythonic code that consis...  
 1   Compare different ways that data science can f...  
 2   access data from both common and uncommon form...  
 3        merge data that is not automatically aligned  
 4   Compute and interpret various summary statisti...  
 5   generate complex plots with pandas and plottin...  
 6   apply data reshaping, cleaning, and filtering ...  
 7   Evaluate a model with multiple metrics and cro...  
 8   fit and apply classification models and select...  
 9   fit and explain regrularized or nonlinear regr...  
 10  apply multiple clustering techniques, and inte...  
 11  Select optimal parameters based of mutiple qua...  
 12  Evaluate tradeoffs between different model com...  
 13  apply transformations in different contexts OR...  
 14  Independently scope and solve realistic data s...  ,
    Unnamed: 0_level_0                 A1                 A2  \
               keyword Unnamed: 1_level_1 Unnamed: 2_level_1   
 0              python                  1                  1   
 1             process                  1                  0   
 2              access                  0                  1   
 3           construct                  0                  0   
 4           summarize                  0                  0   
 5           visualize                  0                  0   
 6             prepare                  0                  0   
 7            evaluate                  0                  0   
 8      classification                  0                  0   
 9          regression                  0                  0   
 10         clustering                  0                  0   
 11           optimize                  0                  0   
 12            compare                  0                  0   
 13     representation                  0                  0   
 14           workflow                  0                  0   
 
                    A3                 A4                 A5  \
    Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1   
 0                   0                  1                  1   
 1                   0                  0                  0   
 2                   1                  1                  1   
 3                   0                  0                  1   
 4                   1                  1                  1   
 5                   1                  1                  0   
 6                   0                  1                  1   
 7                   0                  0                  0   
 8                   0                  0                  0   
 9                   0                  0                  0   
 10                  0                  0                  0   
 11                  0                  0                  0   
 12                  0                  0                  0   
 13                  0                  0                  0   
 14                  0                  0                  0   
 
                    A6                 A7                 A8  \
    Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1   
 0                   0                  0                  0   
 1                   1                  1                  1   
 2                   0                  0                  0   
 3                   0                  1                  1   
 4                   1                  1                  1   
 5                   1                  1                  1   
 6                   0                  0                  0   
 7                   1                  1                  1   
 8                   0                  1                  0   
 9                   0                  0                  1   
 10                  0                  0                  0   
 11                  0                  0                  0   
 12                  0                  0                  0   
 13                  0                  0                  0   
 14                  0                  0                  0   
 
                    A9                 A10                 A11  \
    Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1   
 0                   0                   0                   0   
 1                   1                   1                   1   
 2                   0                   0                   0   
 3                   0                   0                   0   
 4                   1                   1                   1   
 5                   1                   1                   1   
 6                   0                   0                   0   
 7                   0                   1                   1   
 8                   0                   1                   0   
 9                   0                   0                   1   
 10                  1                   0                   1   
 11                  0                   1                   1   
 12                  0                   0                   1   
 13                  0                   0                   0   
 14                  0                   1                   1   
 
                    A12                 A13       # Assignments  
    Unnamed: 12_level_1 Unnamed: 13_level_1 Unnamed: 14_level_1  
 0                    0                   0                   4  
 1                    0                   0                   7  
 2                    0                   0                   4  
 3                    0                   0                   3  
 4                    1                   1                  11  
 5                    1                   1                  10  
 6                    0                   0                   2  
 7                    0                   0                   5  
 8                    0                   0                   2  
 9                    0                   0                   2  
 10                   0                   0                   2  
 11                   0                   0                   2  
 12                   0                   1                   2  
 13                   1                   1                   2  
 14                   1                   1                   4  ]

this does not look like the others, it is a list of the DataFrames for all of the tables on the page. Even if there is only one, it will still be in a list.

Now that we know it is a list, we will save it in a variable and then, pick out one DataFrame.

ach_df_list = pd.read_html(achievements_url,)
ach_defn = ach_df_list[2]

we can pick out the one and check that it is indeed a dataFrame

type(ach_defn)
pandas.core.frame.DataFrame

3.3. Slicing#

We will use a small list to practice slicing, but it works the same in a DataFrame.

toy_list = [3,4,5,6]

we can get the last item with -1

toy_list[-1]
6

We can use : to go up to, but not including a particular point.

toy_list[:2]
[3, 4]

Note that using 2 picks out the value in position 2, but the above stopped before that.

toy_list[2]
5

we can slice from the right, but this time it is inclusive

toy_list[-2:]
[5, 6]

we can put numbers on both sides of the : to take a segment out

toy_list[1:3]
[4, 5]

we can use two to specify an interval at which to pick values.

toy_list[::2]
[3, 5]

or we can state the starting point with it

toy_list[1::2]
[4, 6]

3.4. Built in iterable types#

These are four different iterable constructions:

a = [char for char in 'abcde']
b =  {char:i for i, char in enumerate('abcde')}
c = ('a','b','c','d','e')
d = 'a b c d e'.split(' ')

We can see their types

type(a), type(b), type(c), type(d)
(list, dict, tuple, list)

Dictionaries are really useful because they consist of key, value pairs. This is really powerful and we will use it a lot to pass complex structures into functions.

we can make another one more manually that is much ike the above.

sample_dict = {'a':1,'b':2,'c':3}
sample_dict
{'a': 1, 'b': 2, 'c': 3}

we access values of a dicationary using the key

sample_dict['a']
1

we can iterate over them using the items method to pop them off separately.

for char,num in sample_dict.items():
    print('char', char, 'is linked to number',num)
char a is linked to number 1
char b is linked to number 2
char c is linked to number 3

3.5. Saving an excerpt#

We can find the rows only with a small Number.of.Bags, say <100:

small_batch_mask = coffee_df['Number.of.Bags']<100
small_batch = coffee_df[small_batch_mask]

After we make the mask, then subset the data, we see its a dataframe

type(small_batch)
pandas.core.frame.DataFrame

and then we can save it to csv.

small_batch.to_csv('robusta_small_batch.csv')