Pandas Data Frames and More Iterable Types

3. Pandas Data Frames and More Iterable Types#

Note

I was able to fix my certificate problem by running this script from the python install.

We will import pandas

import pandas as pd

We want to import the coffee data and we will keep working with it over several classes. Sometimes we want to put functions we have written or constants like this in a module of our own that we can import. For today we will put the following in a file called coffee.py

robusta_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
robusta_data_file = 'robusta_data_cleaned.csv'

Now we can import it

import coffee

and use the variable with a ., as an attribute of the module.

coffee.robusta_data_url

'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'

We could also import the variables directly so that we can use them without the .

from coffee import robusta_data_url, robusta_data_file

like follows

ro

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 ro

NameError: name 'ro' is not defined

This is a NameError because the variable does not exist.

When we spell it correctly, it works as desired.

robusta_data_url

'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'

Then we can use the variable.

coffee_df = pd.read_csv(robusta_data_url)

3.1. DataFrame parts#

first we will check the type

type(coffee_df)

pandas.core.frame.DataFrame

we saw that we can check the first 5 rows with head

coffee_df.head()

	Unnamed: 0	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
0	1	Robusta	ankole coffee producers coop	Uganda	kyangundu cooperative society	NaN	ankole coffee producers	0	ankole coffee producers coop	1488	...	Green	2	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1488.0	1488.0	1488.0
1	2	Robusta	nishant gurjer	India	sethuraman estate kaapi royale	25	sethuraman estate	14/1148/2017/21	kaapi royale	3170	...	NaN	2	October 31st, 2018	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	3170.0	3170.0	3170.0
2	3	Robusta	andrew hetzel	India	sethuraman estate	NaN	NaN	0000	sethuraman estate	1000m	...	Green	0	April 29th, 2016	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	1000.0	1000.0	1000.0
3	4	Robusta	ugacof	Uganda	ugacof project area	NaN	ugacof	0	ugacof ltd	1212	...	Green	7	July 14th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1212.0	1212.0	1212.0
4	5	Robusta	katuka development trust ltd	Uganda	katikamu capca farmers association	NaN	katuka development trust	0	katuka development trust ltd	1200-1300	...	Green	3	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1200.0	1300.0	1250.0

5 rows × 44 columns

and tail to see the last ones.

coffee_df.tail(3)

	Unnamed: 0	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
25	26	Robusta	james moore	United States	fazenda cazengo	NaN	cafe cazengo	NaN	global opportunity fund	795 meters	...	NaN	6	December 23rd, 2015	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	795.0	795.0	795.0
26	27	Robusta	cafe politico	India	NaN	NaN	NaN	14-1118-2014-0087	cafe politico	NaN	...	Green	1	August 25th, 2015	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	NaN	NaN	NaN
27	28	Robusta	cafe politico	Vietnam	NaN	NaN	NaN	NaN	cafe politico	NaN	...	NaN	9	August 25th, 2015	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	NaN	NaN	NaN

3 rows × 44 columns

We note that one of the columns is named Unnamed 0: because it does not have a name in the file, for that column.

We can use shift + tab to see help. We see that the index_col parameter to treat that column as the index.

coffee_df = pd.read_csv(robusta_data_url,index_col=0)
coffee_df.head(1)

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Robusta	ankole coffee producers coop	Uganda	kyangundu cooperative society	NaN	ankole coffee producers	0	ankole coffee producers coop	1488	sheema south western	...	Green	2	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1488.0	1488.0	1488.0

1 rows × 43 columns

3.1.1. Data Frame parts#

Now it looks neater and tht columns is the index.

We saw before the columns

coffee_df.columns

Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
       'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
       'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
       'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
       'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')

The rows are named by the index attribute

coffee_df.index

Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28],
      dtype='int64')

Both of these are pandas Index objects

We can also take out only the values without the index or the column names.

coffee_df.values

array([['Robusta', 'ankole coffee producers coop', 'Uganda', ..., 1488.0,
        1488.0, 1488.0],
       ['Robusta', 'nishant gurjer', 'India', ..., 3170.0, 3170.0,
        3170.0],
       ['Robusta', 'andrew hetzel', 'India', ..., 1000.0, 1000.0, 1000.0],
       ...,
       ['Robusta', 'james moore', 'United States', ..., 795.0, 795.0,
        795.0],
       ['Robusta', 'cafe politico', 'India', ..., nan, nan, nan],
       ['Robusta', 'cafe politico', 'Vietnam', ..., nan, nan, nan]],
      dtype=object)

this is a numpy.array type

type(coffee_df.values)

numpy.ndarray

3.1.2. Segmenting rows or columns#

We can pick out a column using it’s name and []

coffee_df['Flavor']

   8.08
   7.75
   7.83
   7.92
   7.83
   7.92
   7.75
   7.75
   7.75
  7.83
  7.92
  7.83
  7.58
  7.75
  7.67
  7.75
  7.50
  7.58
  7.58
  7.42
  7.67
  7.42
  7.50
  7.58
  7.67
  7.33
  6.83
  6.67
Name: Flavor, dtype: float64

this is a pandas Series

type(coffee_df['Flavor'])

pandas.core.series.Series

We can pick a row out with loc, the property that grabs a row by its name, a value from the index.

coffee_df.loc[7]

Species                                                   Robusta
Owner                                               andrew hetzel
Country.of.Origin                                           India
Farm.Name                                      sethuraman estates
Lot.Number                                                    NaN
Mill                                                          NaN
ICO.Number                                                    NaN
Company                                                cafemakers
Altitude                                                     750m
Region                                                chikmagalur
Producer                                           Nishant Gurjer
Number.of.Bags                                                320
Bag.Weight                                                   2 kg
In.Country.Partner                   Specialty Coffee Association
Harvest.Year                                                 2014
Grading.Date                                       May 15th, 2014
Owner.1                                             Andrew Hetzel
Variety                                                       NaN
Processing.Method                                             NaN
Fragrance...Aroma                                            7.67
Flavor                                                       7.75
Aftertaste                                                   7.83
Salt...Acid                                                  7.83
Bitter...Sweet                                                8.0
Mouthfeel                                                    7.92
Uniform.Cup                                                  10.0
Clean.Cup                                                    10.0
Balance                                                      7.75
Cupper.Points                                                7.83
Total.Cup.Points                                            82.58
Moisture                                                      0.0
Category.One.Defects                                            0
Quakers                                                         0
Color                                                       Green
Category.Two.Defects                                            0
Expiration                                         May 15th, 2015
Certification.Body                   Specialty Coffee Association
Certification.Address    ff7c18ad303d4b603ac3f8cff7e611ffc735e720
Certification.Contact    352d0cf7f3e9be14dad7df644ad65efc27605ae2
unit_of_measurement                                             m
altitude_low_meters                                         750.0
altitude_high_meters                                        750.0
altitude_mean_meters                                        750.0
Name: 7, dtype: object

this is also a series.

3.1.3. Masking to Select Subsets#

We can use a boolean mask or Series with boolean values, to pick out a subset of rows.

To build up to that, we will first evaluate a boolean expression on a column.

coffee_df['Flavor']>7.8

    True
   False
    True
    True
    True
    True
   False
   False
   False
   True
   True
   True
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
  False
Name: Flavor, dtype: bool

Now, we can save it to a variable, and then use it to mask, by treating it as an index.

high_flavor = coffee_df['Flavor']>7.8
coffee_df[high_flavor]

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Robusta	ankole coffee producers coop	Uganda	kyangundu cooperative society	NaN	ankole coffee producers	0	ankole coffee producers coop	1488	sheema south western	...	Green	2	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1488.0	1488.0	1488.0
3	Robusta	andrew hetzel	India	sethuraman estate	NaN	NaN	0000	sethuraman estate	1000m	chikmagalur	...	Green	0	April 29th, 2016	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	1000.0	1000.0	1000.0
4	Robusta	ugacof	Uganda	ugacof project area	NaN	ugacof	0	ugacof ltd	1212	central	...	Green	7	July 14th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1212.0	1212.0	1212.0
5	Robusta	katuka development trust ltd	Uganda	katikamu capca farmers association	NaN	katuka development trust	0	katuka development trust ltd	1200-1300	luwero central region	...	Green	3	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1200.0	1300.0	1250.0
6	Robusta	andrew hetzel	India	NaN	NaN	(self)	NaN	cafemakers, llc	3000'	chikmagalur	...	Green	0	February 28th, 2013	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	3000.0	3000.0	3000.0
10	Robusta	ugacof	Uganda	ishaka	NaN	nsubuga umar	0	ugacof ltd	900-1300	western	...	Green	6	August 5th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	900.0	1300.0	1100.0
11	Robusta	ugacof	Uganda	ugacof project area	NaN	ugacof	0	ugacof ltd	1095	iganga namadrope eastern	...	Green	1	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1095.0	1095.0	1095.0
12	Robusta	nishant gurjer	India	sethuraman estate kaapi royale	RC AB	sethuraman estate	14/1148/2016/12	kaapi royale	1000	chikmagalur karnataka	...	Green	0	August 23rd, 2017	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	1000.0	1000.0	1000.0

8 rows × 43 columns

this segments out only the selected rows.

The values are a numpy array which also has the same shape as the dataframe, but it does not incude the index or the column headers (which do not count as rows or columns anyway)

coffee_vals = coffee_df.values

We can see this with the shape attribute

coffee_vals.shape

(28, 43)

coffee_df.shape

(28, 43)

3.2. Reading from a website#

We saw in the help that pandas has a lot of read_ methods. You can also see a list on the IO page of the docs

achievements_url = 'https://rhodyprog4ds.github.io/BrownFall24/syllabus/achievements.html'

We can read from html files (websites)

pd.read_html(achievements_url,)

[   Unnamed: 0_level_0                                             topics  \
                  week                                 Unnamed: 1_level_1   
                 1                             [admin, python review]   
                 2                        Loading data, Python review   
                 3                          Exploratory Data Analysis   
                 4                                      Data Cleaning   
                 5                      Databases, Merging DataFrames   
                 6  Modeling, classification performance metrics, ...   
                 7                        Naive Bayes, decision trees   
                 8                                         Regression   
                 9                                         Clustering   
                10                              SVM, parameter tuning   
               11                              KNN, Model comparison   
               12                                      Text Analysis   
               13                                    Images Analysis   
               14                                      Deep Learning   
 
                              skills  
                  Unnamed: 2_level_1  
                         process  
    [access, prepare, summarize]  
          [summarize, visualize]  
 [prepare, summarize, visualize]  
  [access, construct, summarize]  
                      [evaluate]  
      [classification, evaluate]  
          [regression, evaluate]  
          [clustering, evaluate]  
               [optimize, tools]  
               [compare, tools]  
                 [unstructured]  
          [unstructured, tools]  
               [tools, compare]  ,
    Unnamed: 0_level_0                                              skill  \
               keyword                                 Unnamed: 1_level_1   
            python                              pythonic code writing   
           process                 describe data science as a process   
            access                    access data in multiple formats   
         construct           construct datasets from multiple sources   
         summarize                        Summarize and describe data   
         visualize                                     Visualize data   
           prepare                          prepare data for analysis   
          evaluate                         Evaluate model performance   
    classification                               Apply classification   
        regression                                   Apply Regression   
       clustering                                         Clustering   
         optimize                          Optimize model parameters   
          compare                                     compare models   
   representation          Choose representations and transform data   
         workflow  use industry standard data science tools and w...   
 
                                               Level 1  \
                                    Unnamed: 2_level_1   
 python code that mostly runs, occasional pep8 ...   
         Identify basic components of data science   
 load data from at least one format; identify t...   
 identify what should happen to merge datasets ...   
 Describe the shape and structure of a dataset ...   
 identify plot types, generate basic plots from...   
 identify if data is or is not ready for analys...   
 Explain and compute basic performance metrics ...   
 identify and describe what classification is, ...   
 identify what data that can be used for regres...   
                      describe what clustering is   
Identify when model parameters need to be opti...   
              Qualitatively compare model classes   
Identify options for representing text and cat...   
Solve well strucutred fully specified problems...   
 
                                               Level 2  \
                                    Unnamed: 3_level_1   
 python code that reliably runs, frequent pep8 ...   
 Describe and define each stage of the data sci...   
 Load data for processing from the most common ...   
                                apply basic merges   
 compute summary statndard statistics of a whol...   
 generate multiple plot types with complete lab...   
 apply data reshaping, cleaning, and filtering ...   
 Apply and interpret basic model evaluation met...   
 fit, apply, and interpret preselected classifi...   
        fit and interpret linear regression models   
                           apply basic clustering   
Optimize basic model parameters such as model ...   
Compare model classes in specific terms and fi...   
Apply at least one representation to transform...   
Solve well-strucutred, open-ended problems, ap...   
 
                                               Level 3  
                                    Unnamed: 4_level_1  
 reliable, efficient, pythonic code that consis...  
 Compare different ways that data science can f...  
 access data from both common and uncommon form...  
      merge data that is not automatically aligned  
 Compute and interpret various summary statisti...  
 generate complex plots with pandas and plottin...  
 apply data reshaping, cleaning, and filtering ...  
 Evaluate a model with multiple metrics and cro...  
 fit and apply classification models and select...  
 fit and explain regrularized or nonlinear regr...  
apply multiple clustering techniques, and inte...  
Select optimal parameters based of mutiple qua...  
Evaluate tradeoffs between different model com...  
apply transformations in different contexts OR...  
Independently scope and solve realistic data s...  ,
    Unnamed: 0_level_0                 A1                 A2  \
               keyword Unnamed: 1_level_1 Unnamed: 2_level_1   
            python                  1                  1   
           process                  1                  0   
            access                  0                  1   
         construct                  0                  0   
         summarize                  0                  0   
         visualize                  0                  0   
           prepare                  0                  0   
          evaluate                  0                  0   
    classification                  0                  0   
        regression                  0                  0   
       clustering                  0                  0   
         optimize                  0                  0   
          compare                  0                  0   
   representation                  0                  0   
         workflow                  0                  0   
 
                    A3                 A4                 A5  \
    Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1   
                 0                  1                  1   
                 0                  0                  0   
                 1                  1                  1   
                 0                  0                  1   
                 1                  1                  1   
                 1                  1                  0   
                 0                  1                  1   
                 0                  0                  0   
                 0                  0                  0   
                 0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
 
                    A6                 A7                 A8  \
    Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1   
                 0                  0                  0   
                 1                  1                  1   
                 0                  0                  0   
                 0                  1                  1   
                 1                  1                  1   
                 1                  1                  1   
                 0                  0                  0   
                 1                  1                  1   
                 0                  1                  0   
                 0                  0                  1   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
 
                    A9                 A10                 A11  \
    Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1   
                 0                   0                   0   
                 1                   1                   1   
                 0                   0                   0   
                 0                   0                   0   
                 1                   1                   1   
                 1                   1                   1   
                 0                   0                   0   
                 0                   1                   1   
                 0                   1                   0   
                 0                   0                   1   
                1                   0                   1   
                0                   1                   1   
                0                   0                   1   
                0                   0                   0   
                0                   1                   1   
 
                    A12                 A13       # Assignments  
    Unnamed: 12_level_1 Unnamed: 13_level_1 Unnamed: 14_level_1  
                  0                   0                   4  
                  0                   0                   7  
                  0                   0                   4  
                  0                   0                   3  
                  1                   1                  11  
                  1                   1                  10  
                  0                   0                   2  
                  0                   0                   5  
                  0                   0                   2  
                  0                   0                   2  
                 0                   0                   2  
                 0                   0                   2  
                 0                   1                   2  
                 1                   1                   2  
                 1                   1                   4  ]

this does not look like the others, it is a list of the DataFrames for all of the tables on the page. Even if there is only one, it will still be in a list.

Now that we know it is a list, we will save it in a variable and then, pick out one DataFrame.

ach_df_list = pd.read_html(achievements_url,)
ach_defn = ach_df_list[2]

we can pick out the one and check that it is indeed a dataFrame

type(ach_defn)

pandas.core.frame.DataFrame

3.3. Slicing#

We will use a small list to practice slicing, but it works the same in a DataFrame.

toy_list = [3,4,5,6]

we can get the last item with -1

toy_list[-1]

We can use : to go up to, but not including a particular point.

toy_list[:2]

[3, 4]

Note that using 2 picks out the value in position 2, but the above stopped before that.

toy_list[2]

we can slice from the right, but this time it is inclusive

toy_list[-2:]

[5, 6]

we can put numbers on both sides of the : to take a segment out

toy_list[1:3]

[4, 5]

we can use two to specify an interval at which to pick values.

toy_list[::2]

[3, 5]

or we can state the starting point with it

toy_list[1::2]

[4, 6]

3.4. Built in iterable types#

These are four different iterable constructions:

a = [char for char in 'abcde']
b =  {char:i for i, char in enumerate('abcde')}
c = ('a','b','c','d','e')
d = 'a b c d e'.split(' ')

We can see their types

type(a), type(b), type(c), type(d)

(list, dict, tuple, list)

Dictionaries are really useful because they consist of key, value pairs. This is really powerful and we will use it a lot to pass complex structures into functions.

we can make another one more manually that is much ike the above.

sample_dict = {'a':1,'b':2,'c':3}
sample_dict

{'a': 1, 'b': 2, 'c': 3}

we access values of a dicationary using the key

sample_dict['a']

we can iterate over them using the items method to pop them off separately.

for char,num in sample_dict.items():
    print('char', char, 'is linked to number',num)

char a is linked to number 1
char b is linked to number 2
char c is linked to number 3

3.5. Saving an excerpt#

We can find the rows only with a small Number.of.Bags, say <100:

small_batch_mask = coffee_df['Number.of.Bags']<100
small_batch = coffee_df[small_batch_mask]

After we make the mask, then subset the data, we see its a dataframe

type(small_batch)

pandas.core.frame.DataFrame

and then we can save it to csv.

small_batch.to_csv('robusta_small_batch.csv')