DataFrames from other sources

3. DataFrames from other sources#

Today we will:

continue examining the dataframe object
see more ways to load data
make sure you are set up for assignment 2

3.1. Indexing reivew#

topics = ['what is data science', 'jupyter', 'conditional','functions', 'lists', 'dictionaries','pandas' ]
topics[-1]

'pandas'

negative numbers count from the right

3.2. Reserve words#

these are words you do not want to use for variable names

Python reserve words turn green:

print

<function print>

def

  Cell In[3], line 1
    def
       ^
SyntaxError: invalid syntax

3.3. Built in iterable types#

These are four different iterable constructions:

a = [char for char in 'abcde']
b =  {char:i for i, char in enumerate('abcde')}
c = ('a','b','c','d','e')
d = 'a b c d e'.split(' ')

We can see their types

type(a), type(b), type(c), type(d)

(list, dict, tuple, list)

Dictionaries are really useful because they consist of key, value pairs. This is really powerful and we will use it a lot to pass complex structures into functions.

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

['a', 'b', 'c', 'd', 'e']

Where we index lists with numbers

a[0]

'a'

we can access items, the values in a dictionary using square brackets and the keys

b['b']

3.4. Building iterables quick with Comprehensions#

list comprehensions are super handy

we can make a list using a loop all in one line The constructions above for a and b are called list and dictionary comprehensions it is equivalent to using a loop, but a more concise way to build a list with a loop.

a_long = []
for char in 'abcde':
    a_long.append(char)

Notice that even in this for loop the lopo variable is a conceptually meaningful variable and we iterate over the items in an iterable type object. This is incontrast to creatinga loop variable that is an integer. This loop style is considered good pythonic strategy.

For more detail, see the Python docs section on looping strategies

a_long

['a', 'b', 'c', 'd', 'e']

Hint

Programming is a practice the goal is not to memorize everything, but be exposed to enough that you remember what you can look up later

enumerate is a built in function that allows you to get both a number and an item for your use in a loop or comprehension. You can read the help below or the technical details in the official Python Docs

help(enumerate)

Help on class enumerate in module builtins:

class enumerate(object)
 |  enumerate(iterable, start=0)
 |  
 |  Return an enumerate object.
 |  
 |    iterable
 |      an object supporting iteration
 |  
 |  The enumerate object yields pairs containing a count (from start, which
 |  defaults to zero) and a value yielded by the iterable argument.
 |  
 |  enumerate is useful for obtaining an indexed list:
 |      (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
 |  
 |  Methods defined here:
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __next__(self, /)
 |      Implement next(self).
 |  
 |  __reduce__(...)
 |      Return state information for pickling.
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.

3.5. Read DataFrames from HTML#

let’s use read_html on the course communications page and then inspect what we get to figure it out

course_comms_url = 'https://rhodyprog4ds.github.io/BrownFall23/syllabus/communication.html'

we will first need out library

import pandas as pd

then we will read it in without saving it and look at the output to see what it looks like.

pd.read_html(course_comms_url)

[      Day      Time   Location       Host
Monday  12pm-2pm       Zoom       Mark
Monday     4-5pm       Zoom  Dr. Brown
Friday     4-5pm  134 Tyler  Dr. Brown,
                                                usage platform  \
                                         in class  prismia   
                                         any time  prismia   
             private questions to your assignment   github   
       for general questions that can help others   github   
to share resources or ask general questions in...   github   
     matters that don't fit into another category   e-mail   
 
                            area  \
                        chat   
         download transcript   
    issue on assignment repo   
     issue on course website   
discussion on community repo   
      to brownsarahm@uri.edu   
 
                                                 note  
outside of class time this is not monitored cl...  
use after class to get preliminary notes eg if...  
                            eg bugs in your code"  
eg what the instructions of an assignment mean...  
                  include links in your portfolio  
remember to include `[CSC310]` or `[DSP310]` (...  ]

now we will save it to a variable for future use.

comm_df_list = pd.read_html(course_comms_url)

we can check the type, it is a list as we noted from looking at the outpu.

type(comm_df_list)

list

and each item in the list is a DataFrame

type(comm_df_list[0])

pandas.core.frame.DataFrame

DataFrames also have a shape attribute, to tell us the number of rows and columns.

comm_df_list[0].shape

(3, 4)

achievements_url = 'https://rhodyprog4ds.github.io/BrownFall23/syllabus/achievements.html'

Make a list of the shapes of all of the tables on the syllabus Achievements page.

This is a good job for a list comprehension.

shape_list_comp =[df.shape for df in pd.read_html(achievements_url)]
shape_list_comp

[(14, 3), (15, 5), (15, 15), (15, 6)]

Again, we can write this out as a for loop with append, but the comprehension is more concise.

shape_list = []
for df in pd.read_html(achievements_url):
    shape_list.append(df.shape)

in the comprehension structure the [] are what make it a list, they make anything a list

type([1,2,3])

list

3.6. More DataFrame Indexing#

we’ll go back to our coffee data

coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'

coffee_df = pd.read_csv(coffee_data_url,index_col=0)

See again our shape

coffee_df.shape

(28, 43)

and the first few rows

coffee_df.head(1)

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Robusta	ankole coffee producers coop	Uganda	kyangundu cooperative society	NaN	ankole coffee producers	0	ankole coffee producers coop	1488	sheema south western	...	Green	2	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1488.0	1488.0	1488.0

1 rows × 43 columns

we can also see a random sample, not only the head and tail

coffee_df.sample(3)

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
27	Robusta	cafe politico	India	NaN	NaN	NaN	14-1118-2014-0087	cafe politico	NaN	NaN	...	Green	1	August 25th, 2015	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	NaN	NaN	NaN
5	Robusta	katuka development trust ltd	Uganda	katikamu capca farmers association	NaN	katuka development trust	0	katuka development trust ltd	1200-1300	luwero central region	...	Green	3	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1200.0	1300.0	1250.0
23	Robusta	andrew hetzel	United States	sethuraman estates	NaN	sethuraman estates	NaN	cafemakers, llc	3000'	chikmagalur	...	Green	0	February 28th, 2013	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	3000.0	3000.0	3000.0

3 rows × 43 columns

Hint

printing out the list of columns is a helpful way to get them to copy-paste for later selection to ensure no typos. In a polished notebook, you could then delete a cell like the one below, but it’s really helpful while you are working

coffee_df.columns

Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
       'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
       'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
       'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
       'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')

We can subset columns by passing a list of multiple columns to use for indexing

columns_of_interest = ['Owner', 'Country.of.Origin']
coffee_df[columns_of_interest].head(1)

	Owner	Country.of.Origin
1	ankole coffee producers coop	Uganda

it has to be a list though, if we put them in one set of square brackets, it is a tuple and we get a KeyError because it looks for one column that has the name 'Owner', 'Country.of.Origin'

coffee_df['Owner', 'Country.of.Origin']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('Owner', 'Country.of.Origin')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[31], line 1
----> 1 coffee_df['Owner', 'Country.of.Origin']

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: ('Owner', 'Country.of.Origin')

instead we can use 2 sets col square brackets if we do not want a separate variable

coffee_df[['Owner', 'Country.of.Origin']].head(1)

	Owner	Country.of.Origin
1	ankole coffee producers coop	Uganda

3.7. Subsetting by values#

We can do boolean operators on a pandas.Series and it will do it automatically to every element

is_green = coffee_df['Color'] == 'Green'
is_green

    True
   False
    True
    True
    True
    True
    True
   False
    True
   True
   True
   True
   True
   True
   True
   True
  False
   True
   True
   True
  False
   True
   True
  False
  False
  False
   True
  False
Name: Color, dtype: bool

then we can look at the shape and see that it is the same shape as the column we selected.

is_green.shape, coffee_df['Color'].shape

((28,), (28,))

now we can use that to subset the rows

green_coffee_df = coffee_df[is_green]
green_coffee_df.head()

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Robusta	ankole coffee producers coop	Uganda	kyangundu cooperative society	NaN	ankole coffee producers	0	ankole coffee producers coop	1488	sheema south western	...	Green	2	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1488.0	1488.0	1488.0
3	Robusta	andrew hetzel	India	sethuraman estate	NaN	NaN	0000	sethuraman estate	1000m	chikmagalur	...	Green	0	April 29th, 2016	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	1000.0	1000.0	1000.0
4	Robusta	ugacof	Uganda	ugacof project area	NaN	ugacof	0	ugacof ltd	1212	central	...	Green	7	July 14th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1212.0	1212.0	1212.0
5	Robusta	katuka development trust ltd	Uganda	katikamu capca farmers association	NaN	katuka development trust	0	katuka development trust ltd	1200-1300	luwero central region	...	Green	3	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1200.0	1300.0	1250.0
6	Robusta	andrew hetzel	India	NaN	NaN	(self)	NaN	cafemakers, llc	3000'	chikmagalur	...	Green	0	February 28th, 2013	Specialty Coffee Association	ff7c18ad303d4b603ac3f8cff7e611ffc735e720	352d0cf7f3e9be14dad7df644ad65efc27605ae2	m	3000.0	3000.0	3000.0

5 rows × 43 columns

and look at the shape to see

green_coffee_df.shape

(20, 43)

is_green.sum()

coffee_df.columns

Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
       'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
       'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
       'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
       'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')

3.8. Python has no switch#

we use dictionaries in those kind of cases

score_text = {False:'low',
    True:'high'}

Here we can switch from a list of true/false to high low

this gives true/false values for if the flavor is above or below 7

coffee_df['Flavor']>=7

    True
    True
    True
    True
    True
    True
    True
    True
    True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
  False
  False
Name: Flavor, dtype: bool

and this is high/low instead

[score_text[flavor_comp] for flavor_comp in coffee_df['Flavor']>=7]

['high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'low',
 'low']

3.9. Keeping a clean notebook#

we can put code in a python file and include it in our notebooks to use it

this can be useful if:

you have a long hard to read thing that distracts from your other analysis
you have a function you want to reuse a lot
(unlikely in class) you need to make your own library!

I created a separate file calledn example.py and defined a variable in it like:

name ='sarah'

now I can import that and use it.

from example import name

name

'sarah'

3.10. Additional hints#

pandas objects have dtype info
using any pandas method is okay, including some we have not seen if it is a single method, for example the select_dtypes method docs
Your task is partially to learn other IO methods, so the pandas docs IO page is a good resource

3.11. Questions After Class#

Important

some questions are not answered below because they are explained in the notes above or they are too vague, you can come to office hours if you have a question that is not here or post a more detailed question on this repo or your assignment

3.11.1. what is pandas?#

it is a Python library. Read more at the user guide

3.11.2. My question is how is the data frame being accessed from the url and how can I understand it more clearly?#

pd.read_ functions can do web requests and read data online and load it directly into memory. To understand in greater detail, I recommend the docs and then follow through the links through there to the level of depth that you want.

3.11.3. what does the shape of a dataframe do?#

It is just information that we can do

3.11.4. Why do we need dictionaries to create new rows in the dataframes rather than operators?#

We did not use the dictionary to create new rows, we used it to map values to other values. We will see this patther throughout the course.

3.11.5. how to figure out which dataframes from html are useful#

we have to look at them.

3.11.6. How to download datasets#

For your assignment, you can load directly with a URL

3.11.7. Is the sum() method only counting true values, and if so, is it simply treating them as 1?#

It is adding the values after casting them to integers.

int(True), int(False)

(1, 0)

3.11.8. Why does the thing that happens right before a for in loop apply to all of the values? I think I know but just to be sure#

in a list comprehension the part before the for is like the loop body, see above where I defined a_long and compare it to the defintion of a

3.11.9. I would like to learn more about dictionaries#

I recommend starting in the python language docs section on dictionaries they are a very powerful structure and the text there is technical, but there are plenty of links. It is really good practice to get good at parsing through technical docs like this.