3. DataFrames from other sources#
Today we will:
continue examining the dataframe object
see more ways to load data
make sure you are set up for assignment 2
3.1. Indexing reivew#
topics = ['what is data science', 'jupyter', 'conditional','functions', 'lists', 'dictionaries','pandas' ]
topics[-1]
'pandas'
negative numbers count from the right
3.2. Reserve words#
these are words you do not want to use for variable names
Python reserve words turn green:
print
<function print>
def
Cell In[3], line 1
def
^
SyntaxError: invalid syntax
3.3. Built in iterable types#
These are four different iterable constructions:
a = [char for char in 'abcde']
b = {char:i for i, char in enumerate('abcde')}
c = ('a','b','c','d','e')
d = 'a b c d e'.split(' ')
We can see their types
type(a), type(b), type(c), type(d)
(list, dict, tuple, list)
Dictionaries are really useful because they consist of key, value pairs. This is really powerful and we will use it a lot to pass complex structures into functions.
b
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
a
['a', 'b', 'c', 'd', 'e']
Where we index lists with numbers
a[0]
'a'
we can access items, the values in a dictionary using square brackets and the keys
b['b']
1
3.4. Building iterables quick with Comprehensions#
list comprehensions are super handy
we can make a list using a loop all in one line
The constructions above for a
and b
are called list and dictionary comprehensions it is equivalent to using a loop, but a more concise way to build a list with a loop.
a_long = []
for char in 'abcde':
a_long.append(char)
Notice that even in this for loop the lopo variable is a conceptually meaningful variable and we iterate over the items in an iterable type object. This is incontrast to creatinga loop variable that is an integer. This loop style is considered good pythonic strategy.
For more detail, see the Python docs section on looping strategies
a_long
['a', 'b', 'c', 'd', 'e']
Hint
Programming is a practice the goal is not to memorize everything, but be exposed to enough that you remember what you can look up later
enumerate is a built in function that allows you to get both a number and an item for your use in a loop or comprehension. You can read the help below or the technical details in the official Python Docs
help(enumerate)
Help on class enumerate in module builtins:
class enumerate(object)
| enumerate(iterable, start=0)
|
| Return an enumerate object.
|
| iterable
| an object supporting iteration
|
| The enumerate object yields pairs containing a count (from start, which
| defaults to zero) and a value yielded by the iterable argument.
|
| enumerate is useful for obtaining an indexed list:
| (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
|
| Methods defined here:
|
| __getattribute__(self, name, /)
| Return getattr(self, name).
|
| __iter__(self, /)
| Implement iter(self).
|
| __next__(self, /)
| Implement next(self).
|
| __reduce__(...)
| Return state information for pickling.
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __new__(*args, **kwargs) from builtins.type
| Create and return a new object. See help(type) for accurate signature.
3.5. Read DataFrames from HTML#
let’s use read_html
on the course communications page and then inspect what we get to figure it out
course_comms_url = 'https://rhodyprog4ds.github.io/BrownFall23/syllabus/communication.html'
we will first need out library
import pandas as pd
then we will read it in without saving it and look at the output to see what it looks like.
pd.read_html(course_comms_url)
[ Day Time Location Host
0 Monday 12pm-2pm Zoom Mark
1 Monday 4-5pm Zoom Dr. Brown
2 Friday 4-5pm 134 Tyler Dr. Brown,
usage platform \
0 in class prismia
1 any time prismia
2 private questions to your assignment github
3 for general questions that can help others github
4 to share resources or ask general questions in... github
5 matters that don't fit into another category e-mail
area \
0 chat
1 download transcript
2 issue on assignment repo
3 issue on course website
4 discussion on community repo
5 to brownsarahm@uri.edu
note
0 outside of class time this is not monitored cl...
1 use after class to get preliminary notes eg if...
2 eg bugs in your code"
3 eg what the instructions of an assignment mean...
4 include links in your portfolio
5 remember to include `[CSC310]` or `[DSP310]` (... ]
now we will save it to a variable for future use.
comm_df_list = pd.read_html(course_comms_url)
we can check the type, it is a list as we noted from looking at the outpu.
type(comm_df_list)
list
and each item in the list is a DataFrame
type(comm_df_list[0])
pandas.core.frame.DataFrame
DataFrames also have a shape
attribute, to tell us the number of rows and columns.
comm_df_list[0].shape
(3, 4)
achievements_url = 'https://rhodyprog4ds.github.io/BrownFall23/syllabus/achievements.html'
Make a list of the shapes of all of the tables on the syllabus Achievements page.
This is a good job for a list comprehension.
shape_list_comp =[df.shape for df in pd.read_html(achievements_url)]
shape_list_comp
[(14, 3), (15, 5), (15, 15), (15, 6)]
Again, we can write this out as a for loop with append, but the comprehension is more concise.
shape_list = []
for df in pd.read_html(achievements_url):
shape_list.append(df.shape)
in the comprehension structure the []
are what make it a list, they make anything a list
type([1,2,3])
list
3.6. More DataFrame Indexing#
we’ll go back to our coffee data
coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
coffee_df = pd.read_csv(coffee_data_url,index_col=0)
See again our shape
coffee_df.shape
(28, 43)
and the first few rows
coffee_df.head(1)
Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | Region | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Robusta | ankole coffee producers coop | Uganda | kyangundu cooperative society | NaN | ankole coffee producers | 0 | ankole coffee producers coop | 1488 | sheema south western | ... | Green | 2 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1488.0 | 1488.0 | 1488.0 |
1 rows × 43 columns
we can also see a random sample, not only the head and tail
coffee_df.sample(3)
Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | Region | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
27 | Robusta | cafe politico | India | NaN | NaN | NaN | 14-1118-2014-0087 | cafe politico | NaN | NaN | ... | Green | 1 | August 25th, 2015 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | NaN | NaN | NaN |
5 | Robusta | katuka development trust ltd | Uganda | katikamu capca farmers association | NaN | katuka development trust | 0 | katuka development trust ltd | 1200-1300 | luwero central region | ... | Green | 3 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1200.0 | 1300.0 | 1250.0 |
23 | Robusta | andrew hetzel | United States | sethuraman estates | NaN | sethuraman estates | NaN | cafemakers, llc | 3000' | chikmagalur | ... | Green | 0 | February 28th, 2013 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 3000.0 | 3000.0 | 3000.0 |
3 rows × 43 columns
Hint
printing out the list of columns is a helpful way to get them to copy-paste for later selection to ensure no typos. In a polished notebook, you could then delete a cell like the one below, but it’s really helpful while you are working
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
'Certification.Body', 'Certification.Address', 'Certification.Contact',
'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
'altitude_mean_meters'],
dtype='object')
We can subset columns by passing a list of multiple columns to use for indexing
columns_of_interest = ['Owner', 'Country.of.Origin']
coffee_df[columns_of_interest].head(1)
Owner | Country.of.Origin | |
---|---|---|
1 | ankole coffee producers coop | Uganda |
it has to be a list though, if we put them in one set of square brackets, it is a tuple and we get a KeyError
because it looks for one column that has the name 'Owner', 'Country.of.Origin'
coffee_df['Owner', 'Country.of.Origin']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
3652 try:
-> 3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('Owner', 'Country.of.Origin')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[31], line 1
----> 1 coffee_df['Owner', 'Country.of.Origin']
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
3759 if self.columns.nlevels > 1:
3760 return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
3762 if is_integer(indexer):
3763 indexer = [indexer]
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
-> 3655 raise KeyError(key) from err
3656 except TypeError:
3657 # If we have a listlike key, _check_indexing_error will raise
3658 # InvalidIndexError. Otherwise we fall through and re-raise
3659 # the TypeError.
3660 self._check_indexing_error(key)
KeyError: ('Owner', 'Country.of.Origin')
instead we can use 2 sets col square brackets if we do not want a separate variable
coffee_df[['Owner', 'Country.of.Origin']].head(1)
Owner | Country.of.Origin | |
---|---|---|
1 | ankole coffee producers coop | Uganda |
3.7. Subsetting by values#
We can do boolean operators on a pandas.Series
and it will do it automatically to every element
is_green = coffee_df['Color'] == 'Green'
is_green
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 False
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 False
18 True
19 True
20 True
21 False
22 True
23 True
24 False
25 False
26 False
27 True
28 False
Name: Color, dtype: bool
then we can look at the shape and see that it is the same shape as the column we selected.
is_green.shape, coffee_df['Color'].shape
((28,), (28,))
now we can use that to subset the rows
green_coffee_df = coffee_df[is_green]
green_coffee_df.head()
Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | Region | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Robusta | ankole coffee producers coop | Uganda | kyangundu cooperative society | NaN | ankole coffee producers | 0 | ankole coffee producers coop | 1488 | sheema south western | ... | Green | 2 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1488.0 | 1488.0 | 1488.0 |
3 | Robusta | andrew hetzel | India | sethuraman estate | NaN | NaN | 0000 | sethuraman estate | 1000m | chikmagalur | ... | Green | 0 | April 29th, 2016 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 1000.0 | 1000.0 | 1000.0 |
4 | Robusta | ugacof | Uganda | ugacof project area | NaN | ugacof | 0 | ugacof ltd | 1212 | central | ... | Green | 7 | July 14th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1212.0 | 1212.0 | 1212.0 |
5 | Robusta | katuka development trust ltd | Uganda | katikamu capca farmers association | NaN | katuka development trust | 0 | katuka development trust ltd | 1200-1300 | luwero central region | ... | Green | 3 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1200.0 | 1300.0 | 1250.0 |
6 | Robusta | andrew hetzel | India | NaN | NaN | (self) | NaN | cafemakers, llc | 3000' | chikmagalur | ... | Green | 0 | February 28th, 2013 | Specialty Coffee Association | ff7c18ad303d4b603ac3f8cff7e611ffc735e720 | 352d0cf7f3e9be14dad7df644ad65efc27605ae2 | m | 3000.0 | 3000.0 | 3000.0 |
5 rows × 43 columns
and look at the shape to see
green_coffee_df.shape
(20, 43)
is_green.sum()
20
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
'Certification.Body', 'Certification.Address', 'Certification.Contact',
'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
'altitude_mean_meters'],
dtype='object')
3.8. Python has no switch#
we use dictionaries in those kind of cases
score_text = {False:'low',
True:'high'}
Here we can switch from a list of true/false to high low
this gives true/false values for if the flavor is above or below 7
coffee_df['Flavor']>=7
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 True
22 True
23 True
24 True
25 True
26 True
27 False
28 False
Name: Flavor, dtype: bool
and this is high/low instead
[score_text[flavor_comp] for flavor_comp in coffee_df['Flavor']>=7]
['high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'high',
'low',
'low']
3.9. Keeping a clean notebook#
we can put code in a python file and include it in our notebooks to use it
this can be useful if:
you have a long hard to read thing that distracts from your other analysis
you have a function you want to reuse a lot
(unlikely in class) you need to make your own library!
I created a separate file calledn example.py
and defined a variable in it like:
name ='sarah'
now I can import that and use it.
from example import name
name
'sarah'
3.10. Additional hints#
pandas objects have
dtype
infousing any pandas method is okay, including some we have not seen if it is a single method, for example the
select_dtypes
method docsYour task is partially to learn other IO methods, so the pandas docs IO page is a good resource
3.11. Questions After Class#
Important
some questions are not answered below because they are explained in the notes above or they are too vague, you can come to office hours if you have a question that is not here or post a more detailed question on this repo or your assignment
3.11.1. what is pandas?#
it is a Python library. Read more at the user guide
3.11.2. My question is how is the data frame being accessed from the url and how can I understand it more clearly?#
pd.read_
functions can do web requests and read data online and load it directly into memory. To understand in greater detail, I recommend the docs and then follow through the links through there to the level of depth that you want.
3.11.3. what does the shape of a dataframe do?#
It is just information that we can do
3.11.4. Why do we need dictionaries to create new rows in the dataframes rather than operators?#
We did not use the dictionary to create new rows, we used it to map values to other values. We will see this patther throughout the course.
3.11.5. how to figure out which dataframes from html are useful#
we have to look at them.
3.11.6. How to download datasets#
For your assignment, you can load directly with a URL
3.11.7. Is the sum() method only counting true values, and if so, is it simply treating them as 1?#
It is adding the values after casting them to integers.
int(True), int(False)
(1, 0)
3.11.8. Why does the thing that happens right before a for in loop apply to all of the values? I think I know but just to be sure#
in a list comprehension the part before the for is like the loop body, see above where I defined a_long
and compare it to the defintion of a
3.11.9. I would like to learn more about dictionaries#
I recommend starting in the python language docs section on dictionaries they are a very powerful structure and the text there is technical, but there are plenty of links. It is really good practice to get good at parsing through technical docs like this.