4. Pandas and Indexing#
4.1. Iterable types#
a = [char for char in 'abcde']
b = {char:i for i,char in enumerate('abcde')}
c = ('a','b','c','d','e')
d = 'a b c d e'.split()
a
['a', 'b', 'c', 'd', 'e']
type(a)
list
b
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
type(b)
dict
c
('a', 'b', 'c', 'd', 'e')
type(c)
tuple
d
['a', 'b', 'c', 'd', 'e']
type(d)
list
4.2. Reading data other ways#
import pandas as pd
course_comms_url = 'https://rhodyprog4ds.github.io/BrownSpring23/syllabus/communication.html'
THis reads in from the html directly.
pd.read_html(course_comms_url)
[ Day Time Location Host
0 Mon 11am-1pm Tyler 139 and zoom Kyle
1 Wed 7-8:30pm Zoom Dr. Brown
2 Fri 3-6pm Zoom Kyle]
html_list = pd.read_html(course_comms_url)
type(html_list)
list
type(html_list[0])
pandas.core.frame.DataFrame
[type(h) for h in html_list]
[pandas.core.frame.DataFrame]
achievements_url = 'https://rhodyprog4ds.github.io/BrownSpring23/syllabus/achievements.html'
get the tables
achievements_df_list = pd.read_html(achievements_url)
make a list means use a list comprehension
[ach.shape for ach in achievements_df_list]
[(14, 3), (15, 5), (15, 15), (15, 6)]
achievements_df_list.shape
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[20], line 1
----> 1 achievements_df_list.shape
AttributeError: 'list' object has no attribute 'shape'
coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
coffee_df = pd.read_csv(coffee_data_url,index_col=0)
coffee_df.head(1)
Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | Region | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Robusta | ankole coffee producers coop | Uganda | kyangundu cooperative society | NaN | ankole coffee producers | 0 | ankole coffee producers coop | 1488 | sheema south western | ... | Green | 2 | June 26th, 2015 | Uganda Coffee Development Authority | e36d0270932c3b657e96b7b0278dfd85dc0fe743 | 03077a1c6bac60e6f514691634a7f6eb5c85aae8 | m | 1488.0 | 1488.0 | 1488.0 |
1 rows × 43 columns
coffee_df['Species'].head()
1 Robusta
2 Robusta
3 Robusta
4 Robusta
5 Robusta
Name: Species, dtype: object
type(coffee_df['Species'])
pandas.core.series.Series
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
'Certification.Body', 'Certification.Address', 'Certification.Contact',
'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
'altitude_mean_meters'],
dtype='object')
coffee_df['Number.of.Bags']
1 300
2 320
3 300
4 320
5 1
6 200
7 320
8 320
9 320
10 320
11 320
12 320
13 100
14 1
15 320
16 300
17 140
18 1
19 20
20 6
21 100
22 250
23 100
24 1
25 1
26 1
27 1
28 1
Name: Number.of.Bags, dtype: int64
new_values = {0:'<100',1:'100-199',2:'200-299',3:'300+'}
[new_values[int(num/100)] for num in coffee_df['Number.of.Bags']]
['300+',
'300+',
'300+',
'300+',
'<100',
'200-299',
'300+',
'300+',
'300+',
'300+',
'300+',
'300+',
'100-199',
'<100',
'300+',
'300+',
'100-199',
'<100',
'<100',
'<100',
'100-199',
'200-299',
'100-199',
'<100',
'<100',
'<100',
'<100',
'<100']
bags_bin = lambda num: int(num/100)
[new_values[bags_bin(num)] for num in coffee_df['Number.of.Bags']]
['300+',
'300+',
'300+',
'300+',
'<100',
'200-299',
'300+',
'300+',
'300+',
'300+',
'300+',
'300+',
'100-199',
'<100',
'300+',
'300+',
'100-199',
'<100',
'<100',
'<100',
'100-199',
'200-299',
'100-199',
'<100',
'<100',
'<100',
'<100',
'<100']
type(pd.read_csv)
function
type(bags_bin)
function
4.3. Importing locally#
If I make a file in the same folder as my notebook called example.py
and then put
Show code cell source
%%bash
cat example.py
name = 'sarah'
in the file, we can use that file like:
from example import name
name
'sarah'
import example
example.name
'sarah'
4.4. Questions After Class#
4.4.1. why does casting the int over the (num/100) give you the right number? Is it because of floor division?#
First let’s look at an interim value, lets pick a value for num
num = 307
Then do the calculation without casting to int
num/100
3.07
Remember that int
type is an integer or whole number, no fraction. So, casting drops the decimal part.
4.4.2. How would adding 2 DataFrametogether of separate types affect the type command?#
It depends what “add” means. If addition it might error, but if it worked, then it would still be a DataFrame. If stacking with pd.concat
it would also be a DatFrame.
If you make them into a list, then the would be a list.
4.4.3. what keys to use in the dictionaries?#
In the assignment the instruction say
4.4.4. how to save as a local csv file?#
4.4.5. how to create a Dataframe?#
Use the constructor
4.4.6. how to read using relative path?#
A relative path can work just like a URL. read about them here
4.4.7. I would like to know about other common forms of data files.#
The pandas documentation’s I/O page is where I recommend starting
4.5. What other libraries do we end up using?#
Next week we will use seaborn
for plotting. Later in the semester we will use sklearn
for machine learning. We will use a few other libaries for a few features, but these three are the main ones.