4. Pandas and Indexing#

4.1. Iterable types#

a = [char for char in 'abcde']
b = {char:i for i,char in enumerate('abcde')}
c = ('a','b','c','d','e')
d = 'a b c d e'.split()
a
['a', 'b', 'c', 'd', 'e']
type(a)
list
b
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
type(b)
dict
c
('a', 'b', 'c', 'd', 'e')
type(c)
tuple
d
['a', 'b', 'c', 'd', 'e']
type(d)
list

4.2. Reading data other ways#

import pandas as pd
course_comms_url = 'https://rhodyprog4ds.github.io/BrownSpring23/syllabus/communication.html'

THis reads in from the html directly.

pd.read_html(course_comms_url)
[   Day      Time            Location       Host
 0  Mon  11am-1pm  Tyler 139 and zoom       Kyle
 1  Wed  7-8:30pm                Zoom  Dr. Brown
 2  Fri     3-6pm                Zoom       Kyle]
html_list = pd.read_html(course_comms_url)
type(html_list)
list
type(html_list[0])
pandas.core.frame.DataFrame
[type(h) for h in html_list]
[pandas.core.frame.DataFrame]
achievements_url = 'https://rhodyprog4ds.github.io/BrownSpring23/syllabus/achievements.html'

get the tables

achievements_df_list = pd.read_html(achievements_url)

make a list means use a list comprehension

[ach.shape for ach in achievements_df_list]
[(14, 3), (15, 5), (15, 15), (15, 6)]
achievements_df_list.shape
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[20], line 1
----> 1 achievements_df_list.shape

AttributeError: 'list' object has no attribute 'shape'
coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
coffee_df = pd.read_csv(coffee_data_url,index_col=0)
coffee_df.head(1)
Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Robusta ankole coffee producers coop Uganda kyangundu cooperative society NaN ankole coffee producers 0 ankole coffee producers coop 1488 sheema south western ... Green 2 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1488.0 1488.0 1488.0

1 rows × 43 columns

coffee_df['Species'].head()
1    Robusta
2    Robusta
3    Robusta
4    Robusta
5    Robusta
Name: Species, dtype: object
type(coffee_df['Species'])
pandas.core.series.Series
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
       'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
       'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
       'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
       'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')
coffee_df['Number.of.Bags']
1     300
2     320
3     300
4     320
5       1
6     200
7     320
8     320
9     320
10    320
11    320
12    320
13    100
14      1
15    320
16    300
17    140
18      1
19     20
20      6
21    100
22    250
23    100
24      1
25      1
26      1
27      1
28      1
Name: Number.of.Bags, dtype: int64
new_values = {0:'<100',1:'100-199',2:'200-299',3:'300+'}
[new_values[int(num/100)] for num in coffee_df['Number.of.Bags']]
['300+',
 '300+',
 '300+',
 '300+',
 '<100',
 '200-299',
 '300+',
 '300+',
 '300+',
 '300+',
 '300+',
 '300+',
 '100-199',
 '<100',
 '300+',
 '300+',
 '100-199',
 '<100',
 '<100',
 '<100',
 '100-199',
 '200-299',
 '100-199',
 '<100',
 '<100',
 '<100',
 '<100',
 '<100']
bags_bin = lambda num: int(num/100)
[new_values[bags_bin(num)] for num in coffee_df['Number.of.Bags']]
['300+',
 '300+',
 '300+',
 '300+',
 '<100',
 '200-299',
 '300+',
 '300+',
 '300+',
 '300+',
 '300+',
 '300+',
 '100-199',
 '<100',
 '300+',
 '300+',
 '100-199',
 '<100',
 '<100',
 '<100',
 '100-199',
 '200-299',
 '100-199',
 '<100',
 '<100',
 '<100',
 '<100',
 '<100']
type(pd.read_csv)
function
type(bags_bin)
function

4.3. Importing locally#

If I make a file in the same folder as my notebook called example.py and then put

Hide code cell source
%%bash
cat example.py
name = 'sarah'

in the file, we can use that file like:

from example import name
name
'sarah'
import example
example.name
'sarah'

4.4. Questions After Class#

4.4.1. why does casting the int over the (num/100) give you the right number? Is it because of floor division?#

First let’s look at an interim value, lets pick a value for num

num = 307

Then do the calculation without casting to int

num/100
3.07

Remember that int type is an integer or whole number, no fraction. So, casting drops the decimal part.

4.4.2. How would adding 2 DataFrametogether of separate types affect the type command?#

It depends what “add” means. If addition it might error, but if it worked, then it would still be a DataFrame. If stacking with pd.concat it would also be a DatFrame.

If you make them into a list, then the would be a list.

4.4.3. what keys to use in the dictionaries?#

In the assignment the instruction say

4.4.4. how to save as a local csv file?#

pandas.DataFrame.to_csv

4.4.5. how to create a Dataframe?#

Use the constructor

4.4.6. how to read using relative path?#

A relative path can work just like a URL. read about them here

4.4.7. I would like to know about other common forms of data files.#

The pandas documentation’s I/O page is where I recommend starting

4.5. What other libraries do we end up using?#

Next week we will use seaborn for plotting. Later in the semester we will use sklearn for machine learning. We will use a few other libaries for a few features, but these three are the main ones.