Pandas and Indexing

4. Pandas and Indexing#

4.1. Iterable types#

a = [char for char in 'abcde']
b = {char:i for i,char in enumerate('abcde')}
c = ('a','b','c','d','e')
d = 'a b c d e'.split()

['a', 'b', 'c', 'd', 'e']

type(a)

list

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

type(b)

dict

('a', 'b', 'c', 'd', 'e')

type(c)

tuple

['a', 'b', 'c', 'd', 'e']

type(d)

list

4.2. Reading data other ways#

import pandas as pd

course_comms_url = 'https://rhodyprog4ds.github.io/BrownSpring23/syllabus/communication.html'

THis reads in from the html directly.

pd.read_html(course_comms_url)

[   Day      Time            Location       Host
Mon  11am-1pm  Tyler 139 and zoom       Kyle
Wed  7-8:30pm                Zoom  Dr. Brown
Fri     3-6pm                Zoom       Kyle]

html_list = pd.read_html(course_comms_url)

type(html_list)

list

type(html_list[0])

pandas.core.frame.DataFrame

[type(h) for h in html_list]

[pandas.core.frame.DataFrame]

achievements_url = 'https://rhodyprog4ds.github.io/BrownSpring23/syllabus/achievements.html'

get the tables

achievements_df_list = pd.read_html(achievements_url)

make a list means use a list comprehension

[ach.shape for ach in achievements_df_list]

[(14, 3), (15, 5), (15, 15), (15, 6)]

achievements_df_list.shape

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[20], line 1
----> 1 achievements_df_list.shape

AttributeError: 'list' object has no attribute 'shape'

coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'

coffee_df = pd.read_csv(coffee_data_url,index_col=0)

coffee_df.head(1)

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Robusta	ankole coffee producers coop	Uganda	kyangundu cooperative society	NaN	ankole coffee producers	0	ankole coffee producers coop	1488	sheema south western	...	Green	2	June 26th, 2015	Uganda Coffee Development Authority	e36d0270932c3b657e96b7b0278dfd85dc0fe743	03077a1c6bac60e6f514691634a7f6eb5c85aae8	m	1488.0	1488.0	1488.0

1 rows × 43 columns

coffee_df['Species'].head()

  Robusta
  Robusta
  Robusta
  Robusta
  Robusta
Name: Species, dtype: object

type(coffee_df['Species'])

pandas.core.series.Series

coffee_df.columns

Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
       'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
       'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
       'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
       'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')

coffee_df['Number.of.Bags']

   300
   320
   300
   320
     1
   200
   320
   320
   320
  320
  320
  320
  100
    1
  320
  300
  140
    1
   20
    6
  100
  250
  100
    1
    1
    1
    1
    1
Name: Number.of.Bags, dtype: int64

new_values = {0:'<100',1:'100-199',2:'200-299',3:'300+'}

[new_values[int(num/100)] for num in coffee_df['Number.of.Bags']]

['300+',
 '300+',
 '300+',
 '300+',
 '<100',
 '200-299',
 '300+',
 '300+',
 '300+',
 '300+',
 '300+',
 '300+',
 '100-199',
 '<100',
 '300+',
 '300+',
 '100-199',
 '<100',
 '<100',
 '<100',
 '100-199',
 '200-299',
 '100-199',
 '<100',
 '<100',
 '<100',
 '<100',
 '<100']

bags_bin = lambda num: int(num/100)
[new_values[bags_bin(num)] for num in coffee_df['Number.of.Bags']]

['300+',
 '300+',
 '300+',
 '300+',
 '<100',
 '200-299',
 '300+',
 '300+',
 '300+',
 '300+',
 '300+',
 '300+',
 '100-199',
 '<100',
 '300+',
 '300+',
 '100-199',
 '<100',
 '<100',
 '<100',
 '100-199',
 '200-299',
 '100-199',
 '<100',
 '<100',
 '<100',
 '<100',
 '<100']

type(pd.read_csv)

function

type(bags_bin)

function

4.3. Importing locally#

If I make a file in the same folder as my notebook called example.py and then put

name = 'sarah'

in the file, we can use that file like:

from example import name

name

'sarah'

import example

example.name

'sarah'

4.4. Questions After Class#

4.4.1. why does casting the int over the (num/100) give you the right number? Is it because of floor division?#

First let’s look at an interim value, lets pick a value for num

num = 307

Then do the calculation without casting to int

num/100

3.07

Remember that int type is an integer or whole number, no fraction. So, casting drops the decimal part.

4.4.2. How would adding 2 DataFrametogether of separate types affect the type command?#

It depends what “add” means. If addition it might error, but if it worked, then it would still be a DataFrame. If stacking with pd.concat it would also be a DatFrame.

If you make them into a list, then the would be a list.

4.4.3. what keys to use in the dictionaries?#

In the assignment the instruction say

4.4.4. how to save as a local csv file?#

pandas.DataFrame.to_csv

4.4.5. how to create a Dataframe?#

Use the constructor

4.4.6. how to read using relative path?#

A relative path can work just like a URL. read about them here

4.4.7. I would like to know about other common forms of data files.#

The pandas documentation’s I/O page is where I recommend starting

4.5. What other libraries do we end up using?#

Next week we will use seaborn for plotting. Later in the semester we will use sklearn for machine learning. We will use a few other libaries for a few features, but these three are the main ones.