# Data Frames and other iterables

Today, we're going to explore {term}`DataFrame`s in greater detail. We'll continue using
that same coffee dataset.

In [1]:
import pandas as pd

In [2]:
coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'

In [3]:
coffee_df =pd.read_csv(coffee_data_url)

```{important}
A reason to use Jupyter is that it formats the output to be more readable.  Compare the view of the DataFrame with jupyter and without.  

Jupyter uses the object's `to_html` method if it exists, where the `print` function casts the object to a string.
```

In [4]:
coffee_df

Unnamed: 0.1,Unnamed: 0,Species,Owner,Country.of.Origin,Farm.Name,Lot.Number,Mill,ICO.Number,Company,Altitude,...,Color,Category.Two.Defects,Expiration,Certification.Body,Certification.Address,Certification.Contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
0,1,Robusta,ankole coffee producers coop,Uganda,kyangundu cooperative society,,ankole coffee producers,0,ankole coffee producers coop,1488,...,Green,2,"June 26th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1488.0,1488.0,1488.0
1,2,Robusta,nishant gurjer,India,sethuraman estate kaapi royale,25,sethuraman estate,14/1148/2017/21,kaapi royale,3170,...,,2,"October 31st, 2018",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,3170.0,3170.0,3170.0
2,3,Robusta,andrew hetzel,India,sethuraman estate,,,0000,sethuraman estate,1000m,...,Green,0,"April 29th, 2016",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,1000.0,1000.0,1000.0
3,4,Robusta,ugacof,Uganda,ugacof project area,,ugacof,0,ugacof ltd,1212,...,Green,7,"July 14th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1212.0,1212.0,1212.0
4,5,Robusta,katuka development trust ltd,Uganda,katikamu capca farmers association,,katuka development trust,0,katuka development trust ltd,1200-1300,...,Green,3,"June 26th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1200.0,1300.0,1250.0
5,6,Robusta,andrew hetzel,India,,,(self),,"cafemakers, llc",3000',...,Green,0,"February 28th, 2013",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,3000.0,3000.0,3000.0
6,7,Robusta,andrew hetzel,India,sethuraman estates,,,,cafemakers,750m,...,Green,0,"May 15th, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,750.0,750.0,750.0
7,8,Robusta,nishant gurjer,India,sethuraman estate kaapi royale,7,sethuraman estate,14/1148/2017/18,kaapi royale,3140,...,Bluish-Green,0,"October 25th, 2018",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,3140.0,3140.0,3140.0
8,9,Robusta,nishant gurjer,India,sethuraman estate,RKR,sethuraman estate,14/1148/2016/17,kaapi royale,1000,...,Green,0,"August 17th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,1000.0,1000.0,1000.0
9,10,Robusta,ugacof,Uganda,ishaka,,nsubuga umar,0,ugacof ltd,900-1300,...,Green,6,"August 5th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,900.0,1300.0,1100.0


In [5]:
print(coffee_df)

    Unnamed: 0  Species                              Owner Country.of.Origin  \
0            1  Robusta       ankole coffee producers coop            Uganda   
1            2  Robusta                     nishant gurjer             India   
2            3  Robusta                      andrew hetzel             India   
3            4  Robusta                             ugacof            Uganda   
4            5  Robusta       katuka development trust ltd            Uganda   
5            6  Robusta                      andrew hetzel             India   
6            7  Robusta                      andrew hetzel             India   
7            8  Robusta                     nishant gurjer             India   
8            9  Robusta                     nishant gurjer             India   
9           10  Robusta                             ugacof            Uganda   
10          11  Robusta                             ugacof            Uganda   
11          12  Robusta                 

## Examining the Structure of a Data Frame

I told you this was a DataFrame, but we can check with type.

In [6]:
type(coffee_df)

pandas.core.frame.DataFrame

We can also see that the DataFrame type comes from the `pandas` library, without the library loaded this type does not exist.


We can also exmaine its parts.  It consists of several; first the column headings

In [7]:
coffee_df.columns

Index(['Unnamed: 0', 'Species', 'Owner', 'Country.of.Origin', 'Farm.Name',
       'Lot.Number', 'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region',
       'Producer', 'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner',
       'Harvest.Year', 'Grading.Date', 'Owner.1', 'Variety',
       'Processing.Method', 'Fragrance...Aroma', 'Flavor', 'Aftertaste',
       'Salt...Acid', 'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup',
       'Clean.Cup', 'Balance', 'Cupper.Points', 'Total.Cup.Points', 'Moisture',
       'Category.One.Defects', 'Quakers', 'Color', 'Category.Two.Defects',
       'Expiration', 'Certification.Body', 'Certification.Address',
       'Certification.Contact', 'unit_of_measurement', 'altitude_low_meters',
       'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')

These are a special type called Index

In [8]:
type(coffee_df.columns)

pandas.core.indexes.base.Index

These are still iterable, much like python lists.


and it stores the data

In [9]:
coffee_df.values

array([[1, 'Robusta', 'ankole coffee producers coop', ..., 1488.0,
        1488.0, 1488.0],
       [2, 'Robusta', 'nishant gurjer', ..., 3170.0, 3170.0, 3170.0],
       [3, 'Robusta', 'andrew hetzel', ..., 1000.0, 1000.0, 1000.0],
       ...,
       [26, 'Robusta', 'james moore', ..., 795.0, 795.0, 795.0],
       [27, 'Robusta', 'cafe politico', ..., nan, nan, nan],
       [28, 'Robusta', 'cafe politico', ..., nan, nan, nan]], dtype=object)

It also has an index (first column, visually) but it is special because this is how you can index the data.

In [10]:
coffee_df.index

RangeIndex(start=0, stop=28, step=1)

Right now this is an autogenerated index, but we can also use the `index_col` parameter to set that up front.

In [11]:
coffee_df = pd.read_csv(coffee_data_url,index_col=0)
coffee_df

Unnamed: 0,Species,Owner,Country.of.Origin,Farm.Name,Lot.Number,Mill,ICO.Number,Company,Altitude,Region,...,Color,Category.Two.Defects,Expiration,Certification.Body,Certification.Address,Certification.Contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
1,Robusta,ankole coffee producers coop,Uganda,kyangundu cooperative society,,ankole coffee producers,0,ankole coffee producers coop,1488,sheema south western,...,Green,2,"June 26th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1488.0,1488.0,1488.0
2,Robusta,nishant gurjer,India,sethuraman estate kaapi royale,25,sethuraman estate,14/1148/2017/21,kaapi royale,3170,chikmagalur karnataka indua,...,,2,"October 31st, 2018",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,3170.0,3170.0,3170.0
3,Robusta,andrew hetzel,India,sethuraman estate,,,0000,sethuraman estate,1000m,chikmagalur,...,Green,0,"April 29th, 2016",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,1000.0,1000.0,1000.0
4,Robusta,ugacof,Uganda,ugacof project area,,ugacof,0,ugacof ltd,1212,central,...,Green,7,"July 14th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1212.0,1212.0,1212.0
5,Robusta,katuka development trust ltd,Uganda,katikamu capca farmers association,,katuka development trust,0,katuka development trust ltd,1200-1300,luwero central region,...,Green,3,"June 26th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1200.0,1300.0,1250.0
6,Robusta,andrew hetzel,India,,,(self),,"cafemakers, llc",3000',chikmagalur,...,Green,0,"February 28th, 2013",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,3000.0,3000.0,3000.0
7,Robusta,andrew hetzel,India,sethuraman estates,,,,cafemakers,750m,chikmagalur,...,Green,0,"May 15th, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,750.0,750.0,750.0
8,Robusta,nishant gurjer,India,sethuraman estate kaapi royale,7,sethuraman estate,14/1148/2017/18,kaapi royale,3140,chikmagalur karnataka india,...,Bluish-Green,0,"October 25th, 2018",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,3140.0,3140.0,3140.0
9,Robusta,nishant gurjer,India,sethuraman estate,RKR,sethuraman estate,14/1148/2016/17,kaapi royale,1000,chikmagalur karnataka,...,Green,0,"August 17th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,1000.0,1000.0,1000.0
10,Robusta,ugacof,Uganda,ishaka,,nsubuga umar,0,ugacof ltd,900-1300,western,...,Green,6,"August 5th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,900.0,1300.0,1100.0


In [12]:
coffee_df.index

Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28],
           dtype='int64')

Now it's neater


## Extracting Parts of Data Frames

We can look at the first 5 rows with `head`

In [13]:
coffee_df.head()

Unnamed: 0,Species,Owner,Country.of.Origin,Farm.Name,Lot.Number,Mill,ICO.Number,Company,Altitude,Region,...,Color,Category.Two.Defects,Expiration,Certification.Body,Certification.Address,Certification.Contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
1,Robusta,ankole coffee producers coop,Uganda,kyangundu cooperative society,,ankole coffee producers,0,ankole coffee producers coop,1488,sheema south western,...,Green,2,"June 26th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1488.0,1488.0,1488.0
2,Robusta,nishant gurjer,India,sethuraman estate kaapi royale,25.0,sethuraman estate,14/1148/2017/21,kaapi royale,3170,chikmagalur karnataka indua,...,,2,"October 31st, 2018",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,3170.0,3170.0,3170.0
3,Robusta,andrew hetzel,India,sethuraman estate,,,0000,sethuraman estate,1000m,chikmagalur,...,Green,0,"April 29th, 2016",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,1000.0,1000.0,1000.0
4,Robusta,ugacof,Uganda,ugacof project area,,ugacof,0,ugacof ltd,1212,central,...,Green,7,"July 14th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1212.0,1212.0,1212.0
5,Robusta,katuka development trust ltd,Uganda,katikamu capca farmers association,,katuka development trust,0,katuka development trust ltd,1200-1300,luwero central region,...,Green,3,"June 26th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1200.0,1300.0,1250.0


````{margin}
```{admonition} Try it yourself
How can you look at the first 3 or last 2 rows?
```
````

and the last 5 with `tail`

In [14]:
coffee_df.tail()

Unnamed: 0,Species,Owner,Country.of.Origin,Farm.Name,Lot.Number,Mill,ICO.Number,Company,Altitude,Region,...,Color,Category.Two.Defects,Expiration,Certification.Body,Certification.Address,Certification.Contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
24,Robusta,luis robles,Ecuador,robustasa,Lavado 1,our own lab,,robustasa,,"san juan, playas",...,Blue-Green,1,"January 18th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,
25,Robusta,luis robles,Ecuador,robustasa,Lavado 3,own laboratory,,robustasa,40,"san juan, playas",...,Blue-Green,0,"January 18th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,40.0,40.0,40.0
26,Robusta,james moore,United States,fazenda cazengo,,cafe cazengo,,global opportunity fund,795 meters,"kwanza norte province, angola",...,,6,"December 23rd, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,795.0,795.0,795.0
27,Robusta,cafe politico,India,,,,14-1118-2014-0087,cafe politico,,,...,Green,1,"August 25th, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,
28,Robusta,cafe politico,Vietnam,,,,,cafe politico,,,...,,9,"August 25th, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,


the shape of a DataFrame is an attribute

In [15]:
coffee_df.shape

(28, 43)

In [16]:
len(coffee_df)

28

We can pick out columns by name.

In [17]:
coffee_df['Species']

1     Robusta
2     Robusta
3     Robusta
4     Robusta
5     Robusta
6     Robusta
7     Robusta
8     Robusta
9     Robusta
10    Robusta
11    Robusta
12    Robusta
13    Robusta
14    Robusta
15    Robusta
16    Robusta
17    Robusta
18    Robusta
19    Robusta
20    Robusta
21    Robusta
22    Robusta
23    Robusta
24    Robusta
25    Robusta
26    Robusta
27    Robusta
28    Robusta
Name: Species, dtype: object

```{important}
We did not do this step in class
```

We can pick out rows with `loc`

In [18]:
coffee_df.loc[0]

KeyError: 0

## Reading data from websites

We'll first read from the course website.
```{note}
This is our first bit of web scraping!
We will do more, but for very structured data it can be this easy
```

In [19]:
comm_url = 'https://rhodyprog4ds.github.io/BrownFall22/syllabus/communication.html#'

So far, we've read data in from a .csv file with `pd.read_csv` and created a DataFrame with the constructor `pd.DataFrame` using a dictionary. Pandas provides many interfaces for reading in data.  They're described on the [Pandas IO page](https://pandas.pydata.org/docs/reference/io.html).

````{margin}
```{note}
Using the documentation for a library (and the base language) is
totally expected and normal part of programming.  That's what you
should use as your primary source for questions in this class.  Other
sources can become outdated pretty quickly as the language changes, but
most of the libraries we'll use have processes in place to ensure that
their own documentation gets updated at the same time the code does.
```

```{warning}
If you use other sources and get advised to solutions that are deprecated you may not earn achievements for that work.
```
````

We can use the `read_html` method to read from this page.  We know that it has multiple tables on the page, and from the help, we know that it will return a list of DataFrames.

In [20]:
df_list = pd.read_html(comm_url)

We can also verify what it returns

In [21]:
type(df_list)

list

We can index with `[]` to pick one item from the list and verify that it is a DataFrame.

In [22]:
type(df_list[0])

pandas.core.frame.DataFrame

## Pythonic Loops

In Python, loops do not require an iterator variable.  It has an interable object and a loop variable.

```Python
for loop_variable in iterable_object:
    # loop body
```

the `loop_variable` takes on the value of each item in the `iterable_object`
each time it goes through, in order.  Writing loops this way makes them more
compact and more readable, this is more like English.  For example:

In [23]:
name = 'sarah'
for letter in name:
    print(letter.upper())

S
A
R
A
H


It is best to name variables so that the loop variable makes sense as an item from the iterable. For example, names have letters in them, and an item in `df_list` makes sense as `df`.

In [24]:
for df in df_list:
    print(df.shape)

(6, 4)
(6, 4)
(1, 3)
(3, 3)
(2, 3)


## Types Solution


```{warning}
I am using bad variable names here `a`, `b` ,... because these are only as options for a question and we will not use them again
```

````{margin}
This is called a list comprehension.  It allows you to build a list using a for loop all in one step.
````

In [25]:
a = [char for char in 'abcde']
a

['a', 'b', 'c', 'd', 'e']

In [26]:
type(a)

list

````{margin}
This is called a dictionary comprehension.  It allows you to build a dictionary using a for loop all in one step.
````

In [27]:
b = {char:i for i, char in enumerate('abcde')}
b

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

In [28]:
type(b)

dict

In [29]:
c = ('a','b','c','d','e')
c

('a', 'b', 'c', 'd', 'e')

In [30]:
type(c)

tuple

In [31]:
d = 'a b c d e'.split('')
d

ValueError: empty separator

In [32]:
type(d)

NameError: name 'd' is not defined

## Questions After Class

### what is a dictionary in python?

a dictionary is a datatype from base python that stores key, value pairs.

For example

In [33]:
prof_info = {'first':'Sarah', 'last':'Brown', 'title':'Dr.'}
prof_info

{'first': 'Sarah', 'last': 'Brown', 'title': 'Dr.'}

We can use the keys to index in and get the values out

In [34]:
prof_info['title']

'Dr.'

Even though we will mostly use DataFrame, dictionaries and other base python types are important.  Dictionaries are very powerful they can hold whole functions in them. For example, the Python language does not have a switch case (which can be used for handling many if/else cases) but instead dictionaries can be used for that.

```{admonition} Further Reading
You can read more about the [details of data types](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes) in Pandas in the documentation
```

### How to see unique values in a column

We will get to this soon! We got the first part, picking out a single column to look at, we will see the method for that probably on Monday, but maybe on Friday.