Reading Docstrings, Object Inspection, and Loading Data
Contents
2. Reading Docstrings, Object Inspection, and Loading Data#
2.1. Programming is a Practice#
Python has a print
function and we can use the help in jupyter to learn about
how to use it in different ways.
Given this code excerpt, how could you print out “Sarah_Brown”?
first = 'Sarah'
last = 'Brown'
We can print the docstring out, as a whole instead of using the shfit + tab to view it.
help(print)
Help on built-in function print in module builtins:
print(...)
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file: a file-like object (stream); defaults to the current sys.stdout.
sep: string inserted between values, default a space.
end: string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
The first line says that it can take multiple values, because it says value, ..., sep
It also has a
keyword argument (must be used like argument=value
and has a default) described as sep=' '
.
This means that by default it adds a space as above.
print(first,last)
Sarah Brown
print(first,last,sep='_')
Sarah_Brown
type(first)
str
def compute_grade(num_level1,num_level2,num_level3):
'''
Computes a grade for CSC/DSP310 from numbers of achievements at each level
Parameters:
------------
num_level1 : int
number of level 1 achievements earned
num_level2 : int
number of level 2 achievements earned
num_level3 : int
number of level 3 achievements earned
Returns:
--------
letter_grade : string
letter grade with modifier (+/-)
'''
if num_level1 == 15:
if num_level2 == 15:
if num_level3 == 15:
grade = 'A'
elif num_level3 >= 10:
grade = 'A-'
elif num_level3 >=5:
grade = 'B+'
else:
grade = 'B'
elif num_level2 >=10:
grade = 'B-'
elif num_level2 >=5:
grade = 'C+'
else:
grade = 'C'
elif num_level1 >= 10:
grade = 'C-'
elif num_level1 >= 5:
grade = 'D+'
elif num_level1 >=3:
grade = 'D'
else:
grade = 'F'
return grade
type(compute_grade)
function
help(compute_grade())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[8], line 1
----> 1 help(compute_grade())
TypeError: compute_grade() missing 3 required positional arguments: 'num_level1', 'num_level2', and 'num_level3'
2.1.1. Why inspection in code?#
Some IDEs give you GUI based tools to inspect objects. We are going to do it programmatically inline with our analyses for two reasons.
(minor, logistical) it helps make for good notes (most importantly) it helps build habits of data science
2.1.2. Investigating how doc strings work#
We can see how the docstring impacts help and how exactly it has to be formatted to become a docstring
def ex_1(a):
print(a)
help(ex_1())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[10], line 1
----> 1 help(ex_1())
TypeError: ex_1() missing 1 required positional argument: 'a'
def ex_2(a):
#nis htis a docstring?
print(a)
help(ex_2())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 help(ex_2())
TypeError: ex_2() missing 1 required positional argument: 'a'
def ex_3(a):
''' this is a docstring'''
#nis htis a docstring?
print(a)
help(ex_3())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 help(ex_3())
TypeError: ex_3() missing 1 required positional argument: 'a'
def ex_4(a):
"""this is a docstring"""
#nis htis a docstring?
print(a)
help(ex_4())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[16], line 1
----> 1 help(ex_4())
TypeError: ex_4() missing 1 required positional argument: 'a'
2.2. Coffee Data#
Structured data is easier to work with than other data.
We’re going to focus on tabular data for now. At the end of the course, we’ll examine images, which are structured, but more complex and text, which is much less structured.
We’re going to use a dataset about coffee quality today.
How was this dataset collected?
reviews added to DB
then scraped
Where did it come from?
coffee Quality Institute’s trained reviewers.
what format is it provided in?
csv (Comma Separated Values)
what other information is in this repository?
the code to scrape and clean the data
the data before cleaning
Get raw url for the dataset click on the raw button on the csv page, then copy the url.
We’ll save that url as a variable to work with it.
coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
2.3. Loading in Data#
We will use a library called Pandas
import pandas as pd
the
import
keyword is used for loading packagespandas
is the name of the package that is installedas
keyword allows us to assign an alias (nickname)pd
is the typical alias for pandas
We can read data in using the read csv file
pd.read_csv(coffee_data_url)
Unnamed: 0 | Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
1 | 2 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 1 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
2 | 3 | Arabica | grounds for health admin | Guatemala | san marcos barrancas "san cristobal cuch | NaN | NaN | NaN | NaN | 1600 - 1800 m | ... | NaN | 0 | May 31st, 2011 | Specialty Coffee Association | 36d0d00a3724338ba7937c52a378d085f2172daa | 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 | m | 1600.00 | 1800.00 | 1700.00 |
3 | 4 | Arabica | yidnekachew dabessa | Ethiopia | yidnekachew dabessa coffee plantation | NaN | wolensu | NaN | yidnekachew debessa coffee plantation | 1800-2200 | ... | Green | 2 | March 25th, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1800.00 | 2200.00 | 2000.00 |
4 | 5 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 2 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1306 | 1307 | Arabica | juan carlos garcia lopez | Mexico | el centenario | NaN | la esperanza, municipio juchique de ferrer, ve... | 1104328663 | terra mia | 900 | ... | None | 20 | September 17th, 2013 | AMECAFE | 59e396ad6e22a1c22b248f958e1da2bd8af85272 | 0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7 | m | 900.00 | 900.00 | 900.00 |
1307 | 1308 | Arabica | myriam kaplan-pasternak | Haiti | 200 farms | NaN | coeb koperativ ekselsyo basen (350 members) | NaN | haiti coffee | ~350m | ... | Blue-Green | 16 | May 24th, 2013 | Specialty Coffee Association | 36d0d00a3724338ba7937c52a378d085f2172daa | 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 | m | 350.00 | 350.00 | 350.00 |
1308 | 1309 | Arabica | exportadora atlantic, s.a. | Nicaragua | finca las marías | 017-053-0211/ 017-053-0212 | beneficio atlantic condega | 017-053-0211/ 017-053-0212 | exportadora atlantic s.a | 1100 | ... | Green | 5 | June 6th, 2018 | Instituto Hondureño del Café | b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 | 7f521ca403540f81ec99daec7da19c2788393880 | m | 1100.00 | 1100.00 | 1100.00 |
1309 | 1310 | Arabica | juan luis alvarado romero | Guatemala | finca el limon | NaN | beneficio serben | 11/853/165 | unicafe | 4650 | ... | Green | 4 | May 24th, 2013 | Asociacion Nacional Del Café | b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53 | 724f04ad10ed31dbb9d260f0dfd221ba48be8a95 | ft | 1417.32 | 1417.32 | 1417.32 |
1310 | 1312 | Arabica | bismarck castro | Honduras | los hicaques | 103 | cigrah s.a de c.v. | 13-111-053 | cigrah s.a de c.v | 1400 | ... | Green | 2 | April 28th, 2018 | Instituto Hondureño del Café | b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 | 7f521ca403540f81ec99daec7da19c2788393880 | m | 1400.00 | 1400.00 | 1400.00 |
1311 rows × 44 columns
This read in the data and prints it out because it is the last line on the cell. If we do something else after, it will read it in, but not print it out
pd.read_csv(coffee_data_url)
print(first)
Sarah
In order to use it, we save the output to a variable.
coffee_data = pd.read_csv(coffee_data_url)
Then we can check the type.
type(coffee_data)
pandas.core.frame.DataFrame
This is a new type that is provided by the pandas library. Notice this uses the full libary name, not the alias, because this comes from the code for the library itself, not our current code where pandas as a nickname.
coffee_df = pd.read_csv(coffee_data_url)
coffee_df
Unnamed: 0 | Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
1 | 2 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 1 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
2 | 3 | Arabica | grounds for health admin | Guatemala | san marcos barrancas "san cristobal cuch | NaN | NaN | NaN | NaN | 1600 - 1800 m | ... | NaN | 0 | May 31st, 2011 | Specialty Coffee Association | 36d0d00a3724338ba7937c52a378d085f2172daa | 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 | m | 1600.00 | 1800.00 | 1700.00 |
3 | 4 | Arabica | yidnekachew dabessa | Ethiopia | yidnekachew dabessa coffee plantation | NaN | wolensu | NaN | yidnekachew debessa coffee plantation | 1800-2200 | ... | Green | 2 | March 25th, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1800.00 | 2200.00 | 2000.00 |
4 | 5 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 2 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1306 | 1307 | Arabica | juan carlos garcia lopez | Mexico | el centenario | NaN | la esperanza, municipio juchique de ferrer, ve... | 1104328663 | terra mia | 900 | ... | None | 20 | September 17th, 2013 | AMECAFE | 59e396ad6e22a1c22b248f958e1da2bd8af85272 | 0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7 | m | 900.00 | 900.00 | 900.00 |
1307 | 1308 | Arabica | myriam kaplan-pasternak | Haiti | 200 farms | NaN | coeb koperativ ekselsyo basen (350 members) | NaN | haiti coffee | ~350m | ... | Blue-Green | 16 | May 24th, 2013 | Specialty Coffee Association | 36d0d00a3724338ba7937c52a378d085f2172daa | 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 | m | 350.00 | 350.00 | 350.00 |
1308 | 1309 | Arabica | exportadora atlantic, s.a. | Nicaragua | finca las marías | 017-053-0211/ 017-053-0212 | beneficio atlantic condega | 017-053-0211/ 017-053-0212 | exportadora atlantic s.a | 1100 | ... | Green | 5 | June 6th, 2018 | Instituto Hondureño del Café | b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 | 7f521ca403540f81ec99daec7da19c2788393880 | m | 1100.00 | 1100.00 | 1100.00 |
1309 | 1310 | Arabica | juan luis alvarado romero | Guatemala | finca el limon | NaN | beneficio serben | 11/853/165 | unicafe | 4650 | ... | Green | 4 | May 24th, 2013 | Asociacion Nacional Del Café | b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53 | 724f04ad10ed31dbb9d260f0dfd221ba48be8a95 | ft | 1417.32 | 1417.32 | 1417.32 |
1310 | 1312 | Arabica | bismarck castro | Honduras | los hicaques | 103 | cigrah s.a de c.v. | 13-111-053 | cigrah s.a de c.v | 1400 | ... | Green | 2 | April 28th, 2018 | Instituto Hondureño del Café | b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 | 7f521ca403540f81ec99daec7da19c2788393880 | m | 1400.00 | 1400.00 | 1400.00 |
1311 rows × 44 columns
If you’re curious about something, try it out, see what happens. We’re going to use a lot of code inspection tools during class. These are helpful both for understanding what’s going on, but the advantage to knowing how to get this information programmatically even though a different IDE would give you inspection tools is that it helps you treat your code as data.
2.4. Good Code is always relative#
Important
I added this section as notes, that was not in class today. I said similar things last week, but this includes more references and context.
In programming for data science, we are often trying to tell a story.
Try it yourself
How might this goal change your code for this class relative to other code you have written or could imagine writing?
Python is a fully open source project and as such is governed by community standards and conventions.
Try it yourself
Find PEP8 (note that following it is part of earning python achievements)
The documentation for the full language is online too.
Guido van Rossum was the first main developer and wrote essays about python too.
it’s pretty popular
2.5. Questions After Class#
2.5.1. About the Course#
2.5.1.1. Will we further go over how to achieve level 3 achievements with more specificity?#
Right now, you are still only able to earn level 1s and then with assignment 2 you can start earning level 2s. After that, it will make morse sense to be able to talk about portfolios.
2.5.1.2. How do portfolio checks work?#
At a logistical level, you add files to your portfolio repository by a specific check date and then I grade them.
2.5.1.3. how much stats will we do in this class?#
Only a little bit. We wil do some modeling of data and compute basic statistics, but we will not cover the underlying concepts of statistics, much. However if you know some statistics, you will be able to extend what we cover to use them.
2.5.1.4. Is there a rate at which we need to complete skill checks if we fall behind?#
Follow the table on the Achievements page for which assignments and portfolio checks have which achievements eligible. Some assignments you can earn only 1-2 achievements others you can earn 4-5 level 2s. It is not recommended that you skip early chances because there are future chances, though. However, sometimes if you have an achievement already you can skip a section of an assignment.
2.5.2. About Jupyter#
2.5.2.1. Can you interact with those data tables that we put on jupyter today in real time?#
Yes, we can manipulate the data, but we read a copy in. We are not manipulating the version that was on GitHub.
2.5.2.2. Will we eventually learn how to filter data in order to separate different names of data, or perform mathematical operations on the datasets?#
Yes, all!
2.5.2.3. Can you show how to launch a book another day after we saved it and come back to it?#
When you launch your server on the “home” tab you can click on the file name of a previously saved notebook to work on it again.
2.5.2.4. can you just put anything in the docstring?#
Yes, technically, but you should follow good code style guidelines.
2.5.2.5. What would happen if we just call ‘pd.read_csv(coffe_data_url)’ instead of storing it in a variable and then call the variable?#
It would print it out, but then you don’t have a variable, so you would have to read it in again to be able to manipulate it.
2.5.2.6. What is considered scraped data?#
Data that is pulled from websites automatically. We will do some web scraping starting this week.
2.5.3. Questions we will answer in the rest of this week#
How to view a more detailed look of the data instead of it only showing the first and last few columns
Could we retrieve data in all formats with one function?
Can we access data like a 2d array?
What is the function to check the unique values in a column?