2. Reading Docstrings, Object Inspection, and Loading Data#

2.1. Programming is a Practice#

Python has a print function and we can use the help in jupyter to learn about how to use it in different ways.

Given this code excerpt, how could you print out “Sarah_Brown”?

first = 'Sarah'
last = 'Brown'

We can print the docstring out, as a whole instead of using the shfit + tab to view it.

help(print)
Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.

The first line says that it can take multiple values, because it says value, ..., sep

It also has a keyword argument (must be used like argument=value and has a default) described as sep=' '. This means that by default it adds a space as above.

print(first,last)
Sarah Brown
print(first,last,sep='_')
Sarah_Brown
type(first)
str
def compute_grade(num_level1,num_level2,num_level3):
    '''
    Computes a grade for CSC/DSP310 from numbers of achievements at each level

    Parameters:
    ------------
    num_level1 : int
      number of level 1 achievements earned
    num_level2 : int
      number of level 2 achievements earned
    num_level3 : int
      number of level 3 achievements earned

    Returns:
    --------
    letter_grade : string
      letter grade with modifier (+/-)
    '''
    if num_level1 == 15:
        if num_level2 == 15:
            if num_level3 == 15:
                grade = 'A'
            elif num_level3 >= 10:
                grade = 'A-'
            elif num_level3 >=5:
                grade = 'B+'
            else:
                grade = 'B'
        elif num_level2 >=10:
            grade = 'B-'
        elif num_level2 >=5:
            grade = 'C+'
        else:
            grade = 'C'
    elif num_level1 >= 10:
        grade = 'C-'
    elif num_level1 >= 5:
        grade = 'D+'
    elif num_level1 >=3:
        grade = 'D'
    else:
        grade = 'F'


    return grade
type(compute_grade)
function
help(compute_grade())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 help(compute_grade())

TypeError: compute_grade() missing 3 required positional arguments: 'num_level1', 'num_level2', and 'num_level3'

2.1.1. Why inspection in code?#

Some IDEs give you GUI based tools to inspect objects. We are going to do it programmatically inline with our analyses for two reasons.

(minor, logistical) it helps make for good notes (most importantly) it helps build habits of data science

2.1.2. Investigating how doc strings work#

We can see how the docstring impacts help and how exactly it has to be formatted to become a docstring

def ex_1(a):
    print(a)
help(ex_1())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 help(ex_1())

TypeError: ex_1() missing 1 required positional argument: 'a'
def ex_2(a):
    #nis htis a docstring?
    print(a)
help(ex_2())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 help(ex_2())

TypeError: ex_2() missing 1 required positional argument: 'a'
def ex_3(a):
    ''' this is a docstring'''
    #nis htis a docstring?
    print(a)
help(ex_3())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 help(ex_3())

TypeError: ex_3() missing 1 required positional argument: 'a'
def ex_4(a):
    """this is a docstring"""
    #nis htis a docstring?
    print(a)
help(ex_4())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 help(ex_4())

TypeError: ex_4() missing 1 required positional argument: 'a'

Tip

In python, PEP 257 says how to write a docstring, but it is very broad.

In Data Science, numpydoc style docstrings are popular.

2.2. Coffee Data#

Structured data is easier to work with than other data.

We’re going to focus on tabular data for now. At the end of the course, we’ll examine images, which are structured, but more complex and text, which is much less structured.

We’re going to use a dataset about coffee quality today.

How was this dataset collected?

  • reviews added to DB

  • then scraped

Where did it come from?

  • coffee Quality Institute’s trained reviewers.

what format is it provided in?

  • csv (Comma Separated Values)

what other information is in this repository?

  • the code to scrape and clean the data

  • the data before cleaning

Get raw url for the dataset click on the raw button on the csv page, then copy the url. a screenshot from github of the data file page with the raw button circled in pink

We’ll save that url as a variable to work with it.

coffee_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

2.3. Loading in Data#

We will use a library called Pandas

import pandas as pd
  • the import keyword is used for loading packages

  • pandas is the name of the package that is installed

  • as keyword allows us to assign an alias (nickname)

  • pd is the typical alias for pandas

We can read data in using the read csv file

pd.read_csv(coffee_data_url)
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
1 2 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 1 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
2 3 Arabica grounds for health admin Guatemala san marcos barrancas "san cristobal cuch NaN NaN NaN NaN 1600 - 1800 m ... NaN 0 May 31st, 2011 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 1600.00 1800.00 1700.00
3 4 Arabica yidnekachew dabessa Ethiopia yidnekachew dabessa coffee plantation NaN wolensu NaN yidnekachew debessa coffee plantation 1800-2200 ... Green 2 March 25th, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1800.00 2200.00 2000.00
4 5 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 2 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1306 1307 Arabica juan carlos garcia lopez Mexico el centenario NaN la esperanza, municipio juchique de ferrer, ve... 1104328663 terra mia 900 ... None 20 September 17th, 2013 AMECAFE 59e396ad6e22a1c22b248f958e1da2bd8af85272 0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7 m 900.00 900.00 900.00
1307 1308 Arabica myriam kaplan-pasternak Haiti 200 farms NaN coeb koperativ ekselsyo basen (350 members) NaN haiti coffee ~350m ... Blue-Green 16 May 24th, 2013 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 350.00 350.00 350.00
1308 1309 Arabica exportadora atlantic, s.a. Nicaragua finca las marías 017-053-0211/ 017-053-0212 beneficio atlantic condega 017-053-0211/ 017-053-0212 exportadora atlantic s.a 1100 ... Green 5 June 6th, 2018 Instituto Hondureño del Café b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 7f521ca403540f81ec99daec7da19c2788393880 m 1100.00 1100.00 1100.00
1309 1310 Arabica juan luis alvarado romero Guatemala finca el limon NaN beneficio serben 11/853/165 unicafe 4650 ... Green 4 May 24th, 2013 Asociacion Nacional Del Café b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53 724f04ad10ed31dbb9d260f0dfd221ba48be8a95 ft 1417.32 1417.32 1417.32
1310 1312 Arabica bismarck castro Honduras los hicaques 103 cigrah s.a de c.v. 13-111-053 cigrah s.a de c.v 1400 ... Green 2 April 28th, 2018 Instituto Hondureño del Café b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 7f521ca403540f81ec99daec7da19c2788393880 m 1400.00 1400.00 1400.00

1311 rows × 44 columns

This read in the data and prints it out because it is the last line on the cell. If we do something else after, it will read it in, but not print it out

pd.read_csv(coffee_data_url)

print(first)
Sarah

In order to use it, we save the output to a variable.

coffee_data = pd.read_csv(coffee_data_url)

Then we can check the type.

type(coffee_data)
pandas.core.frame.DataFrame

This is a new type that is provided by the pandas library. Notice this uses the full libary name, not the alias, because this comes from the code for the library itself, not our current code where pandas as a nickname.

coffee_df = pd.read_csv(coffee_data_url)
coffee_df
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
1 2 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 1 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
2 3 Arabica grounds for health admin Guatemala san marcos barrancas "san cristobal cuch NaN NaN NaN NaN 1600 - 1800 m ... NaN 0 May 31st, 2011 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 1600.00 1800.00 1700.00
3 4 Arabica yidnekachew dabessa Ethiopia yidnekachew dabessa coffee plantation NaN wolensu NaN yidnekachew debessa coffee plantation 1800-2200 ... Green 2 March 25th, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1800.00 2200.00 2000.00
4 5 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 2 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1306 1307 Arabica juan carlos garcia lopez Mexico el centenario NaN la esperanza, municipio juchique de ferrer, ve... 1104328663 terra mia 900 ... None 20 September 17th, 2013 AMECAFE 59e396ad6e22a1c22b248f958e1da2bd8af85272 0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7 m 900.00 900.00 900.00
1307 1308 Arabica myriam kaplan-pasternak Haiti 200 farms NaN coeb koperativ ekselsyo basen (350 members) NaN haiti coffee ~350m ... Blue-Green 16 May 24th, 2013 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 350.00 350.00 350.00
1308 1309 Arabica exportadora atlantic, s.a. Nicaragua finca las marías 017-053-0211/ 017-053-0212 beneficio atlantic condega 017-053-0211/ 017-053-0212 exportadora atlantic s.a 1100 ... Green 5 June 6th, 2018 Instituto Hondureño del Café b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 7f521ca403540f81ec99daec7da19c2788393880 m 1100.00 1100.00 1100.00
1309 1310 Arabica juan luis alvarado romero Guatemala finca el limon NaN beneficio serben 11/853/165 unicafe 4650 ... Green 4 May 24th, 2013 Asociacion Nacional Del Café b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53 724f04ad10ed31dbb9d260f0dfd221ba48be8a95 ft 1417.32 1417.32 1417.32
1310 1312 Arabica bismarck castro Honduras los hicaques 103 cigrah s.a de c.v. 13-111-053 cigrah s.a de c.v 1400 ... Green 2 April 28th, 2018 Instituto Hondureño del Café b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 7f521ca403540f81ec99daec7da19c2788393880 m 1400.00 1400.00 1400.00

1311 rows × 44 columns

If you’re curious about something, try it out, see what happens. We’re going to use a lot of code inspection tools during class. These are helpful both for understanding what’s going on, but the advantage to knowing how to get this information programmatically even though a different IDE would give you inspection tools is that it helps you treat your code as data.

2.4. Good Code is always relative#

Important

I added this section as notes, that was not in class today. I said similar things last week, but this includes more references and context.

In programming for data science, we are often trying to tell a story.

Try it yourself

How might this goal change your code for this class relative to other code you have written or could imagine writing?

Python is a fully open source project and as such is governed by community standards and conventions.

Try it yourself

Find PEP8 (note that following it is part of earning python achievements)

The documentation for the full language is online too.

Guido van Rossum was the first main developer and wrote essays about python too.

it’s pretty popular

2.5. Questions After Class#

2.5.1. About the Course#

2.5.1.1. Will we further go over how to achieve level 3 achievements with more specificity?#

Right now, you are still only able to earn level 1s and then with assignment 2 you can start earning level 2s. After that, it will make morse sense to be able to talk about portfolios.

2.5.1.2. How do portfolio checks work?#

At a logistical level, you add files to your portfolio repository by a specific check date and then I grade them.

2.5.1.3. how much stats will we do in this class?#

Only a little bit. We wil do some modeling of data and compute basic statistics, but we will not cover the underlying concepts of statistics, much. However if you know some statistics, you will be able to extend what we cover to use them.

2.5.1.4. Is there a rate at which we need to complete skill checks if we fall behind?#

Follow the table on the Achievements page for which assignments and portfolio checks have which achievements eligible. Some assignments you can earn only 1-2 achievements others you can earn 4-5 level 2s. It is not recommended that you skip early chances because there are future chances, though. However, sometimes if you have an achievement already you can skip a section of an assignment.

2.5.2. About Jupyter#

2.5.2.1. Can you interact with those data tables that we put on jupyter today in real time?#

Yes, we can manipulate the data, but we read a copy in. We are not manipulating the version that was on GitHub.

2.5.2.2. Will we eventually learn how to filter data in order to separate different names of data, or perform mathematical operations on the datasets?#

Yes, all!

2.5.2.3. Can you show how to launch a book another day after we saved it and come back to it?#

When you launch your server on the “home” tab you can click on the file name of a previously saved notebook to work on it again.

2.5.2.4. can you just put anything in the docstring?#

Yes, technically, but you should follow good code style guidelines.

2.5.2.5. What would happen if we just call ‘pd.read_csv(coffe_data_url)’ instead of storing it in a variable and then call the variable?#

It would print it out, but then you don’t have a variable, so you would have to read it in again to be able to manipulate it.

2.5.2.6. What is considered scraped data?#

Data that is pulled from websites automatically. We will do some web scraping starting this week.

2.5.3. Questions we will answer in the rest of this week#

  • How to view a more detailed look of the data instead of it only showing the first and last few columns

  • Could we retrieve data in all formats with one function?

  • Can we access data like a 2d array?

  • What is the function to check the unique values in a column?