Class 3: Welcome to Week 2

This week we will:

  • clarify how this grading really works

  • learn about accessing data

  • use accessing data as motivation to review more python

Grading and Assignment 1

  • Solution function posted.

  • note: not a sum

  • read the rubric

  • Brightspace will show grades as they’re earned

  • In class, respond on prismia

  • Portfolio

    • will start posting prompts The docstring functions like a property of the function object. so it has to be inside.

Iterables

Python has a general data type for objects that are designed to facilitate repetition of some sort, they’re called iterables

We’ve already seen one. Strings are Iterables

name = 'sarah'

which means we can index them

name[3]
'a'

Indexing with a negative number counts from the end

name[-1]
'h'

Loops in python have similar syntax to the if and functions we saw last week:

for char in name:
    print(char*3)
sss
aaa
rrr
aaa
hhh

some notes:

  • char is called the loop variable

  • name is called the collection- this can be any iterable type object in python

  • print(char*3) is called the loop body

  • python lets us use mathematical operations on strings

Lists and List Comprehensions

We make a list with square brackets

names = ['sarah', 'Jose', 'Cam', 'Bri']

we can also build lists by folding a loop into the list construction

['hello' + n for n in names]
['hellosarah', 'helloJose', 'helloCam', 'helloBri']

this is called a list comprehension

greetings = ['hello ' + n for n in names]
greetings[0]
'hello sarah'

Dictionaries

Dictionaries are a useful datatype in python. It is denoted by {} and contains key: value pairs separated by commas.

gh_names = {'brownsarahm':'Sarah Brown',
            'briannakathrynm1' : 'Brianna MacDonald',
            'jdion62':'Jacob Dion'}
gh_names
{'brownsarahm': 'Sarah Brown',
 'briannakathrynm1': 'Brianna MacDonald',
 'jdion62': 'Jacob Dion'}

You can think of it like a list of the values with a named index.

gh_names['jdion62']
'Jacob Dion'

we can iterate over both the key and the value by using the items method on a dictionary. That makes another iterable object that can be used as a loop collection. It functions as a set of pairs now, so we get two loop variables:

for key, value in gh_names.items():
    print(value, "'s username is ", key)
Sarah Brown 's username is  brownsarahm
Brianna MacDonald 's username is  briannakathrynm1
Jacob Dion 's username is  jdion62

If we iterate over the dictionary without that method, we get the keys.

for val in gh_names:
    print(val)
brownsarahm
briannakathrynm1
jdion62

Libraries

To use libraries in python we import them

We will use pandas a lot in this class. It’s the Python Data Analysis Library.

import pandas

Once we import we can use the functions, datatypes, and values a library provides by using a . after the name. In a notebook, pressing tab will show you the options.

pandas.read_csv()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-374a1a6f9f7e> in <module>
----> 1 pandas.read_csv()

TypeError: read_csv() missing 1 required positional argument: 'filepath_or_buffer'

We can also use an alias to give a library a nickname to make it easier to use. pd is the standard alias for pandas

import pandas as pd

We can read in from a local path or a url. Let’s read in the course map page of our course website.

pd.read_html('https://rhodyprog4ds.github.io/BrownFall20/syllabus/course_map.html')
[   Unnamed: 0_level_0                                             topics  \
                  week                                 Unnamed: 1_level_1   
 0                   1                             [admin, python review]   
 1                   2                        Loading data, Python review   
 2                   3                          Exploratory Data Analysis   
 3                   4                                      Data Cleaning   
 4                   5                      Databases, Merging DataFrames   
 5                   6  Modeling, Naive Bayes, classification performa...   
 6                   7                   decision trees, cross validation   
 7                   8                                         Regression   
 8                   9                                         Clustering   
 9                  10                              SVM, parameter tuning   
 10                 11                              KNN, Model comparison   
 11                 12                                      Text Analysis   
 12                 13                                     Topic Modeling   
 13                 14                                      Deep Learning   
 
                              skills  
                  Unnamed: 2_level_1  
 0                           process  
 1      [access, prepare, summarize]  
 2            [summarize, visualize]  
 3   [prepare, summarize, visualize]  
 4    [access, construct, summarize]  
 5        [classification, evaluate]  
 6        [classification, evaluate]  
 7            [regression, evaluate]  
 8            [clustering, evaluate]  
 9                 [optimize, tools]  
 10                 [compare, tools]  
 11                   [unstructured]  
 12            [unstructured, tools]  
 13                 [tools, compare]  ,
    Unnamed: 0_level_0                                              skill  \
               keyword                                 Unnamed: 1_level_1   
 0              python                              pythonic code writing   
 1             process                 describe data science as a process   
 2              access                    access data in multiple formats   
 3           construct           construct datasets from multiple sources   
 4           summarize                        Summarize and describe data   
 5           visualize                                     Visualize data   
 6             prepare                          prepare data for analysis   
 7      classification                               Apply classification   
 8          regression                                   Apply Regression   
 9          clustering                                         Clustering   
 10           evaluate                         Evaluate model performance   
 11           optimize                          Optimize model parameters   
 12            compare                                     compare models   
 13       unstructured                            model unstructured data   
 14           workflow  use industry standard data science tools and w...   
 
                                               Level 1  \
                                    Unnamed: 2_level_1   
 0   python code that mostly runs, occasional pep8 ...   
 1           Identify basic components of data science   
 2   load data from at least one format; identify t...   
 3   identify what should happen to merge datasets ...   
 4   Describe the shape and structure of a dataset ...   
 5   identify plot types, generate basic plots from...   
 6   identify if data is or is not ready for analys...   
 7   identify and describe what classification is, ...   
 8   identify what data that can be used for regres...   
 9                         describe what clustering is   
 10  Explain basic performance metrics for differen...   
 11  Identify when model parameters need to be opti...   
 12                Qualitatively compare model classes   
 13  Identify options for representing text data an...   
 14  Solve well strucutred problems with a single t...   
 
                                               Level 2  \
                                    Unnamed: 3_level_1   
 0   python code that reliably runs, frequent pep8 ...   
 1   Describe and define each stage of the data sci...   
 2   Load data for processing from the most common ...   
 3                                  apply basic merges   
 4   compute summary statndard statistics of a whol...   
 5   generate multiple plot types with complete lab...   
 6   apply data reshaping, cleaning, and filtering ...   
 7   fit preselected classification model to a dataset   
 8                    can fit linear regression models   
 9                              apply basic clustering   
 10  Apply basic model evaluation metrics to a held...   
 11  Manually optimize basic model parameters such ...   
 12  Compare model classes in specific terms and fi...   
 13  Apply at least one representation to transform...   
 14  Solve semi-strucutred, completely specified pr...   
 
                                               Level 3  
                                    Unnamed: 4_level_1  
 0   reliable, efficient, pythonic code that consis...  
 1   Compare different ways that data science can f...  
 2   access data from both common and uncommon form...  
 3        merge data that is not automatically aligned  
 4   Compute and interpret various summary statisti...  
 5   generate complex plots with pandas and plottin...  
 6   apply data reshaping, cleaning, and filtering ...  
 7   fit and apply classification models and select...  
 8   can fit and explain regrularized or nonlinear ...  
 9   apply multiple clustering techniques, and inte...  
 10  Evaluate a model with multiple metrics and cro...  
 11  Select optimal parameters based of mutiple qua...  
 12  Evaluate tradeoffs between different model com...  
 13  apply multiple representations and compare and...  
 14  Scope, choose an appropriate tool pipeline and...  ,
    Unnamed: 0_level_0                 A1                 A2  \
               keyword Unnamed: 1_level_1 Unnamed: 2_level_1   
 0              python                  1                  1   
 1             process                  1                  1   
 2              access                  0                  1   
 3           construct                  0                  0   
 4           summarize                  0                  0   
 5           visualize                  0                  0   
 6             prepare                  0                  0   
 7      classification                  0                  0   
 8          regression                  0                  0   
 9          clustering                  0                  0   
 10           evaluate                  0                  0   
 11           optimize                  0                  0   
 12            compare                  0                  0   
 13       unstructured                  0                  0   
 14           workflow                  0                  0   
 
                    A3                 A4                 A5  \
    Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1   
 0                   1                  1                  0   
 1                   0                  0                  0   
 2                   1                  1                  0   
 3                   0                  0                  1   
 4                   1                  1                  1   
 5                   1                  1                  0   
 6                   0                  1                  1   
 7                   0                  0                  0   
 8                   0                  0                  0   
 9                   0                  0                  0   
 10                  0                  0                  0   
 11                  0                  0                  0   
 12                  0                  0                  0   
 13                  0                  0                  0   
 14                  0                  0                  0   
 
                    A6                 A7                 A8  \
    Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1   
 0                   0                  0                  0   
 1                   0                  0                  0   
 2                   0                  0                  0   
 3                   1                  0                  0   
 4                   1                  1                  1   
 5                   1                  1                  1   
 6                   0                  0                  0   
 7                   1                  1                  0   
 8                   0                  0                  1   
 9                   0                  0                  0   
 10                  0                  0                  0   
 11                  0                  0                  0   
 12                  0                  0                  0   
 13                  0                  0                  0   
 14                  0                  0                  0   
 
                    A9                 A10                 A11  \
    Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1   
 0                   0                   0                   0   
 1                   0                   0                   0   
 2                   0                   0                   0   
 3                   0                   0                   0   
 4                   1                   1                   1   
 5                   1                   1                   1   
 6                   0                   0                   0   
 7                   0                   1                   0   
 8                   0                   0                   1   
 9                   1                   0                   1   
 10                  0                   1                   1   
 11                  0                   1                   1   
 12                  0                   0                   1   
 13                  0                   0                   0   
 14                  0                   1                   1   
 
                    A12                 A13       # Assignments  
    Unnamed: 12_level_1 Unnamed: 13_level_1 Unnamed: 14_level_1  
 0                    0                   0                   4  
 1                    0                   0                   2  
 2                    0                   0                   3  
 3                    0                   0                   2  
 4                    1                   1                  11  
 5                    1                   1                  10  
 6                    0                   0                   2  
 7                    0                   0                   3  
 8                    0                   0                   2  
 9                    0                   0                   2  
 10                   0                   0                   2  
 11                   0                   0                   2  
 12                   0                   1                   2  
 13                   1                   1                   2  
 14                   1                   1                   4  ,
    Unnamed: 0_level_0                                            Level 3  \
               keyword                                 Unnamed: 1_level_1   
 0              python  reliable, efficient, pythonic code that consis...   
 1             process  Compare different ways that data science can f...   
 2              access  access data from both common and uncommon form...   
 3           construct       merge data that is not automatically aligned   
 4           summarize  Compute and interpret various summary statisti...   
 5           visualize  generate complex plots with pandas and plottin...   
 6             prepare  apply data reshaping, cleaning, and filtering ...   
 7      classification  fit and apply classification models and select...   
 8          regression  can fit and explain regrularized or nonlinear ...   
 9          clustering  apply multiple clustering techniques, and inte...   
 10           evaluate  Evaluate a model with multiple metrics and cro...   
 11           optimize  Select optimal parameters based of mutiple qua...   
 12            compare  Evaluate tradeoffs between different model com...   
 13       unstructured  apply multiple representations and compare and...   
 14           workflow  Scope, choose an appropriate tool pipeline and...   
 
                    P1                 P2                 P3                 P4  
    Unnamed: 2_level_1 Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1  
 0                   1                  1                  0                  0  
 1                   0                  1                  1                  0  
 2                   1                  1                  0                  0  
 3                   1                  1                  0                  0  
 4                   1                  1                  0                  0  
 5                   1                  1                  0                  0  
 6                   1                  1                  0                  0  
 7                   0                  1                  1                  0  
 8                   0                  1                  1                  0  
 9                   0                  1                  1                  0  
 10                  0                  1                  1                  0  
 11                  0                  0                  1                  1  
 12                  0                  0                  1                  1  
 13                  0                  0                  1                  1  
 14                  0                  0                  1                  1  ]

This makes a list of pandas.DataFrame objects. We can check that with the following

Warning

This cell was added after class, but the explanation was given in class

type(pd.read_html('https://rhodyprog4ds.github.io/BrownFall20/syllabus/course_map.html'))
list

To work with it though, we should save to a variable, then we can index into that list.

df_list = pd.read_html('https://rhodyprog4ds.github.io/BrownFall20/syllabus/course_map.html')
df_list[0]
Unnamed: 0_level_0 topics skills
week Unnamed: 1_level_1 Unnamed: 2_level_1
0 1 [admin, python review] process
1 2 Loading data, Python review [access, prepare, summarize]
2 3 Exploratory Data Analysis [summarize, visualize]
3 4 Data Cleaning [prepare, summarize, visualize]
4 5 Databases, Merging DataFrames [access, construct, summarize]
5 6 Modeling, Naive Bayes, classification performa... [classification, evaluate]
6 7 decision trees, cross validation [classification, evaluate]
7 8 Regression [regression, evaluate]
8 9 Clustering [clustering, evaluate]
9 10 SVM, parameter tuning [optimize, tools]
10 11 KNN, Model comparison [compare, tools]
11 12 Text Analysis [unstructured]
12 13 Topic Modeling [unstructured, tools]
13 14 Deep Learning [tools, compare]

When you display DataFrames in jupyter, they get nice formatting.

Review & Further Reading

If you’ve made it this far, let me know how you found these notes.