Class 3: Welcome to Week 2¶
This week we will:
clarify how this grading really works
learn about accessing data
use accessing data as motivation to review more python
Grading and Assignment 1¶
Solution function posted.
note: not a sum
read the rubric
Brightspace will show grades as they’re earned
In class, respond on prismia
Portfolio
will start posting prompts The docstring functions like a property of the function object. so it has to be inside.
Iterables¶
Python has a general data type for objects that are designed to facilitate repetition of some sort, they’re called iterable
s
We’ve already seen one. Strings are Iterable
s
name = 'sarah'
which means we can index them
name[3]
'a'
Indexing with a negative number counts from the end
name[-1]
'h'
Loops in python have similar syntax to the if
and functions we saw last week:
for char in name:
print(char*3)
sss
aaa
rrr
aaa
hhh
some notes:
char
is called the loop variablename
is called the collection- this can be any iterable type object in pythonprint(char*3)
is called the loop bodypython lets us use mathematical operations on strings
Lists and List Comprehensions¶
We make a list with square brackets
names = ['sarah', 'Jose', 'Cam', 'Bri']
we can also build lists by folding a loop into the list construction
['hello' + n for n in names]
['hellosarah', 'helloJose', 'helloCam', 'helloBri']
this is called a list comprehension
greetings = ['hello ' + n for n in names]
greetings[0]
'hello sarah'
Dictionaries¶
Dictionaries are a useful datatype in python. It is denoted by {}
and contains key: value
pairs separated by commas.
gh_names = {'brownsarahm':'Sarah Brown',
'briannakathrynm1' : 'Brianna MacDonald',
'jdion62':'Jacob Dion'}
gh_names
{'brownsarahm': 'Sarah Brown',
'briannakathrynm1': 'Brianna MacDonald',
'jdion62': 'Jacob Dion'}
You can think of it like a list of the values with a named index.
gh_names['jdion62']
'Jacob Dion'
we can iterate over both the key and the value by using the items
method on a dictionary. That makes another iterable object that can be used as a loop collection. It functions as a set of pairs now, so we get two loop variables:
for key, value in gh_names.items():
print(value, "'s username is ", key)
Sarah Brown 's username is brownsarahm
Brianna MacDonald 's username is briannakathrynm1
Jacob Dion 's username is jdion62
If we iterate over the dictionary without that method, we get the keys.
for val in gh_names:
print(val)
brownsarahm
briannakathrynm1
jdion62
Libraries¶
To use libraries in python we import them
We will use pandas
a lot in this class. It’s the Python Data Analysis Library.
import pandas
Once we import we can use the functions, datatypes, and values a library provides by using a .
after the name. In a notebook, pressing tab will show you the options.
pandas.read_csv()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-374a1a6f9f7e> in <module>
----> 1 pandas.read_csv()
TypeError: read_csv() missing 1 required positional argument: 'filepath_or_buffer'
We can also use an alias to give a library a nickname to make it easier to use. pd
is the standard alias for pandas
import pandas as pd
We can read in from a local path or a url. Let’s read in the course map page of our course website.
pd.read_html('https://rhodyprog4ds.github.io/BrownFall20/syllabus/course_map.html')
[ Unnamed: 0_level_0 topics \
week Unnamed: 1_level_1
0 1 [admin, python review]
1 2 Loading data, Python review
2 3 Exploratory Data Analysis
3 4 Data Cleaning
4 5 Databases, Merging DataFrames
5 6 Modeling, Naive Bayes, classification performa...
6 7 decision trees, cross validation
7 8 Regression
8 9 Clustering
9 10 SVM, parameter tuning
10 11 KNN, Model comparison
11 12 Text Analysis
12 13 Topic Modeling
13 14 Deep Learning
skills
Unnamed: 2_level_1
0 process
1 [access, prepare, summarize]
2 [summarize, visualize]
3 [prepare, summarize, visualize]
4 [access, construct, summarize]
5 [classification, evaluate]
6 [classification, evaluate]
7 [regression, evaluate]
8 [clustering, evaluate]
9 [optimize, tools]
10 [compare, tools]
11 [unstructured]
12 [unstructured, tools]
13 [tools, compare] ,
Unnamed: 0_level_0 skill \
keyword Unnamed: 1_level_1
0 python pythonic code writing
1 process describe data science as a process
2 access access data in multiple formats
3 construct construct datasets from multiple sources
4 summarize Summarize and describe data
5 visualize Visualize data
6 prepare prepare data for analysis
7 classification Apply classification
8 regression Apply Regression
9 clustering Clustering
10 evaluate Evaluate model performance
11 optimize Optimize model parameters
12 compare compare models
13 unstructured model unstructured data
14 workflow use industry standard data science tools and w...
Level 1 \
Unnamed: 2_level_1
0 python code that mostly runs, occasional pep8 ...
1 Identify basic components of data science
2 load data from at least one format; identify t...
3 identify what should happen to merge datasets ...
4 Describe the shape and structure of a dataset ...
5 identify plot types, generate basic plots from...
6 identify if data is or is not ready for analys...
7 identify and describe what classification is, ...
8 identify what data that can be used for regres...
9 describe what clustering is
10 Explain basic performance metrics for differen...
11 Identify when model parameters need to be opti...
12 Qualitatively compare model classes
13 Identify options for representing text data an...
14 Solve well strucutred problems with a single t...
Level 2 \
Unnamed: 3_level_1
0 python code that reliably runs, frequent pep8 ...
1 Describe and define each stage of the data sci...
2 Load data for processing from the most common ...
3 apply basic merges
4 compute summary statndard statistics of a whol...
5 generate multiple plot types with complete lab...
6 apply data reshaping, cleaning, and filtering ...
7 fit preselected classification model to a dataset
8 can fit linear regression models
9 apply basic clustering
10 Apply basic model evaluation metrics to a held...
11 Manually optimize basic model parameters such ...
12 Compare model classes in specific terms and fi...
13 Apply at least one representation to transform...
14 Solve semi-strucutred, completely specified pr...
Level 3
Unnamed: 4_level_1
0 reliable, efficient, pythonic code that consis...
1 Compare different ways that data science can f...
2 access data from both common and uncommon form...
3 merge data that is not automatically aligned
4 Compute and interpret various summary statisti...
5 generate complex plots with pandas and plottin...
6 apply data reshaping, cleaning, and filtering ...
7 fit and apply classification models and select...
8 can fit and explain regrularized or nonlinear ...
9 apply multiple clustering techniques, and inte...
10 Evaluate a model with multiple metrics and cro...
11 Select optimal parameters based of mutiple qua...
12 Evaluate tradeoffs between different model com...
13 apply multiple representations and compare and...
14 Scope, choose an appropriate tool pipeline and... ,
Unnamed: 0_level_0 A1 A2 \
keyword Unnamed: 1_level_1 Unnamed: 2_level_1
0 python 1 1
1 process 1 1
2 access 0 1
3 construct 0 0
4 summarize 0 0
5 visualize 0 0
6 prepare 0 0
7 classification 0 0
8 regression 0 0
9 clustering 0 0
10 evaluate 0 0
11 optimize 0 0
12 compare 0 0
13 unstructured 0 0
14 workflow 0 0
A3 A4 A5 \
Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1
0 1 1 0
1 0 0 0
2 1 1 0
3 0 0 1
4 1 1 1
5 1 1 0
6 0 1 1
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
A6 A7 A8 \
Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 0
4 1 1 1
5 1 1 1
6 0 0 0
7 1 1 0
8 0 0 1
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
A9 A10 A11 \
Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 1 1 1
5 1 1 1
6 0 0 0
7 0 1 0
8 0 0 1
9 1 0 1
10 0 1 1
11 0 1 1
12 0 0 1
13 0 0 0
14 0 1 1
A12 A13 # Assignments
Unnamed: 12_level_1 Unnamed: 13_level_1 Unnamed: 14_level_1
0 0 0 4
1 0 0 2
2 0 0 3
3 0 0 2
4 1 1 11
5 1 1 10
6 0 0 2
7 0 0 3
8 0 0 2
9 0 0 2
10 0 0 2
11 0 0 2
12 0 1 2
13 1 1 2
14 1 1 4 ,
Unnamed: 0_level_0 Level 3 \
keyword Unnamed: 1_level_1
0 python reliable, efficient, pythonic code that consis...
1 process Compare different ways that data science can f...
2 access access data from both common and uncommon form...
3 construct merge data that is not automatically aligned
4 summarize Compute and interpret various summary statisti...
5 visualize generate complex plots with pandas and plottin...
6 prepare apply data reshaping, cleaning, and filtering ...
7 classification fit and apply classification models and select...
8 regression can fit and explain regrularized or nonlinear ...
9 clustering apply multiple clustering techniques, and inte...
10 evaluate Evaluate a model with multiple metrics and cro...
11 optimize Select optimal parameters based of mutiple qua...
12 compare Evaluate tradeoffs between different model com...
13 unstructured apply multiple representations and compare and...
14 workflow Scope, choose an appropriate tool pipeline and...
P1 P2 P3 P4
Unnamed: 2_level_1 Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1
0 1 1 0 0
1 0 1 1 0
2 1 1 0 0
3 1 1 0 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 0 1 1 0
8 0 1 1 0
9 0 1 1 0
10 0 1 1 0
11 0 0 1 1
12 0 0 1 1
13 0 0 1 1
14 0 0 1 1 ]
This makes a list
of pandas.DataFrame
objects. We can check that with the following
Warning
This cell was added after class, but the explanation was given in class
type(pd.read_html('https://rhodyprog4ds.github.io/BrownFall20/syllabus/course_map.html'))
list
To work with it though, we should save to a variable, then we can index into that list.
df_list = pd.read_html('https://rhodyprog4ds.github.io/BrownFall20/syllabus/course_map.html')
df_list[0]
Unnamed: 0_level_0 | topics | skills | |
---|---|---|---|
week | Unnamed: 1_level_1 | Unnamed: 2_level_1 | |
0 | 1 | [admin, python review] | process |
1 | 2 | Loading data, Python review | [access, prepare, summarize] |
2 | 3 | Exploratory Data Analysis | [summarize, visualize] |
3 | 4 | Data Cleaning | [prepare, summarize, visualize] |
4 | 5 | Databases, Merging DataFrames | [access, construct, summarize] |
5 | 6 | Modeling, Naive Bayes, classification performa... | [classification, evaluate] |
6 | 7 | decision trees, cross validation | [classification, evaluate] |
7 | 8 | Regression | [regression, evaluate] |
8 | 9 | Clustering | [clustering, evaluate] |
9 | 10 | SVM, parameter tuning | [optimize, tools] |
10 | 11 | KNN, Model comparison | [compare, tools] |
11 | 12 | Text Analysis | [unstructured] |
12 | 13 | Topic Modeling | [unstructured, tools] |
13 | 14 | Deep Learning | [tools, compare] |
When you display DataFrames
in jupyter, they get nice formatting.
Review & Further Reading¶
imported
pandas
and read data from a website
If you’ve made it this far, let me know how you found these notes.