Data Science Achievements
Contents
Data Science Achievements#
In this course there are 5 learning outcomes that I expect you to achieve by the end of the semester. To get there, you’ll focus on 15 smaller achievements that will be the basis of your grade. This section will describe how the topics covered, the learning outcomes, and the achievements are covered over time. In the next section, you’ll see how these achievements turn into grades.
Learning Outcomes#
By the end of the semester
(process) Describe the process of data science, define each phase, and identify standard tools
(data) Access and combine data in multiple formats for analysis
(exploratory) Perform exploratory data analyses including descriptive statistics and visualization
(modeling) Select models for data by applying and evaluating mutiple models to a single dataset
(communicate) Communicate solutions to problems with data in common industry formats
We will build your skill in the process
and communicate
outcomes over the whole semester. The middle three skills will correspond roughly to the content taught for each of the first three portfolio checks.
Schedule#
The course will meet MWF 2-2:50pm in Ranger 302. Every class will include participatory live coding (instructor types code while explaining, students follow along)) instruction and small exercises for you to progress toward level 1 achievements of the new skills introduced in class that day.
Each Assignment will have a deadline posted on the page. Portfolio deadlines will be announced at least 2 weeks in advance.
topics | skills | |
---|---|---|
week | ||
1 | [admin, python review] | process |
2 | Loading data, Python review | [access, prepare, summarize] |
3 | Exploratory Data Analysis | [summarize, visualize] |
4 | Data Cleaning | [prepare, summarize, visualize] |
5 | Databases, Merging DataFrames | [access, construct, summarize] |
6 | Modeling, classification performance metrics, cross validation | [evaluate] |
7 | Naive Bayes, decision trees | [classification, evaluate] |
8 | Regression | [regression, evaluate] |
9 | Clustering | [clustering, evaluate] |
10 | SVM, parameter tuning | [optimize, tools] |
11 | KNN, Model comparison | [compare, tools] |
12 | Text Analysis | [unstructured] |
13 | Images Analysis | [unstructured, tools] |
14 | Deep Learning | [tools, compare] |
Achievement Definitions#
The table below describes how your participation, assignments, and portfolios will be assessed to earn each achievement. The keyword for each skill is a short name that will be used to refer to skills throughout the course materials; the full description of the skill is in this table.
skill | Level 1 | Level 2 | Level 3 | |
---|---|---|---|---|
keyword | ||||
python | pythonic code writing | python code that mostly runs, occasional pep8 adherance | python code that reliably runs, frequent pep8 adherance | reliable, efficient, pythonic code that consistently adheres to pep8 |
process | describe data science as a process | Identify basic components of data science | Describe and define each stage of the data science process | Compare different ways that data science can facilitate decision making |
access | access data in multiple formats | load data from at least one format; identify the most common data formats | Load data for processing from the most common formats; Compare and constrast most common formats | access data from both common and uncommon formats and identify best practices for formats in different contexts |
construct | construct datasets from multiple sources | identify what should happen to merge datasets or when they can be merged | apply basic merges | merge data that is not automatically aligned |
summarize | Summarize and describe data | Describe the shape and structure of a dataset in basic terms | compute summary statndard statistics of a whole dataset and grouped data | Compute and interpret various summary statistics of subsets of data |
visualize | Visualize data | identify plot types, generate basic plots from pandas | generate multiple plot types with complete labeling with pandas and seaborn | generate complex plots with pandas and plotting libraries and customize with matplotlib or additional parameters |
prepare | prepare data for analysis | identify if data is or is not ready for analysis, potential problems with data | apply data reshaping, cleaning, and filtering as directed | apply data reshaping, cleaning, and filtering manipulations reliably and correctly by assessing data as received |
evaluate | Evaluate model performance | Explain basic performance metrics for different data science tasks | Apply and interpret basic model evaluation metrics to a held out test set | Evaluate a model with multiple metrics and cross validation |
classification | Apply classification | identify and describe what classification is, apply pre-fit classification models | fit, apply, and interpret preselected classification model to a dataset | fit and apply classification models and select appropriate classification models for different contexts |
regression | Apply Regression | identify what data that can be used for regression looks like | fit and interpret linear regression models | fit and explain regrularized or nonlinear regression |
clustering | Clustering | describe what clustering is | apply basic clustering | apply multiple clustering techniques, and interpret results |
optimize | Optimize model parameters | Identify when model parameters need to be optimized | Optimize basic model parameters such as model order | Select optimal parameters based of mutiple quanttiateve criteria and automate parameter tuning |
compare | compare models | Qualitatively compare model classes | Compare model classes in specific terms and fit models in terms of traditional model performance metrics | Evaluate tradeoffs between different model comparison types |
representation | Choose representations and transform data | Identify options for representing text and categorical data in many contexts | Apply at least one representation to transform unstructured or inappropriately data for model fitting or summarizing | apply transformations in different contexts OR compare and contrast multiple representations a single type of data in terms of model performance |
workflow | use industry standard data science tools and workflows to solve data science problems | Solve well strucutred fully specified problems with a single tool pipeline | Solve well-strucutred, open-ended problems, apply common structure to learn new features of standard tools | Independently scope and solve realistic data science problems OR independently learn releated tools and describe strengths and weakensses of common tools |
Assignments and Skills#
Using the keywords from the table above, this table shows which assignments you will be able to demonstrate which skills and the total number of assignments that assess each skill. This is the number of opportunities you have to earn Level 2 and still preserve 2 chances to earn Level 3 for each skill.
A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | # Assignments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
keyword | ||||||||||||||
python | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
process | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 7 |
access | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
construct | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 3 |
summarize | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 11 |
visualize | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 10 |
prepare | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
evaluate | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 5 |
classification | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 2 |
regression | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 |
clustering | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 2 |
optimize | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 |
compare | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 2 |
representation | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
workflow | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 4 |
Warning
process achievements are accumulated a little slower. Prior to portfolio check 1, only level 1 can be earned. Portfolio check 1 is the first chance to earn level 2 for process, then level 3 can be earned on portfolio check 2 or later.
Portfolios and Skills#
The objective of your portfolio submissions is to earn Level 3 achievements. The following table shows what Level 3 looks like for each skill and identifies which portfolio submissions you can earn that Level 3 in that skill.
Level 3 | P1 | P2 | P3 | P4 | |
---|---|---|---|---|---|
keyword | |||||
python | reliable, efficient, pythonic code that consistently adheres to pep8 | 1 | 1 | 0 | 1 |
process | Compare different ways that data science can facilitate decision making | 0 | 1 | 1 | 1 |
access | access data from both common and uncommon formats and identify best practices for formats in different contexts | 1 | 1 | 0 | 1 |
construct | merge data that is not automatically aligned | 1 | 1 | 0 | 1 |
summarize | Compute and interpret various summary statistics of subsets of data | 1 | 1 | 0 | 1 |
visualize | generate complex plots with pandas and plotting libraries and customize with matplotlib or additional parameters | 1 | 1 | 0 | 1 |
prepare | apply data reshaping, cleaning, and filtering manipulations reliably and correctly by assessing data as received | 1 | 1 | 0 | 1 |
evaluate | Evaluate a model with multiple metrics and cross validation | 0 | 1 | 1 | 1 |
classification | fit and apply classification models and select appropriate classification models for different contexts | 0 | 1 | 1 | 1 |
regression | fit and explain regrularized or nonlinear regression | 0 | 1 | 1 | 1 |
clustering | apply multiple clustering techniques, and interpret results | 0 | 1 | 1 | 1 |
optimize | Select optimal parameters based of mutiple quanttiateve criteria and automate parameter tuning | 0 | 0 | 1 | 1 |
compare | Evaluate tradeoffs between different model comparison types | 0 | 0 | 1 | 1 |
representation | apply transformations in different contexts OR compare and contrast multiple representations a single type of data in terms of model performance | 0 | 0 | 1 | 1 |
workflow | Independently scope and solve realistic data science problems OR independently learn releated tools and describe strengths and weakensses of common tools | 0 | 0 | 1 | 1 |
Detailed Checklists#
python-level1#
python code that mostly runs, occasional pep8 adherance
logical use of control structures
callable functions
correct calls to functions
correct use of variables
use of logical operators
python-level2#
python code that reliably runs, frequent pep8 adherance
descriptive variable names
pythonic loops
efficient use of return vs side effects in functions
correct, effective use of builtin python iterable types (lists & dictionaries)
python-level3#
reliable, efficient, pythonic code that consistently adheres to pep8
pep8 adherant variable, file, class, and function names
effective use of multi-paradigm abilities for efficiency gains
easy to read code that adheres to readability over other rules
process-level1#
Identify basic components of data science
identify component disciplines OR
idenitfy phases
process-level2#
Describe and define each stage of the data science process
correctly defines stages
identifies stages in use
describes general goals as well as a specific processes
process-level3#
Compare different ways that data science can facilitate decision making
describes exceptions to process and iteration in process
connects choices at one phase to impacts in other phases
connects data science steps to real world decisions
access-level1#
load data from at least one format; identify the most common data formats
use at least one pandas
read_
function correctlyname common types
describe the structure of common types
access-level2#
Load data for processing from the most common formats; Compare and constrast most common formats
load data from at least two of (.csv, .tsv, .dat, database, .json)
describe advantages and disadvantages of most commone types
descive how most common types are different
access-level3#
access data from both common and uncommon formats and identify best practices for formats in different contexts
load data from at least 1 uncommon format
describe when one format is better than another
construct-level1#
identify what should happen to merge datasets or when they can be merged
identify what the structure of a merged dataset should be (size, shape, columns)
idenitfy when datasets can or cannot be merged
construct-level2#
apply basic merges
use 3 different types of merges
choose the right type of merge for realistic scenarios
construct-level3#
merge data that is not automatically aligned
manipulate data to make it mergable
identify how to combine data from many sources to answer a question
implement stesp to combine data from multiple sources
summarize-level1#
Describe the shape and structure of a dataset in basic terms
use attributes to produce a description of a dataset
display parts of a dataset
summarize-level2#
compute and interpret summary standard statistics of a whole dataset and grouped data
compute descriptive statistics on whole datasets
apply individual statistics to datasets
group data by a categorical variable for analysis
apply split-apply-combine paradigm to analyze data
interprete statistics on whole datasets
interpret statistics on subsets of data
summarize-level3#
Compute and interpret various summary statistics of subsets of data
produce custom aggregation tables to summarize datasets
compute multivariate summary statistics by grouping
compute custom cacluations on datasets
visualize-level1#
identify plot types, generate basic plots from pandas
generate at least two types of plots with pandas
identify plot types by name
interpret basic information from plots
visualize-level2#
generate multiple plot types with complete labeling with pandas and seaborn
generate at least 3 types of plots
use correct, complete, legible labeling on plots
plot using both pandas and seaborn
interpret multiple types of plots to draw conclusions
visualize-level3#
generate complex plots with pandas and plotting libraries and customize with matplotlib or additional parameters
use at least two libraries to plot
generate figures with subplots
customize the display of a plot to be publication ready
interpret plot types and explain them for novices
choose appopriate plot types to convey information
explain why plotting common best practices are effective
prepare-level1#
identify if data is or is not ready for analysis, potential problems with data
identify problems in a dataset
anticipate how potential data setups will interfere with analysis
describe the structure of tidy data
label data as tidy or not
prepare-level2#
apply data reshaping, cleaning, and filtering as directed
reshape data to be analyzable as directed
filter data as directed
rename columns as directed
rename values to make data more analyzable
handle missing values in at least two ways
transform data to tidy format
prepare-level3#
apply data reshaping, cleaning, and filtering manipulations reliably and correctly by assessing data as received
identify issues in a dataset and correctly implement solutions
convert varialbe representation by changing types
change variable representation using one hot encoding
evaluate-level1#
Explain basic performance metrics for different data science tasks
define at least two performance metrics
describe how those metrics compare or compete
evaluate-level2#
Apply and interpret basic model evaluation metrics to a held out test set
apply at least three performance metrics to models
apply metrics to subsets of data
apply disparity metrics
interpret at least three metrics
evaluate-level3#
Evaluate a model with multiple metrics and cross validation
explain cross validation
explain importance of held out test and validation data
describe why cross vaidation is important
idenitfy appropriate metrics for different types of modeling tasks
use multiple metriccs together to create a more complete description of a model’s performance
classification-level1#
identify and describe what classification is, apply pre-fit classification models
describe what classification is
describe what a dataset must look like for classifcation
identify appliations of classifcation in the real world
describe set up for a classification problem (tes,train)
classification-level2#
fit, apply, and interpret preselected classification model to a dataset
split data for training and testing
fit a classification model
apply a classification model to obtain predictions
interpret the predictions of a classification model
examine parameters of at least one fit classifier to explain how the prediction is made
differentiate between model fitting and generating predictions
evaluate how model parameters impact model performance
classification-level3#
fit and apply classification models and select appropriate classification models for different contexts
choose appropriate classifiers based on application context
explain how at least 3 different classifiers make predictions
evaluate how model parameters impact model performance and justify choices when tradeoffs are necessary
regression-level1#
identify what data that can be used for regression looks like
identify data that is/not appropriate for regression
describe univariate linear regression
identify appliations of regression in the real world
regression-level2#
fit and interpret linear regression models
split data for training and testing
fit univariate linear regression models
interpret linear regression models
fit multivariate linear regression models
regression-level3#
fit and explain regrularized or nonlinear regression
fit nonlinear or regrularized regression models
interpret and explain nonlinear or regrularized regresion models
clustering-level1#
describe what clustering is
differentiate clustering from classification and regression
identify appliations of clustering in the real world
clustering-level2#
apply basic clustering
fit Kmeans
interpret kmeans
evaluate clustering models
clustering-level3#
apply multiple clustering techniques, and interpret results
apply at least two clustering techniques
explain the differences between two clustering models
optimize-level1#
Identify when model parameters need to be optimized
identify when parameters might impact model performance
optimize-level2#
Optimize basic model parameters such as model order
automatically optimize multiple parameters
evaluate potential tradeoffs
interpret optimization results in context
optimize-level3#
Select optimal parameters based of mutiple quanttiateve criteria and automate parameter tuning
optimize models based on multiple metrics
describe when one model vs another is most appropriate
compare-level1#
Qualitatively compare model classes
compare models within the same task on complexity
compare-level2#
Compare model classes in specific terms and fit models in terms of traditional model performance metrics
compare models in multiple terms
interpret cross model comparisons in context
compare-level3#
Evaluate tradeoffs between different model comparison types
compare models on multiple criteria
compare optimized models
jointly interpret optimization result and compare models
compare models on quanttiateve and qualitative measures
representation-level1#
Identify options for representing text and categorical data in many contexts
describe the basic goals for changing the representation of data
representation-level2#
Apply at least one representation to transform unstructured or inappropriately data for model fitting or summarizing
transform text or image data for use with ML
representation-level3#
apply transformations in different contexts OR compare and contrast multiple representations a single type of data in terms of model performance
transform both text and image data for use in ml
evaluate the impact of representation on model performance
workflow-level1#
Solve well strucutred fully specified problems with a single tool pipeline
pseudocode out the steps to answer basic data science questions
workflow-level2#
Solve well-strucutred, open-ended problems, apply common structure to learn new features of standard tools
plan and execute answering real questions to an open ended question
describe the necessary steps and tools
workflow-level3#
Independently scope and solve realistic data science problems OR independently learn releated tools and describe strengths and weakensses of common tools
scope and solve realistic data science problems
compare different data science tool stacks