Due:2023-10-02 end of day
Submission¶
Solo¶
Add your work to an assignment2
branch in your portfolio and you do not need to edit the
Group¶
coordinate so that the first person makes the team when they accept the assigment
the second (and third) joins the same team when they accept the assigment.
Each person should upload their work to a branch named
d1
,d2
ord3
for which dataset checklist you followed from below and open a PR. (each person should do a different dataset)In your portfolio, make an issue with a link to your team’s repo.
Related notes¶
Choose Datasets¶
Each Dataset must have at least three variables, but can have more. Both datasets must have multiple types of variables. These can be datasets you used for Assignment 1, if they meet the criteria below. All datasets must be different datasets, even in a group.
Dataset 1 (d1)¶
must include at least:
two continuous valued variables and
one categorical variable.
Dataset 2 (d2)¶
must include at least:
two categorical variables and
one continuous valued variable
Dataset 3 (d3)¶
must include at least:
two continuous valued variables and
one categorical variable.
EDA¶
Use a separate notebook for each dataset, name them dataset_0x.ipynb
where x
is the number checklist you are following.
For each dataset, in the corresponding notebook complete the following:
Load the data to a notebook as a
DataFrame
from url or local path, if local, include the data file in your repository.Write a short description of what the data contains and what it could be used for
Explore the dataset in a notebook enough to describe its structure. Use the heading
## Description
and include at least the following with interpretation. What does the strucutre imply about the conclusions you can draw from this data? Are there limitations in how to safely interpret the data that the summary helps you see? are the variables what you expect?shape
columns
variable types
overall summary statisics
Ask and answer at least 3 questions by using and interpreting statistics and visualizations as appropriate. Include a heading for each question using a markdown cell and H2:
##
. Make sure your analyses meet the criteria in the check lists below. Use the checklists to think of what kinds of questions would use those type of analyses and help shape your questions. Your questions can be related or different levels of detail or views on a big picture question as long as the analysis addresses the checklist.Describe what, if anything might need to be done to clean or prepare this data for further analysis in a finale
## Future analysis
markdown cell in your notebook.
(overall) Question checklist¶
be sure that every question has:
a heading
at least 1 statistic or 1 plot (both is generally better)
interpretation of the statistic or plot that answers the question
the question does not include the name of the statistic or plot in it
Dataset 1 Checklist¶
make sure that your dataset_01.ipynb
has:
Overall summary statistics grouped by a categorical variable
A single statistic grouped by a categorical variable
at least one plot that uses 3 total variables
a plot and summary table that convey the same information. This can be one statistic or many.
Dataset 2 Checklist¶
make sure that your dataset_02.ipynb
has:
overall summary statistics
two individual summary statistics for one variable
one summary statistic grouped by two categorical variables
a figure with a grid of subplots that correspond to two categorical variables
Dataset 3 Checklist¶
This is only for groups of 3
make sure that your dataset_03.ipynb
has:
overall summary statistics
two individual summary statistics for one variable
at least one plot that uses 3 total variables
a plot and summary table that convey the same information. This can be one statistic or many.
Peer Review¶
If you work alone and complete 2 analyses you do not need to do this, but you might review the questions because they are similar to how we will grade.
With a partner (or group of 3 where person 1 review’s 2 work, 2 reviews 3, and 3 reviews 1) read your partner’s notebook and complete a peer review on their pull request. You can do peer review when you have done most of your analysis, and explanation, even if some parts of the code do not work.
You will complete your review on a PR, by reviewing it. If you want a big picture overview on that, the github PR review “course” is a good place to go, it is designed to take <30 minutes.
(optional) Use PR comments to denote places that are confusing or if you see solutions to problems your classmate could not solve this is hard on notebook files, so it is okay to skip
Prepare to submit your review
Use the list of questions below for your summary review (copy the template into the box and fill in )
Your review should use the template for organization, but the questions guide what sorts of aspects to consider across the sections.
Review Questions¶
Describe overall how it was to read the analysis overall to read. Was it easy? hard? cohesive? jumpy?
How did the data summaries help prepare you to read the rest of the analysis? What do you think might be missing?
For each question, consider the following and write any tips for improvement
Does the question make sense based on the data? How does it relate to the real world is there a reasonable audience? How could the question be improved
How well do the statistics and plots match the question?
Are the interpretations complete, clear, and consistent with the statistics and plots?
What could be done to make the explanations more clear and complete?
What additional analysis might make the analysis more compelling and clear?
Template¶
## Overall
<!-- Describe overall how it was to read the analysis overall to read. Was it easy? hard? cohesive? jumpy? -->
## Intro
## Question 1
## Question 2
## Question 3
Response¶
Respond to the review on your notebook either with inline comments, replies, or by updating your analysis accordingly.
Tips and Hints¶
Remember you can also use masking in your EDA even though we did not do any in class
To ensure you understand the checklist you can optionally make an issue using the appropriate issue type from your repo and fill in what it should be to get early feedback that you are on track
variable types are in the notes
the DataFrame API reference shows all the methods (and more) grouped by high level concepts.