3. Assignment 3: Exploratory Data Analysis#

__Due:2022-09-28 11:59pm __

Template repo for submission

3.1. Objective & Evaluation#

This week your goal is to do a small exploratory data analysis for two datasets of your choice.

Eligible Skills:

  • summarize

  • visualize

  • access

3.2. Choose Datasets#

Each Dataset must have at least three variables, but can have more. Both datasets must have multiple types of variables. These can be datasets you used last week, if they meet the criteria below.

3.2.1. Dataset 1 (d1)#

must include at least:

  • two continuous valued variables and

  • one categorical variable.

3.2.2. Dataset 2 (d2)#

must include at least:

  • two categorical variables and

  • one continuous valued variable

3.3. EDA#

Use a separate notebook for each dataset, name them dataset_01.ipynb and dataset_02.ipynb.

For each dataset, in a dedicated notebook, complete the following:

  1. Load the data to a notebook as a DataFrame from url or local path, if local, include the data in your repository.

  2. Explore the dataset in a notebook enough to describe its structure use the heading ## Description

    • shape

    • columns

    • variable types

    • overall summary statisics

  3. Write a short description of what the data contains and what it could be used for

  4. Ask and answer 4 questions by using and interpreting statistics and visualizations as appropriate. Include a heading for each question using a markdown cell and H2:##. Make sure your analyses meet the criteria in the check lists below.

  5. Describe what, if anything might need to be done to clean or prepare this data for further analysis in a finale ## Future analysis markdown cell in your notebook.

3.3.1. Question checklist#

be sure that every question (all eight, 4 per dataset) has:

  • a heading

  • at least 1 statistic or plot

  • interpretation that answers the question

3.3.2. Dataset 1 Checklist#

make sure that your dataset_01.ipynb has:

  • Overall summary statistics grouped by a categorical variable

  • A single statistic grouped by a categorical variable

  • at least one plot that uses 3 total variables

  • a plot and summary table that convey the same information. This can be one statistic or many.

3.3.3. Dataset 2 Checklist#

  • two individual summary statistics for one variable

  • one summary statistic grouped by two categorical variables

  • a figure with a grid of subplots that correspond to two categorical variables

  • a plot and summary table that convey the same information. This can be one statistic or many.

Tip

Be sure to start early and use help hours to make sure you have a plan for all of these.

3.4. Peer Review#

Note

This is optional, but if you do a review, you only need to do one analysis each.

Warning

Be familiar with the collaboration policy before you choose to go this route

With a partner (or group of 3 where person 1 review’s 2 work, 2 reviews 3, and 3 reviews 1) read your partner’s notebook and complete a peer review on their pull request. You can do peer review when you have done most of your analysis, and explanation, even if some parts of the code do not work. After you each do your reviews, update your own analysis.

3.4.1. Review#

In your review:

  • Use inline comments to denote places that are confusing or if you see solutions to problems your classmate could not solve

  • keep the questions below in mind

  • Use the template below for your summary review

3.4.1.1. Review Questions#

  1. How was the analysis overall to read? easy? hard? cohesive? jumpy?

  2. Did the data summaries tell you enough about the data to understand the analysis and anticipate what kinds of questions could be answered? If not, what questions do you still have about the data?

  3. Do the questions make sense based on the data? Are they interesting questions? What could improve the questions

  4. Are the statistics and plots appropriate for the questions?

  5. Are the interpretations complete, clear, and consistent with the statistics and plots?

  6. What could be done to make the explanations more clear and complete?

  7. What additional analysis might make the analysis more compelling and clear?

3.4.1.2. Review Template#

<!-- delete sections that are not needed -->
## Overall  

This analysis was ...

## Data Summaries

- [ ] complete

To understand this analysis I still need to know ...

## Checklist

- [ ] questions fit the data
- [ ] questions are in natural language
- [ ] chosen statistics and plots match questions
- [ ] all statistics and plots have an interpretation in English

## Areas of improvement

3.4.2. Response#

Respond to your review either inline comments, replies, and by updating your analysis accordingly.

Think Ahead

  1. How could you make more customized summary tables?

  2. Could you use any of the variables in this dataset to add more variables that would make interesting ways to apply split-apply-combine? (eg thresholding a continuous value to make a categorical value)