Skip to article frontmatterSkip to article content

Assignment 2: Exploratory Data Analysis

Due:2023-10-02 end of day

Submission

Solo

Add your work to an assignment2 branch in your portfolio and you do not need to edit the

Group

  1. coordinate so that the first person makes the team when they accept the assigment

  2. the second (and third) joins the same team when they accept the assigment.

  3. Each person should upload their work to a branch named d1, d2 or d3 for which dataset checklist you followed from below and open a PR. (each person should do a different dataset)

  4. In your portfolio, make an issue with a link to your team’s repo.

Choose Datasets

Each Dataset must have at least three variables, but can have more. Both datasets must have multiple types of variables. These can be datasets you used for Assignment 1, if they meet the criteria below. All datasets must be different datasets, even in a group.

Dataset 1 (d1)

must include at least:

Dataset 2 (d2)

must include at least:

Dataset 3 (d3)

must include at least:

EDA

Use a separate notebook for each dataset, name them dataset_0x.ipynb where x is the number checklist you are following.

For each dataset, in the corresponding notebook complete the following:

  1. Load the data to a notebook as a DataFrame from url or local path, if local, include the data file in your repository.

  2. Write a short description of what the data contains and what it could be used for

  3. Explore the dataset in a notebook enough to describe its structure. Use the heading ## Description and include at least the following with interpretation. What does the strucutre imply about the conclusions you can draw from this data? Are there limitations in how to safely interpret the data that the summary helps you see? are the variables what you expect?

    • shape

    • columns

    • variable types

    • overall summary statisics

  4. Ask and answer at least 3 questions by using and interpreting statistics and visualizations as appropriate. Include a heading for each question using a markdown cell and H2:##. Make sure your analyses meet the criteria in the check lists below. Use the checklists to think of what kinds of questions would use those type of analyses and help shape your questions. Your questions can be related or different levels of detail or views on a big picture question as long as the analysis addresses the checklist.

  5. Describe what, if anything might need to be done to clean or prepare this data for further analysis in a finale ## Future analysis markdown cell in your notebook.

(overall) Question checklist

be sure that every question has:

Dataset 1 Checklist

make sure that your dataset_01.ipynb has:

Dataset 2 Checklist

make sure that your dataset_02.ipynb has:

Dataset 3 Checklist

make sure that your dataset_03.ipynb has:

Peer Review

With a partner (or group of 3 where person 1 review’s 2 work, 2 reviews 3, and 3 reviews 1) read your partner’s notebook and complete a peer review on their pull request. You can do peer review when you have done most of your analysis, and explanation, even if some parts of the code do not work.

You will complete your review on a PR, by reviewing it. If you want a big picture overview on that, the github PR review “course” is a good place to go, it is designed to take <30 minutes.

  1. start a review

  2. (optional) Use PR comments to denote places that are confusing or if you see solutions to problems your classmate could not solve this is hard on notebook files, so it is okay to skip

  3. Prepare to submit your review

  4. Use the list of questions below for your summary review (copy the template into the box and fill in )

Review Questions

  1. Describe overall how it was to read the analysis overall to read. Was it easy? hard? cohesive? jumpy?

  2. How did the data summaries help prepare you to read the rest of the analysis? What do you think might be missing?

  3. For each question, consider the following and write any tips for improvement

    1. Does the question make sense based on the data? How does it relate to the real world is there a reasonable audience? How could the question be improved

    2. How well do the statistics and plots match the question?

    3. Are the interpretations complete, clear, and consistent with the statistics and plots?

    4. What could be done to make the explanations more clear and complete?

    5. What additional analysis might make the analysis more compelling and clear?

Template

template
## Overall 
 <!-- Describe overall how it was to read the analysis overall to read. Was it easy? hard? cohesive? jumpy? -->


## Intro

## Question 1 

## Question 2

## Question 3

Response

Respond to the review on your notebook either with inline comments, replies, or by updating your analysis accordingly.

Tips and Hints