3. Assignment 3: Exploratory Data Analysis#

Due:2023-10-01 end of day

3.1. Submission#

Important

You have the option to work with a partner. You must plan this in advance so that you have access to collaborate.

3.1.1. Solo#

Add your work to the assignment3 branch in your portfolio and you do not need to edit the a3_location file

3.1.2. Group#

  1. coordinate so that the first person makes the team when they accept the assigment

  2. the second (and third) joins the same team when they accept the assigment.

  3. Each person should upload their work to a branch named d1, d2 or d3 for which dataset checklist you followed from below and open a PR. (each person should do a different dataset)

  4. In your portfolio, replace the contents of the a3_location.md file on the assignment3 branch with your team name. We will use that to create a PR to give you your individualized achievements update.

3.2. Objective & Evaluation#

This week your goal is to do a small exploratory data analysis for two datasets (or one if in a group) of your choice.

Eligible skills: (links to checklists)

  • process 1

  • access 1 and 2

  • summarize 1 and 2

  • visualize 1

3.4. Choose Datasets#

Each Dataset must have at least three variables, but can have more. Both datasets must have multiple types of variables. These can be datasets you used for Assignment 2, if they meet the criteria below. All datasets must be different datasets even in a group

3.4.1. Dataset 1 (d1)#

must include at least:

  • two continuous valued variables and

  • one categorical variable.

Hint

a dataset from the UCI data repository that’s for classification and has continuous features would work for this

3.4.2. Dataset 2 (d2)#

must include at least:

  • two categorical variables and

  • one continuous valued variable

3.4.3. Dataset 3 (d3)#

Warning

This is only for groups of 3

must include at least:

  • two continuous valued variables and

  • one categorical variable.

3.5. EDA#

Use a separate notebook for each dataset, name them dataset_0x.ipynb where x is the number checklist you are following.

For each dataset, in the corresponding notebook complete the following:

  1. Load the data to a notebook as a DataFrame from url or local path, if local, include the data file in your repository.

  2. Write a short description of what the data contains and what it could be used for

  3. Explore the dataset in a notebook enough to describe its structure. Use the heading ## Description and include at least the following with interpretation. What does the strucutre imply about the conclusions you can draw from this data? Are there limitations in how to safely interpret the data that the summary helps you see? are the variables what you expect?

    • shape

    • columns

    • variable types

    • overall summary statisics

  4. Ask and answer at least 3 questions by using and interpreting statistics and visualizations as appropriate. Include a heading for each question using a markdown cell and H2:##. Make sure your analyses meet the criteria in the check lists below. Use the checklists to think of what kinds of questions would use those type of analyses and help shape your questions. Your questions can be related or different levels of detail or views on a big picture question as long as the analysis addresses the checklist.

  5. Describe what, if anything might need to be done to clean or prepare this data for further analysis in a finale ## Future analysis markdown cell in your notebook.

3.5.1. (overall) Question checklist#

be sure that every question (all six, 3 per dataset) has:

  • a heading

  • at least 1 statistic or plot

  • interpretation that answers the question

  • the question does not include the name of the statistic or plot in it

3.5.2. Dataset 1 Checklist#

make sure that your dataset_01.ipynb has:

  • Overall summary statistics grouped by a categorical variable

  • A single statistic grouped by a categorical variable

  • at least one plot that uses 3 total variables

  • a plot and summary table that convey the same information. This can be one statistic or many.

3.5.3. Dataset 2 Checklist#

make sure that your dataset_02.ipynb has:

  • overall summary statistics

  • two individual summary statistics for one variable

  • one summary statistic grouped by two categorical variables

  • a figure with a grid of subplots that correspond to two categorical variables

3.5.4. Dataset 3 Checklist#

Warning

This is only for groups of 3

make sure that your dataset_03.ipynb has:

  • overall summary statistics

  • two individual summary statistics for one variable

  • at least one plot that uses 3 total variables

  • a plot and summary table that convey the same information. This can be one statistic or many.

3.6. Peer Review#

Note

If you work alone and complete 2 analyses you do not need to do this, but you might review these questions because they are similar to how we will grade.

With a partner (or group of 3 where person 1 review’s 2 work, 2 reviews 3, and 3 reviews 1) read your partner’s notebook and complete a peer review on their pull request. You can do peer review when you have done most of your analysis, and explanation, even if some parts of the code do not work.

You will complete your review on a PR, by reviewing it. If you want a big picture overview on that, the github PR review “course” is a good place to go, it is designed to take <30 minutes.

  1. start a review

  2. (optional) Use PR comments to denote places that are confusing or if you see solutions to problems your classmate could not solve this is hard on notebook files, so it is okay to skip

  3. Prepare to submit your review

  4. Use the list of questions below for your summary review (copy the template into the box and fill in )

Important

Your review should use the template for organization, but the questions guide what sorts of aspects to consider across the sections.

3.6.1. Review Questions#

  1. Describe overall how it was to read the analysis overall to read. Was it easy? hard? cohesive? jumpy?

  2. How did the data summaries help prepare you to read the rest of the analysis? What do you think might be missing?

  3. For each question, consider the following and write any tips for improvement

    1. Does the question make sense based on the data? How does it relate to the real world is there a reasonable audience? How could the question be improved

    2. How well do the statistics and plots match the question?

    3. Are the interpretations complete, clear, and consistent with the statistics and plots?

    4. What could be done to make the explanations more clear and complete?

    5. What additional analysis might make the analysis more compelling and clear?

3.6.1.1. Template#

## Overall 
 <!-- Describe overall how it was to read the analysis overall to read. Was it easy? hard? cohesive? jumpy? -->


## Intro

## Question 1 

## Question 2

## Question 3 

3.6.2. Response#

Respond to the review on your notebook either with inline comments, replies, or by updating your analysis accordingly.

3.7. Tips and Hints#

  • Remember you can also use masking in your EDA even though we did not do any in class

  • To ensure you understand the checklist you can optionally make an issue using the appropriate issue type from your repo and fill in what it should be to get early feedback that you are on track

  • variable types are in the notes

  • the DataFrame API reference shows all the methods (and more) grouped by high level concepts.

3.8. Think Ahead#

This can be addded to any or all of the datasets

## Thinking Ahead 
1. How could you make more customized summary tables?
1. Could you use any of the variables in this dataset to add more variables that would make interesting ways to apply split-apply-combine? (eg thresholding a continuous value to make a categorical value or like what we did with `commodity` in class)
1. Are there multiple ways to answer your big picture question (like different thresholding or subsets of the data)
1. Could any cleaning improve your analyis?