5. Assignment 5: Constructing Datasets and Using Databases#

accept the assignment

Due: 2020-10-12 11:59pm

Table 5.1 access data from a database and merge multiple tables from a dataset#

task

skill

drop nan rows from a dataset

prepare (2)

impute a value to fill missing values

prepare (2)

filter data based on extreme values or other outliers

prepare (2)

convert a variable to one hot encoding

prepare (2)

add a new column computed from one or more other columns

prepare (2)

transform a dataset to tidy format

prepare (2)

append a dataset provided in pieces

construct (2)

merge data with a shared column

construct (2)

compute overall and individual summary statistics

summarize (2)

use split-apply-combine paradigm

summarize (2)

generate at least two types of plots

visualize (2)

interpret statistics and plots

summarize, visualize

use list comprehensions or loops and pythonic conventions

python (2)

5.1. Constructing Datasets#

Your goal is to programmatically construct two ready to analyze dataset from multiple sources.

  • Each dataset must combine at least 2 source tables(4 total).

  • At least one source table(of the 4) must come from an sqlite database or from web scraping.

  • You should use at least two different joins(types of merges, or concat).

The notebook you submit should include:

  • a motivating question for why you’re combining the datasets in an introduction section

  • code and description of how you built and prepared each dataset. For each step describe what you’re about to do, the code with output, interpretation that leads into the next step.

  • exploratory data analysis that shows why you built the data and confirms that is prepared enough to analyze.

For construct, this can be very minimal EDA.

You may build one dataset from three tables instead of two from two each if you’d like

5.2. Earning additional achievements#

To earn additional achievements, you must do more cleaning and/or exploratory data analysis.

5.2.1. Prepare level 2#

To earn level 2 for prepare, you must, either on component table(s) or the final dataset:

  • transform into a tidy format

  • add a new column by computing from others

  • handle NaN values by dropping or filling

  • drop a column, row, or duplicates in another way

5.2.2. Summarize and Visualize level 2#

To earn level 2 for summarize and/or visualize, include additional analyses after building the datasets. Include:

  • compute overall summary statistics

  • compute individual summary statistics

  • use split-apply-combine with two categorical variables

  • at least two types of plots for visualize

  • use a categorical variable to modify the plot (color points or create subplots)

5.2.3. Python Level 2#

Use pythonic naming conventions throughout, AND:

  • Use pythonic loops and a list or dictionary OR

  • use a list or dictionary comprehension

Thinking Ahead

Compare the level 2 skill definitions to level 3, how could you extend and adapt what you’ve done to meet level 3?