5. Assignment 5: Constructing Datasets and Using Databases#

due date : 2023-10-10

Note

I encourage you to get it done and get rest over the long weekend. However, the TA and I are also going to get rest this weekend so grading will not begin until Tuesday and we will grade any submissions in at that time

Skills:

  • prepare level 1

  • summarize 1,2

  • visualize 1,2

  • python 1,2

Warning

sumbit to the assignment5 branch, but there are no template files.

5.2. Constructing Datasets#

Your goal is to programmatically construct a ready to analyze dataset that combines information from multiple sources. This can be:

  • crawing fashion like we did for the CS people

  • combining two tables with a merge.

If you use a merge to meet the multiple sources criterion, only one source must be scraped, the second can be provided as tabular data.

If you crawl, all the data could come from tables(files or read_html), as long as you pull the urls programmatically.

The notebook you submit should include:

  • a motivating question for why your are building the dataset you are building

  • code and description of how you built and prepared your dataset. For each step, describe what you’re about to do, the code with output, interpretation that leads into the next step.

  • exploratory data analysis that shows why you built the data and confirms that is prepared enough to analyze.

  • also save your dataset to csv

For construct only, this can be very minimal EDA.

5.3. Additional achievements#

To earn additional achievements, you must do more cleaning and/or exploratory data analysis.

Important

Make sure everything is well explained

5.3.1. Prepare level 2#

To earn level 2 for prepare, you must manipulate either a component table or the final dataset. Sample manipulations include:

  • transform into a tidy format

  • add a new column by computing from others

  • handle NaN values by dropping or filling

  • drop a column, row, or duplicates in another way

  • change a continuous value to categorical (there is an added section in the notes on quantizing that we did not do in class, but should be easy to follow)

5.3.2. Summarize and Visualize level 2#

To earn level 2 for summarize and/or visualize, include additional analyses after building the datasets.

Connect your EDA to questions, and demonstrate demonstrate items from one of the checklists in A3.

5.3.3. Python Level 2#

Use pythonic naming conventions throughout, AND:

  • Use pythonic loops and a list or dictionary OR

  • use a list or dictionary comprehension

this can be in your cleanup or your EDA

Thinking Ahead

Compare the level 2 skill definitions to level 3, how could you extend and adapt what you’ve done to meet level 3?

Thinking Ahead

You could also demonstrate understanding of how merges work by converting a dataset that is provided as a single table with redundant information into a number of smaller tables in a database.