5. Assignment 5: Constructing Datasets and Using Databases#
due date : 2023-10-10
Note
I encourage you to get it done and get rest over the long weekend. However, the TA and I are also going to get rest this weekend so grading will not begin until Tuesday and we will grade any submissions in at that time
Skills:
prepare level 1
summarize 1,2
visualize 1,2
python 1,2
Warning
sumbit to the assignment5 branch, but there are no template files.
5.2. Constructing Datasets#
Your goal is to programmatically construct a ready to analyze dataset that combines information from multiple sources. This can be:
crawing fashion like we did for the CS people
combining two tables with a merge.
If you use a merge to meet the multiple sources criterion, only one source must be scraped, the second can be provided as tabular data.
If you crawl, all the data could come from tables(files or read_html
), as long as you pull the urls programmatically.
The notebook you submit should include:
a motivating question for why your are building the dataset you are building
code and description of how you built and prepared your dataset. For each step, describe what you’re about to do, the code with output, interpretation that leads into the next step.
exploratory data analysis that shows why you built the data and confirms that is prepared enough to analyze.
also save your dataset to csv
For construct only, this can be very minimal EDA.
5.3. Additional achievements#
To earn additional achievements, you must do more cleaning and/or exploratory data analysis.
Important
Make sure everything is well explained
5.3.1. Prepare level 2#
To earn level 2 for prepare, you must manipulate either a component table or the final dataset. Sample manipulations include:
transform into a tidy format
add a new column by computing from others
handle NaN values by dropping or filling
drop a column, row, or duplicates in another way
change a continuous value to categorical (there is an added section in the notes on quantizing that we did not do in class, but should be easy to follow)
5.3.2. Summarize and Visualize level 2#
To earn level 2 for summarize and/or visualize, include additional analyses after building the datasets.
Connect your EDA to questions, and demonstrate demonstrate items from one of the checklists in A3.
5.3.3. Python Level 2#
Use pythonic naming conventions throughout, AND:
Use pythonic loops and a list or dictionary OR
use a list or dictionary comprehension
this can be in your cleanup or your EDA
Thinking Ahead
Compare the level 2 skill definitions to level 3, how could you extend and adapt what you’ve done to meet level 3?
Thinking Ahead
You could also demonstrate understanding of how merges work by converting a dataset that is provided as a single table with redundant information into a number of smaller tables in a database.