Assignment 4: Cleaning Data

4. Assignment 4: Cleaning Data#

Due: 2023-10-03

Eligible skills:

prepare 1
access 2
python 1,2

4.1. Submission#

Work on your assignment4 branch. When you are done, open a PR with:

base: feedback, compare: assignment4

and request a review from @surbhir08

4.3. Check the Datasets you have worked with already#

In the datasets you have used or come across but decided you could not work with in your past assignments identify at least one thing you could not do because the data was not in an appropriate format.

In a notebook file called dataset_fix.ipynb apply one fix and show one summary statistic or plot that was not possible before to show that it works.

Some examples:

a column that was a list or dictionary
missing values
a column that was continuous, but more interesting as a categorical
too many header rows
a data set that was wide, but tall would be better for plotting

4.4. CS Degrees#

See the notebook on your assignment4 branch and complete the instructions there.

4.5. Study Cleaned Datasets#

Tip

there is a dedicated section in the Data Sources page

Read example data cleaning notes or scripts. To do this find at least one dataset for which the messy version, clean version, and a script or notes about how it was cleaned are available, answer the following questions in a markdown file, named cleaning_notes.md. (some example datasets are on the datasets page and one is in the notes are added to the course website)

What are 3 common problems to look for in a dataset? Describe them with examples.
Using one of the examples you found of cleaned data, give an example of a question or context that would require making different choices for cleaning than were made. Include a bit about the data, what was done, the question, what would need to be done instead and justification.
Explain in your own words, with a concrete example, how domain expertise can help you when cleaning data. Use either a made up example or one that you read about.

Warning

Some of these examples have both the clean and messy data files and an R script to do the cleaning. You are not required to know R, but looking at their R cleaning script could give hints of what things they fixed or changed. You could also compare the clean and messy versions by looking at them with a tool of your choice.