Skip to article frontmatterSkip to article content

Assignment 1: Portfolio Setup

Setting

Next week, we are going to learn about summarizing data. In this assignment, you are going to build a small dataset about datasets. In class next week, we will combine all of your datasets about datasets together in order to be able to answer questions like:

Steps

Set up your Portfolio

  1. Create your portfolio repository by accepting the assignment from the course organization page

  2. Fill in the two files in the about folder

Find Datasets

Find 3 datasets of interest to you that are provided in at least two different file formats. Choose datasets that are not too big, so that they do not take more than a few second to load. At least one dataset must have non numerical (eg string or boolean) data in at least 1 column.

In a notebook called dataset_of_datasets.ipynb, create a markdown cell for each dataset that includes:

Store them for loading

Create a list of dictionaries in datasets.py, so that there is one dictionary for each dataset. Each dictionary should have the keys specified in Table 1

Table 1:Meta Data Description of the dictionary to create

url

the full url of the dataset

short_name

a short name

load_function

(the actual function handle) what function should be used to load the data into a pandas.DataFrame.

Make a dataset about your datasets

In a notebook called dataset_of_datasets.ipynb, import the list of dictionaries from the datasets module you created in the step above. Then iterate over the list of dictionaries, and for each:

  1. load each dataset like using the function from the dictionary

  2. save it to a local csv using the short name you provided for the dataset as the file name, without writing the index column to the file.

  3. record attributes about the dataset as in Table 2 in a list of lists or dictionary of lists

  4. Use that to create a DataFrame with columns that match the rows of the following table.

Table 2:Meta Data Description of the DataFrame to build

name

a short name for the dataset

source

a url to where you found the data

num_rows

number of rows in the dataset

num_columns

number of columns in the dataset

num_numerical

number of numerical variables in the dataset

Explore Your Datasets

In a second notebook file called exploration.ipynb:

For one dataset that includes nonnumerical data:

For any other dataset:

For the third dataset:

Upload (or push) Your notebook and py files

Add the files to your portfolio repository.