Data Sources#
This page is a semi-curated source of datasets for use in assignments. The different sections have datasets that are good for different assignments.
Best for loading directly into a notebook#
Tidy Tuesday inside the folder for each year there is a README file with list of the datasets. These are .csv files
National Center for Education Statistics Digest 2019 These data tables are available for download as excel and visible on the page.
Lots of wikipedia pages have tables in them.
Cleaning Examples#
Messy Artists .csv file, that needs to be cleaned, containing data on artists
Messy Wheels .csv file, that needs to be cleaned, containing data on various wheel attractions around the globe
Clean Artists .csv file, already cleaned, containing data on artists
Clean Wheels, .csv file, already cleaned, containing data on various wheel attractions around the globe
data cleaning with open refine on survey data this is a tutorial for cleaning data with another tool, but it demonstrates common problems with data well.
data clearning for ecology this is a tutorial for cleaning data with another tool, but it demonstrates common problems with data well.
General Sources#
These may require some more work
Stackoverflow Developer Survey This data comes with readme info all packaged together in a .zip. You’ll need to unzip it first.
Kaggle most Kaggle datasets will require you to download and unzip them first and then you can copy them into your repo folder.
UCI Data Repository Machine Learning focused datasets, can filter by task
A curated list of datasets by task It includes datasets for cleaning, visualization, machine learning, and “data analysis” which would align with EDA in this course.
Hugging Face NLP Datasets lots of text datasets
Datasets in many parts#
Datasets with time#
Databases#
If you have others please share by creating a pull request or issue on this repo (from the GitHub logo at the top right, suggest edit
).