{ "cells": [ { "cell_type": "markdown", "id": "aadebc75", "metadata": {}, "source": [ "# Missing Data and Inconsistent coding" ] }, { "cell_type": "code", "execution_count": 1, "id": "9da595e0", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import numpy as np\n", "\n", "sns.set_theme(palette= \"colorblind\")\n", "na_toy_df = pd.DataFrame(data = [[1,3,4,5],[2 ,6, np.nan]])\n", "\n", "# make plots look nicer and increase font size\n", "sns.set_theme(font_scale=2,palette='colorblind')\n", "arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'\n", "\n", "coffee_df = pd.read_csv(arabica_data_url)\n", "\n", "\n", "rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'\n", "course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)" ] }, { "cell_type": "markdown", "id": "ea31363a", "metadata": {}, "source": [ "So far, we've dealt with structural issues in data. but there's a lot more to\n", "cleaning. \n", "\n", "Today, we'll deal with how to fix the values wihtin the data. To see the\n", "types of things:\n", "\n", "[Stanford Policy Lab Open Policing Project data readme](https://github.com/stanford-policylab/opp/blob/master/data_readme.md)\n", "[Propublica Machine Bias](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm) the \"How we acquired data\" section" ] }, { "cell_type": "markdown", "id": "ca4123c0", "metadata": {}, "source": [ "## Missing Values\n", "\n", "\n", "Dealing with missing data is a whole research area. There isn't one solution.\n", "\n", "[in 2020 there was a workshop on it](https://artemiss-workshop.github.io/)\n", "\n", "There are also many classic approaches both when training and when [applying models](https://www.jmlr.org/papers/volume8/saar-tsechansky07a/saar-tsechansky07a.pdf).\n", "\n", "[example application in breast cancer detection](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.4234&rep=rep1&type=pdf)\n", "\n", "In pandas, even representing [missing values](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) is under [experimentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na). Currently, it uses `numpy.NaN`, but the experiment is with `pd.NA`.\n", "\n", "Missing values even causes the [datatypes to change](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-casting-rules-and-indexing)\n", "\n", "Pandas gives a few basic tools:\n", " - drop with (`dropna`)\n", " - fill with `fillna`" ] }, { "cell_type": "code", "execution_count": 2, "id": "5d557ca7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Unnamed: 0 | \n", "Species | \n", "Owner | \n", "Country.of.Origin | \n", "Farm.Name | \n", "Lot.Number | \n", "Mill | \n", "ICO.Number | \n", "Company | \n", "Altitude | \n", "... | \n", "Color | \n", "Category.Two.Defects | \n", "Expiration | \n", "Certification.Body | \n", "Certification.Address | \n", "Certification.Contact | \n", "unit_of_measurement | \n", "altitude_low_meters | \n", "altitude_high_meters | \n", "altitude_mean_meters | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "Arabica | \n", "metad plc | \n", "Ethiopia | \n", "metad plc | \n", "NaN | \n", "metad plc | \n", "2014/2015 | \n", "metad agricultural developmet plc | \n", "1950-2200 | \n", "... | \n", "Green | \n", "0 | \n", "April 3rd, 2016 | \n", "METAD Agricultural Development plc | \n", "309fcf77415a3661ae83e027f7e5f05dad786e44 | \n", "19fef5a731de2db57d16da10287413f5f99bc2dd | \n", "m | \n", "1950.0 | \n", "2200.0 | \n", "2075.0 | \n", "
1 | \n", "2 | \n", "Arabica | \n", "metad plc | \n", "Ethiopia | \n", "metad plc | \n", "NaN | \n", "metad plc | \n", "2014/2015 | \n", "metad agricultural developmet plc | \n", "1950-2200 | \n", "... | \n", "Green | \n", "1 | \n", "April 3rd, 2016 | \n", "METAD Agricultural Development plc | \n", "309fcf77415a3661ae83e027f7e5f05dad786e44 | \n", "19fef5a731de2db57d16da10287413f5f99bc2dd | \n", "m | \n", "1950.0 | \n", "2200.0 | \n", "2075.0 | \n", "
2 | \n", "3 | \n", "Arabica | \n", "grounds for health admin | \n", "Guatemala | \n", "san marcos barrancas \"san cristobal cuch | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1600 - 1800 m | \n", "... | \n", "NaN | \n", "0 | \n", "May 31st, 2011 | \n", "Specialty Coffee Association | \n", "36d0d00a3724338ba7937c52a378d085f2172daa | \n", "0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 | \n", "m | \n", "1600.0 | \n", "1800.0 | \n", "1700.0 | \n", "
3 | \n", "4 | \n", "Arabica | \n", "yidnekachew dabessa | \n", "Ethiopia | \n", "yidnekachew dabessa coffee plantation | \n", "NaN | \n", "wolensu | \n", "NaN | \n", "yidnekachew debessa coffee plantation | \n", "1800-2200 | \n", "... | \n", "Green | \n", "2 | \n", "March 25th, 2016 | \n", "METAD Agricultural Development plc | \n", "309fcf77415a3661ae83e027f7e5f05dad786e44 | \n", "19fef5a731de2db57d16da10287413f5f99bc2dd | \n", "m | \n", "1800.0 | \n", "2200.0 | \n", "2000.0 | \n", "
4 | \n", "5 | \n", "Arabica | \n", "metad plc | \n", "Ethiopia | \n", "metad plc | \n", "NaN | \n", "metad plc | \n", "2014/2015 | \n", "metad agricultural developmet plc | \n", "1950-2200 | \n", "... | \n", "Green | \n", "2 | \n", "April 3rd, 2016 | \n", "METAD Agricultural Development plc | \n", "309fcf77415a3661ae83e027f7e5f05dad786e44 | \n", "19fef5a731de2db57d16da10287413f5f99bc2dd | \n", "m | \n", "1950.0 | \n", "2200.0 | \n", "2075.0 | \n", "
5 rows × 44 columns
\n", "