{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "aadebc75",
   "metadata": {},
   "source": [
    "# Missing Data and Inconsistent coding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9da595e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "\n",
    "sns.set_theme(palette= \"colorblind\")\n",
    "na_toy_df = pd.DataFrame(data = [[1,3,4,5],[2 ,6, np.nan]])\n",
    "\n",
    "# make plots look nicer and increase font size\n",
    "sns.set_theme(font_scale=2,palette='colorblind')\n",
    "arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'\n",
    "\n",
    "coffee_df = pd.read_csv(arabica_data_url)\n",
    "\n",
    "\n",
    "rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'\n",
    "course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea31363a",
   "metadata": {},
   "source": [
    "So far, we've dealt with structural issues in data. but there's a lot more to\n",
    "cleaning.  \n",
    "\n",
    "Today,  we'll deal with how to fix the values wihtin  the data.  To see the\n",
    "types of things:\n",
    "\n",
    "[Stanford Policy Lab Open Policing Project data readme](https://github.com/stanford-policylab/opp/blob/master/data_readme.md)\n",
    "[Propublica Machine Bias](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm) the \"How we acquired data\" section"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca4123c0",
   "metadata": {},
   "source": [
    "## Missing Values\n",
    "\n",
    "\n",
    "Dealing with missing data is a whole research area. There isn't one solution.\n",
    "\n",
    "[in 2020 there was a workshop on it](https://artemiss-workshop.github.io/)\n",
    "\n",
    "There are also many classic approaches both when training and when [applying models](https://www.jmlr.org/papers/volume8/saar-tsechansky07a/saar-tsechansky07a.pdf).\n",
    "\n",
    "[example application in breast cancer detection](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.701.4234&rep=rep1&type=pdf)\n",
    "\n",
    "In pandas, even representing  [missing values](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) is under [experimentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na). Currently, it uses `numpy.NaN`, but the experiment is with `pd.NA`.\n",
    "\n",
    "Missing values even causes the [datatypes to change](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-casting-rules-and-indexing)\n",
    "\n",
    "Pandas gives a few basic tools:\n",
    " - drop with (`dropna`)\n",
    " - fill with `fillna`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5d557ca7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>Species</th>\n",
       "      <th>Owner</th>\n",
       "      <th>Country.of.Origin</th>\n",
       "      <th>Farm.Name</th>\n",
       "      <th>Lot.Number</th>\n",
       "      <th>Mill</th>\n",
       "      <th>ICO.Number</th>\n",
       "      <th>Company</th>\n",
       "      <th>Altitude</th>\n",
       "      <th>...</th>\n",
       "      <th>Color</th>\n",
       "      <th>Category.Two.Defects</th>\n",
       "      <th>Expiration</th>\n",
       "      <th>Certification.Body</th>\n",
       "      <th>Certification.Address</th>\n",
       "      <th>Certification.Contact</th>\n",
       "      <th>unit_of_measurement</th>\n",
       "      <th>altitude_low_meters</th>\n",
       "      <th>altitude_high_meters</th>\n",
       "      <th>altitude_mean_meters</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Arabica</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>Ethiopia</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>NaN</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>2014/2015</td>\n",
       "      <td>metad agricultural developmet plc</td>\n",
       "      <td>1950-2200</td>\n",
       "      <td>...</td>\n",
       "      <td>Green</td>\n",
       "      <td>0</td>\n",
       "      <td>April 3rd, 2016</td>\n",
       "      <td>METAD Agricultural Development plc</td>\n",
       "      <td>309fcf77415a3661ae83e027f7e5f05dad786e44</td>\n",
       "      <td>19fef5a731de2db57d16da10287413f5f99bc2dd</td>\n",
       "      <td>m</td>\n",
       "      <td>1950.0</td>\n",
       "      <td>2200.0</td>\n",
       "      <td>2075.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Arabica</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>Ethiopia</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>NaN</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>2014/2015</td>\n",
       "      <td>metad agricultural developmet plc</td>\n",
       "      <td>1950-2200</td>\n",
       "      <td>...</td>\n",
       "      <td>Green</td>\n",
       "      <td>1</td>\n",
       "      <td>April 3rd, 2016</td>\n",
       "      <td>METAD Agricultural Development plc</td>\n",
       "      <td>309fcf77415a3661ae83e027f7e5f05dad786e44</td>\n",
       "      <td>19fef5a731de2db57d16da10287413f5f99bc2dd</td>\n",
       "      <td>m</td>\n",
       "      <td>1950.0</td>\n",
       "      <td>2200.0</td>\n",
       "      <td>2075.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Arabica</td>\n",
       "      <td>grounds for health admin</td>\n",
       "      <td>Guatemala</td>\n",
       "      <td>san marcos barrancas \"san cristobal cuch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1600 - 1800 m</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>May 31st, 2011</td>\n",
       "      <td>Specialty Coffee Association</td>\n",
       "      <td>36d0d00a3724338ba7937c52a378d085f2172daa</td>\n",
       "      <td>0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660</td>\n",
       "      <td>m</td>\n",
       "      <td>1600.0</td>\n",
       "      <td>1800.0</td>\n",
       "      <td>1700.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Arabica</td>\n",
       "      <td>yidnekachew dabessa</td>\n",
       "      <td>Ethiopia</td>\n",
       "      <td>yidnekachew dabessa coffee plantation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>wolensu</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yidnekachew debessa coffee plantation</td>\n",
       "      <td>1800-2200</td>\n",
       "      <td>...</td>\n",
       "      <td>Green</td>\n",
       "      <td>2</td>\n",
       "      <td>March 25th, 2016</td>\n",
       "      <td>METAD Agricultural Development plc</td>\n",
       "      <td>309fcf77415a3661ae83e027f7e5f05dad786e44</td>\n",
       "      <td>19fef5a731de2db57d16da10287413f5f99bc2dd</td>\n",
       "      <td>m</td>\n",
       "      <td>1800.0</td>\n",
       "      <td>2200.0</td>\n",
       "      <td>2000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Arabica</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>Ethiopia</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>NaN</td>\n",
       "      <td>metad plc</td>\n",
       "      <td>2014/2015</td>\n",
       "      <td>metad agricultural developmet plc</td>\n",
       "      <td>1950-2200</td>\n",
       "      <td>...</td>\n",
       "      <td>Green</td>\n",
       "      <td>2</td>\n",
       "      <td>April 3rd, 2016</td>\n",
       "      <td>METAD Agricultural Development plc</td>\n",
       "      <td>309fcf77415a3661ae83e027f7e5f05dad786e44</td>\n",
       "      <td>19fef5a731de2db57d16da10287413f5f99bc2dd</td>\n",
       "      <td>m</td>\n",
       "      <td>1950.0</td>\n",
       "      <td>2200.0</td>\n",
       "      <td>2075.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 44 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0  Species                     Owner Country.of.Origin  \\\n",
       "0           1  Arabica                 metad plc          Ethiopia   \n",
       "1           2  Arabica                 metad plc          Ethiopia   \n",
       "2           3  Arabica  grounds for health admin         Guatemala   \n",
       "3           4  Arabica       yidnekachew dabessa          Ethiopia   \n",
       "4           5  Arabica                 metad plc          Ethiopia   \n",
       "\n",
       "                                  Farm.Name Lot.Number       Mill ICO.Number  \\\n",
       "0                                 metad plc        NaN  metad plc  2014/2015   \n",
       "1                                 metad plc        NaN  metad plc  2014/2015   \n",
       "2  san marcos barrancas \"san cristobal cuch        NaN        NaN        NaN   \n",
       "3     yidnekachew dabessa coffee plantation        NaN    wolensu        NaN   \n",
       "4                                 metad plc        NaN  metad plc  2014/2015   \n",
       "\n",
       "                                 Company       Altitude  ...  Color  \\\n",
       "0      metad agricultural developmet plc      1950-2200  ...  Green   \n",
       "1      metad agricultural developmet plc      1950-2200  ...  Green   \n",
       "2                                    NaN  1600 - 1800 m  ...    NaN   \n",
       "3  yidnekachew debessa coffee plantation      1800-2200  ...  Green   \n",
       "4      metad agricultural developmet plc      1950-2200  ...  Green   \n",
       "\n",
       "  Category.Two.Defects        Expiration                  Certification.Body  \\\n",
       "0                    0   April 3rd, 2016  METAD Agricultural Development plc   \n",
       "1                    1   April 3rd, 2016  METAD Agricultural Development plc   \n",
       "2                    0    May 31st, 2011        Specialty Coffee Association   \n",
       "3                    2  March 25th, 2016  METAD Agricultural Development plc   \n",
       "4                    2   April 3rd, 2016  METAD Agricultural Development plc   \n",
       "\n",
       "                      Certification.Address  \\\n",
       "0  309fcf77415a3661ae83e027f7e5f05dad786e44   \n",
       "1  309fcf77415a3661ae83e027f7e5f05dad786e44   \n",
       "2  36d0d00a3724338ba7937c52a378d085f2172daa   \n",
       "3  309fcf77415a3661ae83e027f7e5f05dad786e44   \n",
       "4  309fcf77415a3661ae83e027f7e5f05dad786e44   \n",
       "\n",
       "                      Certification.Contact unit_of_measurement  \\\n",
       "0  19fef5a731de2db57d16da10287413f5f99bc2dd                   m   \n",
       "1  19fef5a731de2db57d16da10287413f5f99bc2dd                   m   \n",
       "2  0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660                   m   \n",
       "3  19fef5a731de2db57d16da10287413f5f99bc2dd                   m   \n",
       "4  19fef5a731de2db57d16da10287413f5f99bc2dd                   m   \n",
       "\n",
       "  altitude_low_meters altitude_high_meters altitude_mean_meters  \n",
       "0              1950.0               2200.0               2075.0  \n",
       "1              1950.0               2200.0               2075.0  \n",
       "2              1600.0               1800.0               1700.0  \n",
       "3              1800.0               2200.0               2000.0  \n",
       "4              1950.0               2200.0               2075.0  \n",
       "\n",
       "[5 rows x 44 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f9e7cdf",
   "metadata": {},
   "source": [
    "The 'Lot.Number' has a lot of NaN values, how can we explore it?\n",
    "\n",
    "We can look at the type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8cb018ae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('O')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df['Lot.Number'].dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a46846e",
   "metadata": {},
   "source": [
    "And we can look at the value counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "14d2c23c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1                             18\n",
       "020/17                         6\n",
       "019/17                         5\n",
       "2                              3\n",
       "102                            3\n",
       "                              ..\n",
       "11/23/0696                     1\n",
       "3-59-2318                      1\n",
       "8885                           1\n",
       "5055                           1\n",
       "017-053-0211/ 017-053-0212     1\n",
       "Name: Lot.Number, Length: 221, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df['Lot.Number'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d12fedda",
   "metadata": {},
   "source": [
    "Filling can be good if you know how to fill reasonably, but don't have data to\n",
    "spare by dropping.  For example\n",
    "- you can approximate with another column\n",
    "- you can approximate with that column from other rows\n",
    "\n",
    "\n",
    "\n",
    "We see that a lot are '1', maybe we know that when the data was collected, if the Farm only has one lot, some people recorded '1' and others left it as missing. So we could fill in with 1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4084244e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    1\n",
       "2    1\n",
       "3    1\n",
       "4    1\n",
       "Name: Lot.Number, dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df['Lot.Number'].fillna(1).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c6d7c8e0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    NaN\n",
       "1    NaN\n",
       "2    NaN\n",
       "3    NaN\n",
       "4    NaN\n",
       "Name: Lot.Number, dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df['Lot.Number'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08fdca76",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "Note that even after we called `fillna` we display it again and the original data is unchanged.\n",
    "```\n",
    "\n",
    "To save the filled in column we have a few choices:\n",
    "- use the `inplace` parameter. This doesn't offer performance advantages, but does It still copies the object, but then reassigns the pointer. Its under discussion to [deprecate](https://github.com/pandas-dev/pandas/issues/16529)\n",
    "- write to a new DataFrame\n",
    "- add a column\n",
    "\n",
    "We'll use adding a column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "96d39331",
   "metadata": {},
   "outputs": [],
   "source": [
    "coffee_df['lot_number_clean'] = coffee_df['Lot.Number'].fillna(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3fbd86d",
   "metadata": {},
   "source": [
    "```{admonition} Question in Class\n",
    "When I use value counts it treats the filled ones as different. Why?\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "2a923465",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1                             18\n",
       "020/17                         6\n",
       "019/17                         5\n",
       "2                              3\n",
       "102                            3\n",
       "                              ..\n",
       "11/23/0696                     1\n",
       "3-59-2318                      1\n",
       "8885                           1\n",
       "5055                           1\n",
       "017-053-0211/ 017-053-0212     1\n",
       "Name: Lot.Number, Length: 221, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df['Lot.Number'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c967b54c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1                             1041\n",
       "1                               18\n",
       "020/17                           6\n",
       "019/17                           5\n",
       "102                              3\n",
       "                              ... \n",
       "3-59-2318                        1\n",
       "8885                             1\n",
       "5055                             1\n",
       "MCCFWXA15/16                     1\n",
       "017-053-0211/ 017-053-0212       1\n",
       "Name: lot_number_clean, Length: 222, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df['lot_number_clean'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e932398",
   "metadata": {},
   "source": [
    "If we swithc to `1` as a string, then we'd see all of the one values as the same thing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "62fc064e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1                             1059\n",
       "020/17                           6\n",
       "019/17                           5\n",
       "102                              3\n",
       "103                              3\n",
       "                              ... \n",
       "3-59-2318                        1\n",
       "8885                             1\n",
       "5055                             1\n",
       "MCCFWXA15/16                     1\n",
       "017-053-0211/ 017-053-0212       1\n",
       "Name: lot_number_clean, Length: 221, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df['lot_number_clean'] = coffee_df['Lot.Number'].fillna('1')\n",
    "\n",
    "coffee_df['lot_number_clean'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "255bc996",
   "metadata": {},
   "source": [
    "This was our goal, so in this case, it's the right thing to do to overwrite the value.\n",
    "\n",
    "\n",
    "\n",
    "Dropping is a good choice when you otherwise have a lot of data and the data is\n",
    "missing at random.\n",
    "\n",
    "Dropping can be risky if it's not missing at random. For example, if we saw in\n",
    "the coffee data that one of the scores was missing for all of the rows from one\n",
    "country, or even just missing more often in one country, that could bias our\n",
    "results.\n",
    "\n",
    "To illustrate how `dropna` works, we'll use the `shape` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "12852dca",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1311, 45)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abc41705",
   "metadata": {},
   "source": [
    "By default, it drops any row with one or more `NaN` values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "12b0d9ae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(130, 45)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df.dropna().shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7e756dd",
   "metadata": {},
   "source": [
    "We could instead tell it to only drop rows with `NaN` in a subset of the columns.\n",
    "\n",
    "````{margin}\n",
    "```{note}\n",
    "subset operates along columns by default, because axis is set to 0, by default.\n",
    "```\n",
    "````"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "2a1f87f5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1084, 45)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df.dropna(subset=['altitude_low_meters']).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0eed435a",
   "metadata": {},
   "source": [
    "**whatever you do, document it**\n",
    "\n",
    "```{admonition} Try it Yourself\n",
    "use the `na_toy_df` DataFrame that's defined in the first cell, to experiment with subset and axis parameters to understand them better.\n",
    "```\n",
    "\n",
    "In the [Open Policing Project Data Summary](https://openpolicing.stanford.edu/data/) we saw that they made a summary information that showed which variables had at least 70% not missing values.  We can similarly choose to keep only variables that have more than a specific threshold of data, using the `thresh` parameter and `axis=1` to drop along columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b252812a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1311, 44)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n_rows, n_cols = coffee_df.shape\n",
    "\n",
    "coffee_df.dropna(thresh=.7*n_rows,axis=1).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdf66ff0",
   "metadata": {},
   "source": [
    "This dataset is actually in pretty good shape, but if we use a more stringent threshold it drops more columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "94ed6bc9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1311, 34)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df.dropna(thresh=.85*n_rows,axis=1).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc907f10",
   "metadata": {},
   "source": [
    "```{important}\n",
    "Everththing after this is new material that we did not have time for in class,\n",
    "but is important and helpful in your assignment (and for your portflio).\n",
    "```\n",
    "\n",
    "## Inconsistent values\n",
    "\n",
    "This was one of the things that many of you anticipated or had observed.  A useful way to investigate for this, is to use `value_counts` and sort them alphabetically by the values from the original data, so that similar ones will be consecutive in the list. Once we have the `value_counts()` Series, the values from the `coffee_df` become the index, so we use `sort_index`.\n",
    "\n",
    "Let's look at the `In.Country.Partner` column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "4bb3be0f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AMECAFE                                                                                  205\n",
       "Africa Fine Coffee Association                                                            49\n",
       "Almacafé                                                                                 178\n",
       "Asociacion Nacional Del Café                                                             155\n",
       "Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.                             6\n",
       "Asociación de Cafés Especiales de Nicaragua                                                8\n",
       "Blossom Valley International                                                              58\n",
       "Blossom Valley International\\n                                                             1\n",
       "Brazil Specialty Coffee Association                                                       67\n",
       "Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao      1\n",
       "Centro Agroecológico del Café A.C.                                                         8\n",
       "Coffee Quality Institute                                                                   7\n",
       "Ethiopia Commodity Exchange                                                               18\n",
       "Instituto Hondureño del Café                                                              60\n",
       "Kenya Coffee Traders Association                                                          22\n",
       "METAD Agricultural Development plc                                                        15\n",
       "NUCOFFEE                                                                                  36\n",
       "Salvadoran Coffee Council                                                                 11\n",
       "Specialty Coffee Ass                                                                       1\n",
       "Specialty Coffee Association                                                             295\n",
       "Specialty Coffee Association of Costa Rica                                                42\n",
       "Specialty Coffee Association of Indonesia                                                 10\n",
       "Specialty Coffee Institute of Asia                                                        16\n",
       "Tanzanian Coffee Board                                                                     6\n",
       "Torch Coffee Lab Yunnan                                                                    2\n",
       "Uganda Coffee Development Authority                                                       22\n",
       "Yunnan Coffee Exchange                                                                    12\n",
       "Name: In.Country.Partner, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coffee_df['In.Country.Partner'].value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b1f4cdd",
   "metadata": {},
   "source": [
    "We can see there's only one `Blossom Valley International\\n` but 58 `Blossom Valley International`, the former is likely a typo, especially since `\\n` is a special character for a newline. Similarly, with 'Specialty Coffee Ass' and 'Specialty Coffee Association'.\n",
    "\n",
    "This is another job for dictionaries, we make one with the value to replace as the key and the value to insert as the value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "74dc079d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AMECAFE                                                                                  205\n",
       "Africa Fine Coffee Association                                                            49\n",
       "Almacafé                                                                                 178\n",
       "Asociacion Nacional Del Café                                                             155\n",
       "Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.                             6\n",
       "Asociación de Cafés Especiales de Nicaragua                                                8\n",
       "Blossom Valley International                                                              59\n",
       "Brazil Specialty Coffee Association                                                       67\n",
       "Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao      1\n",
       "Centro Agroecológico del Café A.C.                                                         8\n",
       "Coffee Quality Institute                                                                   7\n",
       "Ethiopia Commodity Exchange                                                               18\n",
       "Instituto Hondureño del Café                                                              60\n",
       "Kenya Coffee Traders Association                                                          22\n",
       "METAD Agricultural Development plc                                                        15\n",
       "NUCOFFEE                                                                                  36\n",
       "Salvadoran Coffee Council                                                                 11\n",
       "Specialty Coffee Association                                                             296\n",
       "Specialty Coffee Association of Costa Rica                                                42\n",
       "Specialty Coffee Association of Indonesia                                                 10\n",
       "Specialty Coffee Institute of Asia                                                        16\n",
       "Tanzanian Coffee Board                                                                     6\n",
       "Torch Coffee Lab Yunnan                                                                    2\n",
       "Uganda Coffee Development Authority                                                       22\n",
       "Yunnan Coffee Exchange                                                                    12\n",
       "Name: in_country_partner_clean, dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "partner_corrections = {'Blossom Valley International\\n':'Blossom Valley International',\n",
    "  'Specialty Coffee Ass':'Specialty Coffee Association'}\n",
    "coffee_df['in_country_partner_clean'] = coffee_df['In.Country.Partner'].replace(\n",
    "  to_replace=partner_corrections)\n",
    "coffee_df['in_country_partner_clean'].value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e04d215a",
   "metadata": {},
   "source": [
    "and now we see the corrected values.  We can also pass lambdas or put lambas in the dictionary if there are systemic patterns.\n",
    "\n",
    "## Fixing data at load time\n",
    "\n",
    "Explore some of the different parameters in `read_csv`\n",
    "\n",
    "\n",
    "How can we read in data that looks like this:\n",
    "![mulitindex img of excel file](https://github.com/rhodyprog4ds/BrownFall20/raw/main/img/multiindex.png)\n",
    "\n",
    "```\n",
    "pd.read_csv('fancy_formatting.xlsx', header = list(range(4)))\n",
    "```\n",
    "\n",
    "Many problems can  be repaired with parameters in [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv).\n",
    "\n",
    "## A Cleaning Data Recipe\n",
    "\n",
    "__not everything possible, but good enough for this course__\n",
    "\n",
    "\n",
    "1. Can you use parameters to read the data in better?\n",
    "1. Fix the index and column headers (making these easier to use makes the rest easier)\n",
    "1. Is the data structured well?\n",
    "1. Are there missing values?\n",
    "1. Do the datatypes match what you expect by looking at the head or a sample?\n",
    "1. Are categorical variables represented in usable way?\n",
    "1. Does your analysis require filtering or augmenting the data?\n",
    "\n",
    "Things to keep in mind:\n",
    "- always save new copies of data when you mutate it\n",
    "- add new columns rather than overwriting columns\n",
    "- long variable names are better than ambiguous naming\n",
    "\n",
    "\n",
    "## Your observations from Monday:\n",
    "\n",
    "I promised we'd come back to your observations on what problems could occur in\n",
    "data. Here they are, organized by rough categories of when/how to fix them.\n",
    "\n",
    "\n",
    "We can fix while reading in data:\n",
    "- decimal was indicated with ',' insead of '.' so pandas saw value as a string rather than a float\n",
    "- missing header\n",
    "- reading the index as a column\n",
    "- large datasets might be too slow or not fit in memory\n",
    "- missing data represeted with a value or special character\n",
    "\n",
    "We can fix by reshaping data:\n",
    "- Data can get read into tables in bizarre ways depending on how the data was entered originally.\n",
    "- every value in one column, instead of separated\n",
    "\n",
    "We can repair by changing values or filtering:\n",
    "- information represented inconsistently eg \"Value\" and \" Value \" or twenty-two instead of 22\n",
    "- blank rows or blank columns or data that is N/A\n",
    "- date/time information can be represented lots of different ways\n",
    "- representing categorical with numbers that are ambiguous\n",
    "- spaces or other symbols in column names\n",
    "- some numbers as strings, others as ints within a column\n",
    "- symbols being mis interpreted\n",
    "\n",
    "\n",
    "Real problems, but beyond our scope:\n",
    "- corrupt data files\n",
    "\n",
    "\n",
    "## More Practice\n",
    "\n",
    "Instead of more practice with these manipulations, below are more\n",
    "examples of cleaning data to see how these types of manipulations get used.  \n",
    "Your goal here is not to memorize every possible thing, but to build a general\n",
    "idea of what good data looks like and good habits for cleaning data and keeping\n",
    "it reproducible.  \n",
    "- [Cleaning the Adult Dataset](https://ryanwingate.com/projects/machine-learning-data-prep/adult/adult-cleaning/)\n",
    "- [All Shades](https://github.com/the-pudding/data/tree/master/foundation-names#allshadescsv--allshadesr)\n",
    "\n",
    "Also here are some tips on general data management and organization.\n",
    "\n",
    "This article is a comprehensive [discussion of data cleaning](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.13,
    "jupytext_version": "1.10.3"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "source_map": [
   12,
   16,
   33,
   44,
   65,
   67,
   72,
   74,
   77,
   79,
   89,
   94,
   96,
   108,
   110,
   118,
   123,
   125,
   129,
   133,
   149,
   151,
   155,
   157,
   167,
   169,
   179,
   183,
   187,
   189,
   203,
   205,
   210,
   216
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}