{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8dba16a2",
   "metadata": {},
   "source": [
    "# Building Datasets from Multiple Sources\n",
    "\n",
    "focus this week is on how to programmatically combine sources of data\n",
    "\n",
    "We will start by looking at combining multiple tabular data formats and see how to get data from other sources.  \n",
    "\n",
    "\n",
    "\n",
    "## Self- assessment, what's your plan\n",
    "\n",
    "Take a few minutes to think about the following questions and make a few notes\n",
    "for yourself whereever you need them to be (a planner, calendar, etc).\n",
    "Share one takeaway or question you have below when you're done.\n",
    "\n",
    "1. What achievements have you earned?\n",
    "1. Does BrightSpace seem accurate to what you've done?\n",
    "1. If not, e-mail brownsarahm with [CSC310] or [DSP310] in the subject and specific details about what you think is missing and why\n",
    "1. Are you on track to earn the grade you want in this class?\n",
    "1. If not, what will you need to do (respond more in class, submit more\n",
    "   assignments, use your portfolio to catch up) to get back on track?\n",
    "1. If you are on track and you want to earn above a B, take a minute to think\n",
    "about your portfolio. (tip: post an idea as an issue to get early feedback and help shaping your idea)\n",
    "\n",
    "## Logistics\n",
    "\n",
    "Check your earned achievements. See the [instructions](gradecheck) that are now saved for future reference on the left side.\n",
    "\n",
    "\n",
    "## Merges"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "14a0c74d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "331ab0f0",
   "metadata": {},
   "source": [
    "I created a folder with [datasets we use in class](https://github.com/rhodyprog4ds/rhodyds/tree/main/data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "10d33a0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a31dfcc",
   "metadata": {},
   "source": [
    " We can load in two data sets of player information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ba2efd5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_18 = pd.read_csv(course_data_url+ '2018-players.csv')\n",
    "df_19 = pd.read_csv(course_data_url+ '2019-players.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "810d1b21",
   "metadata": {},
   "source": [
    "and take a peek at each"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cab99358",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      TEAM_ID  PLAYER_ID  SEASON\n",
       "0  1610612761     202695    2018\n",
       "1  1610612761    1627783    2018"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_18.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9ac8866",
   "metadata": {},
   "source": [
    "```{important}\n",
    "Remember `columns` is an attribute, so it does not need `()`\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4e7b14fe",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['PLAYER_NAME', 'TEAM_ID', 'PLAYER_ID', 'SEASON'], dtype='object')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_19.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e7e71c4",
   "metadata": {},
   "source": [
    "Let's make note of the shape of each"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e3a0d995",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((748, 3), (626, 4))"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_18.shape, df_19.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a737616",
   "metadata": {},
   "source": [
    "## What if we want to analyze them together?\n",
    "\n",
    "We can combine them vertically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "55af8abf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1374, 4)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.concat([df_18,df_19]).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2b26b67",
   "metadata": {},
   "source": [
    "Note that this has the maximum number of columns (because both had some overlapping\n",
    "  columns) and the total number of rows.\n",
    "\n",
    "\n",
    "## How can we find which players changed teams?\n",
    "\n",
    "To do this we want to have one player column and a column with each year's team.\n",
    "\n",
    "We can use a merge to do that.\n",
    "\n",
    "````{margin}\n",
    "```{note}\n",
    "the `head` here is only to help us see more at a time\n",
    "```\n",
    "````"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "92612200",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [TEAM_ID, PLAYER_ID, SEASON, PLAYER_NAME]\n",
       "Index: []"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df_18,df_19,).head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af9383d1",
   "metadata": {},
   "source": [
    "if we merge them without any parameters, it tries to merge on all shared columns.  We want to merge them using the `PLAYER_ID` column though, we would say hat we are \"merging on player ID\" and we use the `on` parameter to do it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0ba6118e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_x</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_x</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_y</th>\n",
       "      <th>SEASON_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "      <td>Kawhi Leonard</td>\n",
       "      <td>1610612746</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "      <td>Pascal Siakam</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    TEAM_ID_x  PLAYER_ID  SEASON_x    PLAYER_NAME   TEAM_ID_y  SEASON_y\n",
       "0  1610612761     202695      2018  Kawhi Leonard  1610612746      2019\n",
       "1  1610612761    1627783      2018  Pascal Siakam  1610612761      2019"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df_18,df_19, on = 'PLAYER_ID').head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03d1c7d3",
   "metadata": {},
   "source": [
    "Since there are other columns that appear in both DataFrames, they get a suffis, which by default is `x` or `y`, we can specify them though."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "fdcf9bff",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_2018</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_2018</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_2019</th>\n",
       "      <th>SEASON_2019</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "      <td>Kawhi Leonard</td>\n",
       "      <td>1610612746</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "      <td>Pascal Siakam</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   TEAM_ID_2018  PLAYER_ID  SEASON_2018    PLAYER_NAME  TEAM_ID_2019  \\\n",
       "0    1610612761     202695         2018  Kawhi Leonard    1610612746   \n",
       "1    1610612761    1627783         2018  Pascal Siakam    1610612761   \n",
       "\n",
       "   SEASON_2019  \n",
       "0         2019  \n",
       "1         2019  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df_18,df_19, on = 'PLAYER_ID',\n",
    "        suffixes=('_2018','_2019')).head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bae3759c",
   "metadata": {},
   "source": [
    "We also told it what to append to any column names that match, but are not the same across both datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "25cc8f5e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      TEAM_ID  PLAYER_ID  SEASON\n",
       "0  1610612761     202695    2018"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_18.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a439cea8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Royce O'Neale</td>\n",
       "      <td>1610612762</td>\n",
       "      <td>1626220</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     PLAYER_NAME     TEAM_ID  PLAYER_ID  SEASON\n",
       "0  Royce O'Neale  1610612762    1626220    2019"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_19.head(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f35d1d51",
   "metadata": {},
   "outputs": [],
   "source": [
    "df1819 = pd.merge(df_18,df_19, on = 'PLAYER_ID',\n",
    "        suffixes=('_2018','_2019'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "4a7dc765",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(538, 6)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1819.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "223ef056",
   "metadata": {},
   "source": [
    "By default, this uses an *inner* merge, so we get the players that are in both datasets only."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "fc013868",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_2018</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_2018</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_2019</th>\n",
       "      <th>SEASON_2019</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "      <td>Kawhi Leonard</td>\n",
       "      <td>1610612746</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "      <td>Pascal Siakam</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "      <td>Marc Gasol</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1610612763</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "      <td>Marc Gasol</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201980</td>\n",
       "      <td>2018</td>\n",
       "      <td>Danny Green</td>\n",
       "      <td>1610612747</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   TEAM_ID_2018  PLAYER_ID  SEASON_2018    PLAYER_NAME  TEAM_ID_2019  \\\n",
       "0    1610612761     202695         2018  Kawhi Leonard    1610612746   \n",
       "1    1610612761    1627783         2018  Pascal Siakam    1610612761   \n",
       "2    1610612761     201188         2018     Marc Gasol    1610612761   \n",
       "3    1610612763     201188         2018     Marc Gasol    1610612761   \n",
       "4    1610612761     201980         2018    Danny Green    1610612747   \n",
       "\n",
       "   SEASON_2019  \n",
       "0         2019  \n",
       "1         2019  \n",
       "2         2019  \n",
       "3         2019  \n",
       "4         2019  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1819.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "353f0257",
   "metadata": {},
   "source": [
    "Some players still appear twice, because they were in one of the datsets twice, this\n",
    "happens when a player plays for two team in one season.\n",
    "\n",
    "## Which players played in 2018, but not 2019?\n",
    "\n",
    "We have different types of merges, inner is both, out is either.  Left and right keep all the rows of one dataFrame.  We can use left with `df_18` as the left DataFrame to see which players played only in 18."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f2d7c973",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_2018</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_2018</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_2019</th>\n",
       "      <th>SEASON_2019</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>749</th>\n",
       "      <td>1610612752</td>\n",
       "      <td>1629246</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>750</th>\n",
       "      <td>1610612748</td>\n",
       "      <td>1629159</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>751</th>\n",
       "      <td>1610612762</td>\n",
       "      <td>1629163</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>752</th>\n",
       "      <td>1610612743</td>\n",
       "      <td>1629150</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>753</th>\n",
       "      <td>1610612738</td>\n",
       "      <td>1629167</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     TEAM_ID_2018  PLAYER_ID  SEASON_2018 PLAYER_NAME  TEAM_ID_2019  \\\n",
       "749    1610612752    1629246         2018         NaN           NaN   \n",
       "750    1610612748    1629159         2018         NaN           NaN   \n",
       "751    1610612762    1629163         2018         NaN           NaN   \n",
       "752    1610612743    1629150         2018         NaN           NaN   \n",
       "753    1610612738    1629167         2018         NaN           NaN   \n",
       "\n",
       "     SEASON_2019  \n",
       "749          NaN  \n",
       "750          NaN  \n",
       "751          NaN  \n",
       "752          NaN  \n",
       "753          NaN  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(df_18,df_19,on='PLAYER_ID', how='left',suffixes=('_2018','_2019')).tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7bec7a6",
   "metadata": {},
   "source": [
    "To pick out those rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "6751eaa7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_2018</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_2018</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_2019</th>\n",
       "      <th>SEASON_2019</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202391</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201975</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>1610612744</td>\n",
       "      <td>101106</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>1610612744</td>\n",
       "      <td>2733</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>1610612744</td>\n",
       "      <td>201973</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>749</th>\n",
       "      <td>1610612752</td>\n",
       "      <td>1629246</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>750</th>\n",
       "      <td>1610612748</td>\n",
       "      <td>1629159</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>751</th>\n",
       "      <td>1610612762</td>\n",
       "      <td>1629163</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>752</th>\n",
       "      <td>1610612743</td>\n",
       "      <td>1629150</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>753</th>\n",
       "      <td>1610612738</td>\n",
       "      <td>1629167</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>216 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     TEAM_ID_2018  PLAYER_ID  SEASON_2018 PLAYER_NAME  TEAM_ID_2019  \\\n",
       "9      1610612761     202391         2018         NaN           NaN   \n",
       "11     1610612761     201975         2018         NaN           NaN   \n",
       "18     1610612744     101106         2018         NaN           NaN   \n",
       "23     1610612744       2733         2018         NaN           NaN   \n",
       "24     1610612744     201973         2018         NaN           NaN   \n",
       "..            ...        ...          ...         ...           ...   \n",
       "749    1610612752    1629246         2018         NaN           NaN   \n",
       "750    1610612748    1629159         2018         NaN           NaN   \n",
       "751    1610612762    1629163         2018         NaN           NaN   \n",
       "752    1610612743    1629150         2018         NaN           NaN   \n",
       "753    1610612738    1629167         2018         NaN           NaN   \n",
       "\n",
       "     SEASON_2019  \n",
       "9            NaN  \n",
       "11           NaN  \n",
       "18           NaN  \n",
       "23           NaN  \n",
       "24           NaN  \n",
       "..           ...  \n",
       "749          NaN  \n",
       "750          NaN  \n",
       "751          NaN  \n",
       "752          NaN  \n",
       "753          NaN  \n",
       "\n",
       "[216 rows x 6 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_18_only = pd.merge(df_18,df_19,on='PLAYER_ID', how='left', suffixes=('_2018','_2019'))\n",
    "df_18_only[df_18_only['SEASON_2019'].isna()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f4994850",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_2018</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_2018</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_2019</th>\n",
       "      <th>SEASON_2019</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202391</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201975</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>1610612744</td>\n",
       "      <td>101106</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>1610612744</td>\n",
       "      <td>2733</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>1610612744</td>\n",
       "      <td>201973</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    TEAM_ID_2018  PLAYER_ID  SEASON_2018 PLAYER_NAME  TEAM_ID_2019  \\\n",
       "9     1610612761     202391         2018         NaN           NaN   \n",
       "11    1610612761     201975         2018         NaN           NaN   \n",
       "18    1610612744     101106         2018         NaN           NaN   \n",
       "23    1610612744       2733         2018         NaN           NaN   \n",
       "24    1610612744     201973         2018         NaN           NaN   \n",
       "\n",
       "    SEASON_2019  \n",
       "9           NaN  \n",
       "11          NaN  \n",
       "18          NaN  \n",
       "23          NaN  \n",
       "24          NaN  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_18_left = pd.merge(df_18,df_19.drop(columns=['SEASON'])\n",
    "                                       ,on='PLAYER_ID', how='left', suffixes=('_2018','_2019'))\n",
    "df_18_only = df_18_only[df_18_only['TEAM_ID_2019'].isna()]\n",
    "df_18_only.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "9aba64e4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "178"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n_18_only, _ = df_18_only.drop_duplicates(subset=['PLAYER_ID']).shape\n",
    "n_18_only"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9aa66a32",
   "metadata": {},
   "source": [
    "## Which players played for the same team both seasons?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ee7c2c48",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_x</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>SEASON_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "      <td>Pascal Siakam</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "      <td>Marc Gasol</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>200768</td>\n",
       "      <td>2018</td>\n",
       "      <td>Kyle Lowry</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627832</td>\n",
       "      <td>2018</td>\n",
       "      <td>Fred VanVleet</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201586</td>\n",
       "      <td>2018</td>\n",
       "      <td>Serge Ibaka</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      TEAM_ID  PLAYER_ID  SEASON_x    PLAYER_NAME  SEASON_y\n",
       "0  1610612761    1627783      2018  Pascal Siakam      2019\n",
       "1  1610612761     201188      2018     Marc Gasol      2019\n",
       "2  1610612761     200768      2018     Kyle Lowry      2019\n",
       "3  1610612761    1627832      2018  Fred VanVleet      2019\n",
       "4  1610612761     201586      2018    Serge Ibaka      2019"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_same_team = pd.merge(df_18,df_19, on = ['PLAYER_ID','TEAM_ID'],)\n",
    "df_same_team.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5d0a55b",
   "metadata": {},
   "source": [
    "In this case, the suffix only applies to season, but they're not telling us much, so we can clean it up using `drop` before we merge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "970190c9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>Pascal Siakam</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201188</td>\n",
       "      <td>Marc Gasol</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>200768</td>\n",
       "      <td>Kyle Lowry</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627832</td>\n",
       "      <td>Fred VanVleet</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201586</td>\n",
       "      <td>Serge Ibaka</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      TEAM_ID  PLAYER_ID    PLAYER_NAME\n",
       "0  1610612761    1627783  Pascal Siakam\n",
       "1  1610612761     201188     Marc Gasol\n",
       "2  1610612761     200768     Kyle Lowry\n",
       "3  1610612761    1627832  Fred VanVleet\n",
       "4  1610612761     201586    Serge Ibaka"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_18_only_clean = pd.merge(df_18.drop(columns='SEASON'),df_19.drop(columns='SEASON'), on = ['PLAYER_ID','TEAM_ID'],)\n",
    "df_18_only_clean.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "8f2802f6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(263, 3)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_18_only_clean.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "f039703e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(263, 3)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_18_only_clean.drop_duplicates(subset=['PLAYER_ID']).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37a7475e",
   "metadata": {},
   "source": [
    "We do not need to drop the duplicates in this case becaue we merged on two columns and there were no actual duplicates in the original dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3505e302",
   "metadata": {},
   "source": [
    "## Visualizing merges"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "676afd96",
   "metadata": {},
   "outputs": [],
   "source": [
    "left = pd.DataFrame(\n",
    "    {\n",
    "        \"key\": [\"K0\", \"K1\", \"K2\", \"K4\"],\n",
    "        \"A\": [\"A0\", \"A1\", \"A2\", \"A4\"],\n",
    "        \"B\": [\"B0\", \"B1\", \"B2\", \"B4\"],\n",
    "    }\n",
    ")\n",
    "\n",
    "\n",
    "right = pd.DataFrame(\n",
    "    {\n",
    "        \"key\": [\"K0\", \"K1\", \"K2\", \"K3\"],\n",
    "        \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n",
    "        \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "2134f10e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K2</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K4</td>\n",
       "      <td>A4</td>\n",
       "      <td>B4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key   A   B\n",
       "0  K0  A0  B0\n",
       "1  K1  A1  B1\n",
       "2  K2  A2  B2\n",
       "3  K4  A4  B4"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "left"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "2a4630c6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K3</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key   C   D\n",
       "0  K0  C0  D0\n",
       "1  K1  C1  D1\n",
       "2  K2  C2  D2\n",
       "3  K3  C3  D3"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "right"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "db26d2ab",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K2</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key   A   B   C   D\n",
       "0  K0  A0  B0  C0  D0\n",
       "1  K1  A1  B1  C1  D1\n",
       "2  K2  A2  B2  C2  D2"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, how='inner',)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "f7960896",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K2</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K4</td>\n",
       "      <td>A4</td>\n",
       "      <td>B4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key    A    B    C    D\n",
       "0  K0   A0   B0   C0   D0\n",
       "1  K1   A1   B1   C1   D1\n",
       "2  K2   A2   B2   C2   D2\n",
       "3  K4   A4   B4  NaN  NaN\n",
       "4  K3  NaN  NaN   C3   D3"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, how='outer')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "ef693fcd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K2</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K4</td>\n",
       "      <td>A4</td>\n",
       "      <td>B4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key   A   B    C    D\n",
       "0  K0  A0  B0   C0   D0\n",
       "1  K1  A1  B1   C1   D1\n",
       "2  K2  A2  B2   C2   D2\n",
       "3  K4  A4  B4  NaN  NaN"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, how='left')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "b5e5ef82",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K2</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key    A    B   C   D\n",
       "0  K0   A0   B0  C0  D0\n",
       "1  K1   A1   B1  C1  D1\n",
       "2  K2   A2   B2  C2  D2\n",
       "3  K3  NaN  NaN  C3  D3"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, how='right')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd6928c8",
   "metadata": {},
   "source": [
    " We can merge on multiple columns too/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "ae44c24f",
   "metadata": {},
   "outputs": [],
   "source": [
    "left = pd.DataFrame(\n",
    "    {\n",
    "        \"key1\": [\"K0\", \"K0\", \"K1\", \"K2\"],\n",
    "        \"key2\": [\"K0\", \"K1\", \"K0\", \"K1\"],\n",
    "        \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n",
    "        \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n",
    "    }\n",
    ")\n",
    "\n",
    "\n",
    "right = pd.DataFrame(\n",
    "    {\n",
    "        \"key1\": [\"K0\", \"K1\", \"K1\", \"K2\"],\n",
    "        \"key2\": [\"K0\", \"K0\", \"K0\", \"K0\"],\n",
    "        \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n",
    "        \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "dd65da6e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K0</td>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K2</td>\n",
       "      <td>K1</td>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2   A   B\n",
       "0   K0   K0  A0  B0\n",
       "1   K0   K1  A1  B1\n",
       "2   K1   K0  A2  B2\n",
       "3   K2   K1  A3  B3"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "left"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "25cc9549",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K2</td>\n",
       "      <td>K0</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2   C   D\n",
       "0   K0   K0  C0  D0\n",
       "1   K1   K0  C1  D1\n",
       "2   K1   K0  C2  D2\n",
       "3   K2   K0  C3  D3"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "right"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "6a5ba4a3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2   A   B   C   D\n",
       "0   K0   K0  A0  B0  C0  D0\n",
       "1   K1   K0  A2  B2  C1  D1\n",
       "2   K1   K0  A2  B2  C2  D2"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, on=[\"key1\", \"key2\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "1c3705d6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K0</td>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K2</td>\n",
       "      <td>K1</td>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>K2</td>\n",
       "      <td>K0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2    A    B    C    D\n",
       "0   K0   K0   A0   B0   C0   D0\n",
       "1   K0   K1   A1   B1  NaN  NaN\n",
       "2   K1   K0   A2   B2   C1   D1\n",
       "3   K1   K0   A2   B2   C2   D2\n",
       "4   K2   K1   A3   B3  NaN  NaN\n",
       "5   K2   K0  NaN  NaN   C3   D3"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, on=[\"key1\", \"key2\"],how='outer')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "b727354a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2_x</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>key2_y</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>K0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K0</td>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>K0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>K0</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>K0</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K2</td>\n",
       "      <td>K1</td>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "      <td>K0</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2_x   A   B key2_y   C   D\n",
       "0   K0     K0  A0  B0     K0  C0  D0\n",
       "1   K0     K1  A1  B1     K0  C0  D0\n",
       "2   K1     K0  A2  B2     K0  C1  D1\n",
       "3   K1     K0  A2  B2     K0  C2  D2\n",
       "4   K2     K1  A3  B3     K0  C3  D3"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right,on = 'key1' )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afe59900",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "5547fa50",
   "metadata": {},
   "outputs": [],
   "source": [
    "left = pd.DataFrame({\"A\": [1, 2], \"B\": [2, 2]})\n",
    "\n",
    "right = pd.DataFrame({\"A\": [4, 5, 6], \"B\": [2, 2, 2]})\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "3a9b0b5d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B\n",
       "0  1  2\n",
       "1  2  2"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "left"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "908c67e0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B\n",
       "0  4  2\n",
       "1  5  2\n",
       "2  6  2"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "right"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "f0c8734a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A_x</th>\n",
       "      <th>B</th>\n",
       "      <th>A_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A_x  B  A_y\n",
       "0    1  2    4\n",
       "1    1  2    5\n",
       "2    1  2    6\n",
       "3    2  2    4\n",
       "4    2  2    5\n",
       "5    2  2    6"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, on=\"B\", how=\"outer\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "875ce932",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "8a80198e",
   "metadata": {},
   "source": [
    "## Questions After Class\n",
    "\n",
    "```{important}\n",
    "several questions were easiest to answer within the narrative of the notes above.\n",
    "```\n",
    "\n",
    "### How do I represent NaN as a variable? (Ex. df1819['TEAM_ID_y'==NaN]...something like that?)\n",
    "\n",
    "For that specific case, you can use the `isna` method as I did above, but if you need a NaN constant otherwise pandas provides"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "56321717",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<NA>"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.NA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0e38245",
   "metadata": {},
   "source": [
    "and numpy provides"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "b7a79448",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ee7fb5d",
   "metadata": {},
   "source": [
    "Watch out though because they are not the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "aaaf0b70",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<NA>"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.nan == pd.NA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "708ac17c",
   "metadata": {},
   "source": [
    "and this cannot even assert"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "794a3da9",
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "boolean value of NA is ambiguous",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[44], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m pd\u001b[38;5;241m.\u001b[39mNA \u001b[38;5;241m==\u001b[39m np\u001b[38;5;241m.\u001b[39mnan\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/missing.pyx:382\u001b[0m, in \u001b[0;36mpandas._libs.missing.NAType.__bool__\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: boolean value of NA is ambiguous"
     ]
    }
   ],
   "source": [
    "assert pd.NA == np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f9f7595",
   "metadata": {},
   "source": [
    "The `pandas` `pd.isna` method is robust, and it knows how `numpy` works (because it is built on top and imports)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "4c17cb24",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.isna(np.nan)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "b91595a7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.isna(pd.NA)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "166e3e20",
   "metadata": {},
   "source": [
    "However, `numpy` does not know `pandas`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "a0aa05bc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<NA>"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.isnan(pd.NA)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "313e21c0",
   "metadata": {},
   "source": [
    "and returns a value that cannot be used for indexing\n",
    "\n",
    "### Is there an easier way to look at the changes with merge?\n",
    "\n",
    "Any of the tools we have to look at a DataFrame will work.  I think using a small DataFrame is the best way to get a good idea of how they each work, so I included that.  Overall, checking the shape gives a broad view and then comparing values gives the detials.\n",
    "\n",
    "### Do merges also delete any other overlapping columns other than the one they are told to consolidate on?\n",
    "\n",
    "No they copy them all\n",
    "\n",
    "### in pd.merge(p18,p19, on='PLAYER_ID',how='____').shape is p18 and p19 supposed to be df_18 and df_19?\n",
    "\n",
    "Yes, sorry, I read them in using different names than were in my notes\n",
    "\n",
    "### Learning more about accessing data from databases\n",
    "\n",
    "We will access databases on Wednesday\n",
    "\n",
    "### How to merge data that does not line up as well as these two datasets?\n",
    "\n",
    "We will see how to merge on columns that have the same data, but different columns this week. When there is no columns that match value to value, you have to transform and add new columns, combining what we learned about cleaning data to make them line up somehow.  For example, maybe one dataset has dates like `YYYY-MM-DD` and the other data set has only months like \"January\". You could split the whole date into three columns and transform the strings to numbers or numbers to strings by treating the columns as [dates](https://pandas.pydata.org/docs/user_guide/timeseries.html)\n",
    "\n",
    "### Why did we merged on 'PLAYER_ID'?\n",
    "\n",
    "We were interested in linking based on the players.\n",
    "\n",
    "\n",
    "### What is the second best source after documentation? For ex: you found a parameter that might do what you want it to in pandas, but where do you go after to make sure? I'm not sure if I ran it, I'd specifically know if it \"worked\"\n",
    "\n",
    "Ideally, you should know what you are looking for to know if it worked or not.  If you are not sure what you are looking for, that is a good time to try to think that through, or attend office hours.  Next is to create an issue on your assignment repo if the question might reveal a solution.\n",
    "\n",
    "\n",
    "### I think I need to understand how GitHub works because I'm following the steps of the assignments and I feel like I'm doing something wrong every time I have to work on GitHub.\n",
    "\n",
    "If you are getting feedback, you are close enough.  However, I am offering an extra session on GitHub on Friday."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.13,
    "jupytext_version": "1.14.1"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "source_map": [
   12,
   44,
   46,
   50,
   52,
   56,
   59,
   63,
   65,
   71,
   73,
   77,
   79,
   85,
   87,
   105,
   107,
   111,
   113,
   117,
   120,
   124,
   128,
   132,
   137,
   139,
   143,
   145,
   154,
   156,
   160,
   165,
   172,
   175,
   179,
   182,
   186,
   191,
   195,
   197,
   201,
   205,
   224,
   228,
   232,
   236,
   240,
   244,
   246,
   250,
   271,
   275,
   279,
   283,
   287,
   291,
   295,
   302,
   306,
   310,
   314,
   316,
   328,
   330,
   333,
   336,
   341,
   343,
   348,
   350,
   355,
   359,
   361,
   366,
   368
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}