{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "119fc7dd",
   "metadata": {},
   "source": [
    "# Building Datasets From multiple Sources"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d968cd5a",
   "metadata": {},
   "source": [
    "```\n",
    "1. remember, that\n",
    "1. with markdown\n",
    "1. you can make a numbered list\n",
    "1. using all `1.`\n",
    "```\n",
    "\n",
    "renders as:\n",
    "\n",
    "1. remember, that\n",
    "1. with markdown\n",
    "1. you can make a numbered list\n",
    "1. using all `1.`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f63d46ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "297dc008",
   "metadata": {},
   "outputs": [],
   "source": [
    "course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f10a4a0",
   "metadata": {},
   "source": [
    "Today we're going to look at some data on NBA  (National Basketball Association) players.  \n",
    "\n",
    "## Combining Multiple Tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c945d3a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "p18 = pd.read_csv(course_data_url +'2018-players.csv')\n",
    "p19 = pd.read_csv(course_data_url +'2019-players.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4ad20749",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201980</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>200768</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      TEAM_ID  PLAYER_ID  SEASON\n",
       "0  1610612761     202695    2018\n",
       "1  1610612761    1627783    2018\n",
       "2  1610612761     201188    2018\n",
       "3  1610612761     201980    2018\n",
       "4  1610612761     200768    2018"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p18.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6534e3d",
   "metadata": {},
   "source": [
    "### Stacking with Concat\n",
    "\n",
    "We've seen one way of putting dataframes together with `concat` we used it when\n",
    "{ref}`unpacking-jsons`. In that case, we stacked them side by side, now we want\n",
    "to stack them vertically, which is the default of concat."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6b8d82df",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201980</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>200768</td>\n",
       "      <td>2018</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      TEAM_ID  PLAYER_ID  SEASON PLAYER_NAME\n",
       "0  1610612761     202695    2018         NaN\n",
       "1  1610612761    1627783    2018         NaN\n",
       "2  1610612761     201188    2018         NaN\n",
       "3  1610612761     201980    2018         NaN\n",
       "4  1610612761     200768    2018         NaN"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "players_df = pd.concat([p18,p19])\n",
    "players_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "087001d9",
   "metadata": {},
   "source": [
    "This has the same columns, and we can see what happened more by looking at the\n",
    "shape of each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "67d34d12",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1374, 4)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "players_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "80c5fe42",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((748, 3), (626, 4))"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p18.shape, p19.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4056263",
   "metadata": {},
   "source": [
    "We can verify that the length of the new data frame is the sum of the original\n",
    "two DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "1d11dd59",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(p18) + len(p19) == len(players_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e63c4876",
   "metadata": {},
   "source": [
    "### Combining Data with Merge\n",
    "\n",
    "What is we want to see which players changed teams from 2018 to 2019? We can do\n",
    "this with `merge`. For `merge` we have to tell it a left DataFrame and a right\n",
    "DataFrame and on what column to match them up.  \n",
    "\n",
    "The left and right DataFrames will be used different ways, but any DataFrame\n",
    "can be put in either position.  \n",
    "\n",
    "For this case, we will use 2018 data as left ,and 2019 as right and then merge\n",
    "`on='PLAYER_ID'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "fdec6044",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_x</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_x</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_y</th>\n",
       "      <th>SEASON_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "      <td>Kawhi Leonard</td>\n",
       "      <td>1610612746</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "      <td>Pascal Siakam</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "      <td>Marc Gasol</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1610612763</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "      <td>Marc Gasol</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201980</td>\n",
       "      <td>2018</td>\n",
       "      <td>Danny Green</td>\n",
       "      <td>1610612747</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    TEAM_ID_x  PLAYER_ID  SEASON_x    PLAYER_NAME   TEAM_ID_y  SEASON_y\n",
       "0  1610612761     202695      2018  Kawhi Leonard  1610612746      2019\n",
       "1  1610612761    1627783      2018  Pascal Siakam  1610612761      2019\n",
       "2  1610612761     201188      2018     Marc Gasol  1610612761      2019\n",
       "3  1610612763     201188      2018     Marc Gasol  1610612761      2019\n",
       "4  1610612761     201980      2018    Danny Green  1610612747      2019"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(p18,p19,on='PLAYER_ID').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83607e44",
   "metadata": {},
   "source": [
    "Now, we get what we expect, but the column names have `_x` and `_y` on the end\n",
    "(as a {term}`suffix`, appended to the original).  We'll add 2018 and 2019 respectively, separated with a `_`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "28f63de4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_2018</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_2018</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_2019</th>\n",
       "      <th>SEASON_2019</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>202695</td>\n",
       "      <td>2018</td>\n",
       "      <td>Kawhi Leonard</td>\n",
       "      <td>1610612746</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>1627783</td>\n",
       "      <td>2018</td>\n",
       "      <td>Pascal Siakam</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "      <td>Marc Gasol</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1610612763</td>\n",
       "      <td>201188</td>\n",
       "      <td>2018</td>\n",
       "      <td>Marc Gasol</td>\n",
       "      <td>1610612761</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1610612761</td>\n",
       "      <td>201980</td>\n",
       "      <td>2018</td>\n",
       "      <td>Danny Green</td>\n",
       "      <td>1610612747</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   TEAM_ID_2018  PLAYER_ID  SEASON_2018    PLAYER_NAME  TEAM_ID_2019  \\\n",
       "0    1610612761     202695         2018  Kawhi Leonard    1610612746   \n",
       "1    1610612761    1627783         2018  Pascal Siakam    1610612761   \n",
       "2    1610612761     201188         2018     Marc Gasol    1610612761   \n",
       "3    1610612763     201188         2018     Marc Gasol    1610612761   \n",
       "4    1610612761     201980         2018    Danny Green    1610612747   \n",
       "\n",
       "   SEASON_2019  \n",
       "0         2019  \n",
       "1         2019  \n",
       "2         2019  \n",
       "3         2019  \n",
       "4         2019  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "year_over_year = pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'))\n",
    "year_over_year.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9f804dc",
   "metadata": {},
   "source": [
    "Now that it's a little bit cleaner, we will examine how it works by looking at\n",
    "the shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7c4f683d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((538, 6), (748, 3), (626, 4))"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "year_over_year.shape, p18.shape, p19.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6863e97",
   "metadata": {},
   "source": [
    "This kept only the players that played both years, with repetitions for each team they played on for each year.\n",
    "\n",
    "\n",
    "We can check the calculation with set math (python `set` type has operations\n",
    "  from math sets like intersect and set difference)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9f81f18b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
      "  sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n",
      "/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
      "  sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "538"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# player IDs for each year, no repeats\n",
    "p19_u = set(p19['PLAYER_ID'])\n",
    "p18_u = set(p18['PLAYER_ID'])\n",
    "\n",
    "# player IDs that played both years\n",
    "p1819 = p18_u.intersection(p19_u)\n",
    "\n",
    "# teams per player per year\n",
    "teams_per_p18 = p18['PLAYER_ID'].value_counts()\n",
    "teams_per_p19 = p19['PLAYER_ID'].value_counts()\n",
    "\n",
    "# total number of team-player combinations\n",
    "# multiply number of teams each player played for in 18 by number of teams in 19\n",
    "# then sum.  (most of of these are 1*1)\n",
    "sum(teams_per_p19[p1819]* teams_per_p18[p1819])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5409ca5d",
   "metadata": {},
   "source": [
    "We can also merge so that we keep all players on either team using `how='outer'`\n",
    "the default value for `how` is inner, which takes the intersection (but with duplicates, does some extra things as we saw).  With `outer` it takes the union,\n",
    "but with extra handling for the duplicates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "a44b9f0d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(927, 6)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer').shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0de9486",
   "metadata": {},
   "source": [
    "It's the total of the rows we had before, plus the total number of player-teams\n",
    "for players that only played in one of the two years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b63addd6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
      "  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n",
      "/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
      "  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n",
      "/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
      "  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n",
      "/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
      "  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "927"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#players tha tonly played in one year\n",
    "o18 = p18_u.difference(p19_u)\n",
    "o19 = p19_u.difference(p18_u)\n",
    "# teams those players played for + the above 538\n",
    "teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a452cf5f",
   "metadata": {},
   "source": [
    "We can save this to a variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0da1b41f",
   "metadata": {},
   "outputs": [],
   "source": [
    "year_over_year_outer =  pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "658d61bc",
   "metadata": {},
   "source": [
    "then look at a few rows to see that it has indeed filled in with NaN in the\n",
    "places where there wasn't a value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e0d7e8b9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEAM_ID_2018</th>\n",
       "      <th>PLAYER_ID</th>\n",
       "      <th>SEASON_2018</th>\n",
       "      <th>PLAYER_NAME</th>\n",
       "      <th>TEAM_ID_2019</th>\n",
       "      <th>SEASON_2019</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>485</th>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2772</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>Trevor Ariza</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>300</th>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>202714</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>639</th>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>1626150</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>Andrew Harrison</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>861</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1629683</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Quinndary Weatherspoon</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>264</th>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>1629027</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>Trae Young</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>720</th>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>1629008</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>Michael Porter Jr.</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>883</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1629860</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Jon Davis</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>826</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1629076</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Tyler Cook</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>384</th>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>203967</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>Dario Saric</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>203090</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>Maurice Harkless</td>\n",
       "      <td>1.610613e+09</td>\n",
       "      <td>2019.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     TEAM_ID_2018  PLAYER_ID  SEASON_2018             PLAYER_NAME  \\\n",
       "485  1.610613e+09       2772       2018.0            Trevor Ariza   \n",
       "300  1.610613e+09     202714       2018.0                     NaN   \n",
       "639  1.610613e+09    1626150       2018.0         Andrew Harrison   \n",
       "861           NaN    1629683          NaN  Quinndary Weatherspoon   \n",
       "264  1.610613e+09    1629027       2018.0              Trae Young   \n",
       "720  1.610613e+09    1629008       2018.0      Michael Porter Jr.   \n",
       "883           NaN    1629860          NaN               Jon Davis   \n",
       "826           NaN    1629076          NaN              Tyler Cook   \n",
       "384  1.610613e+09     203967       2018.0             Dario Saric   \n",
       "52   1.610613e+09     203090       2018.0        Maurice Harkless   \n",
       "\n",
       "     TEAM_ID_2019  SEASON_2019  \n",
       "485  1.610613e+09       2019.0  \n",
       "300           NaN          NaN  \n",
       "639  1.610613e+09       2019.0  \n",
       "861  1.610613e+09       2019.0  \n",
       "264  1.610613e+09       2019.0  \n",
       "720  1.610613e+09       2019.0  \n",
       "883  1.610613e+09       2019.0  \n",
       "826  1.610613e+09       2019.0  \n",
       "384  1.610613e+09       2019.0  \n",
       "52   1.610613e+09       2019.0  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "year_over_year_outer.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a0d0716",
   "metadata": {},
   "source": [
    "```{note}\n",
    "We can also tell that there are NaN beacuse it cast the year to float from int.\n",
    "```\n",
    "\n",
    "## Merge types, in detail\n",
    "\n",
    "We can examine how these things work more visually with smaller DataFrames:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "20a22729",
   "metadata": {},
   "outputs": [],
   "source": [
    "left = pd.DataFrame(\n",
    "  {\n",
    "    \"key1\": [\"K0\", \"K0\", \"K1\", \"K2\"],\n",
    "    \"key2\": [\"K0\", \"K1\", \"K0\", \"K1\"],\n",
    "    \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n",
    "    \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n",
    "  }\n",
    ")\n",
    "\n",
    "\n",
    "right = pd.DataFrame(\n",
    "  {\n",
    "    \"key1\": [\"K0\", \"K1\", \"K1\", \"K2\"],\n",
    "    \"key2\": [\"K0\", \"K0\", \"K0\", \"K0\"],\n",
    "    \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n",
    "    \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n",
    "  }\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "2898399d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K0</td>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K2</td>\n",
       "      <td>K1</td>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2   A   B\n",
       "0   K0   K0  A0  B0\n",
       "1   K0   K1  A1  B1\n",
       "2   K1   K0  A2  B2\n",
       "3   K2   K1  A3  B3"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "left"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "8514f091",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K2</td>\n",
       "      <td>K0</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2   C   D\n",
       "0   K0   K0  C0  D0\n",
       "1   K1   K0  C1  D1\n",
       "2   K1   K0  C2  D2\n",
       "3   K2   K0  C3  D3"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "right"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b72018d3",
   "metadata": {},
   "source": [
    "````{margin}\n",
    "```{note}\n",
    "inner is default, but we can state it to be more explicit\n",
    "```\n",
    "````"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "4fdb6342",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2   A   B   C   D\n",
       "0   K0   K0  A0  B0  C0  D0\n",
       "1   K1   K0  A2  B2  C1  D1\n",
       "2   K1   K0  A2  B2  C2  D2"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, on=[\"key1\", \"key2\"], how='inner')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "8b7f0f6a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key1</th>\n",
       "      <th>key2</th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K0</td>\n",
       "      <td>K0</td>\n",
       "      <td>A0</td>\n",
       "      <td>B0</td>\n",
       "      <td>C0</td>\n",
       "      <td>D0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>K0</td>\n",
       "      <td>K1</td>\n",
       "      <td>A1</td>\n",
       "      <td>B1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C1</td>\n",
       "      <td>D1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K1</td>\n",
       "      <td>K0</td>\n",
       "      <td>A2</td>\n",
       "      <td>B2</td>\n",
       "      <td>C2</td>\n",
       "      <td>D2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K2</td>\n",
       "      <td>K1</td>\n",
       "      <td>A3</td>\n",
       "      <td>B3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>K2</td>\n",
       "      <td>K0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C3</td>\n",
       "      <td>D3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key1 key2    A    B    C    D\n",
       "0   K0   K0   A0   B0   C0   D0\n",
       "1   K0   K1   A1   B1  NaN  NaN\n",
       "2   K1   K0   A2   B2   C1   D1\n",
       "3   K1   K0   A2   B2   C2   D2\n",
       "4   K2   K1   A3   B3  NaN  NaN\n",
       "5   K2   K0  NaN  NaN   C3   D3"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(left, right, on=[\"key1\", \"key2\"], how='outer' )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecca8861",
   "metadata": {},
   "source": [
    "## Duplicate Keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "3f0ee2f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "left = pd.DataFrame({\"A\": [1, 2], \"B\": [2, 2]})\n",
    "\n",
    "right = pd.DataFrame({\"A\": [4, 5, 6], \"B\": [2, 2, 2]})\n",
    "\n",
    "result = pd.merge(left, right, on=\"B\", how=\"outer\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "e70cfbf5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B\n",
       "0  1  2\n",
       "1  2  2"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "left"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "10878324",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A  B\n",
       "0  4  2\n",
       "1  5  2\n",
       "2  6  2"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "right"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "ccb8005f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A_x</th>\n",
       "      <th>B</th>\n",
       "      <th>A_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A_x  B  A_y\n",
       "0    1  2    4\n",
       "1    1  2    5\n",
       "2    1  2    6\n",
       "3    2  2    4\n",
       "4    2  2    5\n",
       "5    2  2    6"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68964b61",
   "metadata": {},
   "source": [
    "If we ask pandas to validate the merge with `validate` and a specific type of\n",
    "merge, it will throw an error if that type of merge is not possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "043fc46d",
   "metadata": {},
   "outputs": [
    {
     "ename": "MergeError",
     "evalue": "Merge keys are not unique in either left or right dataset; not a one-to-one merge",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mMergeError\u001b[0m                                Traceback (most recent call last)",
      "Input \u001b[0;32mIn [26]\u001b[0m, in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmerge\u001b[49m\u001b[43m(\u001b[49m\u001b[43mleft\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mright\u001b[49m\u001b[43m,\u001b[49m\u001b[43mon\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mB\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43mvalidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mone_to_one\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:107\u001b[0m, in \u001b[0;36mmerge\u001b[0;34m(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)\u001b[0m\n\u001b[1;32m     90\u001b[0m \u001b[38;5;129m@Substitution\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124mleft : DataFrame or named Series\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m     91\u001b[0m \u001b[38;5;129m@Appender\u001b[39m(_merge_doc, indents\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m)\n\u001b[1;32m     92\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mmerge\u001b[39m(\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    105\u001b[0m     validate: \u001b[38;5;28mstr\u001b[39m \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[1;32m    106\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame:\n\u001b[0;32m--> 107\u001b[0m     op \u001b[38;5;241m=\u001b[39m \u001b[43m_MergeOperation\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    108\u001b[0m \u001b[43m        \u001b[49m\u001b[43mleft\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    109\u001b[0m \u001b[43m        \u001b[49m\u001b[43mright\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    110\u001b[0m \u001b[43m        \u001b[49m\u001b[43mhow\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mhow\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    111\u001b[0m \u001b[43m        \u001b[49m\u001b[43mon\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mon\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    112\u001b[0m \u001b[43m        \u001b[49m\u001b[43mleft_on\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mleft_on\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    113\u001b[0m \u001b[43m        \u001b[49m\u001b[43mright_on\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mright_on\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    114\u001b[0m \u001b[43m        \u001b[49m\u001b[43mleft_index\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mleft_index\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    115\u001b[0m \u001b[43m        \u001b[49m\u001b[43mright_index\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mright_index\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    116\u001b[0m \u001b[43m        \u001b[49m\u001b[43msort\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msort\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    117\u001b[0m \u001b[43m        \u001b[49m\u001b[43msuffixes\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msuffixes\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    118\u001b[0m \u001b[43m        \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    119\u001b[0m \u001b[43m        \u001b[49m\u001b[43mindicator\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindicator\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    120\u001b[0m \u001b[43m        \u001b[49m\u001b[43mvalidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mvalidate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    121\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    122\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m op\u001b[38;5;241m.\u001b[39mget_result()\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:710\u001b[0m, in \u001b[0;36m_MergeOperation.__init__\u001b[0;34m(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)\u001b[0m\n\u001b[1;32m    706\u001b[0m \u001b[38;5;66;03m# If argument passed to validate,\u001b[39;00m\n\u001b[1;32m    707\u001b[0m \u001b[38;5;66;03m# check if columns specified as unique\u001b[39;00m\n\u001b[1;32m    708\u001b[0m \u001b[38;5;66;03m# are in fact unique.\u001b[39;00m\n\u001b[1;32m    709\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m validate \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 710\u001b[0m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvalidate\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:1427\u001b[0m, in \u001b[0;36m_MergeOperation._validate\u001b[0;34m(self, validate)\u001b[0m\n\u001b[1;32m   1425\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m validate \u001b[38;5;129;01min\u001b[39;00m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone_to_one\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m1:1\u001b[39m\u001b[38;5;124m\"\u001b[39m]:\n\u001b[1;32m   1426\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m left_unique \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m right_unique:\n\u001b[0;32m-> 1427\u001b[0m         \u001b[38;5;28;01mraise\u001b[39;00m MergeError(\n\u001b[1;32m   1428\u001b[0m             \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMerge keys are not unique in either left \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1429\u001b[0m             \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mor right dataset; not a one-to-one merge\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1430\u001b[0m         )\n\u001b[1;32m   1431\u001b[0m     \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m left_unique:\n\u001b[1;32m   1432\u001b[0m         \u001b[38;5;28;01mraise\u001b[39;00m MergeError(\n\u001b[1;32m   1433\u001b[0m             \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMerge keys are not unique in left dataset; not a one-to-one merge\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m   1434\u001b[0m         )\n",
      "\u001b[0;31mMergeError\u001b[0m: Merge keys are not unique in either left or right dataset; not a one-to-one merge"
     ]
    }
   ],
   "source": [
    "pd.merge(left, right,on='B',validate='one_to_one')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d9d1edd",
   "metadata": {},
   "source": [
    "```{admonition} Further Reading\n",
    "The [pandas documentation](https://pandas.pydata.org/docs/user_guide/merging.html#checking-for-duplicate-keys) is a good place to read through their\n",
    "exampels on how validate works. it's important to note the types of possible joins from the [beginning of the section](https://pandas.pydata.org/docs/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra)\n",
    "```\n",
    "\n",
    "## Questions at the End of Class\n",
    "\n",
    "\n",
    "### how to remove certain columns in merges (i.e. the year columns in the basketball dataset since they are redundant)\n",
    "````{toggle}\n",
    "They can be dropped after (or before) using `drop()`\n",
    "\n",
    "```{code-cell} ipython3\n",
    "year_over_year_clean = year_over_year.drop(columns = ['SEASON_2018','SEASON_2019'])\n",
    "year_over_year_clean.head()\n",
    "```\n",
    "\n",
    "````\n",
    "\n",
    "### Can we merge as many dataframes as we would want?\n",
    "```{toggle}\n",
    "Merge is specifically defined for two Data Frames, but the result returns a\n",
    "a DataFrame, so it could be passed to another merge.\n",
    "```\n",
    "\n",
    "\n",
    "### Is there no way to merge the left and right for the A and B data?\n",
    "```{toggle}\n",
    "All merges are possible, they just each give a different result.\n",
    "```\n",
    "\n",
    "\n",
    "### Can you use merging to check correlation between two different sets?\n",
    "```{toggle}\n",
    "You can use merging to check *overlap* between two DataFrames, for actual sets, it's probably best to represent them as `set` type objects and use set operations.\n",
    "\n",
    "However, these type of operations are not actually correlations, which we'll learn a bit about in a few weeks when we talk about regression\n",
    "```\n",
    "\n",
    "\n",
    "### Is there any way to compare the team ids in both datasets so that it outputs the players that changed teams between those times?\n",
    "\n",
    "````{toggle}\n",
    "From the `year_over_year` DataFrame we made above, we can check which players changed teams.  Since players also can change teams *within* a season, this is a tricky question, but we could, for example look only for players that have no overlapping teams from 2018 to 2019.\n",
    "\n",
    "\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "d53176b3",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'year_over_year_clean' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Input \u001b[0;32mIn [27]\u001b[0m, in \u001b[0;36m<cell line: 2>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m change_check \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mlambda\u001b[39;00m r: \u001b[38;5;129;01mnot\u001b[39;00m(r[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mTEAM_ID_2018\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m==\u001b[39m r[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mTEAM_ID_2019\u001b[39m\u001b[38;5;124m'\u001b[39m])\n\u001b[0;32m----> 2\u001b[0m year_over_year_clean[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mCHANGE_TEAM\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43myear_over_year_clean\u001b[49m\u001b[38;5;241m.\u001b[39mapply(change_check,axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m)\n\u001b[1;32m      3\u001b[0m year_over_year_clean\u001b[38;5;241m.\u001b[39mhead()\n",
      "\u001b[0;31mNameError\u001b[0m: name 'year_over_year_clean' is not defined"
     ]
    }
   ],
   "source": [
    "change_check = lambda r: not(r['TEAM_ID_2018']== r['TEAM_ID_2019'])\n",
    "year_over_year_clean['CHANGE_TEAM'] = year_over_year_clean.apply(change_check,axis=1)\n",
    "year_over_year_clean.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc3f1ad1",
   "metadata": {},
   "source": [
    "Then we can filter the data by the new column, then take out only the player information and dropa ny duplicates for players that played in multiple teams in one eyar or the other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "f5f56a7b",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'year_over_year_clean' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Input \u001b[0;32mIn [28]\u001b[0m, in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m y_o_y_changes \u001b[38;5;241m=\u001b[39m \u001b[43myear_over_year_clean\u001b[49m[year_over_year_clean[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mCHANGE_TEAM\u001b[39m\u001b[38;5;124m'\u001b[39m]]\n\u001b[1;32m      2\u001b[0m y_o_y_change_players \u001b[38;5;241m=\u001b[39m y_o_y_changes[[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mPLAYER_ID\u001b[39m\u001b[38;5;124m'\u001b[39m,\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mPLAYER_NAME\u001b[39m\u001b[38;5;124m'\u001b[39m]]\u001b[38;5;241m.\u001b[39mdrop_duplicates()\n\u001b[1;32m      3\u001b[0m y_o_y_change_players\u001b[38;5;241m.\u001b[39mhead()\n",
      "\u001b[0;31mNameError\u001b[0m: name 'year_over_year_clean' is not defined"
     ]
    }
   ],
   "source": [
    "y_o_y_changes = year_over_year_clean[year_over_year_clean['CHANGE_TEAM']]\n",
    "y_o_y_change_players = y_o_y_changes[['PLAYER_ID','PLAYER_NAME']].drop_duplicates()\n",
    "y_o_y_change_players.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39123051",
   "metadata": {},
   "source": [
    "and we can check how many players that is"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "415ca198",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'y_o_y_change_players' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Input \u001b[0;32mIn [29]\u001b[0m, in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43my_o_y_change_players\u001b[49m\u001b[38;5;241m.\u001b[39mshape\n",
      "\u001b[0;31mNameError\u001b[0m: name 'y_o_y_change_players' is not defined"
     ]
    }
   ],
   "source": [
    "y_o_y_change_players.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19f28a89",
   "metadata": {},
   "source": [
    "### in year_over_year, where suffixes=('_2018','_2019'), why is there an underscore?\n",
    "```{toggle}\n",
    "this is optional, but is best practice for pythonic variable naming. (and remember columns are like variables)\n",
    "```\n",
    "\n",
    "\n",
    "### If we merge two dataframes which each have a column (or columns) that are the same, but each have different data types in those columns, what happens?\n",
    "\n",
    "```{toggle}\n",
    "If they're just named the same, but not the \"on\" column, you'll get the two columsn with different suffixes default `('_x','_y')`. If it's the `on` column, the merge will be empty\n",
    "```\n",
    "\n",
    "\n",
    "### how does suffix to know to change the date?\n",
    "```{toggle}\n",
    "Suffix appends what we tell it to to the column names that occur in both datasets\n",
    "```\n",
    "\n",
    "## More Practice\n",
    "\n",
    "1. Use a merge to figure out how many players did not return for the 2019 season\n",
    "1. Also load the `conferences.csv` data. Use this to figure out how many players moved conferences from 2018 to 2019."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.13,
    "jupytext_version": "1.10.3"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "source_map": [
   12,
   16,
   32,
   36,
   38,
   44,
   49,
   51,
   59,
   62,
   67,
   71,
   73,
   78,
   80,
   94,
   96,
   101,
   104,
   109,
   111,
   118,
   134,
   141,
   143,
   147,
   153,
   156,
   158,
   163,
   165,
   176,
   198,
   202,
   204,
   212,
   218,
   220,
   224,
   232,
   236,
   240,
   242,
   246,
   248,
   298,
   302,
   305,
   309,
   312,
   314
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}