{
"cells": [
{
"cell_type": "markdown",
"id": "119fc7dd",
"metadata": {},
"source": [
"# Building Datasets From multiple Sources"
]
},
{
"cell_type": "markdown",
"id": "d968cd5a",
"metadata": {},
"source": [
"```\n",
"1. remember, that\n",
"1. with markdown\n",
"1. you can make a numbered list\n",
"1. using all `1.`\n",
"```\n",
"\n",
"renders as:\n",
"\n",
"1. remember, that\n",
"1. with markdown\n",
"1. you can make a numbered list\n",
"1. using all `1.`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f63d46ac",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "297dc008",
"metadata": {},
"outputs": [],
"source": [
"course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'"
]
},
{
"cell_type": "markdown",
"id": "0f10a4a0",
"metadata": {},
"source": [
"Today we're going to look at some data on NBA (National Basketball Association) players. \n",
"\n",
"## Combining Multiple Tables"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c945d3a8",
"metadata": {},
"outputs": [],
"source": [
"p18 = pd.read_csv(course_data_url +'2018-players.csv')\n",
"p19 = pd.read_csv(course_data_url +'2019-players.csv')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4ad20749",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID | \n",
" PLAYER_ID | \n",
" SEASON | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
"
\n",
" \n",
" 2 | \n",
" 1610612761 | \n",
" 201188 | \n",
" 2018 | \n",
"
\n",
" \n",
" 3 | \n",
" 1610612761 | \n",
" 201980 | \n",
" 2018 | \n",
"
\n",
" \n",
" 4 | \n",
" 1610612761 | \n",
" 200768 | \n",
" 2018 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID PLAYER_ID SEASON\n",
"0 1610612761 202695 2018\n",
"1 1610612761 1627783 2018\n",
"2 1610612761 201188 2018\n",
"3 1610612761 201980 2018\n",
"4 1610612761 200768 2018"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p18.head()"
]
},
{
"cell_type": "markdown",
"id": "c6534e3d",
"metadata": {},
"source": [
"### Stacking with Concat\n",
"\n",
"We've seen one way of putting dataframes together with `concat` we used it when\n",
"{ref}`unpacking-jsons`. In that case, we stacked them side by side, now we want\n",
"to stack them vertically, which is the default of concat."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6b8d82df",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID | \n",
" PLAYER_ID | \n",
" SEASON | \n",
" PLAYER_NAME | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 1610612761 | \n",
" 201188 | \n",
" 2018 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 1610612761 | \n",
" 201980 | \n",
" 2018 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 1610612761 | \n",
" 200768 | \n",
" 2018 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID PLAYER_ID SEASON PLAYER_NAME\n",
"0 1610612761 202695 2018 NaN\n",
"1 1610612761 1627783 2018 NaN\n",
"2 1610612761 201188 2018 NaN\n",
"3 1610612761 201980 2018 NaN\n",
"4 1610612761 200768 2018 NaN"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"players_df = pd.concat([p18,p19])\n",
"players_df.head()"
]
},
{
"cell_type": "markdown",
"id": "087001d9",
"metadata": {},
"source": [
"This has the same columns, and we can see what happened more by looking at the\n",
"shape of each."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "67d34d12",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1374, 4)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"players_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "80c5fe42",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((748, 3), (626, 4))"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p18.shape, p19.shape"
]
},
{
"cell_type": "markdown",
"id": "e4056263",
"metadata": {},
"source": [
"We can verify that the length of the new data frame is the sum of the original\n",
"two DataFrames."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1d11dd59",
"metadata": {},
"outputs": [],
"source": [
"assert len(p18) + len(p19) == len(players_df)"
]
},
{
"cell_type": "markdown",
"id": "e63c4876",
"metadata": {},
"source": [
"### Combining Data with Merge\n",
"\n",
"What is we want to see which players changed teams from 2018 to 2019? We can do\n",
"this with `merge`. For `merge` we have to tell it a left DataFrame and a right\n",
"DataFrame and on what column to match them up. \n",
"\n",
"The left and right DataFrames will be used different ways, but any DataFrame\n",
"can be put in either position. \n",
"\n",
"For this case, we will use 2018 data as left ,and 2019 as right and then merge\n",
"`on='PLAYER_ID'`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "fdec6044",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_x | \n",
" PLAYER_ID | \n",
" SEASON_x | \n",
" PLAYER_NAME | \n",
" TEAM_ID_y | \n",
" SEASON_y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
" Kawhi Leonard | \n",
" 1610612746 | \n",
" 2019 | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
" Pascal Siakam | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 2 | \n",
" 1610612761 | \n",
" 201188 | \n",
" 2018 | \n",
" Marc Gasol | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 3 | \n",
" 1610612763 | \n",
" 201188 | \n",
" 2018 | \n",
" Marc Gasol | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 4 | \n",
" 1610612761 | \n",
" 201980 | \n",
" 2018 | \n",
" Danny Green | \n",
" 1610612747 | \n",
" 2019 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID_x PLAYER_ID SEASON_x PLAYER_NAME TEAM_ID_y SEASON_y\n",
"0 1610612761 202695 2018 Kawhi Leonard 1610612746 2019\n",
"1 1610612761 1627783 2018 Pascal Siakam 1610612761 2019\n",
"2 1610612761 201188 2018 Marc Gasol 1610612761 2019\n",
"3 1610612763 201188 2018 Marc Gasol 1610612761 2019\n",
"4 1610612761 201980 2018 Danny Green 1610612747 2019"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(p18,p19,on='PLAYER_ID').head()"
]
},
{
"cell_type": "markdown",
"id": "83607e44",
"metadata": {},
"source": [
"Now, we get what we expect, but the column names have `_x` and `_y` on the end\n",
"(as a {term}`suffix`, appended to the original). We'll add 2018 and 2019 respectively, separated with a `_`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "28f63de4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_2018 | \n",
" PLAYER_ID | \n",
" SEASON_2018 | \n",
" PLAYER_NAME | \n",
" TEAM_ID_2019 | \n",
" SEASON_2019 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
" Kawhi Leonard | \n",
" 1610612746 | \n",
" 2019 | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
" Pascal Siakam | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 2 | \n",
" 1610612761 | \n",
" 201188 | \n",
" 2018 | \n",
" Marc Gasol | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 3 | \n",
" 1610612763 | \n",
" 201188 | \n",
" 2018 | \n",
" Marc Gasol | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 4 | \n",
" 1610612761 | \n",
" 201980 | \n",
" 2018 | \n",
" Danny Green | \n",
" 1610612747 | \n",
" 2019 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 \\\n",
"0 1610612761 202695 2018 Kawhi Leonard 1610612746 \n",
"1 1610612761 1627783 2018 Pascal Siakam 1610612761 \n",
"2 1610612761 201188 2018 Marc Gasol 1610612761 \n",
"3 1610612763 201188 2018 Marc Gasol 1610612761 \n",
"4 1610612761 201980 2018 Danny Green 1610612747 \n",
"\n",
" SEASON_2019 \n",
"0 2019 \n",
"1 2019 \n",
"2 2019 \n",
"3 2019 \n",
"4 2019 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"year_over_year = pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'))\n",
"year_over_year.head()"
]
},
{
"cell_type": "markdown",
"id": "d9f804dc",
"metadata": {},
"source": [
"Now that it's a little bit cleaner, we will examine how it works by looking at\n",
"the shape."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "7c4f683d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((538, 6), (748, 3), (626, 4))"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"year_over_year.shape, p18.shape, p19.shape"
]
},
{
"cell_type": "markdown",
"id": "b6863e97",
"metadata": {},
"source": [
"This kept only the players that played both years, with repetitions for each team they played on for each year.\n",
"\n",
"\n",
"We can check the calculation with set math (python `set` type has operations\n",
" from math sets like intersect and set difference)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "9f81f18b",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
" sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n",
"/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
" sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n"
]
},
{
"data": {
"text/plain": [
"538"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# player IDs for each year, no repeats\n",
"p19_u = set(p19['PLAYER_ID'])\n",
"p18_u = set(p18['PLAYER_ID'])\n",
"\n",
"# player IDs that played both years\n",
"p1819 = p18_u.intersection(p19_u)\n",
"\n",
"# teams per player per year\n",
"teams_per_p18 = p18['PLAYER_ID'].value_counts()\n",
"teams_per_p19 = p19['PLAYER_ID'].value_counts()\n",
"\n",
"# total number of team-player combinations\n",
"# multiply number of teams each player played for in 18 by number of teams in 19\n",
"# then sum. (most of of these are 1*1)\n",
"sum(teams_per_p19[p1819]* teams_per_p18[p1819])"
]
},
{
"cell_type": "markdown",
"id": "5409ca5d",
"metadata": {},
"source": [
"We can also merge so that we keep all players on either team using `how='outer'`\n",
"the default value for `how` is inner, which takes the intersection (but with duplicates, does some extra things as we saw). With `outer` it takes the union,\n",
"but with extra handling for the duplicates."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a44b9f0d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(927, 6)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer').shape"
]
},
{
"cell_type": "markdown",
"id": "f0de9486",
"metadata": {},
"source": [
"It's the total of the rows we had before, plus the total number of player-teams\n",
"for players that only played in one of the two years."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b63addd6",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
" teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n",
"/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
" teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n",
"/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
" teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n",
"/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.\n",
" teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])\n"
]
},
{
"data": {
"text/plain": [
"927"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#players tha tonly played in one year\n",
"o18 = p18_u.difference(p19_u)\n",
"o19 = p19_u.difference(p18_u)\n",
"# teams those players played for + the above 538\n",
"teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])"
]
},
{
"cell_type": "markdown",
"id": "a452cf5f",
"metadata": {},
"source": [
"We can save this to a variable"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "0da1b41f",
"metadata": {},
"outputs": [],
"source": [
"year_over_year_outer = pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer')"
]
},
{
"cell_type": "markdown",
"id": "658d61bc",
"metadata": {},
"source": [
"then look at a few rows to see that it has indeed filled in with NaN in the\n",
"places where there wasn't a value."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "e0d7e8b9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_2018 | \n",
" PLAYER_ID | \n",
" SEASON_2018 | \n",
" PLAYER_NAME | \n",
" TEAM_ID_2019 | \n",
" SEASON_2019 | \n",
"
\n",
" \n",
" \n",
" \n",
" 485 | \n",
" 1.610613e+09 | \n",
" 2772 | \n",
" 2018.0 | \n",
" Trevor Ariza | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 300 | \n",
" 1.610613e+09 | \n",
" 202714 | \n",
" 2018.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 639 | \n",
" 1.610613e+09 | \n",
" 1626150 | \n",
" 2018.0 | \n",
" Andrew Harrison | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 861 | \n",
" NaN | \n",
" 1629683 | \n",
" NaN | \n",
" Quinndary Weatherspoon | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 264 | \n",
" 1.610613e+09 | \n",
" 1629027 | \n",
" 2018.0 | \n",
" Trae Young | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 720 | \n",
" 1.610613e+09 | \n",
" 1629008 | \n",
" 2018.0 | \n",
" Michael Porter Jr. | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 883 | \n",
" NaN | \n",
" 1629860 | \n",
" NaN | \n",
" Jon Davis | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 826 | \n",
" NaN | \n",
" 1629076 | \n",
" NaN | \n",
" Tyler Cook | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 384 | \n",
" 1.610613e+09 | \n",
" 203967 | \n",
" 2018.0 | \n",
" Dario Saric | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
" 52 | \n",
" 1.610613e+09 | \n",
" 203090 | \n",
" 2018.0 | \n",
" Maurice Harkless | \n",
" 1.610613e+09 | \n",
" 2019.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME \\\n",
"485 1.610613e+09 2772 2018.0 Trevor Ariza \n",
"300 1.610613e+09 202714 2018.0 NaN \n",
"639 1.610613e+09 1626150 2018.0 Andrew Harrison \n",
"861 NaN 1629683 NaN Quinndary Weatherspoon \n",
"264 1.610613e+09 1629027 2018.0 Trae Young \n",
"720 1.610613e+09 1629008 2018.0 Michael Porter Jr. \n",
"883 NaN 1629860 NaN Jon Davis \n",
"826 NaN 1629076 NaN Tyler Cook \n",
"384 1.610613e+09 203967 2018.0 Dario Saric \n",
"52 1.610613e+09 203090 2018.0 Maurice Harkless \n",
"\n",
" TEAM_ID_2019 SEASON_2019 \n",
"485 1.610613e+09 2019.0 \n",
"300 NaN NaN \n",
"639 1.610613e+09 2019.0 \n",
"861 1.610613e+09 2019.0 \n",
"264 1.610613e+09 2019.0 \n",
"720 1.610613e+09 2019.0 \n",
"883 1.610613e+09 2019.0 \n",
"826 1.610613e+09 2019.0 \n",
"384 1.610613e+09 2019.0 \n",
"52 1.610613e+09 2019.0 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"year_over_year_outer.sample(10)"
]
},
{
"cell_type": "markdown",
"id": "9a0d0716",
"metadata": {},
"source": [
"```{note}\n",
"We can also tell that there are NaN beacuse it cast the year to float from int.\n",
"```\n",
"\n",
"## Merge types, in detail\n",
"\n",
"We can examine how these things work more visually with smaller DataFrames:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "20a22729",
"metadata": {},
"outputs": [],
"source": [
"left = pd.DataFrame(\n",
" {\n",
" \"key1\": [\"K0\", \"K0\", \"K1\", \"K2\"],\n",
" \"key2\": [\"K0\", \"K1\", \"K0\", \"K1\"],\n",
" \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n",
" \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n",
" }\n",
")\n",
"\n",
"\n",
"right = pd.DataFrame(\n",
" {\n",
" \"key1\": [\"K0\", \"K1\", \"K1\", \"K2\"],\n",
" \"key2\": [\"K0\", \"K0\", \"K0\", \"K0\"],\n",
" \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n",
" \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n",
" }\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "2898399d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
"
\n",
" \n",
" 1 | \n",
" K0 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
"
\n",
" \n",
" 3 | \n",
" K2 | \n",
" K1 | \n",
" A3 | \n",
" B3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 A B\n",
"0 K0 K0 A0 B0\n",
"1 K0 K1 A1 B1\n",
"2 K1 K0 A2 B2\n",
"3 K2 K1 A3 B3"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"left"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "8514f091",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" K0 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 3 | \n",
" K2 | \n",
" K0 | \n",
" C3 | \n",
" D3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 C D\n",
"0 K0 K0 C0 D0\n",
"1 K1 K0 C1 D1\n",
"2 K1 K0 C2 D2\n",
"3 K2 K0 C3 D3"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"right"
]
},
{
"cell_type": "markdown",
"id": "b72018d3",
"metadata": {},
"source": [
"````{margin}\n",
"```{note}\n",
"inner is default, but we can state it to be more explicit\n",
"```\n",
"````"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "4fdb6342",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 A B C D\n",
"0 K0 K0 A0 B0 C0 D0\n",
"1 K1 K0 A2 B2 C1 D1\n",
"2 K1 K0 A2 B2 C2 D2"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, on=[\"key1\", \"key2\"], how='inner')\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "8b7f0f6a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K0 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 3 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 4 | \n",
" K2 | \n",
" K1 | \n",
" A3 | \n",
" B3 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" K2 | \n",
" K0 | \n",
" NaN | \n",
" NaN | \n",
" C3 | \n",
" D3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 A B C D\n",
"0 K0 K0 A0 B0 C0 D0\n",
"1 K0 K1 A1 B1 NaN NaN\n",
"2 K1 K0 A2 B2 C1 D1\n",
"3 K1 K0 A2 B2 C2 D2\n",
"4 K2 K1 A3 B3 NaN NaN\n",
"5 K2 K0 NaN NaN C3 D3"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, on=[\"key1\", \"key2\"], how='outer' )"
]
},
{
"cell_type": "markdown",
"id": "ecca8861",
"metadata": {},
"source": [
"## Duplicate Keys"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "3f0ee2f2",
"metadata": {},
"outputs": [],
"source": [
"left = pd.DataFrame({\"A\": [1, 2], \"B\": [2, 2]})\n",
"\n",
"right = pd.DataFrame({\"A\": [4, 5, 6], \"B\": [2, 2, 2]})\n",
"\n",
"result = pd.merge(left, right, on=\"B\", how=\"outer\")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "e70cfbf5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"0 1 2\n",
"1 2 2"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"left"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "10878324",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 6 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"0 4 2\n",
"1 5 2\n",
"2 6 2"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"right"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "ccb8005f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A_x | \n",
" B | \n",
" A_y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A_x B A_y\n",
"0 1 2 4\n",
"1 1 2 5\n",
"2 1 2 6\n",
"3 2 2 4\n",
"4 2 2 5\n",
"5 2 2 6"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result"
]
},
{
"cell_type": "markdown",
"id": "68964b61",
"metadata": {},
"source": [
"If we ask pandas to validate the merge with `validate` and a specific type of\n",
"merge, it will throw an error if that type of merge is not possible."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "043fc46d",
"metadata": {},
"outputs": [
{
"ename": "MergeError",
"evalue": "Merge keys are not unique in either left or right dataset; not a one-to-one merge",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mMergeError\u001b[0m Traceback (most recent call last)",
"Input \u001b[0;32mIn [26]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmerge\u001b[49m\u001b[43m(\u001b[49m\u001b[43mleft\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mright\u001b[49m\u001b[43m,\u001b[49m\u001b[43mon\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mB\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43mvalidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mone_to_one\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:107\u001b[0m, in \u001b[0;36mmerge\u001b[0;34m(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)\u001b[0m\n\u001b[1;32m 90\u001b[0m \u001b[38;5;129m@Substitution\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124mleft : DataFrame or named Series\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 91\u001b[0m \u001b[38;5;129m@Appender\u001b[39m(_merge_doc, indents\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m)\n\u001b[1;32m 92\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mmerge\u001b[39m(\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 105\u001b[0m validate: \u001b[38;5;28mstr\u001b[39m \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[1;32m 106\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame:\n\u001b[0;32m--> 107\u001b[0m op \u001b[38;5;241m=\u001b[39m \u001b[43m_MergeOperation\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 108\u001b[0m \u001b[43m \u001b[49m\u001b[43mleft\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 109\u001b[0m \u001b[43m \u001b[49m\u001b[43mright\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 110\u001b[0m \u001b[43m \u001b[49m\u001b[43mhow\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mhow\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 111\u001b[0m \u001b[43m \u001b[49m\u001b[43mon\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mon\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 112\u001b[0m \u001b[43m \u001b[49m\u001b[43mleft_on\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mleft_on\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 113\u001b[0m \u001b[43m \u001b[49m\u001b[43mright_on\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mright_on\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 114\u001b[0m \u001b[43m \u001b[49m\u001b[43mleft_index\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mleft_index\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 115\u001b[0m \u001b[43m \u001b[49m\u001b[43mright_index\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mright_index\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 116\u001b[0m \u001b[43m \u001b[49m\u001b[43msort\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msort\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 117\u001b[0m \u001b[43m \u001b[49m\u001b[43msuffixes\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msuffixes\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 118\u001b[0m \u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 119\u001b[0m \u001b[43m \u001b[49m\u001b[43mindicator\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindicator\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 120\u001b[0m \u001b[43m \u001b[49m\u001b[43mvalidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mvalidate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 121\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 122\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m op\u001b[38;5;241m.\u001b[39mget_result()\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:710\u001b[0m, in \u001b[0;36m_MergeOperation.__init__\u001b[0;34m(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)\u001b[0m\n\u001b[1;32m 706\u001b[0m \u001b[38;5;66;03m# If argument passed to validate,\u001b[39;00m\n\u001b[1;32m 707\u001b[0m \u001b[38;5;66;03m# check if columns specified as unique\u001b[39;00m\n\u001b[1;32m 708\u001b[0m \u001b[38;5;66;03m# are in fact unique.\u001b[39;00m\n\u001b[1;32m 709\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m validate \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m--> 710\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvalidate\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:1427\u001b[0m, in \u001b[0;36m_MergeOperation._validate\u001b[0;34m(self, validate)\u001b[0m\n\u001b[1;32m 1425\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m validate \u001b[38;5;129;01min\u001b[39;00m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone_to_one\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m1:1\u001b[39m\u001b[38;5;124m\"\u001b[39m]:\n\u001b[1;32m 1426\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m left_unique \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m right_unique:\n\u001b[0;32m-> 1427\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m MergeError(\n\u001b[1;32m 1428\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMerge keys are not unique in either left \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1429\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mor right dataset; not a one-to-one merge\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1430\u001b[0m )\n\u001b[1;32m 1431\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m left_unique:\n\u001b[1;32m 1432\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m MergeError(\n\u001b[1;32m 1433\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMerge keys are not unique in left dataset; not a one-to-one merge\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1434\u001b[0m )\n",
"\u001b[0;31mMergeError\u001b[0m: Merge keys are not unique in either left or right dataset; not a one-to-one merge"
]
}
],
"source": [
"pd.merge(left, right,on='B',validate='one_to_one')"
]
},
{
"cell_type": "markdown",
"id": "4d9d1edd",
"metadata": {},
"source": [
"```{admonition} Further Reading\n",
"The [pandas documentation](https://pandas.pydata.org/docs/user_guide/merging.html#checking-for-duplicate-keys) is a good place to read through their\n",
"exampels on how validate works. it's important to note the types of possible joins from the [beginning of the section](https://pandas.pydata.org/docs/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra)\n",
"```\n",
"\n",
"## Questions at the End of Class\n",
"\n",
"\n",
"### how to remove certain columns in merges (i.e. the year columns in the basketball dataset since they are redundant)\n",
"````{toggle}\n",
"They can be dropped after (or before) using `drop()`\n",
"\n",
"```{code-cell} ipython3\n",
"year_over_year_clean = year_over_year.drop(columns = ['SEASON_2018','SEASON_2019'])\n",
"year_over_year_clean.head()\n",
"```\n",
"\n",
"````\n",
"\n",
"### Can we merge as many dataframes as we would want?\n",
"```{toggle}\n",
"Merge is specifically defined for two Data Frames, but the result returns a\n",
"a DataFrame, so it could be passed to another merge.\n",
"```\n",
"\n",
"\n",
"### Is there no way to merge the left and right for the A and B data?\n",
"```{toggle}\n",
"All merges are possible, they just each give a different result.\n",
"```\n",
"\n",
"\n",
"### Can you use merging to check correlation between two different sets?\n",
"```{toggle}\n",
"You can use merging to check *overlap* between two DataFrames, for actual sets, it's probably best to represent them as `set` type objects and use set operations.\n",
"\n",
"However, these type of operations are not actually correlations, which we'll learn a bit about in a few weeks when we talk about regression\n",
"```\n",
"\n",
"\n",
"### Is there any way to compare the team ids in both datasets so that it outputs the players that changed teams between those times?\n",
"\n",
"````{toggle}\n",
"From the `year_over_year` DataFrame we made above, we can check which players changed teams. Since players also can change teams *within* a season, this is a tricky question, but we could, for example look only for players that have no overlapping teams from 2018 to 2019.\n",
"\n",
"\n",
"\n",
"````"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "d53176b3",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'year_over_year_clean' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Input \u001b[0;32mIn [27]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m change_check \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mlambda\u001b[39;00m r: \u001b[38;5;129;01mnot\u001b[39;00m(r[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mTEAM_ID_2018\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m==\u001b[39m r[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mTEAM_ID_2019\u001b[39m\u001b[38;5;124m'\u001b[39m])\n\u001b[0;32m----> 2\u001b[0m year_over_year_clean[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mCHANGE_TEAM\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43myear_over_year_clean\u001b[49m\u001b[38;5;241m.\u001b[39mapply(change_check,axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m)\n\u001b[1;32m 3\u001b[0m year_over_year_clean\u001b[38;5;241m.\u001b[39mhead()\n",
"\u001b[0;31mNameError\u001b[0m: name 'year_over_year_clean' is not defined"
]
}
],
"source": [
"change_check = lambda r: not(r['TEAM_ID_2018']== r['TEAM_ID_2019'])\n",
"year_over_year_clean['CHANGE_TEAM'] = year_over_year_clean.apply(change_check,axis=1)\n",
"year_over_year_clean.head()"
]
},
{
"cell_type": "markdown",
"id": "cc3f1ad1",
"metadata": {},
"source": [
"Then we can filter the data by the new column, then take out only the player information and dropa ny duplicates for players that played in multiple teams in one eyar or the other."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "f5f56a7b",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'year_over_year_clean' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Input \u001b[0;32mIn [28]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m y_o_y_changes \u001b[38;5;241m=\u001b[39m \u001b[43myear_over_year_clean\u001b[49m[year_over_year_clean[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mCHANGE_TEAM\u001b[39m\u001b[38;5;124m'\u001b[39m]]\n\u001b[1;32m 2\u001b[0m y_o_y_change_players \u001b[38;5;241m=\u001b[39m y_o_y_changes[[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mPLAYER_ID\u001b[39m\u001b[38;5;124m'\u001b[39m,\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mPLAYER_NAME\u001b[39m\u001b[38;5;124m'\u001b[39m]]\u001b[38;5;241m.\u001b[39mdrop_duplicates()\n\u001b[1;32m 3\u001b[0m y_o_y_change_players\u001b[38;5;241m.\u001b[39mhead()\n",
"\u001b[0;31mNameError\u001b[0m: name 'year_over_year_clean' is not defined"
]
}
],
"source": [
"y_o_y_changes = year_over_year_clean[year_over_year_clean['CHANGE_TEAM']]\n",
"y_o_y_change_players = y_o_y_changes[['PLAYER_ID','PLAYER_NAME']].drop_duplicates()\n",
"y_o_y_change_players.head()"
]
},
{
"cell_type": "markdown",
"id": "39123051",
"metadata": {},
"source": [
"and we can check how many players that is"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "415ca198",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'y_o_y_change_players' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Input \u001b[0;32mIn [29]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43my_o_y_change_players\u001b[49m\u001b[38;5;241m.\u001b[39mshape\n",
"\u001b[0;31mNameError\u001b[0m: name 'y_o_y_change_players' is not defined"
]
}
],
"source": [
"y_o_y_change_players.shape"
]
},
{
"cell_type": "markdown",
"id": "19f28a89",
"metadata": {},
"source": [
"### in year_over_year, where suffixes=('_2018','_2019'), why is there an underscore?\n",
"```{toggle}\n",
"this is optional, but is best practice for pythonic variable naming. (and remember columns are like variables)\n",
"```\n",
"\n",
"\n",
"### If we merge two dataframes which each have a column (or columns) that are the same, but each have different data types in those columns, what happens?\n",
"\n",
"```{toggle}\n",
"If they're just named the same, but not the \"on\" column, you'll get the two columsn with different suffixes default `('_x','_y')`. If it's the `on` column, the merge will be empty\n",
"```\n",
"\n",
"\n",
"### how does suffix to know to change the date?\n",
"```{toggle}\n",
"Suffix appends what we tell it to to the column names that occur in both datasets\n",
"```\n",
"\n",
"## More Practice\n",
"\n",
"1. Use a merge to figure out how many players did not return for the 2019 season\n",
"1. Also load the `conferences.csv` data. Use this to figure out how many players moved conferences from 2018 to 2019."
]
}
],
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.13,
"jupytext_version": "1.10.3"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"source_map": [
12,
16,
32,
36,
38,
44,
49,
51,
59,
62,
67,
71,
73,
78,
80,
94,
96,
101,
104,
109,
111,
118,
134,
141,
143,
147,
153,
156,
158,
163,
165,
176,
198,
202,
204,
212,
218,
220,
224,
232,
236,
240,
242,
246,
248,
298,
302,
305,
309,
312,
314
]
},
"nbformat": 4,
"nbformat_minor": 5
} | | | |