{
"cells": [
{
"cell_type": "markdown",
"id": "8dba16a2",
"metadata": {},
"source": [
"# Building Datasets from Multiple Sources\n",
"\n",
"focus this week is on how to programmatically combine sources of data\n",
"\n",
"We will start by looking at combining multiple tabular data formats and see how to get data from other sources. \n",
"\n",
"\n",
"\n",
"## Self- assessment, what's your plan\n",
"\n",
"Take a few minutes to think about the following questions and make a few notes\n",
"for yourself whereever you need them to be (a planner, calendar, etc).\n",
"Share one takeaway or question you have below when you're done.\n",
"\n",
"1. What achievements have you earned?\n",
"1. Does BrightSpace seem accurate to what you've done?\n",
"1. If not, e-mail brownsarahm with [CSC310] or [DSP310] in the subject and specific details about what you think is missing and why\n",
"1. Are you on track to earn the grade you want in this class?\n",
"1. If not, what will you need to do (respond more in class, submit more\n",
" assignments, use your portfolio to catch up) to get back on track?\n",
"1. If you are on track and you want to earn above a B, take a minute to think\n",
"about your portfolio. (tip: post an idea as an issue to get early feedback and help shaping your idea)\n",
"\n",
"## Logistics\n",
"\n",
"Check your earned achievements. See the [instructions](gradecheck) that are now saved for future reference on the left side.\n",
"\n",
"\n",
"## Merges"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "14a0c74d",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "331ab0f0",
"metadata": {},
"source": [
"I created a folder with [datasets we use in class](https://github.com/rhodyprog4ds/rhodyds/tree/main/data)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "10d33a0b",
"metadata": {},
"outputs": [],
"source": [
"course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'"
]
},
{
"cell_type": "markdown",
"id": "7a31dfcc",
"metadata": {},
"source": [
" We can load in two data sets of player information."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ba2efd5b",
"metadata": {},
"outputs": [],
"source": [
"df_18 = pd.read_csv(course_data_url+ '2018-players.csv')\n",
"df_19 = pd.read_csv(course_data_url+ '2019-players.csv')"
]
},
{
"cell_type": "markdown",
"id": "810d1b21",
"metadata": {},
"source": [
"and take a peek at each"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cab99358",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID | \n",
" PLAYER_ID | \n",
" SEASON | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID PLAYER_ID SEASON\n",
"0 1610612761 202695 2018\n",
"1 1610612761 1627783 2018"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_18.head(2)"
]
},
{
"cell_type": "markdown",
"id": "f9ac8866",
"metadata": {},
"source": [
"```{important}\n",
"Remember `columns` is an attribute, so it does not need `()`\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4e7b14fe",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['PLAYER_NAME', 'TEAM_ID', 'PLAYER_ID', 'SEASON'], dtype='object')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_19.columns"
]
},
{
"cell_type": "markdown",
"id": "2e7e71c4",
"metadata": {},
"source": [
"Let's make note of the shape of each"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e3a0d995",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((748, 3), (626, 4))"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_18.shape, df_19.shape"
]
},
{
"cell_type": "markdown",
"id": "4a737616",
"metadata": {},
"source": [
"## What if we want to analyze them together?\n",
"\n",
"We can combine them vertically:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "55af8abf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1374, 4)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat([df_18,df_19]).shape"
]
},
{
"cell_type": "markdown",
"id": "d2b26b67",
"metadata": {},
"source": [
"Note that this has the maximum number of columns (because both had some overlapping\n",
" columns) and the total number of rows.\n",
"\n",
"\n",
"## How can we find which players changed teams?\n",
"\n",
"To do this we want to have one player column and a column with each year's team.\n",
"\n",
"We can use a merge to do that.\n",
"\n",
"````{margin}\n",
"```{note}\n",
"the `head` here is only to help us see more at a time\n",
"```\n",
"````"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "92612200",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID | \n",
" PLAYER_ID | \n",
" SEASON | \n",
" PLAYER_NAME | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [TEAM_ID, PLAYER_ID, SEASON, PLAYER_NAME]\n",
"Index: []"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df_18,df_19,).head(2)"
]
},
{
"cell_type": "markdown",
"id": "af9383d1",
"metadata": {},
"source": [
"if we merge them without any parameters, it tries to merge on all shared columns. We want to merge them using the `PLAYER_ID` column though, we would say hat we are \"merging on player ID\" and we use the `on` parameter to do it"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "0ba6118e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_x | \n",
" PLAYER_ID | \n",
" SEASON_x | \n",
" PLAYER_NAME | \n",
" TEAM_ID_y | \n",
" SEASON_y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
" Kawhi Leonard | \n",
" 1610612746 | \n",
" 2019 | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
" Pascal Siakam | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID_x PLAYER_ID SEASON_x PLAYER_NAME TEAM_ID_y SEASON_y\n",
"0 1610612761 202695 2018 Kawhi Leonard 1610612746 2019\n",
"1 1610612761 1627783 2018 Pascal Siakam 1610612761 2019"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df_18,df_19, on = 'PLAYER_ID').head(2)"
]
},
{
"cell_type": "markdown",
"id": "03d1c7d3",
"metadata": {},
"source": [
"Since there are other columns that appear in both DataFrames, they get a suffis, which by default is `x` or `y`, we can specify them though."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "fdcf9bff",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_2018 | \n",
" PLAYER_ID | \n",
" SEASON_2018 | \n",
" PLAYER_NAME | \n",
" TEAM_ID_2019 | \n",
" SEASON_2019 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
" Kawhi Leonard | \n",
" 1610612746 | \n",
" 2019 | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
" Pascal Siakam | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 \\\n",
"0 1610612761 202695 2018 Kawhi Leonard 1610612746 \n",
"1 1610612761 1627783 2018 Pascal Siakam 1610612761 \n",
"\n",
" SEASON_2019 \n",
"0 2019 \n",
"1 2019 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df_18,df_19, on = 'PLAYER_ID',\n",
" suffixes=('_2018','_2019')).head(2)"
]
},
{
"cell_type": "markdown",
"id": "bae3759c",
"metadata": {},
"source": [
"We also told it what to append to any column names that match, but are not the same across both datasets."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "25cc8f5e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID | \n",
" PLAYER_ID | \n",
" SEASON | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID PLAYER_ID SEASON\n",
"0 1610612761 202695 2018"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_18.head(1)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "a439cea8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PLAYER_NAME | \n",
" TEAM_ID | \n",
" PLAYER_ID | \n",
" SEASON | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Royce O'Neale | \n",
" 1610612762 | \n",
" 1626220 | \n",
" 2019 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PLAYER_NAME TEAM_ID PLAYER_ID SEASON\n",
"0 Royce O'Neale 1610612762 1626220 2019"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_19.head(1)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f35d1d51",
"metadata": {},
"outputs": [],
"source": [
"df1819 = pd.merge(df_18,df_19, on = 'PLAYER_ID',\n",
" suffixes=('_2018','_2019'))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "4a7dc765",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(538, 6)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1819.shape"
]
},
{
"cell_type": "markdown",
"id": "223ef056",
"metadata": {},
"source": [
"By default, this uses an *inner* merge, so we get the players that are in both datasets only."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "fc013868",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_2018 | \n",
" PLAYER_ID | \n",
" SEASON_2018 | \n",
" PLAYER_NAME | \n",
" TEAM_ID_2019 | \n",
" SEASON_2019 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 202695 | \n",
" 2018 | \n",
" Kawhi Leonard | \n",
" 1610612746 | \n",
" 2019 | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
" Pascal Siakam | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 2 | \n",
" 1610612761 | \n",
" 201188 | \n",
" 2018 | \n",
" Marc Gasol | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 3 | \n",
" 1610612763 | \n",
" 201188 | \n",
" 2018 | \n",
" Marc Gasol | \n",
" 1610612761 | \n",
" 2019 | \n",
"
\n",
" \n",
" 4 | \n",
" 1610612761 | \n",
" 201980 | \n",
" 2018 | \n",
" Danny Green | \n",
" 1610612747 | \n",
" 2019 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 \\\n",
"0 1610612761 202695 2018 Kawhi Leonard 1610612746 \n",
"1 1610612761 1627783 2018 Pascal Siakam 1610612761 \n",
"2 1610612761 201188 2018 Marc Gasol 1610612761 \n",
"3 1610612763 201188 2018 Marc Gasol 1610612761 \n",
"4 1610612761 201980 2018 Danny Green 1610612747 \n",
"\n",
" SEASON_2019 \n",
"0 2019 \n",
"1 2019 \n",
"2 2019 \n",
"3 2019 \n",
"4 2019 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1819.head()"
]
},
{
"cell_type": "markdown",
"id": "353f0257",
"metadata": {},
"source": [
"Some players still appear twice, because they were in one of the datsets twice, this\n",
"happens when a player plays for two team in one season.\n",
"\n",
"## Which players played in 2018, but not 2019?\n",
"\n",
"We have different types of merges, inner is both, out is either. Left and right keep all the rows of one dataFrame. We can use left with `df_18` as the left DataFrame to see which players played only in 18."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "f2d7c973",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_2018 | \n",
" PLAYER_ID | \n",
" SEASON_2018 | \n",
" PLAYER_NAME | \n",
" TEAM_ID_2019 | \n",
" SEASON_2019 | \n",
"
\n",
" \n",
" \n",
" \n",
" 749 | \n",
" 1610612752 | \n",
" 1629246 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 750 | \n",
" 1610612748 | \n",
" 1629159 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 751 | \n",
" 1610612762 | \n",
" 1629163 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 752 | \n",
" 1610612743 | \n",
" 1629150 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 753 | \n",
" 1610612738 | \n",
" 1629167 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 \\\n",
"749 1610612752 1629246 2018 NaN NaN \n",
"750 1610612748 1629159 2018 NaN NaN \n",
"751 1610612762 1629163 2018 NaN NaN \n",
"752 1610612743 1629150 2018 NaN NaN \n",
"753 1610612738 1629167 2018 NaN NaN \n",
"\n",
" SEASON_2019 \n",
"749 NaN \n",
"750 NaN \n",
"751 NaN \n",
"752 NaN \n",
"753 NaN "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(df_18,df_19,on='PLAYER_ID', how='left',suffixes=('_2018','_2019')).tail()"
]
},
{
"cell_type": "markdown",
"id": "f7bec7a6",
"metadata": {},
"source": [
"To pick out those rows:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "6751eaa7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_2018 | \n",
" PLAYER_ID | \n",
" SEASON_2018 | \n",
" PLAYER_NAME | \n",
" TEAM_ID_2019 | \n",
" SEASON_2019 | \n",
"
\n",
" \n",
" \n",
" \n",
" 9 | \n",
" 1610612761 | \n",
" 202391 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 11 | \n",
" 1610612761 | \n",
" 201975 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 18 | \n",
" 1610612744 | \n",
" 101106 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 23 | \n",
" 1610612744 | \n",
" 2733 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 24 | \n",
" 1610612744 | \n",
" 201973 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 749 | \n",
" 1610612752 | \n",
" 1629246 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 750 | \n",
" 1610612748 | \n",
" 1629159 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 751 | \n",
" 1610612762 | \n",
" 1629163 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 752 | \n",
" 1610612743 | \n",
" 1629150 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 753 | \n",
" 1610612738 | \n",
" 1629167 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
216 rows × 6 columns
\n",
"
"
],
"text/plain": [
" TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 \\\n",
"9 1610612761 202391 2018 NaN NaN \n",
"11 1610612761 201975 2018 NaN NaN \n",
"18 1610612744 101106 2018 NaN NaN \n",
"23 1610612744 2733 2018 NaN NaN \n",
"24 1610612744 201973 2018 NaN NaN \n",
".. ... ... ... ... ... \n",
"749 1610612752 1629246 2018 NaN NaN \n",
"750 1610612748 1629159 2018 NaN NaN \n",
"751 1610612762 1629163 2018 NaN NaN \n",
"752 1610612743 1629150 2018 NaN NaN \n",
"753 1610612738 1629167 2018 NaN NaN \n",
"\n",
" SEASON_2019 \n",
"9 NaN \n",
"11 NaN \n",
"18 NaN \n",
"23 NaN \n",
"24 NaN \n",
".. ... \n",
"749 NaN \n",
"750 NaN \n",
"751 NaN \n",
"752 NaN \n",
"753 NaN \n",
"\n",
"[216 rows x 6 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_18_only = pd.merge(df_18,df_19,on='PLAYER_ID', how='left', suffixes=('_2018','_2019'))\n",
"df_18_only[df_18_only['SEASON_2019'].isna()]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "f4994850",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID_2018 | \n",
" PLAYER_ID | \n",
" SEASON_2018 | \n",
" PLAYER_NAME | \n",
" TEAM_ID_2019 | \n",
" SEASON_2019 | \n",
"
\n",
" \n",
" \n",
" \n",
" 9 | \n",
" 1610612761 | \n",
" 202391 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 11 | \n",
" 1610612761 | \n",
" 201975 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 18 | \n",
" 1610612744 | \n",
" 101106 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 23 | \n",
" 1610612744 | \n",
" 2733 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 24 | \n",
" 1610612744 | \n",
" 201973 | \n",
" 2018 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 \\\n",
"9 1610612761 202391 2018 NaN NaN \n",
"11 1610612761 201975 2018 NaN NaN \n",
"18 1610612744 101106 2018 NaN NaN \n",
"23 1610612744 2733 2018 NaN NaN \n",
"24 1610612744 201973 2018 NaN NaN \n",
"\n",
" SEASON_2019 \n",
"9 NaN \n",
"11 NaN \n",
"18 NaN \n",
"23 NaN \n",
"24 NaN "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_18_left = pd.merge(df_18,df_19.drop(columns=['SEASON'])\n",
" ,on='PLAYER_ID', how='left', suffixes=('_2018','_2019'))\n",
"df_18_only = df_18_only[df_18_only['TEAM_ID_2019'].isna()]\n",
"df_18_only.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "9aba64e4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"178"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n_18_only, _ = df_18_only.drop_duplicates(subset=['PLAYER_ID']).shape\n",
"n_18_only"
]
},
{
"cell_type": "markdown",
"id": "9aa66a32",
"metadata": {},
"source": [
"## Which players played for the same team both seasons?"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "ee7c2c48",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID | \n",
" PLAYER_ID | \n",
" SEASON_x | \n",
" PLAYER_NAME | \n",
" SEASON_y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" 2018 | \n",
" Pascal Siakam | \n",
" 2019 | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 201188 | \n",
" 2018 | \n",
" Marc Gasol | \n",
" 2019 | \n",
"
\n",
" \n",
" 2 | \n",
" 1610612761 | \n",
" 200768 | \n",
" 2018 | \n",
" Kyle Lowry | \n",
" 2019 | \n",
"
\n",
" \n",
" 3 | \n",
" 1610612761 | \n",
" 1627832 | \n",
" 2018 | \n",
" Fred VanVleet | \n",
" 2019 | \n",
"
\n",
" \n",
" 4 | \n",
" 1610612761 | \n",
" 201586 | \n",
" 2018 | \n",
" Serge Ibaka | \n",
" 2019 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID PLAYER_ID SEASON_x PLAYER_NAME SEASON_y\n",
"0 1610612761 1627783 2018 Pascal Siakam 2019\n",
"1 1610612761 201188 2018 Marc Gasol 2019\n",
"2 1610612761 200768 2018 Kyle Lowry 2019\n",
"3 1610612761 1627832 2018 Fred VanVleet 2019\n",
"4 1610612761 201586 2018 Serge Ibaka 2019"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_same_team = pd.merge(df_18,df_19, on = ['PLAYER_ID','TEAM_ID'],)\n",
"df_same_team.head()"
]
},
{
"cell_type": "markdown",
"id": "c5d0a55b",
"metadata": {},
"source": [
"In this case, the suffix only applies to season, but they're not telling us much, so we can clean it up using `drop` before we merge."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "970190c9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TEAM_ID | \n",
" PLAYER_ID | \n",
" PLAYER_NAME | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1610612761 | \n",
" 1627783 | \n",
" Pascal Siakam | \n",
"
\n",
" \n",
" 1 | \n",
" 1610612761 | \n",
" 201188 | \n",
" Marc Gasol | \n",
"
\n",
" \n",
" 2 | \n",
" 1610612761 | \n",
" 200768 | \n",
" Kyle Lowry | \n",
"
\n",
" \n",
" 3 | \n",
" 1610612761 | \n",
" 1627832 | \n",
" Fred VanVleet | \n",
"
\n",
" \n",
" 4 | \n",
" 1610612761 | \n",
" 201586 | \n",
" Serge Ibaka | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TEAM_ID PLAYER_ID PLAYER_NAME\n",
"0 1610612761 1627783 Pascal Siakam\n",
"1 1610612761 201188 Marc Gasol\n",
"2 1610612761 200768 Kyle Lowry\n",
"3 1610612761 1627832 Fred VanVleet\n",
"4 1610612761 201586 Serge Ibaka"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_18_only_clean = pd.merge(df_18.drop(columns='SEASON'),df_19.drop(columns='SEASON'), on = ['PLAYER_ID','TEAM_ID'],)\n",
"df_18_only_clean.head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "8f2802f6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(263, 3)"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_18_only_clean.shape"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "f039703e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(263, 3)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_18_only_clean.drop_duplicates(subset=['PLAYER_ID']).shape"
]
},
{
"cell_type": "markdown",
"id": "37a7475e",
"metadata": {},
"source": [
"We do not need to drop the duplicates in this case becaue we merged on two columns and there were no actual duplicates in the original dataset."
]
},
{
"cell_type": "markdown",
"id": "3505e302",
"metadata": {},
"source": [
"## Visualizing merges"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "676afd96",
"metadata": {},
"outputs": [],
"source": [
"left = pd.DataFrame(\n",
" {\n",
" \"key\": [\"K0\", \"K1\", \"K2\", \"K4\"],\n",
" \"A\": [\"A0\", \"A1\", \"A2\", \"A4\"],\n",
" \"B\": [\"B0\", \"B1\", \"B2\", \"B4\"],\n",
" }\n",
")\n",
"\n",
"\n",
"right = pd.DataFrame(\n",
" {\n",
" \"key\": [\"K0\", \"K1\", \"K2\", \"K3\"],\n",
" \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n",
" \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "2134f10e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
"
\n",
" \n",
" 2 | \n",
" K2 | \n",
" A2 | \n",
" B2 | \n",
"
\n",
" \n",
" 3 | \n",
" K4 | \n",
" A4 | \n",
" B4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key A B\n",
"0 K0 A0 B0\n",
"1 K1 A1 B1\n",
"2 K2 A2 B2\n",
"3 K4 A4 B4"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"left"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "2a4630c6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 3 | \n",
" K3 | \n",
" C3 | \n",
" D3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key C D\n",
"0 K0 C0 D0\n",
"1 K1 C1 D1\n",
"2 K2 C2 D2\n",
"3 K3 C3 D3"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"right"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "db26d2ab",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K2 | \n",
" A2 | \n",
" B2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key A B C D\n",
"0 K0 A0 B0 C0 D0\n",
"1 K1 A1 B1 C1 D1\n",
"2 K2 A2 B2 C2 D2"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, how='inner',)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "f7960896",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K2 | \n",
" A2 | \n",
" B2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 3 | \n",
" K4 | \n",
" A4 | \n",
" B4 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" K3 | \n",
" NaN | \n",
" NaN | \n",
" C3 | \n",
" D3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key A B C D\n",
"0 K0 A0 B0 C0 D0\n",
"1 K1 A1 B1 C1 D1\n",
"2 K2 A2 B2 C2 D2\n",
"3 K4 A4 B4 NaN NaN\n",
"4 K3 NaN NaN C3 D3"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, how='outer')"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "ef693fcd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K2 | \n",
" A2 | \n",
" B2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 3 | \n",
" K4 | \n",
" A4 | \n",
" B4 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key A B C D\n",
"0 K0 A0 B0 C0 D0\n",
"1 K1 A1 B1 C1 D1\n",
"2 K2 A2 B2 C2 D2\n",
"3 K4 A4 B4 NaN NaN"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, how='left')"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "b5e5ef82",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K2 | \n",
" A2 | \n",
" B2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 3 | \n",
" K3 | \n",
" NaN | \n",
" NaN | \n",
" C3 | \n",
" D3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key A B C D\n",
"0 K0 A0 B0 C0 D0\n",
"1 K1 A1 B1 C1 D1\n",
"2 K2 A2 B2 C2 D2\n",
"3 K3 NaN NaN C3 D3"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, how='right')"
]
},
{
"cell_type": "markdown",
"id": "bd6928c8",
"metadata": {},
"source": [
" We can merge on multiple columns too/"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "ae44c24f",
"metadata": {},
"outputs": [],
"source": [
"left = pd.DataFrame(\n",
" {\n",
" \"key1\": [\"K0\", \"K0\", \"K1\", \"K2\"],\n",
" \"key2\": [\"K0\", \"K1\", \"K0\", \"K1\"],\n",
" \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n",
" \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n",
" }\n",
")\n",
"\n",
"\n",
"right = pd.DataFrame(\n",
" {\n",
" \"key1\": [\"K0\", \"K1\", \"K1\", \"K2\"],\n",
" \"key2\": [\"K0\", \"K0\", \"K0\", \"K0\"],\n",
" \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n",
" \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n",
" }\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "dd65da6e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
"
\n",
" \n",
" 1 | \n",
" K0 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
"
\n",
" \n",
" 3 | \n",
" K2 | \n",
" K1 | \n",
" A3 | \n",
" B3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 A B\n",
"0 K0 K0 A0 B0\n",
"1 K0 K1 A1 B1\n",
"2 K1 K0 A2 B2\n",
"3 K2 K1 A3 B3"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"left"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "25cc9549",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" K0 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 3 | \n",
" K2 | \n",
" K0 | \n",
" C3 | \n",
" D3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 C D\n",
"0 K0 K0 C0 D0\n",
"1 K1 K0 C1 D1\n",
"2 K1 K0 C2 D2\n",
"3 K2 K0 C3 D3"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"right"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "6a5ba4a3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 A B C D\n",
"0 K0 K0 A0 B0 C0 D0\n",
"1 K1 K0 A2 B2 C1 D1\n",
"2 K1 K0 A2 B2 C2 D2"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, on=[\"key1\", \"key2\"])"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "1c3705d6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" A | \n",
" B | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K0 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 3 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 4 | \n",
" K2 | \n",
" K1 | \n",
" A3 | \n",
" B3 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" K2 | \n",
" K0 | \n",
" NaN | \n",
" NaN | \n",
" C3 | \n",
" D3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 A B C D\n",
"0 K0 K0 A0 B0 C0 D0\n",
"1 K0 K1 A1 B1 NaN NaN\n",
"2 K1 K0 A2 B2 C1 D1\n",
"3 K1 K0 A2 B2 C2 D2\n",
"4 K2 K1 A3 B3 NaN NaN\n",
"5 K2 K0 NaN NaN C3 D3"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, on=[\"key1\", \"key2\"],how='outer')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "b727354a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2_x | \n",
" A | \n",
" B | \n",
" key2_y | \n",
" C | \n",
" D | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" K0 | \n",
" K0 | \n",
" A0 | \n",
" B0 | \n",
" K0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 1 | \n",
" K0 | \n",
" K1 | \n",
" A1 | \n",
" B1 | \n",
" K0 | \n",
" C0 | \n",
" D0 | \n",
"
\n",
" \n",
" 2 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" K0 | \n",
" C1 | \n",
" D1 | \n",
"
\n",
" \n",
" 3 | \n",
" K1 | \n",
" K0 | \n",
" A2 | \n",
" B2 | \n",
" K0 | \n",
" C2 | \n",
" D2 | \n",
"
\n",
" \n",
" 4 | \n",
" K2 | \n",
" K1 | \n",
" A3 | \n",
" B3 | \n",
" K0 | \n",
" C3 | \n",
" D3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2_x A B key2_y C D\n",
"0 K0 K0 A0 B0 K0 C0 D0\n",
"1 K0 K1 A1 B1 K0 C0 D0\n",
"2 K1 K0 A2 B2 K0 C1 D1\n",
"3 K1 K0 A2 B2 K0 C2 D2\n",
"4 K2 K1 A3 B3 K0 C3 D3"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right,on = 'key1' )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afe59900",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 37,
"id": "5547fa50",
"metadata": {},
"outputs": [],
"source": [
"left = pd.DataFrame({\"A\": [1, 2], \"B\": [2, 2]})\n",
"\n",
"right = pd.DataFrame({\"A\": [4, 5, 6], \"B\": [2, 2, 2]})\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "3a9b0b5d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"0 1 2\n",
"1 2 2"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"left"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "908c67e0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 6 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B\n",
"0 4 2\n",
"1 5 2\n",
"2 6 2"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"right"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "f0c8734a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A_x | \n",
" B | \n",
" A_y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A_x B A_y\n",
"0 1 2 4\n",
"1 1 2 5\n",
"2 1 2 6\n",
"3 2 2 4\n",
"4 2 2 5\n",
"5 2 2 6"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.merge(left, right, on=\"B\", how=\"outer\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "875ce932",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "8a80198e",
"metadata": {},
"source": [
"## Questions After Class\n",
"\n",
"```{important}\n",
"several questions were easiest to answer within the narrative of the notes above.\n",
"```\n",
"\n",
"### How do I represent NaN as a variable? (Ex. df1819['TEAM_ID_y'==NaN]...something like that?)\n",
"\n",
"For that specific case, you can use the `isna` method as I did above, but if you need a NaN constant otherwise pandas provides"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "56321717",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.NA"
]
},
{
"cell_type": "markdown",
"id": "d0e38245",
"metadata": {},
"source": [
"and numpy provides"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "b7a79448",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nan"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"np.nan"
]
},
{
"cell_type": "markdown",
"id": "5ee7fb5d",
"metadata": {},
"source": [
"Watch out though because they are not the same:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "aaaf0b70",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan == pd.NA"
]
},
{
"cell_type": "markdown",
"id": "708ac17c",
"metadata": {},
"source": [
"and this cannot even assert"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "794a3da9",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "boolean value of NA is ambiguous",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[44], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m pd\u001b[38;5;241m.\u001b[39mNA \u001b[38;5;241m==\u001b[39m np\u001b[38;5;241m.\u001b[39mnan\n",
"File \u001b[0;32m/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/missing.pyx:382\u001b[0m, in \u001b[0;36mpandas._libs.missing.NAType.__bool__\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mTypeError\u001b[0m: boolean value of NA is ambiguous"
]
}
],
"source": [
"assert pd.NA == np.nan"
]
},
{
"cell_type": "markdown",
"id": "1f9f7595",
"metadata": {},
"source": [
"The `pandas` `pd.isna` method is robust, and it knows how `numpy` works (because it is built on top and imports)."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "4c17cb24",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.isna(np.nan)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "b91595a7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.isna(pd.NA)"
]
},
{
"cell_type": "markdown",
"id": "166e3e20",
"metadata": {},
"source": [
"However, `numpy` does not know `pandas`"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "a0aa05bc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.isnan(pd.NA)"
]
},
{
"cell_type": "markdown",
"id": "313e21c0",
"metadata": {},
"source": [
"and returns a value that cannot be used for indexing\n",
"\n",
"### Is there an easier way to look at the changes with merge?\n",
"\n",
"Any of the tools we have to look at a DataFrame will work. I think using a small DataFrame is the best way to get a good idea of how they each work, so I included that. Overall, checking the shape gives a broad view and then comparing values gives the detials.\n",
"\n",
"### Do merges also delete any other overlapping columns other than the one they are told to consolidate on?\n",
"\n",
"No they copy them all\n",
"\n",
"### in pd.merge(p18,p19, on='PLAYER_ID',how='____').shape is p18 and p19 supposed to be df_18 and df_19?\n",
"\n",
"Yes, sorry, I read them in using different names than were in my notes\n",
"\n",
"### Learning more about accessing data from databases\n",
"\n",
"We will access databases on Wednesday\n",
"\n",
"### How to merge data that does not line up as well as these two datasets?\n",
"\n",
"We will see how to merge on columns that have the same data, but different columns this week. When there is no columns that match value to value, you have to transform and add new columns, combining what we learned about cleaning data to make them line up somehow. For example, maybe one dataset has dates like `YYYY-MM-DD` and the other data set has only months like \"January\". You could split the whole date into three columns and transform the strings to numbers or numbers to strings by treating the columns as [dates](https://pandas.pydata.org/docs/user_guide/timeseries.html)\n",
"\n",
"### Why did we merged on 'PLAYER_ID'?\n",
"\n",
"We were interested in linking based on the players.\n",
"\n",
"\n",
"### What is the second best source after documentation? For ex: you found a parameter that might do what you want it to in pandas, but where do you go after to make sure? I'm not sure if I ran it, I'd specifically know if it \"worked\"\n",
"\n",
"Ideally, you should know what you are looking for to know if it worked or not. If you are not sure what you are looking for, that is a good time to try to think that through, or attend office hours. Next is to create an issue on your assignment repo if the question might reveal a solution.\n",
"\n",
"\n",
"### I think I need to understand how GitHub works because I'm following the steps of the assignments and I feel like I'm doing something wrong every time I have to work on GitHub.\n",
"\n",
"If you are getting feedback, you are close enough. However, I am offering an extra session on GitHub on Friday."
]
}
],
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.13,
"jupytext_version": "1.14.1"
}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"source_map": [
12,
44,
46,
50,
52,
56,
59,
63,
65,
71,
73,
77,
79,
85,
87,
105,
107,
111,
113,
117,
120,
124,
128,
132,
137,
139,
143,
145,
154,
156,
160,
165,
172,
175,
179,
182,
186,
191,
195,
197,
201,
205,
224,
228,
232,
236,
240,
244,
246,
250,
271,
275,
279,
283,
287,
291,
295,
302,
306,
310,
314,
316,
328,
330,
333,
336,
341,
343,
348,
350,
355,
359,
361,
366,
368
]
},
"nbformat": 4,
"nbformat_minor": 5
}