Building Datasets From multiple Sources
Contents
12. Building Datasets From multiple Sources#
1. remember, that
1. with markdown
1. you can make a numbered list
1. using all `1.`
renders as:
remember, that
with markdown
you can make a numbered list
using all
1.
import pandas as pd
course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'
Today we’re going to look at some data on NBA (National Basketball Association) players.
12.1. Combining Multiple Tables#
p18 = pd.read_csv(course_data_url +'2018-players.csv')
p19 = pd.read_csv(course_data_url +'2019-players.csv')
p18.head()
TEAM_ID | PLAYER_ID | SEASON | |
---|---|---|---|
0 | 1610612761 | 202695 | 2018 |
1 | 1610612761 | 1627783 | 2018 |
2 | 1610612761 | 201188 | 2018 |
3 | 1610612761 | 201980 | 2018 |
4 | 1610612761 | 200768 | 2018 |
12.1.1. Stacking with Concat#
We’ve seen one way of putting dataframes together with concat
we used it when
Unpacking Jsons. In that case, we stacked them side by side, now we want
to stack them vertically, which is the default of concat.
players_df = pd.concat([p18,p19])
players_df.head()
TEAM_ID | PLAYER_ID | SEASON | PLAYER_NAME | |
---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | NaN |
1 | 1610612761 | 1627783 | 2018 | NaN |
2 | 1610612761 | 201188 | 2018 | NaN |
3 | 1610612761 | 201980 | 2018 | NaN |
4 | 1610612761 | 200768 | 2018 | NaN |
This has the same columns, and we can see what happened more by looking at the shape of each.
players_df.shape
(1374, 4)
p18.shape, p19.shape
((748, 3), (626, 4))
We can verify that the length of the new data frame is the sum of the original two DataFrames.
assert len(p18) + len(p19) == len(players_df)
12.1.2. Combining Data with Merge#
What is we want to see which players changed teams from 2018 to 2019? We can do
this with merge
. For merge
we have to tell it a left DataFrame and a right
DataFrame and on what column to match them up.
The left and right DataFrames will be used different ways, but any DataFrame can be put in either position.
For this case, we will use 2018 data as left ,and 2019 as right and then merge
on='PLAYER_ID'
.
pd.merge(p18,p19,on='PLAYER_ID').head()
TEAM_ID_x | PLAYER_ID | SEASON_x | PLAYER_NAME | TEAM_ID_y | SEASON_y | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | Kawhi Leonard | 1610612746 | 2019 |
1 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 1610612761 | 2019 |
2 | 1610612761 | 201188 | 2018 | Marc Gasol | 1610612761 | 2019 |
3 | 1610612763 | 201188 | 2018 | Marc Gasol | 1610612761 | 2019 |
4 | 1610612761 | 201980 | 2018 | Danny Green | 1610612747 | 2019 |
Now, we get what we expect, but the column names have _x
and _y
on the end
(as a suffix, appended to the original). We’ll add 2018 and 2019 respectively, separated with a _
.
year_over_year = pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'))
year_over_year.head()
TEAM_ID_2018 | PLAYER_ID | SEASON_2018 | PLAYER_NAME | TEAM_ID_2019 | SEASON_2019 | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | Kawhi Leonard | 1610612746 | 2019 |
1 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 1610612761 | 2019 |
2 | 1610612761 | 201188 | 2018 | Marc Gasol | 1610612761 | 2019 |
3 | 1610612763 | 201188 | 2018 | Marc Gasol | 1610612761 | 2019 |
4 | 1610612761 | 201980 | 2018 | Danny Green | 1610612747 | 2019 |
Now that it’s a little bit cleaner, we will examine how it works by looking at the shape.
year_over_year.shape, p18.shape, p19.shape
((538, 6), (748, 3), (626, 4))
This kept only the players that played both years, with repetitions for each team they played on for each year.
We can check the calculation with set math (python set
type has operations
from math sets like intersect and set difference)
# player IDs for each year, no repeats
p19_u = set(p19['PLAYER_ID'])
p18_u = set(p18['PLAYER_ID'])
# player IDs that played both years
p1819 = p18_u.intersection(p19_u)
# teams per player per year
teams_per_p18 = p18['PLAYER_ID'].value_counts()
teams_per_p19 = p19['PLAYER_ID'].value_counts()
# total number of team-player combinations
# multiply number of teams each player played for in 18 by number of teams in 19
# then sum. (most of of these are 1*1)
sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
sum(teams_per_p19[p1819]* teams_per_p18[p1819])
538
We can also merge so that we keep all players on either team using how='outer'
the default value for how
is inner, which takes the intersection (but with duplicates, does some extra things as we saw). With outer
it takes the union,
but with extra handling for the duplicates.
pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer').shape
(927, 6)
It’s the total of the rows we had before, plus the total number of player-teams for players that only played in one of the two years.
#players tha tonly played in one year
o18 = p18_u.difference(p19_u)
o19 = p19_u.difference(p18_u)
# teams those players played for + the above 538
teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
927
We can save this to a variable
year_over_year_outer = pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer')
then look at a few rows to see that it has indeed filled in with NaN in the places where there wasn’t a value.
year_over_year_outer.sample(10)
TEAM_ID_2018 | PLAYER_ID | SEASON_2018 | PLAYER_NAME | TEAM_ID_2019 | SEASON_2019 | |
---|---|---|---|---|---|---|
485 | 1.610613e+09 | 2772 | 2018.0 | Trevor Ariza | 1.610613e+09 | 2019.0 |
300 | 1.610613e+09 | 202714 | 2018.0 | NaN | NaN | NaN |
639 | 1.610613e+09 | 1626150 | 2018.0 | Andrew Harrison | 1.610613e+09 | 2019.0 |
861 | NaN | 1629683 | NaN | Quinndary Weatherspoon | 1.610613e+09 | 2019.0 |
264 | 1.610613e+09 | 1629027 | 2018.0 | Trae Young | 1.610613e+09 | 2019.0 |
720 | 1.610613e+09 | 1629008 | 2018.0 | Michael Porter Jr. | 1.610613e+09 | 2019.0 |
883 | NaN | 1629860 | NaN | Jon Davis | 1.610613e+09 | 2019.0 |
826 | NaN | 1629076 | NaN | Tyler Cook | 1.610613e+09 | 2019.0 |
384 | 1.610613e+09 | 203967 | 2018.0 | Dario Saric | 1.610613e+09 | 2019.0 |
52 | 1.610613e+09 | 203090 | 2018.0 | Maurice Harkless | 1.610613e+09 | 2019.0 |
Note
We can also tell that there are NaN beacuse it cast the year to float from int.
12.2. Merge types, in detail#
We can examine how these things work more visually with smaller DataFrames:
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
pd.merge(left, right, on=["key1", "key2"], how='inner')
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
pd.merge(left, right, on=["key1", "key2"], how='outer' )
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
5 | K2 | K0 | NaN | NaN | C3 | D3 |
12.3. Duplicate Keys#
left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})
right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
result = pd.merge(left, right, on="B", how="outer")
left
A | B | |
---|---|---|
0 | 1 | 2 |
1 | 2 | 2 |
right
A | B | |
---|---|---|
0 | 4 | 2 |
1 | 5 | 2 |
2 | 6 | 2 |
result
A_x | B | A_y | |
---|---|---|---|
0 | 1 | 2 | 4 |
1 | 1 | 2 | 5 |
2 | 1 | 2 | 6 |
3 | 2 | 2 | 4 |
4 | 2 | 2 | 5 |
5 | 2 | 2 | 6 |
If we ask pandas to validate the merge with validate
and a specific type of
merge, it will throw an error if that type of merge is not possible.
pd.merge(left, right,on='B',validate='one_to_one')
---------------------------------------------------------------------------
MergeError Traceback (most recent call last)
Input In [26], in <cell line: 1>()
----> 1 pd.merge(left, right,on='B',validate='one_to_one')
File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:107, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
90 @Substitution("\nleft : DataFrame or named Series")
91 @Appender(_merge_doc, indents=0)
92 def merge(
(...)
105 validate: str | None = None,
106 ) -> DataFrame:
--> 107 op = _MergeOperation(
108 left,
109 right,
110 how=how,
111 on=on,
112 left_on=left_on,
113 right_on=right_on,
114 left_index=left_index,
115 right_index=right_index,
116 sort=sort,
117 suffixes=suffixes,
118 copy=copy,
119 indicator=indicator,
120 validate=validate,
121 )
122 return op.get_result()
File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:710, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
706 # If argument passed to validate,
707 # check if columns specified as unique
708 # are in fact unique.
709 if validate is not None:
--> 710 self._validate(validate)
File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:1427, in _MergeOperation._validate(self, validate)
1425 if validate in ["one_to_one", "1:1"]:
1426 if not left_unique and not right_unique:
-> 1427 raise MergeError(
1428 "Merge keys are not unique in either left "
1429 "or right dataset; not a one-to-one merge"
1430 )
1431 elif not left_unique:
1432 raise MergeError(
1433 "Merge keys are not unique in left dataset; not a one-to-one merge"
1434 )
MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge
Further Reading
The pandas documentation is a good place to read through their exampels on how validate works. it’s important to note the types of possible joins from the beginning of the section
12.4. Questions at the End of Class#
12.4.1. how to remove certain columns in merges (i.e. the year columns in the basketball dataset since they are redundant)#
They can be dropped after (or before) using drop()
12.4.2. Can we merge as many dataframes as we would want?#
Merge is specifically defined for two Data Frames, but the result returns a a DataFrame, so it could be passed to another merge.
12.4.3. Is there no way to merge the left and right for the A and B data?#
All merges are possible, they just each give a different result.
12.4.4. Can you use merging to check correlation between two different sets?#
You can use merging to check overlap between two DataFrames, for actual sets, it’s probably best to represent them as set
type objects and use set operations.
However, these type of operations are not actually correlations, which we’ll learn a bit about in a few weeks when we talk about regression
12.4.5. Is there any way to compare the team ids in both datasets so that it outputs the players that changed teams between those times?#
From the year_over_year
DataFrame we made above, we can check which players changed teams. Since players also can change teams within a season, this is a tricky question, but we could, for example look only for players that have no overlapping teams from 2018 to 2019.
change_check = lambda r: not(r['TEAM_ID_2018']== r['TEAM_ID_2019'])
year_over_year_clean['CHANGE_TEAM'] = year_over_year_clean.apply(change_check,axis=1)
year_over_year_clean.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [27], in <cell line: 2>()
1 change_check = lambda r: not(r['TEAM_ID_2018']== r['TEAM_ID_2019'])
----> 2 year_over_year_clean['CHANGE_TEAM'] = year_over_year_clean.apply(change_check,axis=1)
3 year_over_year_clean.head()
NameError: name 'year_over_year_clean' is not defined
Then we can filter the data by the new column, then take out only the player information and dropa ny duplicates for players that played in multiple teams in one eyar or the other.
y_o_y_changes = year_over_year_clean[year_over_year_clean['CHANGE_TEAM']]
y_o_y_change_players = y_o_y_changes[['PLAYER_ID','PLAYER_NAME']].drop_duplicates()
y_o_y_change_players.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [28], in <cell line: 1>()
----> 1 y_o_y_changes = year_over_year_clean[year_over_year_clean['CHANGE_TEAM']]
2 y_o_y_change_players = y_o_y_changes[['PLAYER_ID','PLAYER_NAME']].drop_duplicates()
3 y_o_y_change_players.head()
NameError: name 'year_over_year_clean' is not defined
and we can check how many players that is
y_o_y_change_players.shape
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [29], in <cell line: 1>()
----> 1 y_o_y_change_players.shape
NameError: name 'y_o_y_change_players' is not defined
12.4.6. in year_over_year, where suffixes=(‘_2018’,‘_2019’), why is there an underscore?#
this is optional, but is best practice for pythonic variable naming. (and remember columns are like variables)
12.4.7. If we merge two dataframes which each have a column (or columns) that are the same, but each have different data types in those columns, what happens?#
If they’re just named the same, but not the “on” column, you’ll get the two columsn with different suffixes default ('_x','_y')
. If it’s the on
column, the merge will be empty
12.4.8. how does suffix to know to change the date?#
Suffix appends what we tell it to to the column names that occur in both datasets
12.5. More Practice#
Use a merge to figure out how many players did not return for the 2019 season
Also load the
conferences.csv
data. Use this to figure out how many players moved conferences from 2018 to 2019.