12. Building Datasets From multiple Sources#

1. remember, that
1. with markdown
1. you can make a numbered list
1. using all `1.`

renders as:

  1. remember, that

  2. with markdown

  3. you can make a numbered list

  4. using all 1.

import pandas as pd
course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'

Today we’re going to look at some data on NBA (National Basketball Association) players.

12.1. Combining Multiple Tables#

p18 = pd.read_csv(course_data_url +'2018-players.csv')
p19 = pd.read_csv(course_data_url +'2019-players.csv')
p18.head()
TEAM_ID PLAYER_ID SEASON
0 1610612761 202695 2018
1 1610612761 1627783 2018
2 1610612761 201188 2018
3 1610612761 201980 2018
4 1610612761 200768 2018

12.1.1. Stacking with Concat#

We’ve seen one way of putting dataframes together with concat we used it when Unpacking Jsons. In that case, we stacked them side by side, now we want to stack them vertically, which is the default of concat.

players_df = pd.concat([p18,p19])
players_df.head()
TEAM_ID PLAYER_ID SEASON PLAYER_NAME
0 1610612761 202695 2018 NaN
1 1610612761 1627783 2018 NaN
2 1610612761 201188 2018 NaN
3 1610612761 201980 2018 NaN
4 1610612761 200768 2018 NaN

This has the same columns, and we can see what happened more by looking at the shape of each.

players_df.shape
(1374, 4)
p18.shape, p19.shape
((748, 3), (626, 4))

We can verify that the length of the new data frame is the sum of the original two DataFrames.

assert len(p18) + len(p19) == len(players_df)

12.1.2. Combining Data with Merge#

What is we want to see which players changed teams from 2018 to 2019? We can do this with merge. For merge we have to tell it a left DataFrame and a right DataFrame and on what column to match them up.

The left and right DataFrames will be used different ways, but any DataFrame can be put in either position.

For this case, we will use 2018 data as left ,and 2019 as right and then merge on='PLAYER_ID'.

pd.merge(p18,p19,on='PLAYER_ID').head()
TEAM_ID_x PLAYER_ID SEASON_x PLAYER_NAME TEAM_ID_y SEASON_y
0 1610612761 202695 2018 Kawhi Leonard 1610612746 2019
1 1610612761 1627783 2018 Pascal Siakam 1610612761 2019
2 1610612761 201188 2018 Marc Gasol 1610612761 2019
3 1610612763 201188 2018 Marc Gasol 1610612761 2019
4 1610612761 201980 2018 Danny Green 1610612747 2019

Now, we get what we expect, but the column names have _x and _y on the end (as a suffix, appended to the original). We’ll add 2018 and 2019 respectively, separated with a _.

year_over_year = pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'))
year_over_year.head()
TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 SEASON_2019
0 1610612761 202695 2018 Kawhi Leonard 1610612746 2019
1 1610612761 1627783 2018 Pascal Siakam 1610612761 2019
2 1610612761 201188 2018 Marc Gasol 1610612761 2019
3 1610612763 201188 2018 Marc Gasol 1610612761 2019
4 1610612761 201980 2018 Danny Green 1610612747 2019

Now that it’s a little bit cleaner, we will examine how it works by looking at the shape.

year_over_year.shape, p18.shape, p19.shape
((538, 6), (748, 3), (626, 4))

This kept only the players that played both years, with repetitions for each team they played on for each year.

We can check the calculation with set math (python set type has operations from math sets like intersect and set difference)

# player IDs for each year, no repeats
p19_u = set(p19['PLAYER_ID'])
p18_u = set(p18['PLAYER_ID'])

# player IDs that played both years
p1819 = p18_u.intersection(p19_u)

# teams per player per year
teams_per_p18 = p18['PLAYER_ID'].value_counts()
teams_per_p19 = p19['PLAYER_ID'].value_counts()

# total number of team-player combinations
# multiply number of teams each player played for in 18 by number of teams in 19
# then sum.  (most of of these are 1*1)
sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  sum(teams_per_p19[p1819]* teams_per_p18[p1819])
538

We can also merge so that we keep all players on either team using how='outer' the default value for how is inner, which takes the intersection (but with duplicates, does some extra things as we saw). With outer it takes the union, but with extra handling for the duplicates.

pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer').shape
(927, 6)

It’s the total of the rows we had before, plus the total number of player-teams for players that only played in one of the two years.

#players tha tonly played in one year
o18 = p18_u.difference(p19_u)
o19 = p19_u.difference(p18_u)
# teams those players played for + the above 538
teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
927

We can save this to a variable

year_over_year_outer =  pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer')

then look at a few rows to see that it has indeed filled in with NaN in the places where there wasn’t a value.

year_over_year_outer.sample(10)
TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 SEASON_2019
485 1.610613e+09 2772 2018.0 Trevor Ariza 1.610613e+09 2019.0
300 1.610613e+09 202714 2018.0 NaN NaN NaN
639 1.610613e+09 1626150 2018.0 Andrew Harrison 1.610613e+09 2019.0
861 NaN 1629683 NaN Quinndary Weatherspoon 1.610613e+09 2019.0
264 1.610613e+09 1629027 2018.0 Trae Young 1.610613e+09 2019.0
720 1.610613e+09 1629008 2018.0 Michael Porter Jr. 1.610613e+09 2019.0
883 NaN 1629860 NaN Jon Davis 1.610613e+09 2019.0
826 NaN 1629076 NaN Tyler Cook 1.610613e+09 2019.0
384 1.610613e+09 203967 2018.0 Dario Saric 1.610613e+09 2019.0
52 1.610613e+09 203090 2018.0 Maurice Harkless 1.610613e+09 2019.0

Note

We can also tell that there are NaN beacuse it cast the year to float from int.

12.2. Merge types, in detail#

We can examine how these things work more visually with smaller DataFrames:

left = pd.DataFrame(
  {
    "key1": ["K0", "K0", "K1", "K2"],
    "key2": ["K0", "K1", "K0", "K1"],
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"],
  }
)


right = pd.DataFrame(
  {
    "key1": ["K0", "K1", "K1", "K2"],
    "key2": ["K0", "K0", "K0", "K0"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"],
  }
)
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, on=["key1", "key2"], how='inner')
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
pd.merge(left, right, on=["key1", "key2"], how='outer' )
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3

12.3. Duplicate Keys#

left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

result = pd.merge(left, right, on="B", how="outer")
left
A B
0 1 2
1 2 2
right
A B
0 4 2
1 5 2
2 6 2
result
A_x B A_y
0 1 2 4
1 1 2 5
2 1 2 6
3 2 2 4
4 2 2 5
5 2 2 6

If we ask pandas to validate the merge with validate and a specific type of merge, it will throw an error if that type of merge is not possible.

pd.merge(left, right,on='B',validate='one_to_one')
---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
Input In [26], in <cell line: 1>()
----> 1 pd.merge(left, right,on='B',validate='one_to_one')

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:107, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     90 @Substitution("\nleft : DataFrame or named Series")
     91 @Appender(_merge_doc, indents=0)
     92 def merge(
   (...)
    105     validate: str | None = None,
    106 ) -> DataFrame:
--> 107     op = _MergeOperation(
    108         left,
    109         right,
    110         how=how,
    111         on=on,
    112         left_on=left_on,
    113         right_on=right_on,
    114         left_index=left_index,
    115         right_index=right_index,
    116         sort=sort,
    117         suffixes=suffixes,
    118         copy=copy,
    119         indicator=indicator,
    120         validate=validate,
    121     )
    122     return op.get_result()

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:710, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    706 # If argument passed to validate,
    707 # check if columns specified as unique
    708 # are in fact unique.
    709 if validate is not None:
--> 710     self._validate(validate)

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:1427, in _MergeOperation._validate(self, validate)
   1425 if validate in ["one_to_one", "1:1"]:
   1426     if not left_unique and not right_unique:
-> 1427         raise MergeError(
   1428             "Merge keys are not unique in either left "
   1429             "or right dataset; not a one-to-one merge"
   1430         )
   1431     elif not left_unique:
   1432         raise MergeError(
   1433             "Merge keys are not unique in left dataset; not a one-to-one merge"
   1434         )

MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge

Further Reading

The pandas documentation is a good place to read through their exampels on how validate works. it’s important to note the types of possible joins from the beginning of the section

12.4. Questions at the End of Class#

12.4.1. how to remove certain columns in merges (i.e. the year columns in the basketball dataset since they are redundant)#

They can be dropped after (or before) using drop()

12.4.2. Can we merge as many dataframes as we would want?#

Merge is specifically defined for two Data Frames, but the result returns a a DataFrame, so it could be passed to another merge.

12.4.3. Is there no way to merge the left and right for the A and B data?#

All merges are possible, they just each give a different result.

12.4.4. Can you use merging to check correlation between two different sets?#

You can use merging to check overlap between two DataFrames, for actual sets, it’s probably best to represent them as set type objects and use set operations.

However, these type of operations are not actually correlations, which we’ll learn a bit about in a few weeks when we talk about regression

12.4.5. Is there any way to compare the team ids in both datasets so that it outputs the players that changed teams between those times?#

From the year_over_year DataFrame we made above, we can check which players changed teams. Since players also can change teams within a season, this is a tricky question, but we could, for example look only for players that have no overlapping teams from 2018 to 2019.

change_check = lambda r: not(r['TEAM_ID_2018']== r['TEAM_ID_2019'])
year_over_year_clean['CHANGE_TEAM'] = year_over_year_clean.apply(change_check,axis=1)
year_over_year_clean.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [27], in <cell line: 2>()
      1 change_check = lambda r: not(r['TEAM_ID_2018']== r['TEAM_ID_2019'])
----> 2 year_over_year_clean['CHANGE_TEAM'] = year_over_year_clean.apply(change_check,axis=1)
      3 year_over_year_clean.head()

NameError: name 'year_over_year_clean' is not defined

Then we can filter the data by the new column, then take out only the player information and dropa ny duplicates for players that played in multiple teams in one eyar or the other.

y_o_y_changes = year_over_year_clean[year_over_year_clean['CHANGE_TEAM']]
y_o_y_change_players = y_o_y_changes[['PLAYER_ID','PLAYER_NAME']].drop_duplicates()
y_o_y_change_players.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [28], in <cell line: 1>()
----> 1 y_o_y_changes = year_over_year_clean[year_over_year_clean['CHANGE_TEAM']]
      2 y_o_y_change_players = y_o_y_changes[['PLAYER_ID','PLAYER_NAME']].drop_duplicates()
      3 y_o_y_change_players.head()

NameError: name 'year_over_year_clean' is not defined

and we can check how many players that is

y_o_y_change_players.shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [29], in <cell line: 1>()
----> 1 y_o_y_change_players.shape

NameError: name 'y_o_y_change_players' is not defined

12.4.6. in year_over_year, where suffixes=(‘_2018’,‘_2019’), why is there an underscore?#

this is optional, but is best practice for pythonic variable naming. (and remember columns are like variables)

12.4.7. If we merge two dataframes which each have a column (or columns) that are the same, but each have different data types in those columns, what happens?#

If they’re just named the same, but not the “on” column, you’ll get the two columsn with different suffixes default ('_x','_y'). If it’s the on column, the merge will be empty

12.4.8. how does suffix to know to change the date?#

Suffix appends what we tell it to to the column names that occur in both datasets

12.5. More Practice#

  1. Use a merge to figure out how many players did not return for the 2019 season

  2. Also load the conferences.csv data. Use this to figure out how many players moved conferences from 2018 to 2019.