12. Building Datasets From multiple Sources#

remember, that
with markdown
you can make a numbered list
using all `1.`

renders as:

remember, that
with markdown
you can make a numbered list
using all 1.

import pandas as pd

course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'

Today we’re going to look at some data on NBA (National Basketball Association) players.

12.1. Combining Multiple Tables#

p18 = pd.read_csv(course_data_url +'2018-players.csv')
p19 = pd.read_csv(course_data_url +'2019-players.csv')

p18.head()

	TEAM_ID	PLAYER_ID	SEASON
0	1610612761	202695	2018
1	1610612761	1627783	2018
2	1610612761	201188	2018
3	1610612761	201980	2018
4	1610612761	200768	2018

12.1.1. Stacking with Concat#

We’ve seen one way of putting dataframes together with concat we used it when Unpacking Jsons. In that case, we stacked them side by side, now we want to stack them vertically, which is the default of concat.

players_df = pd.concat([p18,p19])
players_df.head()

	TEAM_ID	PLAYER_ID	SEASON	PLAYER_NAME
0	1610612761	202695	2018	NaN
1	1610612761	1627783	2018	NaN
2	1610612761	201188	2018	NaN
3	1610612761	201980	2018	NaN
4	1610612761	200768	2018	NaN

This has the same columns, and we can see what happened more by looking at the shape of each.

players_df.shape

(1374, 4)

p18.shape, p19.shape

((748, 3), (626, 4))

We can verify that the length of the new data frame is the sum of the original two DataFrames.

assert len(p18) + len(p19) == len(players_df)

12.1.2. Combining Data with Merge#

What is we want to see which players changed teams from 2018 to 2019? We can do this with merge. For merge we have to tell it a left DataFrame and a right DataFrame and on what column to match them up.

The left and right DataFrames will be used different ways, but any DataFrame can be put in either position.

For this case, we will use 2018 data as left ,and 2019 as right and then merge on='PLAYER_ID'.

pd.merge(p18,p19,on='PLAYER_ID').head()

	TEAM_ID_x	PLAYER_ID	SEASON_x	PLAYER_NAME	TEAM_ID_y	SEASON_y
0	1610612761	202695	2018	Kawhi Leonard	1610612746	2019
1	1610612761	1627783	2018	Pascal Siakam	1610612761	2019
2	1610612761	201188	2018	Marc Gasol	1610612761	2019
3	1610612763	201188	2018	Marc Gasol	1610612761	2019
4	1610612761	201980	2018	Danny Green	1610612747	2019

Now, we get what we expect, but the column names have _x and _y on the end (as a suffix, appended to the original). We’ll add 2018 and 2019 respectively, separated with a _.

year_over_year = pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'))
year_over_year.head()

	TEAM_ID_2018	PLAYER_ID	SEASON_2018	PLAYER_NAME	TEAM_ID_2019	SEASON_2019
0	1610612761	202695	2018	Kawhi Leonard	1610612746	2019
1	1610612761	1627783	2018	Pascal Siakam	1610612761	2019
2	1610612761	201188	2018	Marc Gasol	1610612761	2019
3	1610612763	201188	2018	Marc Gasol	1610612761	2019
4	1610612761	201980	2018	Danny Green	1610612747	2019

Now that it’s a little bit cleaner, we will examine how it works by looking at the shape.

year_over_year.shape, p18.shape, p19.shape

((538, 6), (748, 3), (626, 4))

This kept only the players that played both years, with repetitions for each team they played on for each year.

We can check the calculation with set math (python set type has operations from math sets like intersect and set difference)

# player IDs for each year, no repeats
p19_u = set(p19['PLAYER_ID'])
p18_u = set(p18['PLAYER_ID'])

# player IDs that played both years
p1819 = p18_u.intersection(p19_u)

# teams per player per year
teams_per_p18 = p18['PLAYER_ID'].value_counts()
teams_per_p19 = p19['PLAYER_ID'].value_counts()

# total number of team-player combinations
# multiply number of teams each player played for in 18 by number of teams in 19
# then sum.  (most of of these are 1*1)
sum(teams_per_p19[p1819]* teams_per_p18[p1819])

/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/3008611582.py:15: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  sum(teams_per_p19[p1819]* teams_per_p18[p1819])

We can also merge so that we keep all players on either team using how='outer' the default value for how is inner, which takes the intersection (but with duplicates, does some extra things as we saw). With outer it takes the union, but with extra handling for the duplicates.

pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer').shape

(927, 6)

It’s the total of the rows we had before, plus the total number of player-teams for players that only played in one of the two years.

#players tha tonly played in one year
o18 = p18_u.difference(p19_u)
o19 = p19_u.difference(p18_u)
# teams those players played for + the above 538
teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])

/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])
/tmp/ipykernel_2603/955177081.py:5: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  teams_per_p19[o19].sum() + teams_per_p18[o18].sum() + sum(teams_per_p19[p1819]* teams_per_p18[p1819])

We can save this to a variable

year_over_year_outer =  pd.merge(p18,p19,on='PLAYER_ID',suffixes=('_2018','_2019'),how='outer')

then look at a few rows to see that it has indeed filled in with NaN in the places where there wasn’t a value.

year_over_year_outer.sample(10)

	TEAM_ID_2018	PLAYER_ID	SEASON_2018	PLAYER_NAME	TEAM_ID_2019	SEASON_2019
485	1.610613e+09	2772	2018.0	Trevor Ariza	1.610613e+09	2019.0
300	1.610613e+09	202714	2018.0	NaN	NaN	NaN
639	1.610613e+09	1626150	2018.0	Andrew Harrison	1.610613e+09	2019.0
861	NaN	1629683	NaN	Quinndary Weatherspoon	1.610613e+09	2019.0
264	1.610613e+09	1629027	2018.0	Trae Young	1.610613e+09	2019.0
720	1.610613e+09	1629008	2018.0	Michael Porter Jr.	1.610613e+09	2019.0
883	NaN	1629860	NaN	Jon Davis	1.610613e+09	2019.0
826	NaN	1629076	NaN	Tyler Cook	1.610613e+09	2019.0
384	1.610613e+09	203967	2018.0	Dario Saric	1.610613e+09	2019.0
52	1.610613e+09	203090	2018.0	Maurice Harkless	1.610613e+09	2019.0

Note

We can also tell that there are NaN beacuse it cast the year to float from int.

12.2. Merge types, in detail#

We can examine how these things work more visually with smaller DataFrames:

left = pd.DataFrame(
  {
    "key1": ["K0", "K0", "K1", "K2"],
    "key2": ["K0", "K1", "K0", "K1"],
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"],
  }
)


right = pd.DataFrame(
  {
    "key1": ["K0", "K1", "K1", "K2"],
    "key2": ["K0", "K0", "K0", "K0"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"],
  }
)

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

pd.merge(left, right, on=["key1", "key2"], how='inner')

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2

pd.merge(left, right, on=["key1", "key2"], how='outer' )

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

12.3. Duplicate Keys#

left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

result = pd.merge(left, right, on="B", how="outer")

left

	A	B
0	1	2
1	2	2

right

	A	B
0	4	2
1	5	2
2	6	2

result

	A_x	B	A_y
0	1	2	4
1	1	2	5
2	1	2	6
3	2	2	4
4	2	2	5
5	2	2	6

If we ask pandas to validate the merge with validate and a specific type of merge, it will throw an error if that type of merge is not possible.

pd.merge(left, right,on='B',validate='one_to_one')

---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
Input In [26], in <cell line: 1>()
----> 1 pd.merge(left, right,on='B',validate='one_to_one')

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:107, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
@Substitution("\nleft : DataFrame or named Series")
@Appender(_merge_doc, indents=0)
def merge(
   (...)
   validate: str | None = None,
) -> DataFrame:
--> 107     op = _MergeOperation(
       left,
       right,
       how=how,
       on=on,
       left_on=left_on,
       right_on=right_on,
       left_index=left_index,
       right_index=right_index,
       sort=sort,
       suffixes=suffixes,
       copy=copy,
       indicator=indicator,
       validate=validate,
   )
   return op.get_result()

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:710, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
# If argument passed to validate,
# check if columns specified as unique
# are in fact unique.
if validate is not None:
--> 710     self._validate(validate)

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/reshape/merge.py:1427, in _MergeOperation._validate(self, validate)
if validate in ["one_to_one", "1:1"]:
   if not left_unique and not right_unique:
-> 1427         raise MergeError(
           "Merge keys are not unique in either left "
           "or right dataset; not a one-to-one merge"
       )
   elif not left_unique:
       raise MergeError(
           "Merge keys are not unique in left dataset; not a one-to-one merge"
       )

MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge

12.4. Questions at the End of Class#

12.4.1. how to remove certain columns in merges (i.e. the year columns in the basketball dataset since they are redundant)#

They can be dropped after (or before) using drop()

12.4.2. Can we merge as many dataframes as we would want?#

Merge is specifically defined for two Data Frames, but the result returns a a DataFrame, so it could be passed to another merge.

12.4.3. Is there no way to merge the left and right for the A and B data?#

All merges are possible, they just each give a different result.

12.4.4. Can you use merging to check correlation between two different sets?#

You can use merging to check overlap between two DataFrames, for actual sets, it’s probably best to represent them as set type objects and use set operations.

However, these type of operations are not actually correlations, which we’ll learn a bit about in a few weeks when we talk about regression

12.4.5. Is there any way to compare the team ids in both datasets so that it outputs the players that changed teams between those times?#

From the year_over_year DataFrame we made above, we can check which players changed teams. Since players also can change teams within a season, this is a tricky question, but we could, for example look only for players that have no overlapping teams from 2018 to 2019.

change_check = lambda r: not(r['TEAM_ID_2018']== r['TEAM_ID_2019'])
year_over_year_clean['CHANGE_TEAM'] = year_over_year_clean.apply(change_check,axis=1)
year_over_year_clean.head()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [27], in <cell line: 2>()
      1 change_check = lambda r: not(r['TEAM_ID_2018']== r['TEAM_ID_2019'])
----> 2 year_over_year_clean['CHANGE_TEAM'] = year_over_year_clean.apply(change_check,axis=1)
      3 year_over_year_clean.head()

NameError: name 'year_over_year_clean' is not defined

Then we can filter the data by the new column, then take out only the player information and dropa ny duplicates for players that played in multiple teams in one eyar or the other.

y_o_y_changes = year_over_year_clean[year_over_year_clean['CHANGE_TEAM']]
y_o_y_change_players = y_o_y_changes[['PLAYER_ID','PLAYER_NAME']].drop_duplicates()
y_o_y_change_players.head()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [28], in <cell line: 1>()
----> 1 y_o_y_changes = year_over_year_clean[year_over_year_clean['CHANGE_TEAM']]
      2 y_o_y_change_players = y_o_y_changes[['PLAYER_ID','PLAYER_NAME']].drop_duplicates()
      3 y_o_y_change_players.head()

NameError: name 'year_over_year_clean' is not defined

and we can check how many players that is

y_o_y_change_players.shape

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [29], in <cell line: 1>()
----> 1 y_o_y_change_players.shape

NameError: name 'y_o_y_change_players' is not defined

12.4.6. in year_over_year, where suffixes=(‘_2018’,‘_2019’), why is there an underscore?#

this is optional, but is best practice for pythonic variable naming. (and remember columns are like variables)

12.4.7. If we merge two dataframes which each have a column (or columns) that are the same, but each have different data types in those columns, what happens?#

If they’re just named the same, but not the “on” column, you’ll get the two columsn with different suffixes default ('_x','_y'). If it’s the on column, the merge will be empty

12.4.8. how does suffix to know to change the date?#

Suffix appends what we tell it to to the column names that occur in both datasets

12.5. More Practice#

Use a merge to figure out how many players did not return for the 2019 season
Also load the conferences.csv data. Use this to figure out how many players moved conferences from 2018 to 2019.

Programming for Data Science at URI Fall 2021

Building Datasets From multiple Sources

Contents

12. Building Datasets From multiple Sources#

12.1. Combining Multiple Tables#

12.1.1. Stacking with Concat#

12.1.2. Combining Data with Merge#

12.2. Merge types, in detail#

12.3. Duplicate Keys#

12.4. Questions at the End of Class#

12.4.1. how to remove certain columns in merges (i.e. the year columns in the basketball dataset since they are redundant)#

12.4.2. Can we merge as many dataframes as we would want?#

12.4.3. Is there no way to merge the left and right for the A and B data?#

12.4.4. Can you use merging to check correlation between two different sets?#

12.4.5. Is there any way to compare the team ids in both datasets so that it outputs the players that changed teams between those times?#

12.4.6. in year_over_year, where suffixes=(‘_2018’,‘_2019’), why is there an underscore?#

12.4.7. If we merge two dataframes which each have a column (or columns) that are the same, but each have different data types in those columns, what happens?#

12.4.8. how does suffix to know to change the date?#

12.5. More Practice#

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3