Class 12: Constructing Datasets from Multiple Sources¶
import pandas as pd
Introduction¶
Sometimes we use data from multiple sources. For example data may be provided across a bunch of tables and we want to put them together. Or we may want to build our own dataset to answer the questions that we want to ask. For example we have a dataset of drivers who have been pulled over by the police, and need to compare distribution of them between day and night times. If we only have time of being pulled over without mentioning day/night, we can use another dataset including time of sunset for every day and build our own dataset out of these two dataset to find the answer. Later we will also learn about getting data out of databses with some small database operations through SQLite and some python libararies.
To use relative paths as in class (data/2018-games.csv
) instead of a full url https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2018-games.csv
, download data from this GitHub repo by clicking on the green code button and choosing .zip. Then unzip the data and save it in a data folder in the same folder as the notebook. For the class notes, the urls make it so that the notebook can run without having to store the data in another place.
games_df18 = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2018-games.csv')
games_df19 = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2019-games.csv')
Stacking DataFrames¶
Let’s look at the first couple of rows to see how the configuration of data is.
games_df18.head(2)
Unnamed: 0 | GAME_DATE_EST | GAME_ID | GAME_STATUS_TEXT | HOME_TEAM_ID | VISITOR_TEAM_ID | SEASON | TEAM_ID_home | PTS_home | FG_PCT_home | ... | AST_home | REB_home | TEAM_ID_away | PTS_away | FG_PCT_away | FT_PCT_away | FG3_PCT_away | AST_away | REB_away | HOME_TEAM_WINS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16196 | 2019-06-13 | 41800406 | Final | 1610612744 | 1610612761 | 2018 | 1610612744 | 110.0 | 0.488 | ... | 28.0 | 42.0 | 1610612761 | 114.0 | 0.476 | 0.793 | 0.394 | 25.0 | 39.0 | 0 |
1 | 16197 | 2019-06-10 | 41800405 | Final | 1610612761 | 1610612744 | 2018 | 1610612761 | 105.0 | 0.447 | ... | 19.0 | 43.0 | 1610612744 | 106.0 | 0.463 | 0.714 | 0.476 | 27.0 | 37.0 | 0 |
2 rows × 22 columns
games_df19.head(2)
Unnamed: 0 | GAME_DATE_EST | GAME_ID | GAME_STATUS_TEXT | HOME_TEAM_ID | VISITOR_TEAM_ID | SEASON | TEAM_ID_home | PTS_home | FG_PCT_home | ... | AST_home | REB_home | TEAM_ID_away | PTS_away | FG_PCT_away | FT_PCT_away | FG3_PCT_away | AST_away | REB_away | HOME_TEAM_WINS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2020-03-01 | 21900895 | Final | 1610612766 | 1610612749 | 2019 | 1610612766 | 85.0 | 0.354 | ... | 22.0 | 47.0 | 1610612749 | 93.0 | 0.402 | 0.762 | 0.226 | 20.0 | 61.0 | 0 |
1 | 1 | 2020-03-01 | 21900896 | Final | 1610612750 | 1610612742 | 2019 | 1610612750 | 91.0 | 0.364 | ... | 19.0 | 57.0 | 1610612742 | 111.0 | 0.468 | 0.632 | 0.275 | 28.0 | 56.0 | 0 |
2 rows × 22 columns
As we can see, both datasets have the same columns, and just for two different years.
games_df18.shape
(1378, 22)
games_df19.shape
(965, 22)
Let’s concatenate two dataframes to make one dataframe out of them.
games_df = pd.concat([games_df18,games_df19])
games_df.shape
(2343, 22)
As we can see rows are added up but we have the same numbe of rcolums.
games_df.columns
Index(['Unnamed: 0', 'GAME_DATE_EST', 'GAME_ID', 'GAME_STATUS_TEXT',
'HOME_TEAM_ID', 'VISITOR_TEAM_ID', 'SEASON', 'TEAM_ID_home', 'PTS_home',
'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home',
'TEAM_ID_away', 'PTS_away', 'FG_PCT_away', 'FT_PCT_away',
'FG3_PCT_away', 'AST_away', 'REB_away', 'HOME_TEAM_WINS'],
dtype='object')
Let’s drop the column we do not need.
games_df.drop(columns= 'Unnamed: 0',inplace=True,)
games_df.head(2)
GAME_DATE_EST | GAME_ID | GAME_STATUS_TEXT | HOME_TEAM_ID | VISITOR_TEAM_ID | SEASON | TEAM_ID_home | PTS_home | FG_PCT_home | FT_PCT_home | ... | AST_home | REB_home | TEAM_ID_away | PTS_away | FG_PCT_away | FT_PCT_away | FG3_PCT_away | AST_away | REB_away | HOME_TEAM_WINS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-06-13 | 41800406 | Final | 1610612744 | 1610612761 | 2018 | 1610612744 | 110.0 | 0.488 | 0.700 | ... | 28.0 | 42.0 | 1610612761 | 114.0 | 0.476 | 0.793 | 0.394 | 25.0 | 39.0 | 0 |
1 | 2019-06-10 | 41800405 | Final | 1610612761 | 1610612744 | 2018 | 1610612761 | 105.0 | 0.447 | 0.778 | ... | 19.0 | 43.0 | 1610612744 | 106.0 | 0.463 | 0.714 | 0.476 | 27.0 | 37.0 | 0 |
2 rows × 21 columns
Merging Data Frames¶
Now we read another dataset which some of its columns are the same as dataframe “games_df” and some are different.
teams_df = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/teams.csv')
teams_df.head(2)
LEAGUE_ID | TEAM_ID | MIN_YEAR | MAX_YEAR | ABBREVIATION | NICKNAME | YEARFOUNDED | CITY | ARENA | ARENACAPACITY | OWNER | GENERALMANAGER | HEADCOACH | DLEAGUEAFFILIATION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | Tony Ressler | Travis Schlenk | Lloyd Pierce | Erie Bayhawks |
1 | 0 | 1610612738 | 1946 | 2019 | BOS | Celtics | 1946 | Boston | TD Garden | 18624.0 | Wyc Grousbeck | Danny Ainge | Brad Stevens | Maine Red Claws |
We use left_on=’TEAM_ID’ and right_on = ‘HOME_TEAM_ID’ to match two dataframes.
merge1_df = pd.merge(teams_df,games_df,left_on='TEAM_ID', right_on = 'HOME_TEAM_ID')
merge1_df.head(2)
LEAGUE_ID | TEAM_ID | MIN_YEAR | MAX_YEAR | ABBREVIATION | NICKNAME | YEARFOUNDED | CITY | ARENA | ARENACAPACITY | ... | AST_home | REB_home | TEAM_ID_away | PTS_away | FG_PCT_away | FT_PCT_away | FG3_PCT_away | AST_away | REB_away | HOME_TEAM_WINS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | ... | 29.0 | 61.0 | 1610612754 | 135.0 | 0.459 | 0.846 | 0.400 | 22.0 | 43.0 | 0 |
1 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | ... | 29.0 | 44.0 | 1610612755 | 122.0 | 0.459 | 0.579 | 0.323 | 27.0 | 57.0 | 1 |
2 rows × 35 columns
We want information for each game and append the team info onto that. So, lets try another settings in our mergging.
merge1_df.shape
(2343, 35)
merge2_df = pd.merge(teams_df, games_df,left_on='TEAM_ID', right_on = 'HOME_TEAM_ID', how='outer')
merge2_df.head(2)
LEAGUE_ID | TEAM_ID | MIN_YEAR | MAX_YEAR | ABBREVIATION | NICKNAME | YEARFOUNDED | CITY | ARENA | ARENACAPACITY | ... | AST_home | REB_home | TEAM_ID_away | PTS_away | FG_PCT_away | FT_PCT_away | FG3_PCT_away | AST_away | REB_away | HOME_TEAM_WINS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | ... | 29.0 | 61.0 | 1610612754 | 135.0 | 0.459 | 0.846 | 0.400 | 22.0 | 43.0 | 0 |
1 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | ... | 29.0 | 44.0 | 1610612755 | 122.0 | 0.459 | 0.579 | 0.323 | 27.0 | 57.0 | 1 |
2 rows × 35 columns
merge2_df.shape
(2343, 35)
We can group by “ARENA” and then look at the “mean” statistics.
merge1_df.groupby('ARENA').mean()
LEAGUE_ID | TEAM_ID | MIN_YEAR | MAX_YEAR | YEARFOUNDED | ARENACAPACITY | GAME_ID | HOME_TEAM_ID | VISITOR_TEAM_ID | SEASON | ... | AST_home | REB_home | TEAM_ID_away | PTS_away | FG_PCT_away | FT_PCT_away | FG3_PCT_away | AST_away | REB_away | HOME_TEAM_WINS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ARENA | |||||||||||||||||||||
AT&T Center | 0.0 | 1.610613e+09 | 1976.000000 | 2019.0 | 1976.000000 | 18694.0 | 2.184021e+07 | 1.610613e+09 | 1.610613e+09 | 2018.397436 | ... | 25.141026 | 45.782051 | 1.610613e+09 | 108.448718 | 0.449333 | 0.770808 | 0.348231 | 23.987179 | 43.833333 | 0.666667 |
American Airlines Center | 0.0 | 1.610613e+09 | 1980.000000 | 2019.0 | 1980.000000 | 19200.0 | 2.130985e+07 | 1.610613e+09 | 1.610613e+09 | 2018.426667 | ... | 23.920000 | 46.093333 | 1.610613e+09 | 108.960000 | 0.456093 | 0.777320 | 0.347707 | 22.960000 | 43.720000 | 0.546667 |
AmericanAirlines Arena | 0.0 | 1.610613e+09 | 1988.000000 | 2019.0 | 1988.000000 | 19600.0 | 2.105314e+07 | 1.610613e+09 | 1.610613e+09 | 2018.421053 | ... | 25.039474 | 46.276316 | 1.610613e+09 | 106.618421 | 0.442684 | 0.771987 | 0.344250 | 23.157895 | 42.131579 | 0.644737 |
Amway Center | 0.0 | 1.610613e+09 | 1989.000000 | 2019.0 | 1989.000000 | 0.0 | 2.171458e+07 | 1.610613e+09 | 1.610613e+09 | 2018.423077 | ... | 24.525641 | 45.935897 | 1.610613e+09 | 105.538462 | 0.454372 | 0.772487 | 0.359962 | 22.987179 | 44.769231 | 0.538462 |
Bankers Life Fieldhouse | 0.0 | 1.610613e+09 | 1976.000000 | 2019.0 | 1976.000000 | 18345.0 | 2.197454e+07 | 1.610613e+09 | 1.610613e+09 | 2018.441558 | ... | 26.363636 | 44.597403 | 1.610613e+09 | 103.051948 | 0.435494 | 0.761805 | 0.329442 | 23.272727 | 44.662338 | 0.675325 |
Barclays Center | 0.0 | 1.610613e+09 | 1976.000000 | 2019.0 | 1976.000000 | NaN | 2.197516e+07 | 1.610613e+09 | 1.610613e+09 | 2018.413333 | ... | 23.880000 | 48.000000 | 1.610613e+09 | 110.293333 | 0.448133 | 0.773533 | 0.338547 | 22.640000 | 45.000000 | 0.533333 |
Capital One Arena | 0.0 | 1.610613e+09 | 1961.000000 | 2019.0 | 1961.000000 | 20647.0 | 2.130190e+07 | 1.610613e+09 | 1.610613e+09 | 2018.418919 | ... | 25.891892 | 43.932432 | 1.610613e+09 | 115.310811 | 0.472041 | 0.765959 | 0.353324 | 24.554054 | 46.121622 | 0.500000 |
Chase Center | 0.0 | 1.610613e+09 | 1946.000000 | 2019.0 | 1946.000000 | 19596.0 | 2.350492e+07 | 1.610613e+09 | 1.610613e+09 | 2018.377778 | ... | 28.444444 | 44.722222 | 1.610613e+09 | 112.833333 | 0.461144 | 0.782122 | 0.374233 | 25.377778 | 43.900000 | 0.511111 |
Chesapeake Energy Arena | 0.0 | 1.610613e+09 | 1967.000000 | 2019.0 | 1967.000000 | 19163.0 | 2.171802e+07 | 1.610613e+09 | 1.610613e+09 | 2018.425000 | ... | 23.450000 | 45.387500 | 1.610613e+09 | 109.250000 | 0.455088 | 0.772437 | 0.351600 | 23.175000 | 44.700000 | 0.637500 |
FedExForum | 0.0 | 1.610613e+09 | 1995.000000 | 2019.0 | 1995.000000 | 18119.0 | 2.131630e+07 | 1.610613e+09 | 1.610613e+09 | 2018.421053 | ... | 25.394737 | 44.578947 | 1.610613e+09 | 108.236842 | 0.445092 | 0.786355 | 0.362816 | 23.473684 | 44.986842 | 0.526316 |
Fiserv Forum | 0.0 | 1.610613e+09 | 1968.000000 | 2019.0 | 1968.000000 | 17500.0 | 2.328482e+07 | 1.610613e+09 | 1.610613e+09 | 2018.385542 | ... | 26.819277 | 51.590361 | 1.610613e+09 | 106.421687 | 0.417289 | 0.758819 | 0.347325 | 23.253012 | 44.385542 | 0.843373 |
Golden 1 Center | 0.0 | 1.610613e+09 | 1948.000000 | 2019.0 | 1948.000000 | 17500.0 | 2.142553e+07 | 1.610613e+09 | 1.610613e+09 | 2018.416667 | ... | 24.125000 | 45.208333 | 1.610613e+09 | 112.361111 | 0.459417 | 0.772444 | 0.349514 | 23.805556 | 46.708333 | 0.527778 |
Little Caesars Arena | 0.0 | 1.610613e+09 | 1948.000000 | 2019.0 | 1948.000000 | 21000.0 | 2.171695e+07 | 1.610613e+09 | 1.610613e+09 | 2018.430380 | ... | 23.974684 | 44.670886 | 1.610613e+09 | 108.797468 | 0.469127 | 0.767203 | 0.348127 | 24.151899 | 43.417722 | 0.493671 |
Madison Square Garden | 0.0 | 1.610613e+09 | 1946.000000 | 2019.0 | 1946.000000 | 19763.0 | 2.105316e+07 | 1.610613e+09 | 1.610613e+09 | 2018.421053 | ... | 21.500000 | 45.447368 | 1.610613e+09 | 112.105263 | 0.463842 | 0.756461 | 0.365237 | 24.263158 | 45.263158 | 0.250000 |
Moda Center | 0.0 | 1.610613e+09 | 1970.000000 | 2019.0 | 1970.000000 | 19980.0 | 2.316313e+07 | 1.610613e+09 | 1.610613e+09 | 2018.373494 | ... | 22.361446 | 47.831325 | 1.610613e+09 | 112.060241 | 0.448145 | 0.793880 | 0.371578 | 22.939759 | 44.048193 | 0.662651 |
Pepsi Center | 0.0 | 1.610613e+09 | 1976.000000 | 2019.0 | 1976.000000 | 19099.0 | 2.369186e+07 | 1.610613e+09 | 1.610613e+09 | 2018.395062 | ... | 27.419753 | 46.419753 | 1.610613e+09 | 103.777778 | 0.448531 | 0.742173 | 0.323988 | 23.320988 | 42.172840 | 0.790123 |
Quicken Loans Arena | 0.0 | 1.610613e+09 | 1970.000000 | 2019.0 | 1970.000000 | 20562.0 | 2.132390e+07 | 1.610613e+09 | 1.610613e+09 | 2018.428571 | ... | 22.038961 | 44.025974 | 1.610613e+09 | 113.584416 | 0.489766 | 0.776896 | 0.384247 | 25.584416 | 42.597403 | 0.298701 |
Scotiabank Arena | 0.0 | 1.610613e+09 | 1995.000000 | 2019.0 | 1995.000000 | 19800.0 | 2.428270e+07 | 1.610613e+09 | 1.610613e+09 | 2018.377778 | ... | 25.433333 | 45.100000 | 1.610613e+09 | 106.144444 | 0.432111 | 0.770211 | 0.347544 | 24.155556 | 45.300000 | 0.744444 |
Smoothie King Center | 0.0 | 1.610613e+09 | 2002.000000 | 2019.0 | 2002.000000 | NaN | 2.156903e+07 | 1.610613e+09 | 1.610613e+09 | 2018.424658 | ... | 27.232877 | 46.945205 | 1.610613e+09 | 116.808219 | 0.474096 | 0.771562 | 0.362822 | 25.109589 | 44.561644 | 0.438356 |
Spectrum Center | 0.0 | 1.610613e+09 | 1988.000000 | 2019.0 | 1988.000000 | 19026.0 | 2.104182e+07 | 1.610613e+09 | 1.610613e+09 | 2018.413333 | ... | 23.746667 | 44.293333 | 1.610613e+09 | 109.386667 | 0.463920 | 0.761573 | 0.350853 | 25.066667 | 45.266667 | 0.493333 |
Staples Center | 0.0 | 1.610613e+09 | 1959.210191 | 2019.0 | 1959.210191 | 19060.0 | 2.139604e+07 | 1.610613e+09 | 1.610613e+09 | 2018.414013 | ... | 25.942675 | 47.286624 | 1.610613e+09 | 110.477707 | 0.443783 | 0.775064 | 0.338796 | 24.095541 | 45.133758 | 0.636943 |
State Farm Arena | 0.0 | 1.610613e+09 | 1949.000000 | 2019.0 | 1949.000000 | 18729.0 | 2.131766e+07 | 1.610613e+09 | 1.610613e+09 | 2018.434211 | ... | 25.736842 | 45.960526 | 1.610613e+09 | 117.894737 | 0.463447 | 0.767184 | 0.343553 | 25.500000 | 46.276316 | 0.421053 |
TD Garden | 0.0 | 1.610613e+09 | 1946.000000 | 2019.0 | 1946.000000 | 18624.0 | 2.235308e+07 | 1.610613e+09 | 1.610613e+09 | 2018.397436 | ... | 25.717949 | 45.974359 | 1.610613e+09 | 106.358974 | 0.442769 | 0.743064 | 0.335808 | 23.282051 | 44.153846 | 0.717949 |
Talking Stick Resort Arena | 0.0 | 1.610613e+09 | 1968.000000 | 2019.0 | 1968.000000 | NaN | 2.132518e+07 | 1.610613e+09 | 1.610613e+09 | 2018.441558 | ... | 25.298701 | 41.519481 | 1.610613e+09 | 113.064935 | 0.474922 | 0.762377 | 0.363143 | 24.285714 | 45.000000 | 0.311688 |
Target Center | 0.0 | 1.610613e+09 | 1989.000000 | 2019.0 | 1989.000000 | 19356.0 | 2.156304e+07 | 1.610613e+09 | 1.610613e+09 | 2018.402778 | ... | 24.458333 | 46.583333 | 1.610613e+09 | 113.208333 | 0.462306 | 0.763181 | 0.363625 | 24.402778 | 46.500000 | 0.444444 |
Toyota Center | 0.0 | 1.610613e+09 | 1967.000000 | 2019.0 | 1967.000000 | 18104.0 | 2.283925e+07 | 1.610613e+09 | 1.610613e+09 | 2018.387500 | ... | 21.712500 | 44.850000 | 1.610613e+09 | 110.275000 | 0.454438 | 0.759188 | 0.339250 | 24.200000 | 45.700000 | 0.737500 |
United Center | 0.0 | 1.610613e+09 | 1966.000000 | 2019.0 | 1966.000000 | 21711.0 | 2.107486e+07 | 1.610613e+09 | 1.610613e+09 | 2018.435897 | ... | 22.987179 | 43.512821 | 1.610613e+09 | 110.705128 | 0.475372 | 0.750628 | 0.355808 | 25.602564 | 46.192308 | 0.307692 |
Vivint Smart Home Arena | 0.0 | 1.610613e+09 | 1974.000000 | 2019.0 | 1974.000000 | 20148.0 | 2.197424e+07 | 1.610613e+09 | 1.610613e+09 | 2018.421053 | ... | 24.526316 | 47.592105 | 1.610613e+09 | 105.723684 | 0.443618 | 0.767803 | 0.351961 | 20.539474 | 42.276316 | 0.684211 |
Wells Fargo Center | 0.0 | 1.610613e+09 | 1949.000000 | 2019.0 | 1949.000000 | NaN | 2.282764e+07 | 1.610613e+09 | 1.610613e+09 | 2018.395062 | ... | 27.395062 | 47.617284 | 1.610613e+09 | 106.753086 | 0.445346 | 0.757247 | 0.343272 | 22.086420 | 41.160494 | 0.802469 |
29 rows × 25 columns
How to combine the same data for different outcomes¶
We can combine datasets in different ways to learn different things about the data. Let’s now read info about the players.
players18 = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2018-players.csv')
players19 = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2019-players.csv')
players18.head()
Unnamed: 0 | PLAYER_NAME | TEAM_ID | PLAYER_ID | SEASON | |
---|---|---|---|---|---|
0 | 626 | Kawhi Leonard | 1610612761 | 202695 | 2018 |
1 | 627 | Pascal Siakam | 1610612761 | 1627783 | 2018 |
2 | 628 | Marc Gasol | 1610612761 | 201188 | 2018 |
3 | 629 | Danny Green | 1610612761 | 201980 | 2018 |
4 | 630 | Kyle Lowry | 1610612761 | 200768 | 2018 |
First let’s look at the shape of each of them to have a reference for what happens when we try different merges.
players18.shape, players19.shape
((748, 5), (626, 5))
One thing we might want to do is to put all of the information into one long DataFrame. We can do this with concat
.
pd.concat([players18,players19])
Unnamed: 0 | PLAYER_NAME | TEAM_ID | PLAYER_ID | SEASON | |
---|---|---|---|---|---|
0 | 626 | Kawhi Leonard | 1610612761 | 202695 | 2018 |
1 | 627 | Pascal Siakam | 1610612761 | 1627783 | 2018 |
2 | 628 | Marc Gasol | 1610612761 | 201188 | 2018 |
3 | 629 | Danny Green | 1610612761 | 201980 | 2018 |
4 | 630 | Kyle Lowry | 1610612761 | 200768 | 2018 |
... | ... | ... | ... | ... | ... |
621 | 621 | Anthony Bennett | 1610612745 | 203461 | 2019 |
622 | 622 | Ray Spalding | 1610612737 | 1629034 | 2019 |
623 | 623 | Devyn Marble | 1610612744 | 203906 | 2019 |
624 | 624 | Hassani Gravett | 1610612753 | 1629755 | 2019 |
625 | 625 | JaKeenan Gant | 1610612754 | 1629721 | 2019 |
1374 rows × 5 columns
This allows us to see all the players for each year, we could do groupby and count to see how many players played each year for example.
We can check that this is the size we expected.
pd.concat([players18,players19]).shape
(1374, 5)
If we use the default merge settings we get an empty result because the two DataFrames have the same columns so pandas tries to merge on
all of the columns, but there are no rows that have the same value in all of the columns, so there’s nothing left.
pd.merge(players18,players19,)
Unnamed: 0 | PLAYER_NAME | TEAM_ID | PLAYER_ID | SEASON |
---|
If we merge with on=’PLAYER_ID’, it only requires that one column to be the same to match rows from the two DataFrames together. With the default value for how
or explicitly setting how='inner'
we get the info of players who played both seasons.
pd.merge(players18,players19,on='PLAYER_ID', how='inner')
Unnamed: 0_x | PLAYER_NAME_x | TEAM_ID_x | PLAYER_ID | SEASON_x | Unnamed: 0_y | PLAYER_NAME_y | TEAM_ID_y | SEASON_y | |
---|---|---|---|---|---|---|---|---|---|
0 | 626 | Kawhi Leonard | 1610612761 | 202695 | 2018 | 299 | Kawhi Leonard | 1610612746 | 2019 |
1 | 627 | Pascal Siakam | 1610612761 | 1627783 | 2018 | 275 | Pascal Siakam | 1610612761 | 2019 |
2 | 628 | Marc Gasol | 1610612761 | 201188 | 2018 | 276 | Marc Gasol | 1610612761 | 2019 |
3 | 1157 | Marc Gasol | 1610612763 | 201188 | 2018 | 276 | Marc Gasol | 1610612761 | 2019 |
4 | 629 | Danny Green | 1610612761 | 201980 | 2018 | 202 | Danny Green | 1610612747 | 2019 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
533 | 1339 | Abdul Gaddy | 1610612760 | 203583 | 2018 | 561 | Abdul Gaddy | 1610612760 | 2019 |
534 | 1342 | Andre Roberson | 1610612760 | 203460 | 2018 | 566 | Andre Roberson | 1610612760 | 2019 |
535 | 1348 | Norvel Pelle | 1610612755 | 203658 | 2018 | 24 | Norvel Pelle | 1610612755 | 2019 |
536 | 1350 | Denzel Valentine | 1610612741 | 1627756 | 2018 | 58 | Denzel Valentine | 1610612741 | 2019 |
537 | 1353 | C.J. Wilcox | 1610612754 | 203912 | 2018 | 573 | C.J. Wilcox | 1610612754 | 2019 |
538 rows × 9 columns
When we use outer
we get one row for each player who played in either season or both seasons. From this we can for example see who changed teams, who are the rookies in 2019 and who retired or was unsigned in 2019.
pd.merge(players18,players19,on='PLAYER_ID', how='outer')
Unnamed: 0_x | PLAYER_NAME_x | TEAM_ID_x | PLAYER_ID | SEASON_x | Unnamed: 0_y | PLAYER_NAME_y | TEAM_ID_y | SEASON_y | |
---|---|---|---|---|---|---|---|---|---|
0 | 626.0 | Kawhi Leonard | 1.610613e+09 | 202695 | 2018.0 | 299.0 | Kawhi Leonard | 1.610613e+09 | 2019.0 |
1 | 627.0 | Pascal Siakam | 1.610613e+09 | 1627783 | 2018.0 | 275.0 | Pascal Siakam | 1.610613e+09 | 2019.0 |
2 | 628.0 | Marc Gasol | 1.610613e+09 | 201188 | 2018.0 | 276.0 | Marc Gasol | 1.610613e+09 | 2019.0 |
3 | 1157.0 | Marc Gasol | 1.610613e+09 | 201188 | 2018.0 | 276.0 | Marc Gasol | 1.610613e+09 | 2019.0 |
4 | 629.0 | Danny Green | 1.610613e+09 | 201980 | 2018.0 | 202.0 | Danny Green | 1.610613e+09 | 2019.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
922 | NaN | NaN | NaN | 1629097 | NaN | 619.0 | Terry Larrier | 1.610613e+09 | 2019.0 |
923 | NaN | NaN | NaN | 203461 | NaN | 621.0 | Anthony Bennett | 1.610613e+09 | 2019.0 |
924 | NaN | NaN | NaN | 203906 | NaN | 623.0 | Devyn Marble | 1.610613e+09 | 2019.0 |
925 | NaN | NaN | NaN | 1629755 | NaN | 624.0 | Hassani Gravett | 1.610613e+09 | 2019.0 |
926 | NaN | NaN | NaN | 1629721 | NaN | 625.0 | JaKeenan Gant | 1.610613e+09 | 2019.0 |
927 rows × 9 columns
Using left
gives us the 'PLAYER_ID'
that are in the left(players18
) DataFrame, including those that are in both DataFrame
right
would give players in the players19
DataFrame or both DataFrames. With this result, we can see who retired, but not the 2019 rookies.
pd.merge(players18,players19,on='PLAYER_ID', how='left')
Unnamed: 0_x | PLAYER_NAME_x | TEAM_ID_x | PLAYER_ID | SEASON_x | Unnamed: 0_y | PLAYER_NAME_y | TEAM_ID_y | SEASON_y | |
---|---|---|---|---|---|---|---|---|---|
0 | 626 | Kawhi Leonard | 1610612761 | 202695 | 2018 | 299.0 | Kawhi Leonard | 1.610613e+09 | 2019.0 |
1 | 627 | Pascal Siakam | 1610612761 | 1627783 | 2018 | 275.0 | Pascal Siakam | 1.610613e+09 | 2019.0 |
2 | 628 | Marc Gasol | 1610612761 | 201188 | 2018 | 276.0 | Marc Gasol | 1.610613e+09 | 2019.0 |
3 | 629 | Danny Green | 1610612761 | 201980 | 2018 | 202.0 | Danny Green | 1.610613e+09 | 2019.0 |
4 | 630 | Kyle Lowry | 1610612761 | 200768 | 2018 | 451.0 | Kyle Lowry | 1.610613e+09 | 2019.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
749 | 1369 | Tyrius Walker | 1610612752 | 1629246 | 2018 | NaN | NaN | NaN | NaN |
750 | 1370 | Marcus Lee | 1610612748 | 1629159 | 2018 | NaN | NaN | NaN | NaN |
751 | 1371 | Trey Lewis | 1610612762 | 1629163 | 2018 | NaN | NaN | NaN | NaN |
752 | 1372 | Emanuel Terry | 1610612743 | 1629150 | 2018 | NaN | NaN | NaN | NaN |
753 | 1373 | Justin Bibbs | 1610612738 | 1629167 | 2018 | NaN | NaN | NaN | NaN |
754 rows × 9 columns
Try it yourself¶
Try different merges and inspect them:
how many rows & columns?
Where are NaN values inserted?
What rows from the original datasets are not included?
describe each type of merge in your own words
Split a DataFrame into separate data frames by subsetting the columns and indexing the rows with loc
, then use concat to put it back together. Programmatically check that it’s back together correctly.