Class 12: Constructing Datasets from Multiple Sources

import pandas as pd

Introduction

Sometimes we use data from multiple sources. For example data may be provided across a bunch of tables and we want to put them together. Or we may want to build our own dataset to answer the questions that we want to ask. For example we have a dataset of drivers who have been pulled over by the police, and need to compare distribution of them between day and night times. If we only have time of being pulled over without mentioning day/night, we can use another dataset including time of sunset for every day and build our own dataset out of these two dataset to find the answer. Later we will also learn about getting data out of databses with some small database operations through SQLite and some python libararies.

To use relative paths as in class (data/2018-games.csv) instead of a full url https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2018-games.csv, download data from this GitHub repo by clicking on the green code button and choosing .zip. Then unzip the data and save it in a data folder in the same folder as the notebook. For the class notes, the urls make it so that the notebook can run without having to store the data in another place.

games_df18 = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2018-games.csv')
games_df19 = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2019-games.csv')

Stacking DataFrames

Let’s look at the first couple of rows to see how the configuration of data is.

games_df18.head(2)
Unnamed: 0 GAME_DATE_EST GAME_ID GAME_STATUS_TEXT HOME_TEAM_ID VISITOR_TEAM_ID SEASON TEAM_ID_home PTS_home FG_PCT_home ... AST_home REB_home TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away HOME_TEAM_WINS
0 16196 2019-06-13 41800406 Final 1610612744 1610612761 2018 1610612744 110.0 0.488 ... 28.0 42.0 1610612761 114.0 0.476 0.793 0.394 25.0 39.0 0
1 16197 2019-06-10 41800405 Final 1610612761 1610612744 2018 1610612761 105.0 0.447 ... 19.0 43.0 1610612744 106.0 0.463 0.714 0.476 27.0 37.0 0

2 rows × 22 columns

games_df19.head(2)
Unnamed: 0 GAME_DATE_EST GAME_ID GAME_STATUS_TEXT HOME_TEAM_ID VISITOR_TEAM_ID SEASON TEAM_ID_home PTS_home FG_PCT_home ... AST_home REB_home TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away HOME_TEAM_WINS
0 0 2020-03-01 21900895 Final 1610612766 1610612749 2019 1610612766 85.0 0.354 ... 22.0 47.0 1610612749 93.0 0.402 0.762 0.226 20.0 61.0 0
1 1 2020-03-01 21900896 Final 1610612750 1610612742 2019 1610612750 91.0 0.364 ... 19.0 57.0 1610612742 111.0 0.468 0.632 0.275 28.0 56.0 0

2 rows × 22 columns

As we can see, both datasets have the same columns, and just for two different years.

games_df18.shape
(1378, 22)
games_df19.shape
(965, 22)

Let’s concatenate two dataframes to make one dataframe out of them.

games_df = pd.concat([games_df18,games_df19])
games_df.shape
(2343, 22)

As we can see rows are added up but we have the same numbe of rcolums.

games_df.columns
Index(['Unnamed: 0', 'GAME_DATE_EST', 'GAME_ID', 'GAME_STATUS_TEXT',
       'HOME_TEAM_ID', 'VISITOR_TEAM_ID', 'SEASON', 'TEAM_ID_home', 'PTS_home',
       'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home',
       'TEAM_ID_away', 'PTS_away', 'FG_PCT_away', 'FT_PCT_away',
       'FG3_PCT_away', 'AST_away', 'REB_away', 'HOME_TEAM_WINS'],
      dtype='object')

Let’s drop the column we do not need.

games_df.drop(columns= 'Unnamed: 0',inplace=True,)
games_df.head(2)
GAME_DATE_EST GAME_ID GAME_STATUS_TEXT HOME_TEAM_ID VISITOR_TEAM_ID SEASON TEAM_ID_home PTS_home FG_PCT_home FT_PCT_home ... AST_home REB_home TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away HOME_TEAM_WINS
0 2019-06-13 41800406 Final 1610612744 1610612761 2018 1610612744 110.0 0.488 0.700 ... 28.0 42.0 1610612761 114.0 0.476 0.793 0.394 25.0 39.0 0
1 2019-06-10 41800405 Final 1610612761 1610612744 2018 1610612761 105.0 0.447 0.778 ... 19.0 43.0 1610612744 106.0 0.463 0.714 0.476 27.0 37.0 0

2 rows × 21 columns

Merging Data Frames

Now we read another dataset which some of its columns are the same as dataframe “games_df” and some are different.

teams_df = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/teams.csv')
teams_df.head(2)
LEAGUE_ID TEAM_ID MIN_YEAR MAX_YEAR ABBREVIATION NICKNAME YEARFOUNDED CITY ARENA ARENACAPACITY OWNER GENERALMANAGER HEADCOACH DLEAGUEAFFILIATION
0 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 Tony Ressler Travis Schlenk Lloyd Pierce Erie Bayhawks
1 0 1610612738 1946 2019 BOS Celtics 1946 Boston TD Garden 18624.0 Wyc Grousbeck Danny Ainge Brad Stevens Maine Red Claws

We use left_on=’TEAM_ID’ and right_on = ‘HOME_TEAM_ID’ to match two dataframes.

merge1_df = pd.merge(teams_df,games_df,left_on='TEAM_ID', right_on = 'HOME_TEAM_ID')
merge1_df.head(2)
LEAGUE_ID TEAM_ID MIN_YEAR MAX_YEAR ABBREVIATION NICKNAME YEARFOUNDED CITY ARENA ARENACAPACITY ... AST_home REB_home TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away HOME_TEAM_WINS
0 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 ... 29.0 61.0 1610612754 135.0 0.459 0.846 0.400 22.0 43.0 0
1 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 ... 29.0 44.0 1610612755 122.0 0.459 0.579 0.323 27.0 57.0 1

2 rows × 35 columns

We want information for each game and append the team info onto that. So, lets try another settings in our mergging.

merge1_df.shape
(2343, 35)
merge2_df = pd.merge(teams_df, games_df,left_on='TEAM_ID', right_on = 'HOME_TEAM_ID', how='outer')
merge2_df.head(2)
LEAGUE_ID TEAM_ID MIN_YEAR MAX_YEAR ABBREVIATION NICKNAME YEARFOUNDED CITY ARENA ARENACAPACITY ... AST_home REB_home TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away HOME_TEAM_WINS
0 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 ... 29.0 61.0 1610612754 135.0 0.459 0.846 0.400 22.0 43.0 0
1 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 ... 29.0 44.0 1610612755 122.0 0.459 0.579 0.323 27.0 57.0 1

2 rows × 35 columns

merge2_df.shape
(2343, 35)

We can group by “ARENA” and then look at the “mean” statistics.

merge1_df.groupby('ARENA').mean()
LEAGUE_ID TEAM_ID MIN_YEAR MAX_YEAR YEARFOUNDED ARENACAPACITY GAME_ID HOME_TEAM_ID VISITOR_TEAM_ID SEASON ... AST_home REB_home TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away HOME_TEAM_WINS
ARENA
AT&T Center 0.0 1.610613e+09 1976.000000 2019.0 1976.000000 18694.0 2.184021e+07 1.610613e+09 1.610613e+09 2018.397436 ... 25.141026 45.782051 1.610613e+09 108.448718 0.449333 0.770808 0.348231 23.987179 43.833333 0.666667
American Airlines Center 0.0 1.610613e+09 1980.000000 2019.0 1980.000000 19200.0 2.130985e+07 1.610613e+09 1.610613e+09 2018.426667 ... 23.920000 46.093333 1.610613e+09 108.960000 0.456093 0.777320 0.347707 22.960000 43.720000 0.546667
AmericanAirlines Arena 0.0 1.610613e+09 1988.000000 2019.0 1988.000000 19600.0 2.105314e+07 1.610613e+09 1.610613e+09 2018.421053 ... 25.039474 46.276316 1.610613e+09 106.618421 0.442684 0.771987 0.344250 23.157895 42.131579 0.644737
Amway Center 0.0 1.610613e+09 1989.000000 2019.0 1989.000000 0.0 2.171458e+07 1.610613e+09 1.610613e+09 2018.423077 ... 24.525641 45.935897 1.610613e+09 105.538462 0.454372 0.772487 0.359962 22.987179 44.769231 0.538462
Bankers Life Fieldhouse 0.0 1.610613e+09 1976.000000 2019.0 1976.000000 18345.0 2.197454e+07 1.610613e+09 1.610613e+09 2018.441558 ... 26.363636 44.597403 1.610613e+09 103.051948 0.435494 0.761805 0.329442 23.272727 44.662338 0.675325
Barclays Center 0.0 1.610613e+09 1976.000000 2019.0 1976.000000 NaN 2.197516e+07 1.610613e+09 1.610613e+09 2018.413333 ... 23.880000 48.000000 1.610613e+09 110.293333 0.448133 0.773533 0.338547 22.640000 45.000000 0.533333
Capital One Arena 0.0 1.610613e+09 1961.000000 2019.0 1961.000000 20647.0 2.130190e+07 1.610613e+09 1.610613e+09 2018.418919 ... 25.891892 43.932432 1.610613e+09 115.310811 0.472041 0.765959 0.353324 24.554054 46.121622 0.500000
Chase Center 0.0 1.610613e+09 1946.000000 2019.0 1946.000000 19596.0 2.350492e+07 1.610613e+09 1.610613e+09 2018.377778 ... 28.444444 44.722222 1.610613e+09 112.833333 0.461144 0.782122 0.374233 25.377778 43.900000 0.511111
Chesapeake Energy Arena 0.0 1.610613e+09 1967.000000 2019.0 1967.000000 19163.0 2.171802e+07 1.610613e+09 1.610613e+09 2018.425000 ... 23.450000 45.387500 1.610613e+09 109.250000 0.455088 0.772437 0.351600 23.175000 44.700000 0.637500
FedExForum 0.0 1.610613e+09 1995.000000 2019.0 1995.000000 18119.0 2.131630e+07 1.610613e+09 1.610613e+09 2018.421053 ... 25.394737 44.578947 1.610613e+09 108.236842 0.445092 0.786355 0.362816 23.473684 44.986842 0.526316
Fiserv Forum 0.0 1.610613e+09 1968.000000 2019.0 1968.000000 17500.0 2.328482e+07 1.610613e+09 1.610613e+09 2018.385542 ... 26.819277 51.590361 1.610613e+09 106.421687 0.417289 0.758819 0.347325 23.253012 44.385542 0.843373
Golden 1 Center 0.0 1.610613e+09 1948.000000 2019.0 1948.000000 17500.0 2.142553e+07 1.610613e+09 1.610613e+09 2018.416667 ... 24.125000 45.208333 1.610613e+09 112.361111 0.459417 0.772444 0.349514 23.805556 46.708333 0.527778
Little Caesars Arena 0.0 1.610613e+09 1948.000000 2019.0 1948.000000 21000.0 2.171695e+07 1.610613e+09 1.610613e+09 2018.430380 ... 23.974684 44.670886 1.610613e+09 108.797468 0.469127 0.767203 0.348127 24.151899 43.417722 0.493671
Madison Square Garden 0.0 1.610613e+09 1946.000000 2019.0 1946.000000 19763.0 2.105316e+07 1.610613e+09 1.610613e+09 2018.421053 ... 21.500000 45.447368 1.610613e+09 112.105263 0.463842 0.756461 0.365237 24.263158 45.263158 0.250000
Moda Center 0.0 1.610613e+09 1970.000000 2019.0 1970.000000 19980.0 2.316313e+07 1.610613e+09 1.610613e+09 2018.373494 ... 22.361446 47.831325 1.610613e+09 112.060241 0.448145 0.793880 0.371578 22.939759 44.048193 0.662651
Pepsi Center 0.0 1.610613e+09 1976.000000 2019.0 1976.000000 19099.0 2.369186e+07 1.610613e+09 1.610613e+09 2018.395062 ... 27.419753 46.419753 1.610613e+09 103.777778 0.448531 0.742173 0.323988 23.320988 42.172840 0.790123
Quicken Loans Arena 0.0 1.610613e+09 1970.000000 2019.0 1970.000000 20562.0 2.132390e+07 1.610613e+09 1.610613e+09 2018.428571 ... 22.038961 44.025974 1.610613e+09 113.584416 0.489766 0.776896 0.384247 25.584416 42.597403 0.298701
Scotiabank Arena 0.0 1.610613e+09 1995.000000 2019.0 1995.000000 19800.0 2.428270e+07 1.610613e+09 1.610613e+09 2018.377778 ... 25.433333 45.100000 1.610613e+09 106.144444 0.432111 0.770211 0.347544 24.155556 45.300000 0.744444
Smoothie King Center 0.0 1.610613e+09 2002.000000 2019.0 2002.000000 NaN 2.156903e+07 1.610613e+09 1.610613e+09 2018.424658 ... 27.232877 46.945205 1.610613e+09 116.808219 0.474096 0.771562 0.362822 25.109589 44.561644 0.438356
Spectrum Center 0.0 1.610613e+09 1988.000000 2019.0 1988.000000 19026.0 2.104182e+07 1.610613e+09 1.610613e+09 2018.413333 ... 23.746667 44.293333 1.610613e+09 109.386667 0.463920 0.761573 0.350853 25.066667 45.266667 0.493333
Staples Center 0.0 1.610613e+09 1959.210191 2019.0 1959.210191 19060.0 2.139604e+07 1.610613e+09 1.610613e+09 2018.414013 ... 25.942675 47.286624 1.610613e+09 110.477707 0.443783 0.775064 0.338796 24.095541 45.133758 0.636943
State Farm Arena 0.0 1.610613e+09 1949.000000 2019.0 1949.000000 18729.0 2.131766e+07 1.610613e+09 1.610613e+09 2018.434211 ... 25.736842 45.960526 1.610613e+09 117.894737 0.463447 0.767184 0.343553 25.500000 46.276316 0.421053
TD Garden 0.0 1.610613e+09 1946.000000 2019.0 1946.000000 18624.0 2.235308e+07 1.610613e+09 1.610613e+09 2018.397436 ... 25.717949 45.974359 1.610613e+09 106.358974 0.442769 0.743064 0.335808 23.282051 44.153846 0.717949
Talking Stick Resort Arena 0.0 1.610613e+09 1968.000000 2019.0 1968.000000 NaN 2.132518e+07 1.610613e+09 1.610613e+09 2018.441558 ... 25.298701 41.519481 1.610613e+09 113.064935 0.474922 0.762377 0.363143 24.285714 45.000000 0.311688
Target Center 0.0 1.610613e+09 1989.000000 2019.0 1989.000000 19356.0 2.156304e+07 1.610613e+09 1.610613e+09 2018.402778 ... 24.458333 46.583333 1.610613e+09 113.208333 0.462306 0.763181 0.363625 24.402778 46.500000 0.444444
Toyota Center 0.0 1.610613e+09 1967.000000 2019.0 1967.000000 18104.0 2.283925e+07 1.610613e+09 1.610613e+09 2018.387500 ... 21.712500 44.850000 1.610613e+09 110.275000 0.454438 0.759188 0.339250 24.200000 45.700000 0.737500
United Center 0.0 1.610613e+09 1966.000000 2019.0 1966.000000 21711.0 2.107486e+07 1.610613e+09 1.610613e+09 2018.435897 ... 22.987179 43.512821 1.610613e+09 110.705128 0.475372 0.750628 0.355808 25.602564 46.192308 0.307692
Vivint Smart Home Arena 0.0 1.610613e+09 1974.000000 2019.0 1974.000000 20148.0 2.197424e+07 1.610613e+09 1.610613e+09 2018.421053 ... 24.526316 47.592105 1.610613e+09 105.723684 0.443618 0.767803 0.351961 20.539474 42.276316 0.684211
Wells Fargo Center 0.0 1.610613e+09 1949.000000 2019.0 1949.000000 NaN 2.282764e+07 1.610613e+09 1.610613e+09 2018.395062 ... 27.395062 47.617284 1.610613e+09 106.753086 0.445346 0.757247 0.343272 22.086420 41.160494 0.802469

29 rows × 25 columns

How to combine the same data for different outcomes

We can combine datasets in different ways to learn different things about the data. Let’s now read info about the players.

players18 = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2018-players.csv')
players19 = pd.read_csv('https://raw.githubusercontent.com/rhodyprog4ds/inclass-data/main/2019-players.csv')
players18.head()
Unnamed: 0 PLAYER_NAME TEAM_ID PLAYER_ID SEASON
0 626 Kawhi Leonard 1610612761 202695 2018
1 627 Pascal Siakam 1610612761 1627783 2018
2 628 Marc Gasol 1610612761 201188 2018
3 629 Danny Green 1610612761 201980 2018
4 630 Kyle Lowry 1610612761 200768 2018

First let’s look at the shape of each of them to have a reference for what happens when we try different merges.

players18.shape, players19.shape
((748, 5), (626, 5))

One thing we might want to do is to put all of the information into one long DataFrame. We can do this with concat.

pd.concat([players18,players19])
Unnamed: 0 PLAYER_NAME TEAM_ID PLAYER_ID SEASON
0 626 Kawhi Leonard 1610612761 202695 2018
1 627 Pascal Siakam 1610612761 1627783 2018
2 628 Marc Gasol 1610612761 201188 2018
3 629 Danny Green 1610612761 201980 2018
4 630 Kyle Lowry 1610612761 200768 2018
... ... ... ... ... ...
621 621 Anthony Bennett 1610612745 203461 2019
622 622 Ray Spalding 1610612737 1629034 2019
623 623 Devyn Marble 1610612744 203906 2019
624 624 Hassani Gravett 1610612753 1629755 2019
625 625 JaKeenan Gant 1610612754 1629721 2019

1374 rows × 5 columns

This allows us to see all the players for each year, we could do groupby and count to see how many players played each year for example.

We can check that this is the size we expected.

pd.concat([players18,players19]).shape
(1374, 5)

If we use the default merge settings we get an empty result because the two DataFrames have the same columns so pandas tries to merge on all of the columns, but there are no rows that have the same value in all of the columns, so there’s nothing left.

pd.merge(players18,players19,)
Unnamed: 0 PLAYER_NAME TEAM_ID PLAYER_ID SEASON

If we merge with on=’PLAYER_ID’, it only requires that one column to be the same to match rows from the two DataFrames together. With the default value for how or explicitly setting how='inner' we get the info of players who played both seasons.

pd.merge(players18,players19,on='PLAYER_ID', how='inner')
Unnamed: 0_x PLAYER_NAME_x TEAM_ID_x PLAYER_ID SEASON_x Unnamed: 0_y PLAYER_NAME_y TEAM_ID_y SEASON_y
0 626 Kawhi Leonard 1610612761 202695 2018 299 Kawhi Leonard 1610612746 2019
1 627 Pascal Siakam 1610612761 1627783 2018 275 Pascal Siakam 1610612761 2019
2 628 Marc Gasol 1610612761 201188 2018 276 Marc Gasol 1610612761 2019
3 1157 Marc Gasol 1610612763 201188 2018 276 Marc Gasol 1610612761 2019
4 629 Danny Green 1610612761 201980 2018 202 Danny Green 1610612747 2019
... ... ... ... ... ... ... ... ... ...
533 1339 Abdul Gaddy 1610612760 203583 2018 561 Abdul Gaddy 1610612760 2019
534 1342 Andre Roberson 1610612760 203460 2018 566 Andre Roberson 1610612760 2019
535 1348 Norvel Pelle 1610612755 203658 2018 24 Norvel Pelle 1610612755 2019
536 1350 Denzel Valentine 1610612741 1627756 2018 58 Denzel Valentine 1610612741 2019
537 1353 C.J. Wilcox 1610612754 203912 2018 573 C.J. Wilcox 1610612754 2019

538 rows × 9 columns

When we use outer we get one row for each player who played in either season or both seasons. From this we can for example see who changed teams, who are the rookies in 2019 and who retired or was unsigned in 2019.

pd.merge(players18,players19,on='PLAYER_ID', how='outer')
Unnamed: 0_x PLAYER_NAME_x TEAM_ID_x PLAYER_ID SEASON_x Unnamed: 0_y PLAYER_NAME_y TEAM_ID_y SEASON_y
0 626.0 Kawhi Leonard 1.610613e+09 202695 2018.0 299.0 Kawhi Leonard 1.610613e+09 2019.0
1 627.0 Pascal Siakam 1.610613e+09 1627783 2018.0 275.0 Pascal Siakam 1.610613e+09 2019.0
2 628.0 Marc Gasol 1.610613e+09 201188 2018.0 276.0 Marc Gasol 1.610613e+09 2019.0
3 1157.0 Marc Gasol 1.610613e+09 201188 2018.0 276.0 Marc Gasol 1.610613e+09 2019.0
4 629.0 Danny Green 1.610613e+09 201980 2018.0 202.0 Danny Green 1.610613e+09 2019.0
... ... ... ... ... ... ... ... ... ...
922 NaN NaN NaN 1629097 NaN 619.0 Terry Larrier 1.610613e+09 2019.0
923 NaN NaN NaN 203461 NaN 621.0 Anthony Bennett 1.610613e+09 2019.0
924 NaN NaN NaN 203906 NaN 623.0 Devyn Marble 1.610613e+09 2019.0
925 NaN NaN NaN 1629755 NaN 624.0 Hassani Gravett 1.610613e+09 2019.0
926 NaN NaN NaN 1629721 NaN 625.0 JaKeenan Gant 1.610613e+09 2019.0

927 rows × 9 columns

Using left gives us the 'PLAYER_ID' that are in the left(players18) DataFrame, including those that are in both DataFrame right would give players in the players19 DataFrame or both DataFrames. With this result, we can see who retired, but not the 2019 rookies.

pd.merge(players18,players19,on='PLAYER_ID', how='left')
Unnamed: 0_x PLAYER_NAME_x TEAM_ID_x PLAYER_ID SEASON_x Unnamed: 0_y PLAYER_NAME_y TEAM_ID_y SEASON_y
0 626 Kawhi Leonard 1610612761 202695 2018 299.0 Kawhi Leonard 1.610613e+09 2019.0
1 627 Pascal Siakam 1610612761 1627783 2018 275.0 Pascal Siakam 1.610613e+09 2019.0
2 628 Marc Gasol 1610612761 201188 2018 276.0 Marc Gasol 1.610613e+09 2019.0
3 629 Danny Green 1610612761 201980 2018 202.0 Danny Green 1.610613e+09 2019.0
4 630 Kyle Lowry 1610612761 200768 2018 451.0 Kyle Lowry 1.610613e+09 2019.0
... ... ... ... ... ... ... ... ... ...
749 1369 Tyrius Walker 1610612752 1629246 2018 NaN NaN NaN NaN
750 1370 Marcus Lee 1610612748 1629159 2018 NaN NaN NaN NaN
751 1371 Trey Lewis 1610612762 1629163 2018 NaN NaN NaN NaN
752 1372 Emanuel Terry 1610612743 1629150 2018 NaN NaN NaN NaN
753 1373 Justin Bibbs 1610612738 1629167 2018 NaN NaN NaN NaN

754 rows × 9 columns

Try it yourself

Try different merges and inspect them:

  • how many rows & columns?

  • Where are NaN values inserted?

  • What rows from the original datasets are not included?

  • describe each type of merge in your own words

Split a DataFrame into separate data frames by subsetting the columns and indexing the rows with loc, then use concat to put it back together. Programmatically check that it’s back together correctly.