11. Building Datasets from Multiple Sources#

focus this week is on how to programmatically combine sources of data

We will start by looking at combining multiple tabular data formats and see how to get data from other sources.

11.1. Self- assessment, what’s your plan#

Take a few minutes to think about the following questions and make a few notes for yourself whereever you need them to be (a planner, calendar, etc). Share one takeaway or question you have below when you’re done.

  1. What achievements have you earned?

  2. Does BrightSpace seem accurate to what you’ve done?

  3. If not, e-mail brownsarahm with [CSC310] or [DSP310] in the subject and specific details about what you think is missing and why

  4. Are you on track to earn the grade you want in this class?

  5. If not, what will you need to do (respond more in class, submit more assignments, use your portfolio to catch up) to get back on track?

  6. If you are on track and you want to earn above a B, take a minute to think about your portfolio. (tip: post an idea as an issue to get early feedback and help shaping your idea)

11.2. Logistics#

Check your earned achievements. See the instructions that are now saved for future reference on the left side.

11.3. Merges#

import pandas as pd

I created a folder with datasets we use in class

course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'

We can load in two data sets of player information.

df_18 = pd.read_csv(course_data_url+ '2018-players.csv')
df_19 = pd.read_csv(course_data_url+ '2019-players.csv')

and take a peek at each

df_18.head(2)
TEAM_ID PLAYER_ID SEASON
0 1610612761 202695 2018
1 1610612761 1627783 2018

Important

Remember columns is an attribute, so it does not need ()

df_19.columns
Index(['PLAYER_NAME', 'TEAM_ID', 'PLAYER_ID', 'SEASON'], dtype='object')

Let’s make note of the shape of each

df_18.shape, df_19.shape
((748, 3), (626, 4))

11.4. What if we want to analyze them together?#

We can combine them vertically:

pd.concat([df_18,df_19]).shape
(1374, 4)

Note that this has the maximum number of columns (because both had some overlapping columns) and the total number of rows.

11.5. How can we find which players changed teams?#

To do this we want to have one player column and a column with each year’s team.

We can use a merge to do that.

pd.merge(df_18,df_19,).head(2)
TEAM_ID PLAYER_ID SEASON PLAYER_NAME

if we merge them without any parameters, it tries to merge on all shared columns. We want to merge them using the PLAYER_ID column though, we would say hat we are “merging on player ID” and we use the on parameter to do it

pd.merge(df_18,df_19, on = 'PLAYER_ID').head(2)
TEAM_ID_x PLAYER_ID SEASON_x PLAYER_NAME TEAM_ID_y SEASON_y
0 1610612761 202695 2018 Kawhi Leonard 1610612746 2019
1 1610612761 1627783 2018 Pascal Siakam 1610612761 2019

Since there are other columns that appear in both DataFrames, they get a suffis, which by default is x or y, we can specify them though.

pd.merge(df_18,df_19, on = 'PLAYER_ID',
        suffixes=('_2018','_2019')).head(2)
TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 SEASON_2019
0 1610612761 202695 2018 Kawhi Leonard 1610612746 2019
1 1610612761 1627783 2018 Pascal Siakam 1610612761 2019

We also told it what to append to any column names that match, but are not the same across both datasets.

df_18.head(1)
TEAM_ID PLAYER_ID SEASON
0 1610612761 202695 2018
df_19.head(1)
PLAYER_NAME TEAM_ID PLAYER_ID SEASON
0 Royce O'Neale 1610612762 1626220 2019
df1819 = pd.merge(df_18,df_19, on = 'PLAYER_ID',
        suffixes=('_2018','_2019'))
df1819.shape
(538, 6)

By default, this uses an inner merge, so we get the players that are in both datasets only.

df1819.head()
TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 SEASON_2019
0 1610612761 202695 2018 Kawhi Leonard 1610612746 2019
1 1610612761 1627783 2018 Pascal Siakam 1610612761 2019
2 1610612761 201188 2018 Marc Gasol 1610612761 2019
3 1610612763 201188 2018 Marc Gasol 1610612761 2019
4 1610612761 201980 2018 Danny Green 1610612747 2019

Some players still appear twice, because they were in one of the datsets twice, this happens when a player plays for two team in one season.

11.6. Which players played in 2018, but not 2019?#

We have different types of merges, inner is both, out is either. Left and right keep all the rows of one dataFrame. We can use left with df_18 as the left DataFrame to see which players played only in 18.

pd.merge(df_18,df_19,on='PLAYER_ID', how='left',suffixes=('_2018','_2019')).tail()
TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 SEASON_2019
749 1610612752 1629246 2018 NaN NaN NaN
750 1610612748 1629159 2018 NaN NaN NaN
751 1610612762 1629163 2018 NaN NaN NaN
752 1610612743 1629150 2018 NaN NaN NaN
753 1610612738 1629167 2018 NaN NaN NaN

To pick out those rows:

df_18_only = pd.merge(df_18,df_19,on='PLAYER_ID', how='left', suffixes=('_2018','_2019'))
df_18_only[df_18_only['SEASON_2019'].isna()]
TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 SEASON_2019
9 1610612761 202391 2018 NaN NaN NaN
11 1610612761 201975 2018 NaN NaN NaN
18 1610612744 101106 2018 NaN NaN NaN
23 1610612744 2733 2018 NaN NaN NaN
24 1610612744 201973 2018 NaN NaN NaN
... ... ... ... ... ... ...
749 1610612752 1629246 2018 NaN NaN NaN
750 1610612748 1629159 2018 NaN NaN NaN
751 1610612762 1629163 2018 NaN NaN NaN
752 1610612743 1629150 2018 NaN NaN NaN
753 1610612738 1629167 2018 NaN NaN NaN

216 rows × 6 columns

df_18_left = pd.merge(df_18,df_19.drop(columns=['SEASON'])
                                       ,on='PLAYER_ID', how='left', suffixes=('_2018','_2019'))
df_18_only = df_18_only[df_18_only['TEAM_ID_2019'].isna()]
df_18_only.head()
TEAM_ID_2018 PLAYER_ID SEASON_2018 PLAYER_NAME TEAM_ID_2019 SEASON_2019
9 1610612761 202391 2018 NaN NaN NaN
11 1610612761 201975 2018 NaN NaN NaN
18 1610612744 101106 2018 NaN NaN NaN
23 1610612744 2733 2018 NaN NaN NaN
24 1610612744 201973 2018 NaN NaN NaN
n_18_only, _ = df_18_only.drop_duplicates(subset=['PLAYER_ID']).shape
n_18_only
178

11.7. Which players played for the same team both seasons?#

df_same_team = pd.merge(df_18,df_19, on = ['PLAYER_ID','TEAM_ID'],)
df_same_team.head()
TEAM_ID PLAYER_ID SEASON_x PLAYER_NAME SEASON_y
0 1610612761 1627783 2018 Pascal Siakam 2019
1 1610612761 201188 2018 Marc Gasol 2019
2 1610612761 200768 2018 Kyle Lowry 2019
3 1610612761 1627832 2018 Fred VanVleet 2019
4 1610612761 201586 2018 Serge Ibaka 2019

In this case, the suffix only applies to season, but they’re not telling us much, so we can clean it up using drop before we merge.

df_18_only_clean = pd.merge(df_18.drop(columns='SEASON'),df_19.drop(columns='SEASON'), on = ['PLAYER_ID','TEAM_ID'],)
df_18_only_clean.head()
TEAM_ID PLAYER_ID PLAYER_NAME
0 1610612761 1627783 Pascal Siakam
1 1610612761 201188 Marc Gasol
2 1610612761 200768 Kyle Lowry
3 1610612761 1627832 Fred VanVleet
4 1610612761 201586 Serge Ibaka
df_18_only_clean.shape
(263, 3)
df_18_only_clean.drop_duplicates(subset=['PLAYER_ID']).shape
(263, 3)

We do not need to drop the duplicates in this case becaue we merged on two columns and there were no actual duplicates in the original dataset.

11.8. Visualizing merges#

left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K4"],
        "A": ["A0", "A1", "A2", "A4"],
        "B": ["B0", "B1", "B2", "B4"],
    }
)


right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
left
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K4 A4 B4
right
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
pd.merge(left, right, how='inner',)
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
pd.merge(left, right, how='outer')
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K4 A4 B4 NaN NaN
4 K3 NaN NaN C3 D3
pd.merge(left, right, how='left')
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K4 A4 B4 NaN NaN
pd.merge(left, right, how='right')
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 NaN NaN C3 D3

We can merge on multiple columns too/

left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)


right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, on=["key1", "key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
pd.merge(left, right, on=["key1", "key2"],how='outer')
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
pd.merge(left, right,on = 'key1' )
key1 key2_x A B key2_y C D
0 K0 K0 A0 B0 K0 C0 D0
1 K0 K1 A1 B1 K0 C0 D0
2 K1 K0 A2 B2 K0 C1 D1
3 K1 K0 A2 B2 K0 C2 D2
4 K2 K1 A3 B3 K0 C3 D3
left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
left
A B
0 1 2
1 2 2
right
A B
0 4 2
1 5 2
2 6 2
pd.merge(left, right, on="B", how="outer")
A_x B A_y
0 1 2 4
1 1 2 5
2 1 2 6
3 2 2 4
4 2 2 5
5 2 2 6

11.9. Questions After Class#

Important

several questions were easiest to answer within the narrative of the notes above.

11.9.1. How do I represent NaN as a variable? (Ex. df1819[‘TEAM_ID_y’==NaN]…something like that?)#

For that specific case, you can use the isna method as I did above, but if you need a NaN constant otherwise pandas provides

pd.NA
<NA>

and numpy provides

import numpy as np
np.nan
nan

Watch out though because they are not the same:

np.nan == pd.NA
<NA>

and this cannot even assert

assert pd.NA == np.nan
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[44], line 1
----> 1 assert pd.NA == np.nan

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/missing.pyx:382, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

The pandas pd.isna method is robust, and it knows how numpy works (because it is built on top and imports).

pd.isna(np.nan)
True
pd.isna(pd.NA)
True

However, numpy does not know pandas

np.isnan(pd.NA)
<NA>

and returns a value that cannot be used for indexing

11.9.2. Is there an easier way to look at the changes with merge?#

Any of the tools we have to look at a DataFrame will work. I think using a small DataFrame is the best way to get a good idea of how they each work, so I included that. Overall, checking the shape gives a broad view and then comparing values gives the detials.

11.9.3. Do merges also delete any other overlapping columns other than the one they are told to consolidate on?#

No they copy them all

11.9.4. in pd.merge(p18,p19, on=‘PLAYER_ID’,how=‘____’).shape is p18 and p19 supposed to be df_18 and df_19?#

Yes, sorry, I read them in using different names than were in my notes

11.9.5. Learning more about accessing data from databases#

We will access databases on Wednesday

11.9.6. How to merge data that does not line up as well as these two datasets?#

We will see how to merge on columns that have the same data, but different columns this week. When there is no columns that match value to value, you have to transform and add new columns, combining what we learned about cleaning data to make them line up somehow. For example, maybe one dataset has dates like YYYY-MM-DD and the other data set has only months like “January”. You could split the whole date into three columns and transform the strings to numbers or numbers to strings by treating the columns as dates

11.9.7. Why did we merged on ‘PLAYER_ID’?#

We were interested in linking based on the players.

11.9.8. What is the second best source after documentation? For ex: you found a parameter that might do what you want it to in pandas, but where do you go after to make sure? I’m not sure if I ran it, I’d specifically know if it “worked”#

Ideally, you should know what you are looking for to know if it worked or not. If you are not sure what you are looking for, that is a good time to try to think that through, or attend office hours. Next is to create an issue on your assignment repo if the question might reveal a solution.

11.9.9. I think I need to understand how GitHub works because I’m following the steps of the assignments and I feel like I’m doing something wrong every time I have to work on GitHub.#

If you are getting feedback, you are close enough. However, I am offering an extra session on GitHub on Friday.