Building Datasets from Multiple Sources
Contents
11. Building Datasets from Multiple Sources#
focus this week is on how to programmatically combine sources of data
We will start by looking at combining multiple tabular data formats and see how to get data from other sources.
11.1. Self- assessment, what’s your plan#
Take a few minutes to think about the following questions and make a few notes for yourself whereever you need them to be (a planner, calendar, etc). Share one takeaway or question you have below when you’re done.
What achievements have you earned?
Does BrightSpace seem accurate to what you’ve done?
If not, e-mail brownsarahm with [CSC310] or [DSP310] in the subject and specific details about what you think is missing and why
Are you on track to earn the grade you want in this class?
If not, what will you need to do (respond more in class, submit more assignments, use your portfolio to catch up) to get back on track?
If you are on track and you want to earn above a B, take a minute to think about your portfolio. (tip: post an idea as an issue to get early feedback and help shaping your idea)
11.2. Logistics#
Check your earned achievements. See the instructions that are now saved for future reference on the left side.
11.3. Merges#
import pandas as pd
I created a folder with datasets we use in class
course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'
We can load in two data sets of player information.
df_18 = pd.read_csv(course_data_url+ '2018-players.csv')
df_19 = pd.read_csv(course_data_url+ '2019-players.csv')
and take a peek at each
df_18.head(2)
TEAM_ID | PLAYER_ID | SEASON | |
---|---|---|---|
0 | 1610612761 | 202695 | 2018 |
1 | 1610612761 | 1627783 | 2018 |
Important
Remember columns
is an attribute, so it does not need ()
df_19.columns
Index(['PLAYER_NAME', 'TEAM_ID', 'PLAYER_ID', 'SEASON'], dtype='object')
Let’s make note of the shape of each
df_18.shape, df_19.shape
((748, 3), (626, 4))
11.4. What if we want to analyze them together?#
We can combine them vertically:
pd.concat([df_18,df_19]).shape
(1374, 4)
Note that this has the maximum number of columns (because both had some overlapping columns) and the total number of rows.
11.5. How can we find which players changed teams?#
To do this we want to have one player column and a column with each year’s team.
We can use a merge to do that.
pd.merge(df_18,df_19,).head(2)
TEAM_ID | PLAYER_ID | SEASON | PLAYER_NAME |
---|
if we merge them without any parameters, it tries to merge on all shared columns. We want to merge them using the PLAYER_ID
column though, we would say hat we are “merging on player ID” and we use the on
parameter to do it
pd.merge(df_18,df_19, on = 'PLAYER_ID').head(2)
TEAM_ID_x | PLAYER_ID | SEASON_x | PLAYER_NAME | TEAM_ID_y | SEASON_y | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | Kawhi Leonard | 1610612746 | 2019 |
1 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 1610612761 | 2019 |
Since there are other columns that appear in both DataFrames, they get a suffis, which by default is x
or y
, we can specify them though.
pd.merge(df_18,df_19, on = 'PLAYER_ID',
suffixes=('_2018','_2019')).head(2)
TEAM_ID_2018 | PLAYER_ID | SEASON_2018 | PLAYER_NAME | TEAM_ID_2019 | SEASON_2019 | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | Kawhi Leonard | 1610612746 | 2019 |
1 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 1610612761 | 2019 |
We also told it what to append to any column names that match, but are not the same across both datasets.
df_18.head(1)
TEAM_ID | PLAYER_ID | SEASON | |
---|---|---|---|
0 | 1610612761 | 202695 | 2018 |
df_19.head(1)
PLAYER_NAME | TEAM_ID | PLAYER_ID | SEASON | |
---|---|---|---|---|
0 | Royce O'Neale | 1610612762 | 1626220 | 2019 |
df1819 = pd.merge(df_18,df_19, on = 'PLAYER_ID',
suffixes=('_2018','_2019'))
df1819.shape
(538, 6)
By default, this uses an inner merge, so we get the players that are in both datasets only.
df1819.head()
TEAM_ID_2018 | PLAYER_ID | SEASON_2018 | PLAYER_NAME | TEAM_ID_2019 | SEASON_2019 | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | Kawhi Leonard | 1610612746 | 2019 |
1 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 1610612761 | 2019 |
2 | 1610612761 | 201188 | 2018 | Marc Gasol | 1610612761 | 2019 |
3 | 1610612763 | 201188 | 2018 | Marc Gasol | 1610612761 | 2019 |
4 | 1610612761 | 201980 | 2018 | Danny Green | 1610612747 | 2019 |
Some players still appear twice, because they were in one of the datsets twice, this happens when a player plays for two team in one season.
11.6. Which players played in 2018, but not 2019?#
We have different types of merges, inner is both, out is either. Left and right keep all the rows of one dataFrame. We can use left with df_18
as the left DataFrame to see which players played only in 18.
pd.merge(df_18,df_19,on='PLAYER_ID', how='left',suffixes=('_2018','_2019')).tail()
TEAM_ID_2018 | PLAYER_ID | SEASON_2018 | PLAYER_NAME | TEAM_ID_2019 | SEASON_2019 | |
---|---|---|---|---|---|---|
749 | 1610612752 | 1629246 | 2018 | NaN | NaN | NaN |
750 | 1610612748 | 1629159 | 2018 | NaN | NaN | NaN |
751 | 1610612762 | 1629163 | 2018 | NaN | NaN | NaN |
752 | 1610612743 | 1629150 | 2018 | NaN | NaN | NaN |
753 | 1610612738 | 1629167 | 2018 | NaN | NaN | NaN |
To pick out those rows:
df_18_only = pd.merge(df_18,df_19,on='PLAYER_ID', how='left', suffixes=('_2018','_2019'))
df_18_only[df_18_only['SEASON_2019'].isna()]
TEAM_ID_2018 | PLAYER_ID | SEASON_2018 | PLAYER_NAME | TEAM_ID_2019 | SEASON_2019 | |
---|---|---|---|---|---|---|
9 | 1610612761 | 202391 | 2018 | NaN | NaN | NaN |
11 | 1610612761 | 201975 | 2018 | NaN | NaN | NaN |
18 | 1610612744 | 101106 | 2018 | NaN | NaN | NaN |
23 | 1610612744 | 2733 | 2018 | NaN | NaN | NaN |
24 | 1610612744 | 201973 | 2018 | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... |
749 | 1610612752 | 1629246 | 2018 | NaN | NaN | NaN |
750 | 1610612748 | 1629159 | 2018 | NaN | NaN | NaN |
751 | 1610612762 | 1629163 | 2018 | NaN | NaN | NaN |
752 | 1610612743 | 1629150 | 2018 | NaN | NaN | NaN |
753 | 1610612738 | 1629167 | 2018 | NaN | NaN | NaN |
216 rows × 6 columns
df_18_left = pd.merge(df_18,df_19.drop(columns=['SEASON'])
,on='PLAYER_ID', how='left', suffixes=('_2018','_2019'))
df_18_only = df_18_only[df_18_only['TEAM_ID_2019'].isna()]
df_18_only.head()
TEAM_ID_2018 | PLAYER_ID | SEASON_2018 | PLAYER_NAME | TEAM_ID_2019 | SEASON_2019 | |
---|---|---|---|---|---|---|
9 | 1610612761 | 202391 | 2018 | NaN | NaN | NaN |
11 | 1610612761 | 201975 | 2018 | NaN | NaN | NaN |
18 | 1610612744 | 101106 | 2018 | NaN | NaN | NaN |
23 | 1610612744 | 2733 | 2018 | NaN | NaN | NaN |
24 | 1610612744 | 201973 | 2018 | NaN | NaN | NaN |
n_18_only, _ = df_18_only.drop_duplicates(subset=['PLAYER_ID']).shape
n_18_only
178
11.7. Which players played for the same team both seasons?#
df_same_team = pd.merge(df_18,df_19, on = ['PLAYER_ID','TEAM_ID'],)
df_same_team.head()
TEAM_ID | PLAYER_ID | SEASON_x | PLAYER_NAME | SEASON_y | |
---|---|---|---|---|---|
0 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 2019 |
1 | 1610612761 | 201188 | 2018 | Marc Gasol | 2019 |
2 | 1610612761 | 200768 | 2018 | Kyle Lowry | 2019 |
3 | 1610612761 | 1627832 | 2018 | Fred VanVleet | 2019 |
4 | 1610612761 | 201586 | 2018 | Serge Ibaka | 2019 |
In this case, the suffix only applies to season, but they’re not telling us much, so we can clean it up using drop
before we merge.
df_18_only_clean = pd.merge(df_18.drop(columns='SEASON'),df_19.drop(columns='SEASON'), on = ['PLAYER_ID','TEAM_ID'],)
df_18_only_clean.head()
TEAM_ID | PLAYER_ID | PLAYER_NAME | |
---|---|---|---|
0 | 1610612761 | 1627783 | Pascal Siakam |
1 | 1610612761 | 201188 | Marc Gasol |
2 | 1610612761 | 200768 | Kyle Lowry |
3 | 1610612761 | 1627832 | Fred VanVleet |
4 | 1610612761 | 201586 | Serge Ibaka |
df_18_only_clean.shape
(263, 3)
df_18_only_clean.drop_duplicates(subset=['PLAYER_ID']).shape
(263, 3)
We do not need to drop the duplicates in this case becaue we merged on two columns and there were no actual duplicates in the original dataset.
11.8. Visualizing merges#
left = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K4"],
"A": ["A0", "A1", "A2", "A4"],
"B": ["B0", "B1", "B2", "B4"],
}
)
right = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
left
key | A | B | |
---|---|---|---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K2 | A2 | B2 |
3 | K4 | A4 | B4 |
right
key | C | D | |
---|---|---|---|
0 | K0 | C0 | D0 |
1 | K1 | C1 | D1 |
2 | K2 | C2 | D2 |
3 | K3 | C3 | D3 |
pd.merge(left, right, how='inner',)
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
pd.merge(left, right, how='outer')
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K4 | A4 | B4 | NaN | NaN |
4 | K3 | NaN | NaN | C3 | D3 |
pd.merge(left, right, how='left')
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K4 | A4 | B4 | NaN | NaN |
pd.merge(left, right, how='right')
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | NaN | NaN | C3 | D3 |
We can merge on multiple columns too/
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
left
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
right
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
pd.merge(left, right, on=["key1", "key2"])
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
pd.merge(left, right, on=["key1", "key2"],how='outer')
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
5 | K2 | K0 | NaN | NaN | C3 | D3 |
pd.merge(left, right,on = 'key1' )
key1 | key2_x | A | B | key2_y | C | D | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | K0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | K0 | C0 | D0 |
2 | K1 | K0 | A2 | B2 | K0 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | K0 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | K0 | C3 | D3 |
left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})
right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
left
A | B | |
---|---|---|
0 | 1 | 2 |
1 | 2 | 2 |
right
A | B | |
---|---|---|
0 | 4 | 2 |
1 | 5 | 2 |
2 | 6 | 2 |
pd.merge(left, right, on="B", how="outer")
A_x | B | A_y | |
---|---|---|---|
0 | 1 | 2 | 4 |
1 | 1 | 2 | 5 |
2 | 1 | 2 | 6 |
3 | 2 | 2 | 4 |
4 | 2 | 2 | 5 |
5 | 2 | 2 | 6 |
11.9. Questions After Class#
Important
several questions were easiest to answer within the narrative of the notes above.
11.9.1. How do I represent NaN as a variable? (Ex. df1819[‘TEAM_ID_y’==NaN]…something like that?)#
For that specific case, you can use the isna
method as I did above, but if you need a NaN constant otherwise pandas provides
pd.NA
<NA>
and numpy provides
import numpy as np
np.nan
nan
Watch out though because they are not the same:
np.nan == pd.NA
<NA>
and this cannot even assert
assert pd.NA == np.nan
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[44], line 1
----> 1 assert pd.NA == np.nan
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/missing.pyx:382, in pandas._libs.missing.NAType.__bool__()
TypeError: boolean value of NA is ambiguous
The pandas
pd.isna
method is robust, and it knows how numpy
works (because it is built on top and imports).
pd.isna(np.nan)
True
pd.isna(pd.NA)
True
However, numpy
does not know pandas
np.isnan(pd.NA)
<NA>
and returns a value that cannot be used for indexing
11.9.2. Is there an easier way to look at the changes with merge?#
Any of the tools we have to look at a DataFrame will work. I think using a small DataFrame is the best way to get a good idea of how they each work, so I included that. Overall, checking the shape gives a broad view and then comparing values gives the detials.
11.9.3. Do merges also delete any other overlapping columns other than the one they are told to consolidate on?#
No they copy them all
11.9.4. in pd.merge(p18,p19, on=‘PLAYER_ID’,how=‘____’).shape is p18 and p19 supposed to be df_18 and df_19?#
Yes, sorry, I read them in using different names than were in my notes
11.9.5. Learning more about accessing data from databases#
We will access databases on Wednesday
11.9.6. How to merge data that does not line up as well as these two datasets?#
We will see how to merge on columns that have the same data, but different columns this week. When there is no columns that match value to value, you have to transform and add new columns, combining what we learned about cleaning data to make them line up somehow. For example, maybe one dataset has dates like YYYY-MM-DD
and the other data set has only months like “January”. You could split the whole date into three columns and transform the strings to numbers or numbers to strings by treating the columns as dates
11.9.7. Why did we merged on ‘PLAYER_ID’?#
We were interested in linking based on the players.
11.9.8. What is the second best source after documentation? For ex: you found a parameter that might do what you want it to in pandas, but where do you go after to make sure? I’m not sure if I ran it, I’d specifically know if it “worked”#
Ideally, you should know what you are looking for to know if it worked or not. If you are not sure what you are looking for, that is a good time to try to think that through, or attend office hours. Next is to create an issue on your assignment repo if the question might reveal a solution.
11.9.9. I think I need to understand how GitHub works because I’m following the steps of the assignments and I feel like I’m doing something wrong every time I have to work on GitHub.#
If you are getting feedback, you are close enough. However, I am offering an extra session on GitHub on Friday.