8. Merging Data#
8.1. Merging Data#
Focus this week is on how to programmatically combine sources of data
We will start by looking at combining multiple tabular data formats and see how to get data from other sources.
import pandas as pd
import sqlite3
from urllib import request
we’re going to work with a set of datasets today that are stored in a repo.
course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'
We can load in two data sets of player information.
df_18 = pd.read_csv(course_data_url+'2018-players.csv')
df_19 = pd.read_csv(course_data_url+'2019-players.csv')
and take a peek at each
df_18.head(1)
TEAM_ID | PLAYER_ID | SEASON | |
---|---|---|---|
0 | 1610612761 | 202695 | 2018 |
df_19.head(1)
PLAYER_NAME | TEAM_ID | PLAYER_ID | SEASON | |
---|---|---|---|---|
0 | Royce O'Neale | 1610612762 | 1626220 | 2019 |
Important
Remember shape
is a property, not a method, so it does not need ()
Let’s make note of the shape of each
df_18.shape, df_19.shape
((748, 3), (626, 4))
8.1.1. What if we want to analyze them together?#
We can stack them, but this does not make it easy to see , for example, who changed teams.
pd.concat([df_18,df_19])
TEAM_ID | PLAYER_ID | SEASON | PLAYER_NAME | |
---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | NaN |
1 | 1610612761 | 1627783 | 2018 | NaN |
2 | 1610612761 | 201188 | 2018 | NaN |
3 | 1610612761 | 201980 | 2018 | NaN |
4 | 1610612761 | 200768 | 2018 | NaN |
... | ... | ... | ... | ... |
621 | 1610612745 | 203461 | 2019 | Anthony Bennett |
622 | 1610612737 | 1629034 | 2019 | Ray Spalding |
623 | 1610612744 | 203906 | 2019 | Devyn Marble |
624 | 1610612753 | 1629755 | 2019 | Hassani Gravett |
625 | 1610612754 | 1629721 | 2019 | JaKeenan Gant |
1374 rows × 4 columns
we can see that this is the total number of rows:
748+626
1374
Note that this has the maximum number of columns (because both had some overlapping columns) and the total number of rows.
8.1.2. How can we find which players changed teams?#
To do this we want to have one player column and a column with each year’s team.
We can use a merge to do that.
pd.merge(df_18,df_19).head(2)
TEAM_ID | PLAYER_ID | SEASON | PLAYER_NAME |
---|
if we merge them without any parameters, it tries to merge on all shared columns. We want to merge them using the PLAYER_ID
column though, we would say that we are “merging on player ID” and we use the on
parameter to do it. In this case, it looks for the values in the PLAYER_ID
column that appear in both DataFrames and combines them into a single row.
pd.merge(df_18,df_19,on='PLAYER_ID').head(2)
TEAM_ID_x | PLAYER_ID | SEASON_x | PLAYER_NAME | TEAM_ID_y | SEASON_y | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | Kawhi Leonard | 1610612746 | 2019 |
1 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 1610612761 | 2019 |
Since there are other columns that appear in both DataFrames, they get a suffix, which by default is x
or y
, we can specify them though.
pd.merge(df_18,df_19,on='PLAYER_ID',suffixes=('_18','_19')).head(2)
TEAM_ID_18 | PLAYER_ID | SEASON_18 | PLAYER_NAME | TEAM_ID_19 | SEASON_19 | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | Kawhi Leonard | 1610612746 | 2019 |
1 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 1610612761 | 2019 |
By default, this uses an inner merge, so we get the players that are in both datasets only. If we want to see differences, we need another type of merge.
Some players still appear twice, because they were in one of the datsets twice, this happens when a player plays for two team in one season.
8.2. Merge type examples#
left = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
left
key | A | B | |
---|---|---|---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K2 | A2 | B2 |
3 | K3 | A3 | B3 |
right
key | C | D | |
---|---|---|---|
0 | K0 | C0 | D0 |
1 | K1 | C1 | D1 |
2 | K2 | C2 | D2 |
3 | K3 | C3 | D3 |
pd.merge(left,right)
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | A3 | B3 | C3 | D3 |
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
result = pd.merge(left, right, on=["key1", "key2"])
result
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
pd.merge(left, right, on=["key1", "key2"],how ='outer')
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
5 | K2 | K0 | NaN | NaN | C3 | D3 |
8.2.1. Which players played in 2018, but not 2019?#
We have different types of merges, inner is both, out is either. Left and right keep all the rows of one dataFrame. We can use left with df_18
as the left DataFrame to see which players played only in 18.
df_18_only = pd.merge(df_18,df_19,on='PLAYER_ID',suffixes=('_18','_19'),how='left')
df_18_only.head(2)
TEAM_ID_18 | PLAYER_ID | SEASON_18 | PLAYER_NAME | TEAM_ID_19 | SEASON_19 | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | 2018 | Kawhi Leonard | 1.610613e+09 | 2019.0 |
1 | 1610612761 | 1627783 | 2018 | Pascal Siakam | 1.610613e+09 | 2019.0 |
df_18_only.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 754 entries, 0 to 753
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TEAM_ID_18 754 non-null int64
1 PLAYER_ID 754 non-null int64
2 SEASON_18 754 non-null int64
3 PLAYER_NAME 538 non-null object
4 TEAM_ID_19 538 non-null float64
5 SEASON_19 538 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 35.5+ KB
len(df_18_only[df_18_only['TEAM_ID_19'].isna()]['PLAYER_ID'].unique())
178
df_1819_outer = pd.merge(df_18,df_19,on='PLAYER_ID',suffixes=('_18','_19'),how='outer')
Also, note that this has different types than before. There are some players who only played one season, so they have a NaN value in some colums. pandas always casts a whole column.
df_1819_outer.dtypes
TEAM_ID_18 float64
PLAYER_ID int64
SEASON_18 float64
PLAYER_NAME object
TEAM_ID_19 float64
SEASON_19 float64
dtype: object
nan is a float
import numpy as np
type(np.nan)
float
df_1819_outer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 927 entries, 0 to 926
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TEAM_ID_18 754 non-null float64
1 PLAYER_ID 927 non-null int64
2 SEASON_18 754 non-null float64
3 PLAYER_NAME 711 non-null object
4 TEAM_ID_19 711 non-null float64
5 SEASON_19 711 non-null float64
dtypes: float64(4), int64(1), object(1)
memory usage: 43.6+ KB
Back the the question, we can also use a left merge. To pick out those rows:
df_1819_outer['TEAM_ID_19'].isna()
0 False
1 False
2 False
3 False
4 False
...
922 False
923 False
924 False
925 False
926 False
Name: TEAM_ID_19, Length: 927, dtype: bool
this gives us a boolean list of False
where there is a value and TRUE
where there is nan
. Since we applied this to the TEAM_ID_19
column, it gives us a TRUE
for each row that represents a player playing in 2018, but not 19.
However this still has repetitions for the players that played for two teams in 2018. If we tke the unique values fro the PLAYER_ID
column we get the IDs for the players who played in 18, but not 19. Then we can use len
(a built in python function) to get the number of players that played in 2018, but not 2019.
len(df_1819_outer[df_1819_outer['TEAM_ID_19'].isna()]['PLAYER_ID'].unique())
178
df_18.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TEAM_ID 748 non-null int64
1 PLAYER_ID 748 non-null int64
2 SEASON 748 non-null int64
dtypes: int64(3)
memory usage: 17.7 KB
8.3. Getting Data from Databases#
8.3.1. What is a Database?#
A common attitude in Data Science is:
If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating. — Hadley Wickham
Businesses and research organizations nearly always have too much data to feasibly work without a database. Instead, they use different tools which are designed to scale to very large amounts of data. These tools are largely databases like Snowflake or Google’s BigQuery and distributed computing frameworks like Apache Spark.
Warning
We are going to focus on the case of getting data out of a Database so that you can use it and making sure you know what a Database is.
You could spend a whole semester on databases:
CSC436 covers how to implement them in detail (recommended, but requires CSC212)
BAI456 only how to use them (counts for DS majors, but if you want to understand them deeper, the CSC one is recommended)
For the purpose of this class the key attributes of a database are:
it is a collection of tables
the data is accessed live from disk (not RAM)
you send a query to the database to get the data (or your answer)
Databases can be designed in many different ways. For examples two popular ones.
SQLite is optimized for transactional workloads, which means a high volume of requests that involving inserting or reading a couple things. This is good for eg a webserver.
DuckDB is optimized for analytical workloads, which means a small number of requests that each require reading many records in the database. This is better for eg: data science
Experimenting with DuckDB is a way to earn construct level 3
8.3.2. Accessing a Database from Python#
We will use pandas again, as well as the request
module from the urllib
package and sqlite3
.
Off the shelf, pandas cannot read databased by default. We’ll use the
sqlite3
library, but there
are others, depending on the type of database.
First we need to download the database to work with it.
request.urlretrieve('https://github.com/rhodyprog4ds/rhodyds/raw/main/data/nba1819.db',
'nba1819.db')
('nba1819.db', <http.client.HTTPMessage at 0x7f508bffcd90>)
Next, we set up a connection, that links the the notebook to the database. To use it, we add a cursor.
conn = sqlite3.connect('nba1819.db')
cursor = conn.cursor()
We can use execute to pass SQL queries through the cursor to the database.
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
<sqlite3.Cursor at 0x7f508bfd07a0>
Then we use fetchall
to get the the results of the query.
cursor.fetchall()
[('teams',),
('conferences',),
('playerGameStats2018',),
('playerGameStats2019',),
('teamGameStats2018',),
('teamGameStats2019',),
('playerTeams2018',),
('playerTeams2019',),
('teamDailyRankings2018',),
('teamDailyRankings2019',),
('playerNames',)]
If we fetch again, there is nothing to fetch. Fetch pulls what was queued by execute.
cursor.fetchall()
[]
8.4. Querying with pandas#
We can use pd.read_sql
to send queries, get the result sand transform them
into a DataFrame all at once
We can pass the exact same queries if we want.
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';",conn)
name | |
---|---|
0 | teams |
1 | conferences |
2 | playerGameStats2018 |
3 | playerGameStats2019 |
4 | teamGameStats2018 |
5 | teamGameStats2019 |
6 | playerTeams2018 |
7 | playerTeams2019 |
8 | teamDailyRankings2018 |
9 | teamDailyRankings2019 |
10 | playerNames |
or we can get all of one of the tables:
pd.read_sql('SELECT * FROM teams',conn).head(1)
index | LEAGUE_ID | TEAM_ID | MIN_YEAR | MAX_YEAR | ABBREVIATION | NICKNAME | YEARFOUNDED | CITY | ARENA | ARENACAPACITY | OWNER | GENERALMANAGER | HEADCOACH | DLEAGUEAFFILIATION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | Tony Ressler | Travis Schlenk | Lloyd Pierce | Erie Bayhawks |
8.4.1. Which player was traded the most during the 2018 season? How many times?#
There is one row in players per team a played for per season, so if a player was traded (changed teams), they are in there multiple times.
First, we’ll check the column names
pd.read_sql("SELECT * FROM playerTeams2018 LIMIT 1",conn)
index | TEAM_ID | PLAYER_ID | |
---|---|---|---|
0 | 0 | 1610612761 | 202695 |
then get the 2018 players, we only need the PLAYER_ID
column for this question
p18 =pd.read_sql("SELECT PLAYER_ID FROM playerTeams2018 ",conn)
Then we can use value counts
p18.value_counts().sort_values(ascending=False).head(10)
PLAYER_ID
1629150 4
202325 3
203092 3
201160 3
202328 3
1626150 3
1628393 3
202083 3
202692 3
203477 3
Name: count, dtype: int64
and we can get the player’s name from the player name remember our first query told us all the tables
pd.read_sql("SELECT PLAYER_NAME FROM playerNames WHERE PLAYER_ID = 1629150",conn)
PLAYER_NAME | |
---|---|
0 | Emanuel Terry |
8.4.2. Did more players who changed teams from the 2018 season to the 2019 season stay in the same conferences or switch conferences?#
In the NBA, there are 30 teams organized into two conferences: East and West;
the conferences
table has the columns TEAM_ID
and CONFERENCE
Let’s build a Dataframe that could answer the question.
I first pulled 1 row from each table I needed to see the columns.
pd.read_sql('SELECT * FROM conferences LIMIT 1',conn)
index | TEAM_ID | CONFERENCE | |
---|---|---|---|
0 | 0 | 1610612744 | West |
pd.read_sql('SELECT * FROM playerTeams2018 LIMIT 1',conn)
index | TEAM_ID | PLAYER_ID | |
---|---|---|---|
0 | 0 | 1610612761 | 202695 |
pd.read_sql('SELECT * FROM playerTeams2019 LIMIT 1',conn)
index | TEAM_ID | PLAYER_ID | |
---|---|---|---|
0 | 0 | 1610612762 | 1626220 |
Then I pulled the columns I needed from each of the 3 tables into a separate DataFrame.
conf_df = pd.read_sql('SELECT TEAM_ID,CONFERENCE FROM conferences',conn)
df18 = pd.read_sql('SELECT TEAM_ID,PLAYER_ID FROM playerTeams2018',conn)
df19 = pd.read_sql('SELECT TEAM_ID,PLAYER_ID FROM playerTeams2019',conn)
df18_c = pd.merge(df18,conf_df,on='TEAM_ID')
df19_c = pd.merge(df19,conf_df,on='TEAM_ID')
df1819_conf = pd.merge(df18_c,df19_c, on='PLAYER_ID',suffixes=('_2018','_2019'))
df1819_conf
TEAM_ID_2018 | PLAYER_ID | CONFERENCE_2018 | TEAM_ID_2019 | CONFERENCE_2019 | |
---|---|---|---|---|---|
0 | 1610612761 | 202695 | East | 1610612746 | West |
1 | 1610612761 | 1627783 | East | 1610612761 | East |
2 | 1610612761 | 201188 | East | 1610612761 | East |
3 | 1610612763 | 201188 | West | 1610612761 | East |
4 | 1610612761 | 201980 | East | 1610612747 | West |
... | ... | ... | ... | ... | ... |
533 | 1610612739 | 1628021 | East | 1610612751 | East |
534 | 1610612739 | 201567 | East | 1610612739 | East |
535 | 1610612739 | 202684 | East | 1610612739 | East |
536 | 1610612739 | 1628424 | East | 1610612766 | East |
537 | 1610612739 | 1627819 | East | 1610612761 | East |
538 rows × 5 columns
Then I merged the conference with each set of player informationon the teams. Then I merged the two expanded single year DataFrames together.
Now, to answer the question, we have a bit more work to do. I’m going to use a lambda
and apply
to make a column that says same or new for the relative conference of the two seasons.
labels = {False:'new',True:'same'}
change_conf = lambda row: labels[row['CONFERENCE_2018']==row['CONFERENCE_2019']]
df1819_conf['conference_1819']= df1819_conf.apply(change_conf,axis=1)
df1819_conf.head()
TEAM_ID_2018 | PLAYER_ID | CONFERENCE_2018 | TEAM_ID_2019 | CONFERENCE_2019 | conference_1819 | |
---|---|---|---|---|---|---|
0 | 1610612761 | 202695 | East | 1610612746 | West | new |
1 | 1610612761 | 1627783 | East | 1610612761 | East | same |
2 | 1610612761 | 201188 | East | 1610612761 | East | same |
3 | 1610612763 | 201188 | West | 1610612761 | East | new |
4 | 1610612761 | 201980 | East | 1610612747 | West | new |
Then I can use this DataFrame grouped by my new column to get the unique players in each situation new or same conference.
df1819_conf.groupby('conference_1819')['PLAYER_ID'].apply(pd.unique)
conference_1819
new [202695, 201188, 201980, 203961, 1626153, 1011...
same [1627783, 201188, 200768, 1627832, 201586, 162...
Name: PLAYER_ID, dtype: object
And finally, get the length of each of those lists.
df1819_conf.groupby('conference_1819')['PLAYER_ID'].apply(pd.unique).apply(len)
conference_1819
new 119
same 385
Name: PLAYER_ID, dtype: int64
This, however, includes players who stayed on the same team, so we also need to split for who changed teams. First we add the team comparison column, then groupby by both and count unique players.
new_team = lambda row: labels[row['TEAM_ID_2018']==row['TEAM_ID_2019']]
df1819_conf['team_1819']= df1819_conf.apply(new_team,axis=1)
df1819_conf.groupby(['conference_1819','team_1819'])['PLAYER_ID'].apply(pd.unique).apply(len)
conference_1819 team_1819
new new 119
same new 135
same 263
Name: PLAYER_ID, dtype: int64
This is good, we could read the answer from here. It’s good practice, though, to be able to pull that value out programmatically.
player_counts_1819_team = df1819_conf.groupby(['conference_1819','team_1819'])['PLAYER_ID'].apply(pd.unique).apply(len)
player_counts_1819_team.idxmax()
('same', 'same')
This tells us that the largest number of players stayed on the same team (and therefore same conference). We’re not interested inthis thoug, we’re itnerested in those that changed teams, so we can drop the (same,same)
value and then do this again.
player_counts_1819_team.drop(('same','same')).idxmax()
('same', 'new')
This tells us that more players changed teams within the same conference than changed teams and conferences. We can compare the two directly:
player_counts_1819_team['new','new'], player_counts_1819_team['same','new']
(119, 135)
Again 135 is more than 119.
We can also make this a little neater to print it as a DataFrame. If we use reset_index
it will make a DataFrame, but the count column will still be named PLAYER_ID
so we can rename it.
player_counts_1819_team.reset_index().rename(columns={'PLAYER_ID':'num_players'})
conference_1819 | team_1819 | num_players | |
---|---|---|---|
0 | new | new | 119 |
1 | same | new | 135 |
2 | same | same | 263 |
All in all, this gives us a good answer that we can get with data and display answers and this is one way that using multiple data sources can help answer richer questions.
conn.close()
8.5. Questions After Class#
8.5.1. How to merge multiple data frames at a time/ filter, I dont know if that is possible#
You can filter and merge in more complex ways in a database in at least some cases, but in pandas merge is strictly two at a time.
8.5.2. What do you recommend I know about SQL from someone who has not been exposed to it much before this class?#
Wizard zines has a good reference, but it is not free. I have some of their other work though and it is all high quality. this preview is especially helpful for me If the cost is prohibitive for you, but the preview of this looks like something you would like, send me an e-mail.
8.5.3. What other SQL ‘keywords’ in the queries are there? ex: SELECT, FROM, WHERE#
8.5.4. Is there a max DB size?#
Generally, no. In specific instances, yes. For example, MSFT SQL Server has a max size of 524,272 terabytes.
8.5.5. when can pandas not use SQL databases?#
The most important limit here is realy that the cmoputer you are working on will have limits on how much data you can pull from the database into local RAM.
8.5.6. When is sql more advantageous to use?#
When you use a database. It’s a query language
8.5.7. how are databases and sql queries better than dataframes other than large datasets#
It allows you to have a single file instead of separate ones, but that’s it. The real motivation for databases is their advantages for large datasets.
On slower computers with less memory this was more important.
learning databases is also a good way to learn about schemas and structure, if you learn in depth, like a full course.
8.5.8. Is it possible to do something analogous to a merge by making a SQL query through pandas?#
yes! you can send any sql query through pandas
8.5.9. so is the fetchall() take the names(titles) for the datas from database?#
It takes the output of the query.
8.5.10. When you run database is there a clear way to see what structure the data is in?#
You have to pull queries to get the data.
8.5.11. how can I practice on my own?#
Download
8.5.12. when is the first portfolio check again?#
Just posted now to the portfolio page
8.5.13. Is there a guideline for asking queries that we will learn about?#
the queries should match your questions
8.5.14. Why do we use pandas to read a database if the database is too large to store in memory?#
It could be that the whole database is, but you run a query for a subset that can fit in memory and then you want to plot it using seaborn.
For today, we also did that so that you can get a chance to see some SQL queries, without having to install a separate program on your computers.