Merging and Databases

9. Merging and Databases#

9.1. Announcements#

Assignment 3 is graded, but only today, so A4 is extended until tomorrow so that you can incorporate your A3 feedback into A4 (in particular, extending you EDA if you did not earn level 2 for summarize and visualize in A3).

Your achievement trackers are updated.

Take a few minutes to think about the following questions and make a few notes for yourself whereever you need them to be (a planner, calendar, etc).

What achievements have you earned?
Does your tracking repo seem accurate to what you’ve done? If not, make an issue to ask about it.
Are you on track to earn the grade you want in this class?
If not, what will you need to do (respond more in class, submit more assignments, use your portfolio to catch up) to get back on track?
If you are on track and you want to earn above a B, take a minute to think about your portfolio. (tip: post an idea as an issue to get early feedback and help shaping your idea)

9.2. Merging Data#

Focus this week is on how to programmatically combine sources of data

We will start by looking at combining multiple tabular data formats and see how to get data from other sources.

import pandas as pd
import sqlite3
from urllib import request

we’re going to work with a set of datasets today that are stored in a repo.

course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'

We can load in two data sets of player information.

df_18 = pd.read_csv(course_data_url+'2018-players.csv')
df_19 = pd.read_csv(course_data_url+'2019-players.csv')

and take a peek at each

df_18.head()

	TEAM_ID	PLAYER_ID	SEASON
0	1610612761	202695	2018
1	1610612761	1627783	2018
2	1610612761	201188	2018
3	1610612761	201980	2018
4	1610612761	200768	2018

df_19.head()

	PLAYER_NAME	TEAM_ID	PLAYER_ID	SEASON
0	Royce O'Neale	1610612762	1626220	2019
1	Bojan Bogdanovic	1610612762	202711	2019
2	Rudy Gobert	1610612762	203497	2019
3	Donovan Mitchell	1610612762	1628378	2019
4	Mike Conley	1610612762	201144	2019

Important

Remember columns is an attribute, so it does not need ()

df_19.columns

Index(['PLAYER_NAME', 'TEAM_ID', 'PLAYER_ID', 'SEASON'], dtype='object')

Let’s make note of the shape of each

df_18.shape, df_19.shape

((748, 3), (626, 4))

9.2.1. What if we want to analyze them together?#

We can stack them, but this does not make it easy to see , for example, who changed teams.

pd.concat([df_18,df_19])

	TEAM_ID	PLAYER_ID	SEASON	PLAYER_NAME
0	1610612761	202695	2018	NaN
1	1610612761	1627783	2018	NaN
2	1610612761	201188	2018	NaN
3	1610612761	201980	2018	NaN
4	1610612761	200768	2018	NaN
...	...	...	...	...
621	1610612745	203461	2019	Anthony Bennett
622	1610612737	1629034	2019	Ray Spalding
623	1610612744	203906	2019	Devyn Marble
624	1610612753	1629755	2019	Hassani Gravett
625	1610612754	1629721	2019	JaKeenan Gant

1374 rows × 4 columns

pd.concat([df_18,df_19]).shape

(1374, 4)

Note that this has the maximum number of columns (because both had some overlapping columns) and the total number of rows.

9.2.2. How can we find which players changed teams?#

To do this we want to have one player column and a column with each year’s team.

We can use a merge to do that.

pd.merge(df_18,df_19,).head(2)

	TEAM_ID	PLAYER_ID	SEASON	PLAYER_NAME

if we merge them without any parameters, it tries to merge on all shared columns. We want to merge them using the PLAYER_ID column though, we would say that we are “merging on player ID” and we use the on parameter to do it. In this case, it looks for the values in the PLAYER_ID column that appear in both DataFrames and combines them into a single row.

pd.merge(df_18,df_19,on='PLAYER_ID')

	TEAM_ID_x	PLAYER_ID	SEASON_x	PLAYER_NAME	TEAM_ID_y	SEASON_y
0	1610612761	202695	2018	Kawhi Leonard	1610612746	2019
1	1610612761	1627783	2018	Pascal Siakam	1610612761	2019
2	1610612761	201188	2018	Marc Gasol	1610612761	2019
3	1610612763	201188	2018	Marc Gasol	1610612761	2019
4	1610612761	201980	2018	Danny Green	1610612747	2019
...	...	...	...	...	...	...
533	1610612760	203583	2018	Abdul Gaddy	1610612760	2019
534	1610612760	203460	2018	Andre Roberson	1610612760	2019
535	1610612755	203658	2018	Norvel Pelle	1610612755	2019
536	1610612741	1627756	2018	Denzel Valentine	1610612741	2019
537	1610612754	203912	2018	C.J. Wilcox	1610612754	2019

538 rows × 6 columns

Since there are other columns that appear in both DataFrames, they get a suffix, which by default is x or y, we can specify them though.

df_1819_inner = pd.merge(df_18,df_19,on='PLAYER_ID',suffixes=('_2018','_2019'))
df_1819_inner.shape

(538, 6)

By default, this uses an inner merge, so we get the players that are in both datasets only. If we want to see differences, we need another type of merge.

Some players still appear twice, because they were in one of the datsets twice, this happens when a player plays for two team in one season.

9.2.3. Which players played in 2018, but not 2019?#

We have different types of merges, inner is both, out is either. Left and right keep all the rows of one dataFrame. We can use left with df_18 as the left DataFrame to see which players played only in 18.

df_1819_outer = pd.merge(df_18,df_19,how='outer',on='PLAYER_ID',suffixes=('_2018','_2019'),)
df_1819_outer

	TEAM_ID_2018	PLAYER_ID	SEASON_2018	PLAYER_NAME	TEAM_ID_2019	SEASON_2019
0	1.610613e+09	202695	2018.0	Kawhi Leonard	1.610613e+09	2019.0
1	1.610613e+09	1627783	2018.0	Pascal Siakam	1.610613e+09	2019.0
2	1.610613e+09	201188	2018.0	Marc Gasol	1.610613e+09	2019.0
3	1.610613e+09	201188	2018.0	Marc Gasol	1.610613e+09	2019.0
4	1.610613e+09	201980	2018.0	Danny Green	1.610613e+09	2019.0
...	...	...	...	...	...	...
922	NaN	1629097	NaN	Terry Larrier	1.610613e+09	2019.0
923	NaN	203461	NaN	Anthony Bennett	1.610613e+09	2019.0
924	NaN	203906	NaN	Devyn Marble	1.610613e+09	2019.0
925	NaN	1629755	NaN	Hassani Gravett	1.610613e+09	2019.0
926	NaN	1629721	NaN	JaKeenan Gant	1.610613e+09	2019.0

927 rows × 6 columns

Also, note that this has different types than before. There are some players who only played one season, so they have a NaN value in some colums. pandas always casts a whole column.

df_1819_inner.dtypes

TEAM_ID_2018     int64
PLAYER_ID        int64
SEASON_2018      int64
PLAYER_NAME     object
TEAM_ID_2019     int64
SEASON_2019      int64
dtype: object

df_1819_outer.dtypes

TEAM_ID_2018    float64
PLAYER_ID         int64
SEASON_2018     float64
PLAYER_NAME      object
TEAM_ID_2019    float64
SEASON_2019     float64
dtype: object

nan is a float

import numpy as np
type(np.nan)

float

Back the the question, we can also use a left merge. To pick out those rows:

df_1819left = pd.merge(df_18,df_19,how='left',on='PLAYER_ID',suffixes=('_2018','_2019'),)
df_18only = df_1819left[df_1819left['SEASON_2019'].isna()]
df_18only

	TEAM_ID_2018	PLAYER_ID	SEASON_2018	PLAYER_NAME	TEAM_ID_2019	SEASON_2019
9	1610612761	202391	2018	NaN	NaN	NaN
11	1610612761	201975	2018	NaN	NaN	NaN
18	1610612744	101106	2018	NaN	NaN	NaN
23	1610612744	2733	2018	NaN	NaN	NaN
24	1610612744	201973	2018	NaN	NaN	NaN
...	...	...	...	...	...	...
749	1610612752	1629246	2018	NaN	NaN	NaN
750	1610612748	1629159	2018	NaN	NaN	NaN
751	1610612762	1629163	2018	NaN	NaN	NaN
752	1610612743	1629150	2018	NaN	NaN	NaN
753	1610612738	1629167	2018	NaN	NaN	NaN

216 rows × 6 columns

len(df_18only['PLAYER_ID'].unique())

df_18only['PLAYER_ID'].value_counts()

PLAYER_ID
1629150    4
202325     3
201160     3
1628393    3
202328     3
          ..
1628515    1
1627816    1
1628979    1
2736       1
1629167    1
Name: count, Length: 178, dtype: int64

9.3. Getting Data from Databases#

9.3.1. What is a Database?#

A common attitude in Data Science is:

If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating. — Hadley Wickham

Businesses and research organizations nearly always have too much data to feasibly work without a database. Instead, they use different tools which are designed to scale to very large amounts of data. These tools are largely databases like Snowflake or Google’s BigQuery and distributed computing frameworks like Apache Spark.

Warning

We are going to focus on the case of getting data out of a Database so that you can use it and making sure you know what a Database is.

You could spend a whole semester on databases:

CSC436 covers how to implement them in detail (recommended, but requires CSC212)
BAI456 only how to use them (counts for DS majors, but if you want to understand them deeper, the CSC one is recommended)

For the purpose of this class the key attributes of a database are:

it is a collection of tables
the data is accessed live from disk (not RAM)
you send a query to the database to get the data (or your answer)

Databases can be designed in many different ways. For examples two popular ones.

SQLite is optimized for transactional workloads, which means a high volume of requests that involving inserting or reading a couple things. This is good for eg a webserver.
DuckDB is optimized for analytical workloads, which means a small number of requests that each require reading many records in the database. This is better for eg: data science

Experimenting with DuckDB is a way to earn construct level 3

9.3.2. Accessing a Database from Python#

We will use pandas again, as well as the request module from the urllib package and sqlite3.

Off the shelf, pandas cannot read databased by default. We’ll use the sqlite3 library, but there are others, depending on the type of database.

First we need to download the database to work with it.

request.urlretrieve('https://github.com/rhodyprog4ds/rhodyds/raw/main/data/nba1819.db',
      'nba1819.db')

('nba1819.db', <http.client.HTTPMessage at 0x7f10ed5f5c10>)

Next, we set up a connection, that links the the notebook to the database. To use it, we add a cursor.

conn = sqlite3.connect('nba1819.db')
cursor = conn.cursor()

We can use execute to pass SQL queries through the cursor to the database.

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")

<sqlite3.Cursor at 0x7f10ed60d1f0>

Then we use fetchall to get the the results of the query.

cursor.fetchall()

[('teams',),
 ('conferences',),
 ('playerGameStats2018',),
 ('playerGameStats2019',),
 ('teamGameStats2018',),
 ('teamGameStats2019',),
 ('playerTeams2018',),
 ('playerTeams2019',),
 ('teamDailyRankings2018',),
 ('teamDailyRankings2019',),
 ('playerNames',)]

If we fetch again, there is nothing to fetch. Fetch pulls what was queued by execute.

cursor.fetchall()

[]

We can run another query with execute then fetch that result. This query gives us the column names.

The schema of a database is the description of its setup and layout. The * means to get all.

cursor.execute("SELECT * FROM INFORMATION_SCHEMA.COLUMNS ")

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
Cell In[25], line 1
----> 1 cursor.execute("SELECT * FROM INFORMATION_SCHEMA.COLUMNS ")

OperationalError: no such table: INFORMATION_SCHEMA.COLUMNS

Then we use fetchall to get the the results of the query.

cursor.fetchall()

[]

9.4. Querying with pandas#

We can use pd.read_sql to send queries, get the result sand transform them into a DataFrame all at once

We can pass the exact same queries if we want.

pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';",conn)

	name
0	teams
1	conferences
2	playerGameStats2018
3	playerGameStats2019
4	teamGameStats2018
5	teamGameStats2019
6	playerTeams2018
7	playerTeams2019
8	teamDailyRankings2018
9	teamDailyRankings2019
10	playerNames

pd.read_sql('SELECT * FROM teams',conn).head(1)

	index	LEAGUE_ID	TEAM_ID	MIN_YEAR	MAX_YEAR	ABBREVIATION	NICKNAME	YEARFOUNDED	CITY	ARENA	ARENACAPACITY	OWNER	GENERALMANAGER	HEADCOACH	DLEAGUEAFFILIATION
0	0	0	1610612737	1949	2019	ATL	Hawks	1949	Atlanta	State Farm Arena	18729.0	Tony Ressler	Travis Schlenk	Lloyd Pierce	Erie Bayhawks

9.4.1. Which player was traded the most during the 2018 season? How many times?#

There is one row in players per team a played for per season, so if a player was traded (changed teams), they are in there multiple times.

First, we’ll check the column names

pd.read_sql("SELECT * FROM playerTeams2018 LIMIT 1",conn)

	index	TEAM_ID	PLAYER_ID
0	0	1610612761	202695

then get the 2018 players, we only need the PLAYER_ID column for this question

p18 =pd.read_sql("SELECT PLAYER_ID FROM playerTeams2018 ",conn)

Then we can use value counts

p18.value_counts().sort_values(ascending=False).head(10)

PLAYER_ID
1629150      4
202325       3
203092       3
201160       3
202328       3
1626150      3
1628393      3
202083       3
202692       3
203477       3
Name: count, dtype: int64

and we can get the player’s name from the player name remember our first query told us all the tables

pd.read_sql("SELECT PLAYER_NAME FROM playerNames WHERE PLAYER_ID = 1629150",conn)

	PLAYER_NAME
0	Emanuel Terry

9.4.2. Did more players who changed teams from the 2018 season to the 2019 season stay in the same conferences or switch conferences?#

In the NBA, there are 30 teams organized into two conferences: East and West; the conferences table has the columns TEAM_ID and CONFERENCE

Let’s build a Dataframe that could answer the question.

I first pulled 1 row from each table I needed to see the columns.

pd.read_sql('SELECT * FROM conferences LIMIT 1',conn)

	index	TEAM_ID	CONFERENCE
0	0	1610612744	West

pd.read_sql('SELECT * FROM playerTeams2018 LIMIT 1',conn)

	index	TEAM_ID	PLAYER_ID
0	0	1610612761	202695

pd.read_sql('SELECT * FROM playerTeams2019 LIMIT 1',conn)

	index	TEAM_ID	PLAYER_ID
0	0	1610612762	1626220

Then I pulled the columns I needed from each of the 3 tables into a separate DataFrame.

conf_df = pd.read_sql('SELECT TEAM_ID,CONFERENCE FROM conferences',conn)
df18 = pd.read_sql('SELECT TEAM_ID,PLAYER_ID FROM playerTeams2018',conn)
df19 = pd.read_sql('SELECT TEAM_ID,PLAYER_ID FROM playerTeams2019',conn)
df18_c = pd.merge(df18,conf_df,on='TEAM_ID')
df19_c = pd.merge(df19,conf_df,on='TEAM_ID')
df1819_conf = pd.merge(df18_c,df19_c, on='PLAYER_ID',suffixes=('_2018','_2019'))
df1819_conf

	TEAM_ID_2018	PLAYER_ID	CONFERENCE_2018	TEAM_ID_2019	CONFERENCE_2019
0	1610612761	202695	East	1610612746	West
1	1610612761	1627783	East	1610612761	East
2	1610612761	201188	East	1610612761	East
3	1610612763	201188	West	1610612761	East
4	1610612761	201980	East	1610612747	West
...	...	...	...	...	...
533	1610612739	1628021	East	1610612751	East
534	1610612739	201567	East	1610612739	East
535	1610612739	202684	East	1610612739	East
536	1610612739	1628424	East	1610612766	East
537	1610612739	1627819	East	1610612761	East

538 rows × 5 columns

Then I merged the conference with each set of player informationon the teams. Then I merged the two expanded single year DataFrames together.

Now, to answer the question, we have a bit more work to do. I’m going to use a lambda and apply to make a column that says same or new for the relative conference of the two seasons.

labels = {False:'new',True:'same'}
change_conf = lambda row: labels[row['CONFERENCE_2018']==row['CONFERENCE_2019']]
df1819_conf['conference_1819']= df1819_conf.apply(change_conf,axis=1)
df1819_conf.head()

	TEAM_ID_2018	PLAYER_ID	CONFERENCE_2018	TEAM_ID_2019	CONFERENCE_2019	conference_1819
0	1610612761	202695	East	1610612746	West	new
1	1610612761	1627783	East	1610612761	East	same
2	1610612761	201188	East	1610612761	East	same
3	1610612763	201188	West	1610612761	East	new
4	1610612761	201980	East	1610612747	West	new

Then I can use this DataFrame grouped by my new column to get the unique players in each situation new or same conference.

df1819_conf.groupby('conference_1819')['PLAYER_ID'].apply(pd.unique)

conference_1819
new     [202695, 201188, 201980, 203961, 1626153, 1011...
same    [1627783, 201188, 200768, 1627832, 201586, 162...
Name: PLAYER_ID, dtype: object

And finally, get the length of each of those lists.

df1819_conf.groupby('conference_1819')['PLAYER_ID'].apply(pd.unique).apply(len)

conference_1819
new     119
same    385
Name: PLAYER_ID, dtype: int64

This, however, includes players who stayed on the same team, so we also need to split for who changed teams. First we add the team comparison column, then groupby by both and count unique players.

new_team = lambda row: labels[row['TEAM_ID_2018']==row['TEAM_ID_2019']]
df1819_conf['team_1819']= df1819_conf.apply(new_team,axis=1)
df1819_conf.groupby(['conference_1819','team_1819'])['PLAYER_ID'].apply(pd.unique).apply(len)

conference_1819  team_1819
new              new          119
same             new          135
                 same         263
Name: PLAYER_ID, dtype: int64

This is good, we could read the answer from here. It’s good practice, though, to be able to pull that value out programmatically.

player_counts_1819_team =  df1819_conf.groupby(['conference_1819','team_1819'])['PLAYER_ID'].apply(pd.unique).apply(len)
player_counts_1819_team.idxmax()

('same', 'same')

This tells us that the largest number of players stayed on the same team (and therefore same conference). We’re not interested inthis thoug, we’re itnerested in those that changed teams, so we can drop the (same,same) value and then do this again.

player_counts_1819_team.drop(('same','same')).idxmax()

('same', 'new')

This tells us that more players changed teams within the same conference than changed teams and conferences. We can compare the two directly:

player_counts_1819_team['new','new'], player_counts_1819_team['same','new']

(119, 135)

Again 135 is more than 119.

We can also make this a little neater to print it as a DataFrame. If we use reset_index it will make a DataFrame, but the count column will still be named PLAYER_ID so we can rename it.

player_counts_1819_team.reset_index().rename(columns={'PLAYER_ID':'num_players'})

	conference_1819	team_1819	num_players
0	new	new	119
1	same	new	135
2	same	same	263

All in all, this gives us a good answer that we can get with data and display answers and this is one way that using multiple data sources can help answer richer questions.

9.5. Questions After Class#

9.5.1. Is there a max DB size?#

Generally, no. In specific instances, yes. For example, MSFT SQL Server has a max size of 524,272 terabytes.

9.5.2. when can pandas not use SQL databases?#

The most important limit here is realy that the cmoputer you are working on will have limits on how much data you can pull from the database into local RAM.

9.5.3. how do you learn more about different quieries you can use for sql?#

Wizard zines has a good reference, but it is not free. I have some of their other work though and it is all high quality. this preview is especially helpful for me If the cost is prohibitive for you, but the preview of this looks like something you would like, send me an e-mail.

This cheatsheet is also good.

9.5.4. What other SQL ‘keywords’ in the queries are there? ex: SELECT, FROM, WHERE#

quick reference