10. Merging Data & Databases#

10.1. Merging Data#

import pandas as pd
import sqlite3
from urllib import request

we’re going to work with a set of datasets today that are stored in a repo.

course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'

We can load in two data sets of NBA player information.

df_18 = pd.read_csv(course_data_url+'2018-players.csv')
df_19 = pd.read_csv(course_data_url+'2019-players.csv')

and take a peek at each

df_18.head()
TEAM_ID PLAYER_ID SEASON
0 1610612761 202695 2018
1 1610612761 1627783 2018
2 1610612761 201188 2018
3 1610612761 201980 2018
4 1610612761 200768 2018
df_19.head()
PLAYER_NAME TEAM_ID PLAYER_ID SEASON
0 Royce O'Neale 1610612762 1626220 2019
1 Bojan Bogdanovic 1610612762 202711 2019
2 Rudy Gobert 1610612762 203497 2019
3 Donovan Mitchell 1610612762 1628378 2019
4 Mike Conley 1610612762 201144 2019

Important

Remember shape is a property, not a method, so it does not need ()

Let’s make note of the shape of each

df_18.shape, df_19.shape
((748, 3), (626, 4))

this created a tuple

10.1.1. What if we want to analyze them together?#

We can stack them, but this does not make it easy to see , for example, who changed teams.

pd.concat([df_18,df_19])
TEAM_ID PLAYER_ID SEASON PLAYER_NAME
0 1610612761 202695 2018 NaN
1 1610612761 1627783 2018 NaN
2 1610612761 201188 2018 NaN
3 1610612761 201980 2018 NaN
4 1610612761 200768 2018 NaN
... ... ... ... ...
621 1610612745 203461 2019 Anthony Bennett
622 1610612737 1629034 2019 Ray Spalding
623 1610612744 203906 2019 Devyn Marble
624 1610612753 1629755 2019 Hassani Gravett
625 1610612754 1629721 2019 JaKeenan Gant

1374 rows × 4 columns

pd.concat([df_18,df_19]).shape
(1374, 4)

we can see that this is the total number of rows:

748+626
1374

Note that this has the maximum number of columns (because both had some overlapping columns) and the total number of rows.

10.1.2. How can we find which players changed teams?#

To do this we want to have one player column and a column with each year’s team.

We can use a merge to do that.

pd.merge(df_18,df_19).head(2)
TEAM_ID PLAYER_ID SEASON PLAYER_NAME

if we merge them without any parameters, it tries to merge on all shared columns. We want to merge them using the PLAYER_ID column though, we would say that we are “merging on player ID” and we use the on parameter to do it. In this case, it looks for the values in the PLAYER_ID column that appear in both DataFrames and combines them into a single row.

pd.merge(df_18,df_19,on='PLAYER_ID',suffixes=('_18','_19')).head(2)
TEAM_ID_18 PLAYER_ID SEASON_18 PLAYER_NAME TEAM_ID_19 SEASON_19
0 1610612761 202695 2018 Kawhi Leonard 1610612746 2019
1 1610612761 1627783 2018 Pascal Siakam 1610612761 2019

Since there are other columns that appear in both DataFrames, they get a suffix, which by default is x or y, we can specify them though.

pd.merge(df_18,df_19,on='PLAYER_ID',suffixes=('_18','_19')).shape
(538, 6)

10.1.3. Which players played in 2018, but not 2019?#

We have different types of merges, inner is both, out is either. Left and right keep all the rows of one dataFrame. We can use left with df_18 as the left DataFrame to see which players played only in 18.

pd.merge(df_18,df_19,on='PLAYER_ID',how='left',suffixes=('_18','_19')).shape
(754, 6)
pd.merge(df_18,df_19,on='PLAYER_ID',how='left',suffixes=('_18','_19')).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 754 entries, 0 to 753
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   TEAM_ID_18   754 non-null    int64  
 1   PLAYER_ID    754 non-null    int64  
 2   SEASON_18    754 non-null    int64  
 3   PLAYER_NAME  538 non-null    object 
 4   TEAM_ID_19   538 non-null    float64
 5   SEASON_19    538 non-null    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 35.5+ KB

IF we save this to a variable, we can answer our question

df_18_only = pd.merge(df_18,df_19,on='PLAYER_ID',suffixes=('_18','_19'),how='left')
df_18_only.head(2)
TEAM_ID_18 PLAYER_ID SEASON_18 PLAYER_NAME TEAM_ID_19 SEASON_19
0 1610612761 202695 2018 Kawhi Leonard 1.610613e+09 2019.0
1 1610612761 1627783 2018 Pascal Siakam 1.610613e+09 2019.0
len(df_18_only[df_18_only['TEAM_ID_19'].isna()]['PLAYER_ID'].unique())
178

Also, note that this has different types than before. There are some players who only played one season, so they have a NaN value in some colums. pandas always casts a whole column.

10.2. Getting Data from Databases#

10.2.1. What is a Database?#

A common attitude in Data Science is:

If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating. — Hadley Wickham

Businesses and research organizations nearly always have too much data to feasibly work without a database. Instead, they use different tools which are designed to scale to very large amounts of data. These tools are largely databases like Snowflake or Google’s BigQuery and distributed computing frameworks like Apache Spark.

Warning

We are going to focus on the case of getting data out of a Database so that you can use it and making sure you know what a Database is.

You could spend a whole semester on databases:

  • CSC436 covers how to implement them in detail (recommended, but requires CSC212)

  • BAI456 only how to use them (counts for DS majors, but if you want to understand them deeper, the CSC one is recommended)

For the purpose of this class the key attributes of a database are:

  • it is a collection of tables

  • the data is accessed live from disk (not RAM)

  • you send a query to the database to get the data (or your answer)

Databases can be designed in many different ways. For examples two popular ones.

  • SQLite is optimized for transactional workloads, which means a high volume of requests that involving inserting or reading a couple things. This is good for eg a webserver.

  • DuckDB is optimized for analytical workloads, which means a small number of requests that each require reading many records in the database. This is better for eg: data science

Experimenting with DuckDB is a way to earn construct level 3

10.2.2. Accessing a Database from Python#

We will use pandas again, as well as the request module from the urllib package and sqlite3.

Off the shelf, pandas cannot read databased by default. We’ll use the sqlite3 library, but there are others, depending on the type of database.

First we need to download the database to work with it.

request.urlretrieve('https://github.com/rhodyprog4ds/rhodyds/raw/main/data/nba1819.db',
      'nba1819.db')
('nba1819.db', <http.client.HTTPMessage at 0x7f9ee829fd30>)

Next, we set up a connection, that links the the notebook to the database. To use it, we add a cursor.

conn = sqlite3.connect('nba1819.db')
cursor = conn.cursor()

We can use execute to pass SQL queries through the cursor to the database.

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
<sqlite3.Cursor at 0x7f9ee827fd50>

Then we use fetchall to get the the results of the query.

cursor.fetchall()
[('teams',),
 ('conferences',),
 ('playerGameStats2018',),
 ('playerGameStats2019',),
 ('teamGameStats2018',),
 ('teamGameStats2019',),
 ('playerTeams2018',),
 ('playerTeams2019',),
 ('teamDailyRankings2018',),
 ('teamDailyRankings2019',),
 ('playerNames',)]

If we fetch again, there is nothing to fetch. Fetch pulls what was queued by execute.

cursor.fetchall()
[]
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';",conn)
name
0 teams
1 conferences
2 playerGameStats2018
3 playerGameStats2019
4 teamGameStats2018
5 teamGameStats2019
6 playerTeams2018
7 playerTeams2019
8 teamDailyRankings2018
9 teamDailyRankings2019
10 playerNames

10.3. Querying with pandas#

We can use pd.read_sql to send queries, get the result sand transform them into a DataFrame all at once

We can pass the exact same queries if we want.

pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';",conn)
name
0 teams
1 conferences
2 playerGameStats2018
3 playerGameStats2019
4 teamGameStats2018
5 teamGameStats2019
6 playerTeams2018
7 playerTeams2019
8 teamDailyRankings2018
9 teamDailyRankings2019
10 playerNames

or we can get all of one of the tables:

pd.read_sql('SELECT * FROM teams',conn).head(1)
index LEAGUE_ID TEAM_ID MIN_YEAR MAX_YEAR ABBREVIATION NICKNAME YEARFOUNDED CITY ARENA ARENACAPACITY OWNER GENERALMANAGER HEADCOACH DLEAGUEAFFILIATION
0 0 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 Tony Ressler Travis Schlenk Lloyd Pierce Erie Bayhawks

10.3.1. Which player was traded the most during the 2018 season? How many times?#

There is one row in players per team a played for per season, so if a player was traded (changed teams), they are in there multiple times.

First, we’ll check the column names

pd.read_sql("SELECT * FROM playerTeams2018 LIMIT 1",conn)
index TEAM_ID PLAYER_ID
0 0 1610612761 202695

then get the 2018 players, we only need the PLAYER_ID column for this question

p18 =pd.read_sql("SELECT PLAYER_ID FROM playerTeams2018 ",conn)

Then we can use value counts

p18.value_counts().sort_values(ascending=False).head(10)
PLAYER_ID
1629150      4
202325       3
203092       3
201160       3
202328       3
1626150      3
1628393      3
202083       3
202692       3
203477       3
Name: count, dtype: int64

and we can get the player’s name from the player name remember our first query told us all the tables

pd.read_sql("SELECT PLAYER_NAME FROM playerNames WHERE PLAYER_ID = 1629150",conn)
PLAYER_NAME
0 Emanuel Terry

10.3.2. Did more players who changed teams from the 2018 season to the 2019 season stay in the same conferences or switch conferences?#

In the NBA, there are 30 teams organized into two conferences: East and West; the conferences table has the columns TEAM_ID and CONFERENCE

Let’s build a Dataframe that could answer the question.

I first pulled 1 row from each table I needed to see the columns.

pd.read_sql('SELECT * FROM conferences LIMIT 1',conn)
index TEAM_ID CONFERENCE
0 0 1610612744 West
pd.read_sql('SELECT * FROM playerTeams2018 LIMIT 1',conn)
index TEAM_ID PLAYER_ID
0 0 1610612761 202695
pd.read_sql('SELECT * FROM playerTeams2019 LIMIT 1',conn)
index TEAM_ID PLAYER_ID
0 0 1610612762 1626220

Then I pulled the columns I needed from each of the 3 tables into a separate DataFrame.

conf_df = pd.read_sql('SELECT TEAM_ID,CONFERENCE FROM conferences',conn)
df18 = pd.read_sql('SELECT TEAM_ID,PLAYER_ID FROM playerTeams2018',conn)
df19 = pd.read_sql('SELECT TEAM_ID,PLAYER_ID FROM playerTeams2019',conn)
df18_c = pd.merge(df18,conf_df,on='TEAM_ID')
df19_c = pd.merge(df19,conf_df,on='TEAM_ID')
df1819_conf = pd.merge(df18_c,df19_c, on='PLAYER_ID',suffixes=('_2018','_2019'))
df1819_conf
TEAM_ID_2018 PLAYER_ID CONFERENCE_2018 TEAM_ID_2019 CONFERENCE_2019
0 1610612761 202695 East 1610612746 West
1 1610612761 1627783 East 1610612761 East
2 1610612761 201188 East 1610612761 East
3 1610612763 201188 West 1610612761 East
4 1610612761 201980 East 1610612747 West
... ... ... ... ... ...
533 1610612739 1628021 East 1610612751 East
534 1610612739 201567 East 1610612739 East
535 1610612739 202684 East 1610612739 East
536 1610612739 1628424 East 1610612766 East
537 1610612739 1627819 East 1610612761 East

538 rows × 5 columns

Then I merged the conference with each set of player informationon the teams. Then I merged the two expanded single year DataFrames together.

Now, to answer the question, we have a bit more work to do. I’m going to use a lambda and apply to make a column that says same or new for the relative conference of the two seasons.

labels = {False:'new',True:'same'}
change_conf = lambda row: labels[row['CONFERENCE_2018']==row['CONFERENCE_2019']]
df1819_conf['conference_1819']= df1819_conf.apply(change_conf,axis=1)
df1819_conf.head()
TEAM_ID_2018 PLAYER_ID CONFERENCE_2018 TEAM_ID_2019 CONFERENCE_2019 conference_1819
0 1610612761 202695 East 1610612746 West new
1 1610612761 1627783 East 1610612761 East same
2 1610612761 201188 East 1610612761 East same
3 1610612763 201188 West 1610612761 East new
4 1610612761 201980 East 1610612747 West new

Then I can use this DataFrame grouped by my new column to get the unique players in each situation new or same conference.

df1819_conf.groupby('conference_1819')['PLAYER_ID'].apply(pd.unique)
conference_1819
new     [202695, 201188, 201980, 203961, 1626153, 1011...
same    [1627783, 201188, 200768, 1627832, 201586, 162...
Name: PLAYER_ID, dtype: object

And finally, get the length of each of those lists.

df1819_conf.groupby('conference_1819')['PLAYER_ID'].apply(pd.unique).apply(len)
conference_1819
new     119
same    385
Name: PLAYER_ID, dtype: int64

This, however, includes players who stayed on the same team, so we also need to split for who changed teams. First we add the team comparison column, then groupby by both and count unique players.

new_team = lambda row: labels[row['TEAM_ID_2018']==row['TEAM_ID_2019']]
df1819_conf['team_1819']= df1819_conf.apply(new_team,axis=1)
df1819_conf.groupby(['conference_1819','team_1819'])['PLAYER_ID'].apply(pd.unique).apply(len)
conference_1819  team_1819
new              new          119
same             new          135
                 same         263
Name: PLAYER_ID, dtype: int64

This is good, we could read the answer from here. It’s good practice, though, to be able to pull that value out programmatically.

player_counts_1819_team =  df1819_conf.groupby(['conference_1819','team_1819'])['PLAYER_ID'].apply(pd.unique).apply(len)
player_counts_1819_team.idxmax()
('same', 'same')

This tells us that the largest number of players stayed on the same team (and therefore same conference). We’re not interested inthis thoug, we’re itnerested in those that changed teams, so we can drop the (same,same) value and then do this again.

player_counts_1819_team.drop(('same','same')).idxmax()
('same', 'new')

This tells us that more players changed teams within the same conference than changed teams and conferences. We can compare the two directly:

player_counts_1819_team['new','new'], player_counts_1819_team['same','new']
(119, 135)

Again 135 is more than 119.

We can also make this a little neater to print it as a DataFrame. If we use reset_index it will make a DataFrame, but the count column will still be named PLAYER_ID so we can rename it.

player_counts_1819_team.reset_index().rename(columns={'PLAYER_ID':'num_players'})
conference_1819 team_1819 num_players
0 new new 119
1 same new 135
2 same same 263

All in all, this gives us a good answer that we can get with data and display answers and this is one way that using multiple data sources can help answer richer questions.

conn.close()

10.4. Questions#

10.4.1. do the level 3 attempts have a due date?#

End of the semester basically is it.

10.4.2. how long is assignment 5?#

It can be a little long, web scraping is relatively open-ended.