Skip to article frontmatterSkip to article content

Merging & Databases

import pandas as pd
import sqlite3
from urllib import request

we’re going to work with a set of datasets today that are stored in a repo.

course_data_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/'

We can load in two data sets of NBA player information.

df_18 = pd.read_csv(course_data_url+'2018-players.csv')
df_19 = pd.read_csv(course_data_url+'2019-players.csv')

and take a peek at each

df_18.head(1)
Loading...
df_19.head(1)
Loading...

Let’s make note of the shape of each

df_18.shape, df_19.shape
((748, 3), (626, 4))

this created a tuple

What if we want to analyze them together?

We can stack them, but this does not make it easy to see , for example, who changed teams.

df_1819stack = pd.concat([df_18,df_19])
df_1819stack.head()
Loading...
df_1819stack.shape
(1374, 4)

Note that this has the maximum number of columns (because both had some overlapping columns) and the total number of rows.

df_18.shape[0] + df_19.shape[0] == df_1819stack.shape[0]
True

df_18 has 748 rows and df_19 has 626 so the total number of rows in the vertically stacked is 1374

How can we find which players changed teams?

To do this we want to have one player column and a column with each year’s team.

We can use a merge to do that.

pd.merge(df_18,df_19).head(2)
Loading...

if we merge them without any parameters, it tries to merge on all shared columns. We want to merge them using the PLAYER_ID column though, we would say that we are “merging on player ID” and we use the on parameter to do it. In this case, it looks for the values in the PLAYER_ID column that appear in both DataFrames and combines them into a single row. This is called an inner merge or inner join.

pd.merge(df_18, df_19, on='PLAYER_ID')
Loading...

Since there are other columns that appear in both DataFrames, they get a suffix, which by default is x or y, we can specify them though.

df_1819 = pd.merge(df_18, df_19, on='PLAYER_ID',suffixes=('_18','_19'))
df_1819.head()
Loading...

This looks much better!

If we get back to the question, which players changed teams from 2018 to 2019 season then we need to find the players that have different teams for the two years.

changed_teams = df_1819['TEAM_ID_18'] != df_1819['TEAM_ID_19']
changed_teams
0 True 1 False 2 False 3 True 4 False ... 533 False 534 False 535 False 536 True 537 True Length: 538, dtype: bool

Then we can use the mask to subset the data:

df_1819[changed_teams]
Loading...

but some players play on more than one team within a season so, if we only want the ones that changed during that offseason, we would also drop any players that appear more than once:

df_19diff18_team = df_1819[changed_teams].drop_duplicates(subset='PLAYER_ID')
df_19diff18_team.head()
Loading...

so we see that 219 players changed teams and we can see their information in the dataframe

Which players played in 2018, but not 2019?

We have different types of merges, inner is both, out is either. Left and right keep all the rows of one dataFrame. We can use left with df_18 as the left DataFrame to see which players played only in 18.

df_1819_left = pd.merge(df_18, df_19, on='PLAYER_ID',suffixes=('_18','_19'),how='left')
df_1819_left.shape
(754, 6)

In this dataframe we see that we have a smaller number

df_18.shape
(748, 3)

Then again to answer the question, we want to drop the duplicates:

df_18_only =  pd.merge(df_18, df_19, on='PLAYER_ID',suffixes=('_18','_19'),
          how='left').drop_duplicates(subset='PLAYER_ID')

and we can answer by counting how many have a missing value in the 2019 team column:

df_18_only['TEAM_ID_19'].isna().sum()
np.int64(178)

Also, note that this has different types than before. There are some players who only played one season, so they have a NaN value in some colums. pandas always casts a whole column.

Getting Data from Databases

What is a Database?

A common attitude in Data Science is:

If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating. — Hadley Wickham

Businesses and research organizations nearly always have too much data to feasibly work without a database. Instead, they use different tools which are designed to scale to very large amounts of data. These tools are largely databases like Snowflake or Google’s BigQuery and distributed computing frameworks like Apache Spark.

For the purpose of this class the key attributes of a database are:

Databases can be designed in many different ways. For examples two popular ones.

Experimenting with DuckDB is a way to earn innovative for data

Accessing a Database from Python

We will use pandas again, as well as the request module from the urllib package and sqlite3.

Off the shelf, pandas cannot read databased by default. We’ll use the sqlite3 library, but there are others, depending on the type of database.

First we need to download the database to work with it.

request.urlretrieve('https://github.com/rhodyprog4ds/rhodyds/raw/main/data/nba1819.db',
      'nba1819.db')
('nba1819.db', <http.client.HTTPMessage at 0x7fe5b05d1dc0>)

Next, we set up a connection, that links the the notebook to the database. To use it, we add a cursor.

conn = sqlite3.connect('nba1819.db')
cursor = conn.cursor()

We can use execute to pass SQL queries through the cursor to the database.

cursor.execute('SELECT name FROM sqlite_master WHERE type="table";')
<sqlite3.Cursor at 0x7fe5b05d9540>

Then we use fetchall to get the the results of the query.

cursor.fetchall()
[('teams',), ('conferences',), ('playerGameStats2018',), ('playerGameStats2019',), ('teamGameStats2018',), ('teamGameStats2019',), ('playerTeams2018',), ('playerTeams2019',), ('teamDailyRankings2018',), ('teamDailyRankings2019',), ('playerNames',)]

If we fetch again, there is nothing to fetch. Fetch pulls what was queued by execute.

cursor.fetchall()
[]

Querying with pandas

We can use pd.read_sql to send queries, get the result sand transform them into a DataFrame all at once

We can pass the exact same queries if we want.

pd.read_sql('SELECT name FROM sqlite_master WHERE type="table";',
           conn)
Loading...

or we can get all of one of the tables:

pd.read_sql('SELECT * FROM teams',conn).head()
Loading...

Which player was traded the most during the 2018 season? How many times?

There is one row in players per team a played for per season, so if a player was traded (changed teams), they are in there multiple times.

First, we’ll check the column names

pd.read_sql('SELECT * FROM playerTeams2018 LIMIT 1',conn)
Loading...

then get the 2018 players, we only need the PLAYER_ID column for this question

p18 = pd.read_sql('SELECT PLAYER_ID FROM playerTeams2018',conn)
p18.head()
Loading...

Then we can use value counts

p18.value_counts().sort_values(ascending=False)
PLAYER_ID 1629150 4 1628393 3 201160 3 202692 3 202325 3 .. 1629236 1 1629245 1 1629246 1 1629260 1 1629285 1 Name: count, Length: 627, dtype: int64

and we can get the player’s name from the player name remember our first query told us all the tables

pd.read_sql('SELECT PLAYER_NAME FROM playerNames WHERE PLAYER_ID = 1629150'
            ,conn)
Loading...

Did more players who changed teams from the 2018 season to the 2019 season stay in the same conferences or switch conferences?

the conferences table has the columns TEAM_ID and CONFERENCE

Did more players who changed teams from the 2018 season to the 2019 season stay in the same conferences or switch conferences?

In the NBA, there are 30 teams organized into two conferences: East and West; the conferences table has the columns TEAM_ID and CONFERENCE

Let’s build a Dataframe that could answer the question.

I first pulled 1 row from each table I needed to see the columns.

pd.read_sql('SELECT * FROM conferences LIMIT 1',conn)
Loading...
pd.read_sql('SELECT * FROM playerTeams2018 LIMIT 1',conn)
Loading...
pd.read_sql('SELECT * FROM playerTeams2019 LIMIT 1',conn)
Loading...

Then I pulled the columns I needed from each of the 3 tables into a separate DataFrame.

conf_df = pd.read_sql('SELECT TEAM_ID,CONFERENCE FROM conferences',conn)
df18 = pd.read_sql('SELECT TEAM_ID,PLAYER_ID FROM playerTeams2018',conn)
df19 = pd.read_sql('SELECT TEAM_ID,PLAYER_ID FROM playerTeams2019',conn)

Then I merged the conference with each set of player information on the teams.

df18_c = pd.merge(df18,conf_df,on='TEAM_ID')
df19_c = pd.merge(df19,conf_df,on='TEAM_ID')

Then I merged the two expanded single year DataFrames together.

df1819_conf = pd.merge(df18_c,df19_c, on='PLAYER_ID',suffixes=('_2018','_2019'))
df1819_conf.head()
Loading...

Now, to answer the question, we have a bit more work to do. I’m going to use a lambda and apply to make a column that says same or new for the relative conference of the two seasons.

labels = {False:'new',True:'same'}
change_conf = lambda row: labels[row['CONFERENCE_2018']==row['CONFERENCE_2019']]
df1819_conf['conference_1819']= df1819_conf.apply(change_conf,axis=1)
df1819_conf.head()
Loading...

The cell above:

Then I can use this DataFrame grouped by my new column to get the unique players in each situation new or same conference.

df1819_conf.groupby('conference_1819')['PLAYER_ID'].apply(pd.unique)
conference_1819 new [202695, 201980, 1628035, 201142, 203961, 1627... same [1627783, 201188, 200768, 1627832, 201586, 162... Name: PLAYER_ID, dtype: object

And finally, get the length of each of those lists.

df1819_conf.groupby('conference_1819')['PLAYER_ID'].apply(pd.unique).apply(len)
conference_1819 new 119 same 385 Name: PLAYER_ID, dtype: int64

This, however, includes players who stayed on the same team, so we also need to split for who changed teams. First we add the team comparison column, then groupby by both and count unique players.

new_team = lambda row: labels[row['TEAM_ID_2018']==row['TEAM_ID_2019']]
df1819_conf['team_1819']= df1819_conf.apply(new_team,axis=1)
df1819_conf.groupby(['conference_1819','team_1819'])['PLAYER_ID'].apply(pd.unique).apply(len)
conference_1819 team_1819 new new 119 same new 135 same 263 Name: PLAYER_ID, dtype: int64

This is good, we could read the answer from here. It’s good practice, though, to be able to pull that value out programmatically.

player_counts_1819_team =  df1819_conf.groupby(['conference_1819','team_1819'])['PLAYER_ID'].apply(pd.unique).apply(len)
player_counts_1819_team.idxmax()
('same', 'same')

This tells us that the largest number of players stayed on the same team (and therefore same conference). We’re not interested in this though, we’re itnerested in those that changed teams, so we can drop the (same,same) value and then do this again.

Here we use a tuple or index, because it is a multiindex

player_counts_1819_team.drop(('same','same')).idxmax()
('same', 'new')

This tells us that more players changed teams within the same conference than changed teams and conferences. We can compare the two directly:

player_counts_1819_team['new','new'], player_counts_1819_team['same','new']
(np.int64(119), np.int64(135))

Again 135 is more than 119.

We can also make this a little neater to print it as a DataFrame. If we use reset_index it will make a DataFrame, but the count column will still be named PLAYER_ID so we can rename it.

player_counts_1819_team.reset_index().rename(columns={'PLAYER_ID':'num_players'})
Loading...

All in all, this gives us a good answer that we can get with data and display answers and this is one way that using multiple data sources can help answer richer questions.

conn.close()

Questions

Will be using databases instead of pandas?

You do not have to use a database for A2. It is good to get practice to use it for at least one later.

They’re not only important for large traditional data, but also used for example in building AI powered things that can look up actual data, using RAG (retrieival augmented generation) .

will we go over in class how to do the peer review?

No, the assignment has pretty detailed instructions but using office hours on Thursday is also a good thing to do if you need.

If that is all that’s left, your group could also come to mine on Friday.

can you go into more detail as to how the read_sql function works?

it sends the query to the database, the database executes it, fetches it and then pandas gets the results back and transforms them into a dataframe.

the user guide has more detail with links to database docs too.

what do you suggest we do if we miss a line of code that’s important and want to see it again?

let me know on prismia asap so that I can send it (or raise your hand! )

If you fall behind, check prismia to see if I sent the code and if not, message \:rewind: if you press tab after it will show as ⏪ or just ‘need help’ or ‘code please’

so you could have multiple connections by naming them like nba_connection etc

yes!

How did this all work?

the database itself is saved to the local computer, then the conn variable connected the python interpretter (in the jupyter lab) to the database and then we sent queries to the database that executed them and sent the data back to Python.

What would happen of you do not do conn.close()? Would resetting your computer/kernal give you the same result?

Restarting the computer definitely yes.

Resetting the kernel, I’m not sure but I suspect yes.

Stopping the Python interpretter, I think yes also.

I would need to look up the precise ways that the garbage collection and shutdowns happen to be sure.

It is definitely better to close it when you are done though!