Reviewing Merges & Databases
Contents
13. Reviewing Merges & Databases#
import pandas as pd
pd.merge?
a review problem
If weather_df
has a row for every date in the past 100 years, a column named date
and columns for the weather that day and birthdays_df
has a row for each member of the CS department with a column birthdate
and a second column name, how would you merge these two datasets to find out the weather on the day each person was born?
bday_weather = pd.merge(left= weather_df, right = birthdays_df,
left_on = 'date', right_on ='birthdate', how= 'inner')
13.1. Working with Databases#
Off the shelf, pandas cannot read databased by default. We’ll use the sqlite3
library.
import sqlite3
First, we set up a connection, that links the the notebook to the databse.
conn = sqlite3.connect('data/nba1819.db')
conn
<sqlite3.Connection at 0x7f2b569e85d0>
To use it, we add a cursor.
cursor = conn.cursor()
We can use execute to pass SQL queries through the cursor to the database.
cursor.execute('SELECT name FROM sqlite_master WHERE type="table"')
<sqlite3.Cursor at 0x7f2b56aa2570>
Then we use fetchall
to get the the results of the query.
cursor.fetchall()
[('teams',),
('conferences',),
('playerGameStats2018',),
('playerGameStats2019',),
('teamGameStats2018',),
('teamGameStats2019',),
('playerTeams2018',),
('playerTeams2019',),
('teamDailyRankings2018',),
('teamDailyRankings2019',),
('playerNames',)]
We can also pass queries to the database through pandas and then get it returned as a DataFrame.
pd.read_sql('SELECT PLAYER_ID, PLAYER_NAME FROM playerNames',conn).head(1)
PLAYER_ID | PLAYER_NAME | |
---|---|---|
0 | 202695 | Kawhi Leonard |
We can use *
to get all of the columns and LIMIT
to reduce the number of rows.
pd.read_sql('SELECT * FROM playerNames LIMIT 5',conn)
index | PLAYER_NAME | PLAYER_ID | |
---|---|---|---|
0 | 0 | Kawhi Leonard | 202695 |
1 | 1 | Pascal Siakam | 1627783 |
2 | 2 | Marc Gasol | 201188 |
3 | 3 | Danny Green | 201980 |
4 | 4 | Kyle Lowry | 200768 |
How can get all of the data from the teams table?
teams_df = pd.read_sql('SELECT * FROM teams',conn)
teams_df.head()
index | LEAGUE_ID | TEAM_ID | MIN_YEAR | MAX_YEAR | ABBREVIATION | NICKNAME | YEARFOUNDED | CITY | ARENA | ARENACAPACITY | OWNER | GENERALMANAGER | HEADCOACH | DLEAGUEAFFILIATION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | Tony Ressler | Travis Schlenk | Lloyd Pierce | Erie Bayhawks |
1 | 1 | 0 | 1610612738 | 1946 | 2019 | BOS | Celtics | 1946 | Boston | TD Garden | 18624.0 | Wyc Grousbeck | Danny Ainge | Brad Stevens | Maine Red Claws |
2 | 2 | 0 | 1610612740 | 2002 | 2019 | NOP | Pelicans | 2002 | New Orleans | Smoothie King Center | NaN | Tom Benson | Trajan Langdon | Alvin Gentry | No Affiliate |
3 | 3 | 0 | 1610612741 | 1966 | 2019 | CHI | Bulls | 1966 | Chicago | United Center | 21711.0 | Jerry Reinsdorf | Gar Forman | Jim Boylen | Windy City Bulls |
4 | 4 | 0 | 1610612742 | 1980 | 2019 | DAL | Mavericks | 1980 | Dallas | American Airlines Center | 19200.0 | Mark Cuban | Donnie Nelson | Rick Carlisle | Texas Legends |
13.2. More practice#
For each of the following, think about what to query from the database and how to merge the tables after. Also, consider if these questions are all specific enough or require more decisions to be made.
How many players changed from the
East
conference to theWest
conference during the 2018 season?How many from the
East
conference to theWest
conference from the 2018 season to the 2019 season?Did teams that were founded earlier have better records in the 2018 season?
Think ahead
For a portfolio, you could ask questions like do events in a city impact if the home team wins later that week?