13. Reviewing Merges & Databases#

import pandas as pd
pd.merge?

a review problem

If weather_df has a row for every date in the past 100 years, a column named date and columns for the weather that day and birthdays_df has a row for each member of the CS department with a column birthdate and a second column name, how would you merge these two datasets to find out the weather on the day each person was born?

bday_weather = pd.merge(left= weather_df, right = birthdays_df,
                        left_on = 'date', right_on ='birthdate', how= 'inner')

13.1. Working with Databases#

Off the shelf, pandas cannot read databased by default. We’ll use the sqlite3 library.

import sqlite3

First, we set up a connection, that links the the notebook to the databse.

conn = sqlite3.connect('data/nba1819.db')
conn
<sqlite3.Connection at 0x7f2b569e85d0>

To use it, we add a cursor.

cursor = conn.cursor()

We can use execute to pass SQL queries through the cursor to the database.

cursor.execute('SELECT name FROM sqlite_master WHERE type="table"')
<sqlite3.Cursor at 0x7f2b56aa2570>

Then we use fetchall to get the the results of the query.

cursor.fetchall()
[('teams',),
 ('conferences',),
 ('playerGameStats2018',),
 ('playerGameStats2019',),
 ('teamGameStats2018',),
 ('teamGameStats2019',),
 ('playerTeams2018',),
 ('playerTeams2019',),
 ('teamDailyRankings2018',),
 ('teamDailyRankings2019',),
 ('playerNames',)]

We can also pass queries to the database through pandas and then get it returned as a DataFrame.

pd.read_sql('SELECT PLAYER_ID, PLAYER_NAME FROM playerNames',conn).head(1)
PLAYER_ID PLAYER_NAME
0 202695 Kawhi Leonard

We can use * to get all of the columns and LIMIT to reduce the number of rows.

pd.read_sql('SELECT * FROM playerNames LIMIT 5',conn)
index PLAYER_NAME PLAYER_ID
0 0 Kawhi Leonard 202695
1 1 Pascal Siakam 1627783
2 2 Marc Gasol 201188
3 3 Danny Green 201980
4 4 Kyle Lowry 200768

How can get all of the data from the teams table?

teams_df = pd.read_sql('SELECT * FROM teams',conn)
teams_df.head()
index LEAGUE_ID TEAM_ID MIN_YEAR MAX_YEAR ABBREVIATION NICKNAME YEARFOUNDED CITY ARENA ARENACAPACITY OWNER GENERALMANAGER HEADCOACH DLEAGUEAFFILIATION
0 0 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 Tony Ressler Travis Schlenk Lloyd Pierce Erie Bayhawks
1 1 0 1610612738 1946 2019 BOS Celtics 1946 Boston TD Garden 18624.0 Wyc Grousbeck Danny Ainge Brad Stevens Maine Red Claws
2 2 0 1610612740 2002 2019 NOP Pelicans 2002 New Orleans Smoothie King Center NaN Tom Benson Trajan Langdon Alvin Gentry No Affiliate
3 3 0 1610612741 1966 2019 CHI Bulls 1966 Chicago United Center 21711.0 Jerry Reinsdorf Gar Forman Jim Boylen Windy City Bulls
4 4 0 1610612742 1980 2019 DAL Mavericks 1980 Dallas American Airlines Center 19200.0 Mark Cuban Donnie Nelson Rick Carlisle Texas Legends

13.2. More practice#

For each of the following, think about what to query from the database and how to merge the tables after. Also, consider if these questions are all specific enough or require more decisions to be made.

  1. How many players changed from the East conference to the West conference during the 2018 season?

  2. How many from the East conference to the West conference from the 2018 season to the 2019 season?

  3. Did teams that were founded earlier have better records in the 2018 season?

Think ahead

For a portfolio, you could ask questions like do events in a city impact if the home team wins later that week?