13. Reviewing Merges & Databases#

import pandas as pd

pd.merge?

a review problem

If weather_df has a row for every date in the past 100 years, a column named date and columns for the weather that day and birthdays_df has a row for each member of the CS department with a column birthdate and a second column name, how would you merge these two datasets to find out the weather on the day each person was born?

bday_weather = pd.merge(left= weather_df, right = birthdays_df,
                        left_on = 'date', right_on ='birthdate', how= 'inner')

13.1. Working with Databases#

Off the shelf, pandas cannot read databased by default. We’ll use the sqlite3 library.

import sqlite3

First, we set up a connection, that links the the notebook to the databse.

conn = sqlite3.connect('data/nba1819.db')

conn

<sqlite3.Connection at 0x7f2b569e85d0>

To use it, we add a cursor.

cursor = conn.cursor()

We can use execute to pass SQL queries through the cursor to the database.

cursor.execute('SELECT name FROM sqlite_master WHERE type="table"')

<sqlite3.Cursor at 0x7f2b56aa2570>

Then we use fetchall to get the the results of the query.

cursor.fetchall()

[('teams',),
 ('conferences',),
 ('playerGameStats2018',),
 ('playerGameStats2019',),
 ('teamGameStats2018',),
 ('teamGameStats2019',),
 ('playerTeams2018',),
 ('playerTeams2019',),
 ('teamDailyRankings2018',),
 ('teamDailyRankings2019',),
 ('playerNames',)]

We can also pass queries to the database through pandas and then get it returned as a DataFrame.

pd.read_sql('SELECT PLAYER_ID, PLAYER_NAME FROM playerNames',conn).head(1)

	PLAYER_ID	PLAYER_NAME
0	202695	Kawhi Leonard

We can use * to get all of the columns and LIMIT to reduce the number of rows.

pd.read_sql('SELECT * FROM playerNames LIMIT 5',conn)

	index	PLAYER_NAME	PLAYER_ID
0	0	Kawhi Leonard	202695
1	1	Pascal Siakam	1627783
2	2	Marc Gasol	201188
3	3	Danny Green	201980
4	4	Kyle Lowry	200768

How can get all of the data from the teams table?

teams_df = pd.read_sql('SELECT * FROM teams',conn)
teams_df.head()

	index	TEAM_ID	MIN_YEAR	MAX_YEAR	ABBREVIATION	NICKNAME	YEARFOUNDED	CITY	ARENA	ARENACAPACITY	OWNER	GENERALMANAGER	HEADCOACH	DLEAGUEAFFILIATION
0	0	1610612737	1949	2019	ATL	Hawks	1949	Atlanta	State Farm Arena	18729.0	Tony Ressler	Travis Schlenk	Lloyd Pierce	Erie Bayhawks
1	1	1610612738	1946	2019	BOS	Celtics	1946	Boston	TD Garden	18624.0	Wyc Grousbeck	Danny Ainge	Brad Stevens	Maine Red Claws
2	2	1610612740	2002	2019	NOP	Pelicans	2002	New Orleans	Smoothie King Center	NaN	Tom Benson	Trajan Langdon	Alvin Gentry	No Affiliate
3	3	1610612741	1966	2019	CHI	Bulls	1966	Chicago	United Center	21711.0	Jerry Reinsdorf	Gar Forman	Jim Boylen	Windy City Bulls
4	4	1610612742	1980	2019	DAL	Mavericks	1980	Dallas	American Airlines Center	19200.0	Mark Cuban	Donnie Nelson	Rick Carlisle	Texas Legends

13.2. More practice#

For each of the following, think about what to query from the database and how to merge the tables after. Also, consider if these questions are all specific enough or require more decisions to be made.

How many players changed from the East conference to the West conference during the 2018 season?
How many from the East conference to the West conference from the 2018 season to the 2019 season?
Did teams that were founded earlier have better records in the 2018 season?

Think ahead

For a portfolio, you could ask questions like do events in a city impact if the home team wins later that week?

Programming for Data Science at URI Fall 2021

Reviewing Merges & Databases

Contents

13. Reviewing Merges & Databases#

13.1. Working with Databases#

13.2. More practice#