13. Getting Data from Databases#

Important

Your first portfolio check is due October 17th. Submitting something for this check is important. You can treat it like a draft to get feedback and then you will be able to revise for the next one to improve your work if needed.

13.1. What is a Database?#

A common attitude in Data Science is:

If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating. — Hadley Wickham

Businesses and research organizations nearly always have too much data to feasibly work without a database. Instead, they use different tools which are designed to scale to very large amounts of data. These tools are largely databases like Snowflake or Google’s BigQuery and distributed computing frameworks like Apache Spark.

Warning

We are going to focus on the case of getting data out of a Database so that you can use it and making sure you know what a Database is.

You could spend a whole semester on databases:

  • CSC436 covers how to implement them in detail (recommended, but requires CSC212)

  • BAI456 only how to use them (counts for DS majors, but if you want to understand them deeper, the CSC one is recommended)

For the purpose of this class the key attributes of a database are:

  • it is a collection of tables

  • the data is accessed live from disk (not RAM)

  • you send a query to the database to get the data (or your answer)

Databases can be designed in many different ways. For examples two popular ones.

  • SQLite is optimized for transactional workloads, which means a high volume of requests that involving inserting or reading a couple things. This is good for eg a webserver.

  • DuckDB is optimized for analytical workloads, which means a small number of requests that each require reading many records in the database. This is better for eg: data science

Experimenting with DuckDB is a way to earn construct level 3

13.2. Accessing a Database from Python#

We will use pandas again, as well as the request module from the urllib package and sqlite3.

Off the shelf, pandas cannot read databased by default. We’ll use the sqlite3 library, but there are others, depending on the type of database.

import pandas as pd
from urllib import request
import sqlite3

First we need to download the database to work with it.

request.urlretrieve('https://github.com/rhodyprog4ds/rhodyds/raw/main/data/nba1819.db',
   'nba1819.db')
('nba1819.db', <http.client.HTTPMessage at 0x7f8b7c1b6b20>)

Next, we set up a connection, that links the the notebook to the database. To use it, we add a cursor.

conn = sqlite3.connect('nba1819.db')
cursor = conn.cursor()

We can use execute to pass SQL queries through the cursor to the database.

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
<sqlite3.Cursor at 0x7f8b7c19eb20>

Then we use fetchall to get the the results of the query.

cursor.fetchall()
[('teams',),
 ('conferences',),
 ('playerGameStats2018',),
 ('playerGameStats2019',),
 ('teamGameStats2018',),
 ('teamGameStats2019',),
 ('playerTeams2018',),
 ('playerTeams2019',),
 ('teamDailyRankings2018',),
 ('teamDailyRankings2019',),
 ('playerNames',)]

If we fetch again, there is nothing to fetch. Fetch pulls what was queued by execute.

cursor.fetchall()
[]

We can run another query with execute then fetch that result. This query gives us the column names.

The schema of a database is the description of its setup and layout. The * means to get all.

cursor.execute("SELECT * FROM INFORMATION_SCHEMA.COLUMNS ")
---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
Cell In[7], line 1
----> 1 cursor.execute("SELECT * FROM INFORMATION_SCHEMA.COLUMNS ")

OperationalError: no such table: INFORMATION_SCHEMA.COLUMNS

Then we use fetchall to get the the results of the query.

cursor.fetchall()
[]

13.3. Querying with pandas#

We can use pd.read_sql to send queries, get the result sand transform them into a DataFrame all at once

pd.read_sql("SELECT * FROM teams",conn)
index LEAGUE_ID TEAM_ID MIN_YEAR MAX_YEAR ABBREVIATION NICKNAME YEARFOUNDED CITY ARENA ARENACAPACITY OWNER GENERALMANAGER HEADCOACH DLEAGUEAFFILIATION
0 0 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 Tony Ressler Travis Schlenk Lloyd Pierce Erie Bayhawks
1 1 0 1610612738 1946 2019 BOS Celtics 1946 Boston TD Garden 18624.0 Wyc Grousbeck Danny Ainge Brad Stevens Maine Red Claws
2 2 0 1610612740 2002 2019 NOP Pelicans 2002 New Orleans Smoothie King Center NaN Tom Benson Trajan Langdon Alvin Gentry No Affiliate
3 3 0 1610612741 1966 2019 CHI Bulls 1966 Chicago United Center 21711.0 Jerry Reinsdorf Gar Forman Jim Boylen Windy City Bulls
4 4 0 1610612742 1980 2019 DAL Mavericks 1980 Dallas American Airlines Center 19200.0 Mark Cuban Donnie Nelson Rick Carlisle Texas Legends
5 5 0 1610612743 1976 2019 DEN Nuggets 1976 Denver Pepsi Center 19099.0 Stan Kroenke Tim Connelly Michael Malone No Affiliate
6 6 0 1610612745 1967 2019 HOU Rockets 1967 Houston Toyota Center 18104.0 Tilman Fertitta Daryl Morey Mike D'Antoni Rio Grande Valley Vipers
7 7 0 1610612746 1970 2019 LAC Clippers 1970 Los Angeles Staples Center 19060.0 Steve Ballmer Michael Winger Doc Rivers Agua Caliente Clippers of Ontario
8 8 0 1610612747 1948 2019 LAL Lakers 1948 Los Angeles Staples Center 19060.0 Jerry Buss Family Trust Rob Pelinka Frank Vogel South Bay Lakers
9 9 0 1610612748 1988 2019 MIA Heat 1988 Miami AmericanAirlines Arena 19600.0 Micky Arison Pat Riley Erik Spoelstra Sioux Falls Skyforce
10 10 0 1610612749 1968 2019 MIL Bucks 1968 Milwaukee Fiserv Forum 17500.0 Wesley Edens & Marc Lasry Jon Horst Mike Budenholzer Wisconsin Herd
11 11 0 1610612750 1989 2019 MIN Timberwolves 1989 Minnesota Target Center 19356.0 Glen Taylor Scott Layden Ryan Saunders Iowa Wolves
12 12 0 1610612751 1976 2019 BKN Nets 1976 Brooklyn Barclays Center NaN Joe Tsai Sean Marks Kenny Atkinson Long Island Nets
13 13 0 1610612752 1946 2019 NYK Knicks 1946 New York Madison Square Garden 19763.0 Cablevision (James Dolan) Steve Mills David Fizdale Westchester Knicks
14 14 0 1610612753 1989 2019 ORL Magic 1989 Orlando Amway Center 0.0 Rick DeVos John Hammond Steve Clifford Lakeland Magic
15 15 0 1610612754 1976 2019 IND Pacers 1976 Indiana Bankers Life Fieldhouse 18345.0 Herb Simon Kevin Pritchard Nate McMillan Fort Wayne Mad Ants
16 16 0 1610612755 1949 2019 PHI 76ers 1949 Philadelphia Wells Fargo Center NaN Joshua Harris Elton Brand Brett Brown Delaware Blue Coats
17 17 0 1610612756 1968 2019 PHX Suns 1968 Phoenix Talking Stick Resort Arena NaN Robert Sarver James Jones Monty Williams Northern Arizona Suns
18 18 0 1610612757 1970 2019 POR Trail Blazers 1970 Portland Moda Center 19980.0 Paul Allen Neil Olshey Terry Stotts No Affiliate
19 19 0 1610612758 1948 2019 SAC Kings 1948 Sacramento Golden 1 Center 17500.0 Vivek Ranadive Vlade Divac Luke Walton Stockton Kings
20 20 0 1610612759 1976 2019 SAS Spurs 1976 San Antonio AT&T Center 18694.0 Peter Holt Brian Wright Gregg Popovich Austin Spurs
21 21 0 1610612760 1967 2019 OKC Thunder 1967 Oklahoma City Chesapeake Energy Arena 19163.0 Clay Bennett Sam Presti Billy Donovan Oklahoma City Blue
22 22 0 1610612761 1995 2019 TOR Raptors 1995 Toronto Scotiabank Arena 19800.0 Maple Leaf Sports and Entertainment Masai Ujiri Nick Nurse Raptors 905
23 23 0 1610612762 1974 2019 UTA Jazz 1974 Utah Vivint Smart Home Arena 20148.0 Greg Miller Dennis Lindsey Quin Snyder Salt Lake City Stars
24 24 0 1610612763 1995 2019 MEM Grizzlies 1995 Memphis FedExForum 18119.0 Robert Pera Zach Kleiman Taylor Jenkins Memphis Hustle
25 25 0 1610612764 1961 2019 WAS Wizards 1961 Washington Capital One Arena 20647.0 Ted Leonsis Tommy Sheppard Scott Brooks Capital City Go-Go
26 26 0 1610612765 1948 2019 DET Pistons 1948 Detroit Little Caesars Arena 21000.0 Tom Gores Ed Stefanski Dwane Casey Grand Rapids Drive
27 27 0 1610612766 1988 2019 CHA Hornets 1988 Charlotte Spectrum Center 19026.0 Michael Jordan Mitch Kupchak James Borrego Greensboro Swarm
28 28 0 1610612739 1970 2019 CLE Cavaliers 1970 Cleveland Quicken Loans Arena 20562.0 Dan Gilbert Koby Altman John Beilein Canton Charge
29 29 0 1610612744 1946 2019 GSW Warriors 1946 Golden State Chase Center 19596.0 Joe Lacob Bob Myers Steve Kerr Santa Cruz Warriors

We can use * to get all of the columns and LIMIT to reduce the number of rows.

pd.read_sql(con=conn,sql="SELECT * FROM teams LIMIT 3")
index LEAGUE_ID TEAM_ID MIN_YEAR MAX_YEAR ABBREVIATION NICKNAME YEARFOUNDED CITY ARENA ARENACAPACITY OWNER GENERALMANAGER HEADCOACH DLEAGUEAFFILIATION
0 0 0 1610612737 1949 2019 ATL Hawks 1949 Atlanta State Farm Arena 18729.0 Tony Ressler Travis Schlenk Lloyd Pierce Erie Bayhawks
1 1 0 1610612738 1946 2019 BOS Celtics 1946 Boston TD Garden 18624.0 Wyc Grousbeck Danny Ainge Brad Stevens Maine Red Claws
2 2 0 1610612740 2002 2019 NOP Pelicans 2002 New Orleans Smoothie King Center NaN Tom Benson Trajan Langdon Alvin Gentry No Affiliate

13.3.1. Which player was traded the most during the 2018 season? How many times?#

Note that the NBA Data is a little complicated for the questions we were asking because there is one row in players per team a played for per season. SO if a player was traded (changed teams), they are in there multiple times.

pd.read_sql("SELECT * FROM playerTeams2018 LIMIT 1",conn)
index TEAM_ID PLAYER_ID
0 0 1610612761 202695
p18 =pd.read_sql("SELECT PLAYER_ID FROM playerTeams2018 ",conn)

Then we can use value counts

p18.value_counts().sort_values(ascending=False).head(10)
PLAYER_ID
1629150      4
202325       3
203092       3
201160       3
202328       3
1626150      3
1628393      3
202083       3
202692       3
203477       3
dtype: int64

and we can get the player’s name from the player name remember our first query told us all the tables

pd.read_sql("SELECT PLAYER_NAME FROM playerNames WHERE PLAYER_ID = 1629150",conn)
PLAYER_NAME
0 Emanuel Terry

13.3.2. Using multiple merges#

In the NBA, there are 30 teams organized into two conferences: East and West; the conferences table has the columns TEAM_ID and CONFERENCE

build a Dataframe that could answer the question:

Did more players who changed teams from the 2018 season to the 2019 season stay in the same conferences or switch conferences?

Warning

I am going to fill this in later: get notified by commenting on the issue

# get conferences

# get 2018 players

# add conferences to 2018 players wiht a left merge on teams


# get 2019 players

# add conference

# inner merge the two tables and add suffixes to conference

# subset the dataframe with only rows where 18 & 19 conference are diff

Important

Remmber you have a choice to focus on web scraping or databases for assignment 5.
You can then do the other or learn more about the same one for your portfolio.

13.4. Questions After class#

13.4.1. What other SQL ‘keywords’ in the queries are there? ex: SELECT, FROM, WHERE#

quick reference

13.4.2. what will you talk about on Monday#

13.4.3. Is it viable to utilize sqlite for all big datasets? For example genomics data?#

I am going to ask around more about the common formats for genomics data. I know that sometimes it is provided in plain text files, but it is not used by loading it all into RAM, for example this lesson works with the raw genomics data using bash scripts.

13.4.4. How to merge data together with sqlite?#

You can use the same types of merges we have seen directly in the database by passign them as your query.

13.4.5. Do you need to put SQL code in caps?#

Important

I have put one hot encoding in my notes to come back to it in two weeks during classificaiton, but we are giving you credit on A4 even if that is missed. I covered the apply and splitting a column differntly this year that led to fewer of you getting very confused, but also meant we did not get to one hot encoding, but I missed that it was still in a4.