Getting Data from Databases
Contents
13. Getting Data from Databases#
Important
Your first portfolio check is due October 17th. Submitting something for this check is important. You can treat it like a draft to get feedback and then you will be able to revise for the next one to improve your work if needed.
13.1. What is a Database?#
A common attitude in Data Science is:
If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating. — Hadley Wickham
Businesses and research organizations nearly always have too much data to feasibly work without a database. Instead, they use different tools which are designed to scale to very large amounts of data. These tools are largely databases like Snowflake or Google’s BigQuery and distributed computing frameworks like Apache Spark.
Warning
We are going to focus on the case of getting data out of a Database so that you can use it and making sure you know what a Database is.
You could spend a whole semester on databases:
CSC436 covers how to implement them in detail (recommended, but requires CSC212)
BAI456 only how to use them (counts for DS majors, but if you want to understand them deeper, the CSC one is recommended)
For the purpose of this class the key attributes of a database are:
it is a collection of tables
the data is accessed live from disk (not RAM)
you send a query to the database to get the data (or your answer)
Databases can be designed in many different ways. For examples two popular ones.
SQLite is optimized for transactional workloads, which means a high volume of requests that involving inserting or reading a couple things. This is good for eg a webserver.
DuckDB is optimized for analytical workloads, which means a small number of requests that each require reading many records in the database. This is better for eg: data science
Experimenting with DuckDB is a way to earn construct level 3
13.2. Accessing a Database from Python#
We will use pandas again, as well as the request
module from the urllib
package and sqlite3
.
Off the shelf, pandas cannot read databased by default. We’ll use the
sqlite3
library, but there
are others, depending on the type of database.
import pandas as pd
from urllib import request
import sqlite3
First we need to download the database to work with it.
request.urlretrieve('https://github.com/rhodyprog4ds/rhodyds/raw/main/data/nba1819.db',
'nba1819.db')
('nba1819.db', <http.client.HTTPMessage at 0x7f8b7c1b6b20>)
Next, we set up a connection, that links the the notebook to the database. To use it, we add a cursor.
conn = sqlite3.connect('nba1819.db')
cursor = conn.cursor()
We can use execute to pass SQL queries through the cursor to the database.
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
<sqlite3.Cursor at 0x7f8b7c19eb20>
Then we use fetchall
to get the the results of the query.
cursor.fetchall()
[('teams',),
('conferences',),
('playerGameStats2018',),
('playerGameStats2019',),
('teamGameStats2018',),
('teamGameStats2019',),
('playerTeams2018',),
('playerTeams2019',),
('teamDailyRankings2018',),
('teamDailyRankings2019',),
('playerNames',)]
If we fetch again, there is nothing to fetch. Fetch pulls what was queued by execute.
cursor.fetchall()
[]
We can run another query with execute then fetch that result. This query gives us the column names.
The schema of a database is the description of its setup and layout. The *
means to get all.
cursor.execute("SELECT * FROM INFORMATION_SCHEMA.COLUMNS ")
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
Cell In[7], line 1
----> 1 cursor.execute("SELECT * FROM INFORMATION_SCHEMA.COLUMNS ")
OperationalError: no such table: INFORMATION_SCHEMA.COLUMNS
Then we use fetchall
to get the the results of the query.
cursor.fetchall()
[]
13.3. Querying with pandas#
We can use pd.read_sql
to send queries, get the result sand transform them
into a DataFrame all at once
pd.read_sql("SELECT * FROM teams",conn)
index | LEAGUE_ID | TEAM_ID | MIN_YEAR | MAX_YEAR | ABBREVIATION | NICKNAME | YEARFOUNDED | CITY | ARENA | ARENACAPACITY | OWNER | GENERALMANAGER | HEADCOACH | DLEAGUEAFFILIATION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | Tony Ressler | Travis Schlenk | Lloyd Pierce | Erie Bayhawks |
1 | 1 | 0 | 1610612738 | 1946 | 2019 | BOS | Celtics | 1946 | Boston | TD Garden | 18624.0 | Wyc Grousbeck | Danny Ainge | Brad Stevens | Maine Red Claws |
2 | 2 | 0 | 1610612740 | 2002 | 2019 | NOP | Pelicans | 2002 | New Orleans | Smoothie King Center | NaN | Tom Benson | Trajan Langdon | Alvin Gentry | No Affiliate |
3 | 3 | 0 | 1610612741 | 1966 | 2019 | CHI | Bulls | 1966 | Chicago | United Center | 21711.0 | Jerry Reinsdorf | Gar Forman | Jim Boylen | Windy City Bulls |
4 | 4 | 0 | 1610612742 | 1980 | 2019 | DAL | Mavericks | 1980 | Dallas | American Airlines Center | 19200.0 | Mark Cuban | Donnie Nelson | Rick Carlisle | Texas Legends |
5 | 5 | 0 | 1610612743 | 1976 | 2019 | DEN | Nuggets | 1976 | Denver | Pepsi Center | 19099.0 | Stan Kroenke | Tim Connelly | Michael Malone | No Affiliate |
6 | 6 | 0 | 1610612745 | 1967 | 2019 | HOU | Rockets | 1967 | Houston | Toyota Center | 18104.0 | Tilman Fertitta | Daryl Morey | Mike D'Antoni | Rio Grande Valley Vipers |
7 | 7 | 0 | 1610612746 | 1970 | 2019 | LAC | Clippers | 1970 | Los Angeles | Staples Center | 19060.0 | Steve Ballmer | Michael Winger | Doc Rivers | Agua Caliente Clippers of Ontario |
8 | 8 | 0 | 1610612747 | 1948 | 2019 | LAL | Lakers | 1948 | Los Angeles | Staples Center | 19060.0 | Jerry Buss Family Trust | Rob Pelinka | Frank Vogel | South Bay Lakers |
9 | 9 | 0 | 1610612748 | 1988 | 2019 | MIA | Heat | 1988 | Miami | AmericanAirlines Arena | 19600.0 | Micky Arison | Pat Riley | Erik Spoelstra | Sioux Falls Skyforce |
10 | 10 | 0 | 1610612749 | 1968 | 2019 | MIL | Bucks | 1968 | Milwaukee | Fiserv Forum | 17500.0 | Wesley Edens & Marc Lasry | Jon Horst | Mike Budenholzer | Wisconsin Herd |
11 | 11 | 0 | 1610612750 | 1989 | 2019 | MIN | Timberwolves | 1989 | Minnesota | Target Center | 19356.0 | Glen Taylor | Scott Layden | Ryan Saunders | Iowa Wolves |
12 | 12 | 0 | 1610612751 | 1976 | 2019 | BKN | Nets | 1976 | Brooklyn | Barclays Center | NaN | Joe Tsai | Sean Marks | Kenny Atkinson | Long Island Nets |
13 | 13 | 0 | 1610612752 | 1946 | 2019 | NYK | Knicks | 1946 | New York | Madison Square Garden | 19763.0 | Cablevision (James Dolan) | Steve Mills | David Fizdale | Westchester Knicks |
14 | 14 | 0 | 1610612753 | 1989 | 2019 | ORL | Magic | 1989 | Orlando | Amway Center | 0.0 | Rick DeVos | John Hammond | Steve Clifford | Lakeland Magic |
15 | 15 | 0 | 1610612754 | 1976 | 2019 | IND | Pacers | 1976 | Indiana | Bankers Life Fieldhouse | 18345.0 | Herb Simon | Kevin Pritchard | Nate McMillan | Fort Wayne Mad Ants |
16 | 16 | 0 | 1610612755 | 1949 | 2019 | PHI | 76ers | 1949 | Philadelphia | Wells Fargo Center | NaN | Joshua Harris | Elton Brand | Brett Brown | Delaware Blue Coats |
17 | 17 | 0 | 1610612756 | 1968 | 2019 | PHX | Suns | 1968 | Phoenix | Talking Stick Resort Arena | NaN | Robert Sarver | James Jones | Monty Williams | Northern Arizona Suns |
18 | 18 | 0 | 1610612757 | 1970 | 2019 | POR | Trail Blazers | 1970 | Portland | Moda Center | 19980.0 | Paul Allen | Neil Olshey | Terry Stotts | No Affiliate |
19 | 19 | 0 | 1610612758 | 1948 | 2019 | SAC | Kings | 1948 | Sacramento | Golden 1 Center | 17500.0 | Vivek Ranadive | Vlade Divac | Luke Walton | Stockton Kings |
20 | 20 | 0 | 1610612759 | 1976 | 2019 | SAS | Spurs | 1976 | San Antonio | AT&T Center | 18694.0 | Peter Holt | Brian Wright | Gregg Popovich | Austin Spurs |
21 | 21 | 0 | 1610612760 | 1967 | 2019 | OKC | Thunder | 1967 | Oklahoma City | Chesapeake Energy Arena | 19163.0 | Clay Bennett | Sam Presti | Billy Donovan | Oklahoma City Blue |
22 | 22 | 0 | 1610612761 | 1995 | 2019 | TOR | Raptors | 1995 | Toronto | Scotiabank Arena | 19800.0 | Maple Leaf Sports and Entertainment | Masai Ujiri | Nick Nurse | Raptors 905 |
23 | 23 | 0 | 1610612762 | 1974 | 2019 | UTA | Jazz | 1974 | Utah | Vivint Smart Home Arena | 20148.0 | Greg Miller | Dennis Lindsey | Quin Snyder | Salt Lake City Stars |
24 | 24 | 0 | 1610612763 | 1995 | 2019 | MEM | Grizzlies | 1995 | Memphis | FedExForum | 18119.0 | Robert Pera | Zach Kleiman | Taylor Jenkins | Memphis Hustle |
25 | 25 | 0 | 1610612764 | 1961 | 2019 | WAS | Wizards | 1961 | Washington | Capital One Arena | 20647.0 | Ted Leonsis | Tommy Sheppard | Scott Brooks | Capital City Go-Go |
26 | 26 | 0 | 1610612765 | 1948 | 2019 | DET | Pistons | 1948 | Detroit | Little Caesars Arena | 21000.0 | Tom Gores | Ed Stefanski | Dwane Casey | Grand Rapids Drive |
27 | 27 | 0 | 1610612766 | 1988 | 2019 | CHA | Hornets | 1988 | Charlotte | Spectrum Center | 19026.0 | Michael Jordan | Mitch Kupchak | James Borrego | Greensboro Swarm |
28 | 28 | 0 | 1610612739 | 1970 | 2019 | CLE | Cavaliers | 1970 | Cleveland | Quicken Loans Arena | 20562.0 | Dan Gilbert | Koby Altman | John Beilein | Canton Charge |
29 | 29 | 0 | 1610612744 | 1946 | 2019 | GSW | Warriors | 1946 | Golden State | Chase Center | 19596.0 | Joe Lacob | Bob Myers | Steve Kerr | Santa Cruz Warriors |
We can use *
to get all of the columns and LIMIT
to reduce the number of rows.
pd.read_sql(con=conn,sql="SELECT * FROM teams LIMIT 3")
index | LEAGUE_ID | TEAM_ID | MIN_YEAR | MAX_YEAR | ABBREVIATION | NICKNAME | YEARFOUNDED | CITY | ARENA | ARENACAPACITY | OWNER | GENERALMANAGER | HEADCOACH | DLEAGUEAFFILIATION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1610612737 | 1949 | 2019 | ATL | Hawks | 1949 | Atlanta | State Farm Arena | 18729.0 | Tony Ressler | Travis Schlenk | Lloyd Pierce | Erie Bayhawks |
1 | 1 | 0 | 1610612738 | 1946 | 2019 | BOS | Celtics | 1946 | Boston | TD Garden | 18624.0 | Wyc Grousbeck | Danny Ainge | Brad Stevens | Maine Red Claws |
2 | 2 | 0 | 1610612740 | 2002 | 2019 | NOP | Pelicans | 2002 | New Orleans | Smoothie King Center | NaN | Tom Benson | Trajan Langdon | Alvin Gentry | No Affiliate |
13.3.1. Which player was traded the most during the 2018 season? How many times?#
Note that the NBA Data is a little complicated for the questions we were asking because there is one row in players per team a played for per season. SO if a player was traded (changed teams), they are in there multiple times.
pd.read_sql("SELECT * FROM playerTeams2018 LIMIT 1",conn)
index | TEAM_ID | PLAYER_ID | |
---|---|---|---|
0 | 0 | 1610612761 | 202695 |
p18 =pd.read_sql("SELECT PLAYER_ID FROM playerTeams2018 ",conn)
Then we can use value counts
p18.value_counts().sort_values(ascending=False).head(10)
PLAYER_ID
1629150 4
202325 3
203092 3
201160 3
202328 3
1626150 3
1628393 3
202083 3
202692 3
203477 3
dtype: int64
and we can get the player’s name from the player name remember our first query told us all the tables
pd.read_sql("SELECT PLAYER_NAME FROM playerNames WHERE PLAYER_ID = 1629150",conn)
PLAYER_NAME | |
---|---|
0 | Emanuel Terry |
13.3.2. Using multiple merges#
In the NBA, there are 30 teams organized into two conferences: East and West;
the conferences
table has the columns TEAM_ID
and CONFERENCE
build a Dataframe that could answer the question:
Did more players who changed teams from the 2018 season to the 2019 season stay in the same conferences or switch conferences?
Warning
I am going to fill this in later: get notified by commenting on the issue
# get conferences
# get 2018 players
# add conferences to 2018 players wiht a left merge on teams
# get 2019 players
# add conference
# inner merge the two tables and add suffixes to conference
# subset the dataframe with only rows where 18 & 19 conference are diff
Important
Remmber you have a choice to focus on web scraping or databases for assignment 5.
You can then do the other or learn more about the same one for your portfolio.
13.4. Questions After class#
13.4.1. What other SQL ‘keywords’ in the queries are there? ex: SELECT, FROM, WHERE#
13.4.2. what will you talk about on Monday#
13.4.3. Is it viable to utilize sqlite for all big datasets? For example genomics data?#
I am going to ask around more about the common formats for genomics data. I know that sometimes it is provided in plain text files, but it is not used by loading it all into RAM, for example this lesson works with the raw genomics data using bash scripts.
13.4.4. How to merge data together with sqlite?#
You can use the same types of merges we have seen directly in the database by passign them as your query.
13.4.5. Do you need to put SQL code in caps?#
Important
I have put one hot encoding in my notes to come back to it in two weeks during classificaiton, but we are giving you credit on A4 even if that is missed. I covered the apply and splitting a column differntly this year that led to fewer of you getting very confused, but also meant we did not get to one hot encoding, but I missed that it was still in a4.