9. Web Scraping#
Note
Dates:
A5 due 10/10
P1 due 10/16 (ideally you have already started this)
A6 due 10/18 (it’s short, I promise)
then back to assignments due Monday
import requests
from bs4 import BeautifulSoup
import pandas as pd
Warning
If it says it cannot load one of the libraries, use pip inside your notebook to install,
pip install beautifulsoup4
then restart your kernel (Kernel menu, choose restart)
9.1. Getting Data From Websites#
We have seen that read_html
can get content from an actual website, not a data file that is hosted somewhere on the internet, that takes tables on a website and returns a list of DataFrames.
pd.read_html('https://rhodyprog4ds.github.io/BrownSpring23/syllabus/achievements.html')
Show code cell output
[ Unnamed: 0_level_0 topics \
week Unnamed: 1_level_1
0 1 [admin, python review]
1 2 Loading data, Python review
2 3 Exploratory Data Analysis
3 4 Data Cleaning
4 5 Databases, Merging DataFrames
5 6 Modeling, classification performance metrics, ...
6 7 Naive Bayes, decision trees
7 8 Regression
8 9 Clustering
9 10 SVM, parameter tuning
10 11 KNN, Model comparison
11 12 Text Analysis
12 13 Images Analysis
13 14 Deep Learning
skills
Unnamed: 2_level_1
0 process
1 [access, prepare, summarize]
2 [summarize, visualize]
3 [prepare, summarize, visualize]
4 [access, construct, summarize]
5 [evaluate]
6 [classification, evaluate]
7 [regression, evaluate]
8 [clustering, evaluate]
9 [optimize, tools]
10 [compare, tools]
11 [unstructured]
12 [unstructured, tools]
13 [tools, compare] ,
Unnamed: 0_level_0 skill \
keyword Unnamed: 1_level_1
0 python pythonic code writing
1 process describe data science as a process
2 access access data in multiple formats
3 construct construct datasets from multiple sources
4 summarize Summarize and describe data
5 visualize Visualize data
6 prepare prepare data for analysis
7 evaluate Evaluate model performance
8 classification Apply classification
9 regression Apply Regression
10 clustering Clustering
11 optimize Optimize model parameters
12 compare compare models
13 representation Choose representations and transform data
14 workflow use industry standard data science tools and w...
Level 1 \
Unnamed: 2_level_1
0 python code that mostly runs, occasional pep8 ...
1 Identify basic components of data science
2 load data from at least one format; identify t...
3 identify what should happen to merge datasets ...
4 Describe the shape and structure of a dataset ...
5 identify plot types, generate basic plots from...
6 identify if data is or is not ready for analys...
7 Explain basic performance metrics for differen...
8 identify and describe what classification is, ...
9 identify what data that can be used for regres...
10 describe what clustering is
11 Identify when model parameters need to be opti...
12 Qualitatively compare model classes
13 Identify options for representing text and cat...
14 Solve well strucutred fully specified problems...
Level 2 \
Unnamed: 3_level_1
0 python code that reliably runs, frequent pep8 ...
1 Describe and define each stage of the data sci...
2 Load data for processing from the most common ...
3 apply basic merges
4 compute summary statndard statistics of a whol...
5 generate multiple plot types with complete lab...
6 apply data reshaping, cleaning, and filtering ...
7 Apply and interpret basic model evaluation met...
8 fit, apply, and interpret preselected classifi...
9 fit and interpret linear regression models
10 apply basic clustering
11 Optimize basic model parameters such as model ...
12 Compare model classes in specific terms and fi...
13 Apply at least one representation to transform...
14 Solve well-strucutred, open-ended problems, ap...
Level 3
Unnamed: 4_level_1
0 reliable, efficient, pythonic code that consis...
1 Compare different ways that data science can f...
2 access data from both common and uncommon form...
3 merge data that is not automatically aligned
4 Compute and interpret various summary statisti...
5 generate complex plots with pandas and plottin...
6 apply data reshaping, cleaning, and filtering ...
7 Evaluate a model with multiple metrics and cro...
8 fit and apply classification models and select...
9 fit and explain regrularized or nonlinear regr...
10 apply multiple clustering techniques, and inte...
11 Select optimal parameters based of mutiple qua...
12 Evaluate tradeoffs between different model com...
13 apply transformations in different contexts OR...
14 Independently scope and solve realistic data s... ,
Unnamed: 0_level_0 A1 A2 \
keyword Unnamed: 1_level_1 Unnamed: 2_level_1
0 python 1 1
1 process 1 0
2 access 0 1
3 construct 0 0
4 summarize 0 0
5 visualize 0 0
6 prepare 0 0
7 evaluate 0 0
8 classification 0 0
9 regression 0 0
10 clustering 0 0
11 optimize 0 0
12 compare 0 0
13 representation 0 0
14 workflow 0 0
A3 A4 A5 \
Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1
0 0 1 1
1 0 0 0
2 1 1 1
3 0 0 1
4 1 1 1
5 1 1 0
6 0 1 1
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
A6 A7 A8 \
Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1
0 0 0 0
1 1 1 1
2 0 0 0
3 0 1 1
4 1 1 1
5 1 1 1
6 0 0 0
7 1 1 1
8 0 1 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
A9 A10 A11 \
Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1
0 0 0 0
1 1 1 1
2 0 0 0
3 0 0 0
4 1 1 1
5 1 1 1
6 0 0 0
7 0 1 1
8 0 1 0
9 0 0 1
10 1 0 1
11 0 1 1
12 0 0 1
13 0 0 0
14 0 1 1
A12 A13 # Assignments
Unnamed: 12_level_1 Unnamed: 13_level_1 Unnamed: 14_level_1
0 0 0 4
1 0 0 7
2 0 0 4
3 0 0 3
4 1 1 11
5 1 1 10
6 0 0 2
7 0 0 5
8 0 0 2
9 0 0 2
10 0 0 2
11 0 0 2
12 0 1 2
13 1 1 2
14 1 1 4 ,
Unnamed: 0_level_0 Level 3 \
keyword Unnamed: 1_level_1
0 python reliable, efficient, pythonic code that consis...
1 process Compare different ways that data science can f...
2 access access data from both common and uncommon form...
3 construct merge data that is not automatically aligned
4 summarize Compute and interpret various summary statisti...
5 visualize generate complex plots with pandas and plottin...
6 prepare apply data reshaping, cleaning, and filtering ...
7 evaluate Evaluate a model with multiple metrics and cro...
8 classification fit and apply classification models and select...
9 regression fit and explain regrularized or nonlinear regr...
10 clustering apply multiple clustering techniques, and inte...
11 optimize Select optimal parameters based of mutiple qua...
12 compare Evaluate tradeoffs between different model com...
13 representation apply transformations in different contexts OR...
14 workflow Independently scope and solve realistic data s...
P1 P2 P3 P4
Unnamed: 2_level_1 Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1
0 1 1 0 1
1 0 1 1 1
2 1 1 0 1
3 1 1 0 1
4 1 1 0 1
5 1 1 0 1
6 1 1 0 1
7 0 1 1 1
8 0 1 1 1
9 0 1 1 1
10 0 1 1 1
11 0 0 1 1
12 0 0 1 1
13 0 0 1 1
14 0 0 1 1 ]
This gives us a list of DataFrames that come from the website. pandas
gets tables by looking in the html for the site and finding the <table>
tags.
9.2. Everything is Data#
For the purpose of this class, it is best to think of the content on a web page like a datastructure.
there are tags <>
that define the structure, and these can be further classified with classes
9.3. Scraping a URI website#
We’re going to create a DataFrame about URI CS & Statistics Faculty.
from the people page of the department website.
We can inspect the page to check that it’s well structured.
Warning
With great power comes great responsibility.
always check the robots.txt
do not do things that the owner says not to do
government websites are typically safe
We’ll save the URL for easy use
cs_people_url = 'https://web.uri.edu/cs/people/'
Then we can use the requests
library to make a call to the internet. It actually gets back a response object which has a lot of extra information. For today we only need the content
from the page which is an attrtibute of that object:
cs_people_html = requests.get(cs_people_url).content
This is raw:
cs_people_html[:200]
b'\n<!DOCTYPE html>\n<html lang="en-US">\n\t\n<head>\n<meta charset="UTF-8"><script type="text/javascript">(window.NREUM||(NREUM={})).init={privacy:{cookies_enabled:true},ajax:{deny_list:["bam.nr-data.net"]},'
But we do not need to manually write search tools, that’s what BeautifulSoup
is for.
cs_people = BeautifulSoup(cs_people_html,'html.parser')
type(cs_people)
bs4.BeautifulSoup
9.3.2. Searching the source#
More helpful is the find_all
method we wnat to find all div
tags that are “peopleitem” class. We decided this by inspecting the code on the website.
type(cs_people.find_all('div','peopleitem'))
bs4.element.ResultSet
cs_people.find_all('div','peopleitem')
[<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/"><img alt="" class="u-photo wp-post-image" height="1600" loading="lazy" sizes="(max-width: 1600px) 100vw, 1600px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg 1600w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1024x1024.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1536x1536.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-500x500.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1000x1000.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1280x1280.jpg 1280w" width="1600"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor | Chair</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.4388</span> <br/> <a class="u-email" href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/lisa-dipippo/"><img alt="" class="u-photo wp-post-image" height="126" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/lisa-dipippo-web.jpg" width="125"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/lisa-dipippo/">Lisa DiPippo</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Professor | Director of Undergraduate Studies</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:ldipippo@uri.edu">ldipippo@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/natallia-katenka/"><img alt="" class="u-photo wp-post-image" height="200" loading="lazy" sizes="(max-width: 200px) 100vw, 200px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/natalia-katenka-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/natalia-katenka-web.jpg 200w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/natalia-katenka-web-150x150.jpg 150w" width="200"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/natallia-katenka/">Natallia Katenka</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor | Director of Data Science</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><a class="u-email" href="mailto:nkatenka@uri.edu">nkatenka@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/krishna-venkatasubramanian/"><img alt="" class="u-photo wp-post-image" height="2703" loading="lazy" sizes="(max-width: 2437px) 100vw, 2437px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2.jpg 2437w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-270x300.jpg 270w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-768x852.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-923x1024.jpg 923w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-364x404.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-500x555.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-1000x1109.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-1280x1420.jpg 1280w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-2000x2218.jpg 2000w" width="2437"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/krishna-venkatasubramanian/">Krishna Venkatasubramanian</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor | Director of Graduate Studies</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:krish@uri.edu">krish@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/jing-wu/"><img alt="" class="u-photo wp-post-image" height="200" loading="lazy" sizes="(max-width: 200px) 100vw, 200px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/jing-wu-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/jing-wu-web.jpg 200w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/jing-wu-web-150x150.jpg 150w" width="200"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/jing-wu/">Jing Wu</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor | Director of Graduate Studies</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.4504</span> <br/> <a class="u-email" href="mailto:jing_wu@uri.edu">jing_wu@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/yichi-zhang/"><img alt="" class="u-photo wp-post-image" height="240" loading="lazy" sizes="(max-width: 240px) 100vw, 240px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/yichi-zhang-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/yichi-zhang-web.jpg 240w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/yichi-zhang-web-150x150.jpg 150w" width="240"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/yichi-zhang/">Yichi Zhang</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor | Director of Undergraduate Studies</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><a class="u-email" href="mailto:yichizhang@uri.edu">yichizhang@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/marco-alvarez/"><img alt="" class="u-photo wp-post-image" height="120" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/marco-alvarez.png" width="120"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/marco-alvarez/">Marco Alvarez</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><span class="p-tel">401.874.5009</span> <br/> <a class="u-email" href="mailto:malvarez@uri.edu">malvarez@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card">
<header>
<div class="header">
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/samantha-armenti/">Samantha Armenti</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Teaching Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:sarmenti@uri.edu ">sarmenti@uri.edu </a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/sarah-brown/"><img alt="" class="u-photo wp-post-image" height="300" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Sarah-Brown.png" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Sarah-Brown.png 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Sarah-Brown-150x150.png 150w" width="300"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/sarah-brown/">Sarah Brown</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:brownsarahm@uri.edu">brownsarahm@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/michael-conti/"><img alt="" class="u-photo wp-post-image" height="2475" loading="lazy" sizes="(max-width: 2560px) 100vw, 2560px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-scaled.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-scaled.jpg 2560w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-300x290.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-1024x990.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-768x743.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-1536x1485.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-2048x1980.jpg 2048w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-364x352.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-500x483.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-1000x967.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-1280x1238.jpg 1280w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-2000x1934.jpg 2000w" width="2560"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/michael-conti/">Michael Conti</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Teaching Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:michaelconti@uri.edu ">michaelconti@uri.edu </a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/noah-daniels/"><img alt="" class="u-photo wp-post-image" height="219" loading="lazy" sizes="(max-width: 219px) 100vw, 219px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/noah-daniels-web.png" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/noah-daniels-web.png 219w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/noah-daniels-web-150x150.png 150w" width="219"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/noah-daniels/">Noah Daniels</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:noah_daniels@uri.edu">noah_daniels@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/victor-fay-wolfe/"><img alt="" class="u-photo wp-post-image" height="125" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/victor-fay-wolfe-web.jpg" width="125"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/victor-fay-wolfe/">Victor Fay-Wolfe</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:vfaywolfe@uri.edu">vfaywolfe@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card">
<header>
<div class="header">
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/lutz-hamel/">Lutz Hamel</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor </p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:lutzhamel@uri.edu">lutzhamel@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/abdeltawab-hendawi/"><img alt="" class="u-photo wp-post-image" height="885" loading="lazy" sizes="(max-width: 1000px) 100vw, 1000px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1.jpeg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1.jpeg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1-300x266.jpeg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1-768x680.jpeg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1-364x322.jpeg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1-500x443.jpeg 500w" width="1000"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/abdeltawab-hendawi/">Abdeltawab Hendawi</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor</p>
<p class="people-department">Data Science | Computer Science</p>
<p class="people-misc"><span class="p-tel">401.874.5738</span> <br/> <a class="u-email" href="mailto:hendawi@uri.edu">hendawi@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/jean-yves-herve/"><img alt="" class="u-photo wp-post-image" height="299" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/jean-yves-herve-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/jean-yves-herve-web.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/jean-yves-herve-web-150x150.jpg 150w" width="300"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/jean-yves-herve/">Jean-Yves Hervé</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:jyh@cs.uri.edu">jyh@cs.uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card">
<header>
<div class="header">
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/soheyb-kouider/">Soheyb Kouider</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Teaching Professor</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.2562</span> <br/> <a class="u-email" href="mailto:soheyb@uri.edu">soheyb@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/edmund-lamagna/"><img alt="" class="u-photo wp-post-image" height="150" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/edmund-lamagna-web.jpg" width="150"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/edmund-lamagna/">Edmund Lamagna</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:eal@cs.uri.edu">eal@cs.uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/indrani-mandal/"><img alt="" class="u-photo wp-post-image" height="280" loading="lazy" sizes="(max-width: 278px) 100vw, 278px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/0.jpeg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/0.jpeg 278w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-150x150.jpeg 150w" width="278"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/indrani-mandal/">Indrani Mandal</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Teaching Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:indrani_mandal@uri.edu ">indrani_mandal@uri.edu </a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card">
<header>
<div class="header">
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/jonathan-schrader/">Jonathan Schrader</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Teaching Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:jonathan.schrader@uri.edu">jonathan.schrader@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/shaun-wallace/"><img alt="Shaun Wallace" class="u-photo wp-post-image" height="300" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Shaun-Wallace_300.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Shaun-Wallace_300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Shaun-Wallace_300-150x150.jpg 150w" width="300"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/shaun-wallace/">Shaun Wallace</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:shaun.wallace@uri.edu">shaun.wallace@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card">
<header>
<div class="header">
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/yunshu-jasmine-wang/">Yunshu (Jasmine) Wang</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Teaching Professor</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><a class="u-email" href="mailto:yunshu_wang@uri.edu">yunshu_wang@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/haihan-mark-yu/"><img alt="Haihan Yu" class="u-photo wp-post-image" height="300" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/haihan_2307.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/haihan_2307.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/haihan_2307-150x150.jpg 150w" width="300"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/haihan-mark-yu/">Haihan (Mark) Yu</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><a class="u-email" href="mailto:haihan.yu@uri.edu">haihan.yu@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/guangyu-zhu/"><img alt="" class="u-photo wp-post-image" height="200" loading="lazy" sizes="(max-width: 200px) 100vw, 200px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Guangyu-Zhu-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Guangyu-Zhu-web.jpg 200w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Guangyu-Zhu-web-150x150.jpg 150w" width="200"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/guangyu-zhu/">Guangyu Zhu</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><a class="u-email" href="mailto:guangyuzhu@uri.edu">guangyuzhu@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/ashley-buchanan/"><img alt="" class="u-photo wp-post-image" height="966" loading="lazy" sizes="(max-width: 966px) 100vw, 966px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan.jpg 966w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-500x500.jpg 500w" width="966"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/ashley-buchanan/">Ashley Buchanan</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Limited Joint Appointment</p>
<p class="people-department">Biostatistics</p>
<p class="people-misc"><span class="p-tel">401.874.4739</span> <br/> <a class="u-email" href="mailto:buchanan@uri.edu">buchanan@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/nina-kajiji/"><img alt="" class="u-photo wp-post-image" height="125" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/nina-kajiji-web.jpg" width="125"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/nina-kajiji/">Nina Kajiji</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Adjunct Associate Professor</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><a class="u-email" href="mailto:nina@uri.edu">nina@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/rachel-schwartz/"><img alt="" class="u-photo wp-post-image" height="120" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Rachel-Schwartz-web.jpg" width="120"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/rachel-schwartz/">Rachel Schwartz</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor – Limited Joint Appointment</p>
<p class="people-department">Biological Sciences</p>
<p class="people-misc"><span class="p-tel">401.874.5404</span> <br/> <a class="u-email" href="mailto:rsschwartz@uri.edu">rsschwartz@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>,
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/ying-zhang/"><img alt="" class="u-photo wp-post-image" height="250" loading="lazy" sizes="(max-width: 249px) 100vw, 249px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/ying-zhang-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/ying-zhang-web.jpg 249w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/ying-zhang-web-150x150.jpg 150w" width="249"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/ying-zhang/">Ying Zhang</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor – Limited Joint Appointment</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><span class="p-tel">401.874.4915</span> <br/> <a class="u-email" href="mailto:yingzhang@uri.edu">yingzhang@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>]
this is a long, object and we can see it looks iterable ([
at the start)
people_items = cs_people.find_all('div','peopleitem')
len(people_items)
27
Important
answer to questions about searching from the docs
We can also look at only the first instance
people_items[0]
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/"><img alt="" class="u-photo wp-post-image" height="1600" loading="lazy" sizes="(max-width: 1600px) 100vw, 1600px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg 1600w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1024x1024.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1536x1536.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-500x500.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1000x1000.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1280x1280.jpg 1280w" width="1600"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor | Chair</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.4388</span> <br/> <a class="u-email" href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>
We notice that the name is inside a <h3>
tag with class p-name
and then inside an a tag after that. We also know from looking at the overall page that there are lots of other a tags, so we do not want to search all of those.
people_items[0].find('h3','p-name')
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
Then we see in this, there is an <a>
tag, so we can pull that out next, we can use the tag attribute, because the first instance of the tag is exactly what we want.
people_items[0].find('h3','p-name').a
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a>
inside that there is the text in a string, so we can pull that out
people_items[0].find('h3','p-name').a.string
'Gavino Puggioni'
Finally, now that we know how to get one out, we can put it all in a list comprehension
names = [person.find('h3','p-name').a.string for person in people_items]
9.4. Pulling more information#
First, we look at the whole person entry again.
people_items[0]
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/"><img alt="" class="u-photo wp-post-image" height="1600" loading="lazy" sizes="(max-width: 1600px) 100vw, 1600px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg 1600w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1024x1024.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1536x1536.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-500x500.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1000x1000.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1280x1280.jpg 1280w" width="1600"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor | Chair</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.4388</span> <br/> <a class="u-email" href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>
How to pull out the titles for each person (eg Assitatn Teaching Professor, Associate Professor)
on one item, the p
tag with the people-title
class gets us what we want and
then we can
[person.find('p','people-title').a.string for person in people_items]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[19], line 1
----> 1 [person.find('p','people-title').a.string for person in people_items]
Cell In[19], line 1, in <listcomp>(.0)
----> 1 [person.find('p','people-title').a.string for person in people_items]
AttributeError: 'NoneType' object has no attribute 'string'
This give an error because there is no <a>
tag inside the <p>
tag
Python’s null concept (outside of pandas and numpy that have nan float values) is None
we can assign it to a variable if we want.
a = None
Its type is NoneType
type(a)
NoneType
So the error message says that the thing we are applying .string
to is None
, since that is the .a
which means there is no <a>
inside a <p class = 'people-title'>
If we take out the a
it works and then we can iterate and store like above.
titles = [person.find('p','people-title').string for person in people_items]
We can pull out two more things, the people-department indicates who is CS & who is Statistics.
disciplines = [d.string for d in cs_people.find_all("p",'people-department')]
emails = [e.string for e in cs_people.find_all("a",'u-email')]
We can finally use the DataFrame constructor to make it a table. I chose to use a dictionary in class
css_df = pd.DataFrame({'name':names,'title':titles,'email':emails,'discipline':disciplines})
css_df
name | title | discipline | ||
---|---|---|---|---|
0 | Gavino Puggioni | Associate Professor | Chair | gpuggioni@uri.edu | Statistics |
1 | Lisa DiPippo | Professor | Director of Undergraduate Studies | ldipippo@uri.edu | Computer Science |
2 | Natallia Katenka | Associate Professor | Director of Data Science | nkatenka@uri.edu | Statistics |
3 | Krishna Venkatasubramanian | Assistant Professor | Director of Graduate Stu... | krish@uri.edu | Computer Science |
4 | Jing Wu | Associate Professor | Director of Graduate Stu... | jing_wu@uri.edu | Statistics |
5 | Yichi Zhang | Assistant Professor | Director of Undergraduat... | yichizhang@uri.edu | Statistics |
6 | Marco Alvarez | Associate Professor | malvarez@uri.edu | Computer Science |
7 | Samantha Armenti | Associate Teaching Professor | sarmenti@uri.edu | Computer Science |
8 | Sarah Brown | Assistant Professor | brownsarahm@uri.edu | Computer Science |
9 | Michael Conti | Associate Teaching Professor | michaelconti@uri.edu | Computer Science |
10 | Noah Daniels | Associate Professor | noah_daniels@uri.edu | Computer Science |
11 | Victor Fay-Wolfe | Professor | vfaywolfe@uri.edu | Computer Science |
12 | Lutz Hamel | Associate Professor | lutzhamel@uri.edu | Computer Science |
13 | Abdeltawab Hendawi | Assistant Professor | hendawi@uri.edu | Data Science | Computer Science |
14 | Jean-Yves Hervé | Associate Professor | jyh@cs.uri.edu | Computer Science |
15 | Soheyb Kouider | Associate Teaching Professor | soheyb@uri.edu | Statistics |
16 | Edmund Lamagna | Professor | eal@cs.uri.edu | Computer Science |
17 | Indrani Mandal | Associate Teaching Professor | indrani_mandal@uri.edu | Computer Science |
18 | Jonathan Schrader | Assistant Teaching Professor | jonathan.schrader@uri.edu | Computer Science |
19 | Shaun Wallace | Assistant Professor | shaun.wallace@uri.edu | Computer Science |
20 | Yunshu (Jasmine) Wang | Assistant Teaching Professor | yunshu_wang@uri.edu | Statistics |
21 | Haihan (Mark) Yu | Assistant Professor | haihan.yu@uri.edu | Statistics |
22 | Guangyu Zhu | Assistant Professor | guangyuzhu@uri.edu | Statistics |
23 | Ashley Buchanan | Limited Joint Appointment | buchanan@uri.edu | Biostatistics |
24 | Nina Kajiji | Adjunct Associate Professor | nina@uri.edu | Computer Science |
25 | Rachel Schwartz | Assistant Professor – Limited Joint Appointment | rsschwartz@uri.edu | Biological Sciences |
26 | Ying Zhang | Assistant Professor – Limited Joint Appointment | yingzhang@uri.edu | Computer Science |
We could also use a list of list and separate list of column names.
css_df_b = pd.DataFrame(data=[names,titles,emails,disciplines],
columns =['names','titles','emails','disciplines'])
css_df_b
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:934, in _finalize_columns_and_data(content, columns, dtype)
933 try:
--> 934 columns = _validate_or_indexify_columns(contents, columns)
935 except AssertionError as err:
936 # GH#26429 do not raise user-facing AssertionError
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:981, in _validate_or_indexify_columns(content, columns)
979 if not is_mi_list and len(columns) != len(content): # pragma: no cover
980 # caller's responsibility to check for this...
--> 981 raise AssertionError(
982 f"{len(columns)} columns passed, passed data had "
983 f"{len(content)} columns"
984 )
985 if is_mi_list:
986 # check if nested list column, length of each sub-list should be equal
AssertionError: 4 columns passed, passed data had 27 columns
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Cell In[25], line 1
----> 1 css_df_b = pd.DataFrame(data=[names,titles,emails,disciplines],
2 columns =['names','titles','emails','disciplines'])
3 css_df_b
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:782, in DataFrame.__init__(self, data, index, columns, dtype, copy)
780 if columns is not None:
781 columns = ensure_index(columns)
--> 782 arrays, columns, index = nested_data_to_arrays(
783 # error: Argument 3 to "nested_data_to_arrays" has incompatible
784 # type "Optional[Collection[Any]]"; expected "Optional[Index]"
785 data,
786 columns,
787 index, # type: ignore[arg-type]
788 dtype,
789 )
790 mgr = arrays_to_mgr(
791 arrays,
792 columns,
(...)
795 typ=manager,
796 )
797 else:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:498, in nested_data_to_arrays(data, columns, index, dtype)
495 if is_named_tuple(data[0]) and columns is None:
496 columns = ensure_index(data[0]._fields)
--> 498 arrays, columns = to_arrays(data, columns, dtype=dtype)
499 columns = ensure_index(columns)
501 if index is None:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:840, in to_arrays(data, columns, dtype)
837 data = [tuple(x) for x in data]
838 arr = _list_to_arrays(data)
--> 840 content, columns = _finalize_columns_and_data(arr, columns, dtype)
841 return content, columns
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:937, in _finalize_columns_and_data(content, columns, dtype)
934 columns = _validate_or_indexify_columns(contents, columns)
935 except AssertionError as err:
936 # GH#26429 do not raise user-facing AssertionError
--> 937 raise ValueError(err) from err
939 if len(contents) and contents[0].dtype == np.object_:
940 contents = convert_object_array(contents, dtype=dtype)
ValueError: 4 columns passed, passed data had 27 columns
9.5. Crawling and scraping#
Remember we pulled the names out of links, when in the browser, we click on the links, we see that they are to a profile page. On these pages, they have the office number. Let’s add those to our dataframe.
First, we will do it for one person, then make a loop.
people_items[0].find('h3','p-name').a
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a>
We see that the information that we want is in the href
attribute, to read that, we check the documentation. This tells us there is a .attrs
attribute of the python object we are working with.
people_items[0].find('h3','p-name').a.attrs
{'href': 'https://web.uri.edu/cs/meet/gavino-puggioni/'}
It’s a dictionary and the attribute we want is the key we want.
puggioni_url = people_items[0].find('h3','p-name').a.attrs['href']
Now, we do the same thing we did above, request, pull the content from the response and then use the parser.
puggioni_html = requests.get(puggioni_url).content
puggioni_info = BeautifulSoup(puggioni_html,'html.parser')
then we find the tag and class we need from inspecting and pull that.
puggioni_info.find_all('li','people-location')
[<li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>]
it’s an interable, so we pull the item out
puggioni_info.find_all('li','people-location')[0]
<li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>
Then we try to pull the string out an that is empty
puggioni_info.find_all('li','people-location')[0].string
Here, we could go to the documentation and look up what the object contains, but insteas we can use object serialization.
We can use the python __dict__
to inspect the object and see where it stored what we want.
puggioni_info.find_all('li','people-location')[0].__dict__
{'parser_class': bs4.BeautifulSoup,
'name': 'li',
'namespace': None,
'_namespaces': {},
'prefix': None,
'sourceline': 372,
'sourcepos': 303,
'known_xml': False,
'attrs': {'class': ['people-location']},
'contents': [<strong>Office Location:</strong>, ' Tyler Hall 254'],
'parent': <ul class="people-list">
<li class="people-title">Associate Professor | Chair</li> <li class="people-department">Statistics</li> <li class="people-phone"><strong>Phone:</strong> 401.874.4388</li> <li class="people-email"><strong>Email:</strong> <a href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></li> <li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>
</ul>,
'previous_element': ' ',
'next_element': <strong>Office Location:</strong>,
'next_sibling': '\n',
'previous_sibling': ' ',
'hidden': False,
'can_be_empty_element': False,
'cdata_list_attributes': {'*': ['class', 'accesskey', 'dropzone'],
'a': ['rel', 'rev'],
'link': ['rel', 'rev'],
'td': ['headers'],
'th': ['headers'],
'form': ['accept-charset'],
'object': ['archive'],
'area': ['rel'],
'icon': ['sizes'],
'iframe': ['sandbox'],
'output': ['for']},
'preserve_whitespace_tags': {'pre', 'textarea'},
'interesting_string_types': (bs4.element.NavigableString, bs4.element.CData)}
we see its the second element in a list in the 'content'
value
puggioni_info.find_all('li','people-location')[0].contents[1]
' Tyler Hall 254'
Now tht we know how to do it, we can put it in a loop.
offices = []
for name_link in cs_people.find_all('h3','p-name'):
url = name_link.a.attrs['href']
person_html = requests.get(url).content
person_info = BeautifulSoup(person_html,'html.parser')
try:
offices.append(person_info.find_all('li','people-location')[0].contents[1])
except:
offices.append(pd.NA)
css_df['office'] = offices
We added the try
and except
to handle when there is no office location. This is something in practice you would often think to do due an error.
Here we check the info
and we can see how it is.
css_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 27 non-null object
1 title 27 non-null object
2 email 27 non-null object
3 discipline 27 non-null object
4 office 25 non-null object
dtypes: object(5)
memory usage: 1.2+ KB
We can also, finally save out our ready dataset:
css_df.to_csv('css_faculty.csv')
9.6. Questions after class#
9.6.1. what does .a do?#
it gives the first instance of the <a>
tag
9.6.2. is it worth it to try and web scrape a page that is poorly written?#
If it is important information. In these cases, you might have to do more manual parsing or even some manual fixes.
For this class, no.
9.6.3. Is there a way to check robots.txt through BeautifulSoup or must that be done manually in a browser?#
it could maybe be read programmaticlaly, but it doesn’t necessarily save time to do it that way.
9.6.4. What else can I do with inspect?#
It lets you view the code. It’s most often used to debug websites.
9.6.5. when web scraping if the html is not set up well is it possible to change the html to make it easier to parse#
Technically you could manually edit a copy of it.
9.6.6. Are there instances where you can get data from websites that are not in tabular form?#
Web scraping is for when the website is not in tabular form. It should be strucutred, but the structure does not need to come from a single page. It could be that there are many pages strucutred similarly and you build most of the columns from the other pages, not the starting page.
For example from the teams page of the nba you can get to a page with info about each team that includes all time records and the current rosters. On these individual pages, most info is an actual table, so you can use pd.read_html
for those, but the crawing part from the first page would count.
9.6.7. A source table would be the people’s page on the URI website, but when you click on their individual names does that count as another source table?#
Not as we did above because we combined the data by adding another column. If you built a whole table on each of the sub-pages it would count.
9.7. Portfolio Question#
9.7.1. I guess how to further edit the submission_1.intro. I know about the chapters you gotta add, but what else? Is submission_1.intro the only file you gotta edit?#
Yes you edit that file and the _toc.yml
. There are instructions on the portfolio page
There are also formatting tips and ideas
9.7.2. for portfolio one, can we submit whatever we want? like as little OR as much?#
Yes, exactly!
However, I do want to really encourage you to submit whatever you are thinking of, even if it is not as complete as you want. If you submit, you will get feedback even if you do not earn all of the achievements you try. Tht will prepare you for the next one.