9. Web Scraping#



import requests
from bs4 import BeautifulSoup
import pandas as pd


If it says it cannot load one of the libraries, use pip inside your notebook to install,

pip install beautifulsoup4

then restart your kernel (Kernel menu, choose restart)

9.1. Getting Data From Websites#

We have seen that read_html can get content from an actual website, not a data file that is hosted somewhere on the internet, that takes tables on a website and returns a list of DataFrames.

This gives us a list of DataFrames that come from the website. pandas gets tables by looking in the html for the site and finding the <table> tags.

9.2. Everything is Data#

For the purpose of this class, it is best to think of the content on a web page like a datastructure.

html anatomy

HTML tree structure

there are tags <> that define the structure, and these can be further classified with classes

9.3. Scraping a URI website#

We’re going to create a DataFrame about URI CS & Statistics Faculty.

from the people page of the department website.

We can inspect the page to check that it’s well structured.


With great power comes great responsibility.

  • always check the robots.txt

  • do not do things that the owner says not to do

  • government websites are typically safe

We’ll save the URL for easy use

cs_people_url = 'https://web.uri.edu/cs/people/'

Then we can use the requests library to make a call to the internet. It actually gets back a response object which has a lot of extra information. For today we only need the content from the page which is an attrtibute of that object:

cs_people_html = requests.get(cs_people_url).content

This is raw:

b'\n<!DOCTYPE html>\n<html lang="en-US">\n\t\n<head>\n<meta charset="UTF-8"><script type="text/javascript">(window.NREUM||(NREUM={})).init={privacy:{cookies_enabled:true},ajax:{deny_list:["bam.nr-data.net"]},'

But we do not need to manually write search tools, that’s what BeautifulSoup is for.

cs_people = BeautifulSoup(cs_people_html,'html.parser')

9.3.1. Looking at tags#

In this object we can use any tag from the file and get back the first instance

<a class="skip-link screen-reader-text" href="#content">Skip to content</a>

We also see <h3> in the code so we can get the first one like this:

<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>

this cheatsheet shows lots of html tags, but for this purpose you do not really need it. You’ll be inspecting the page and then looking for what you want

9.3.2. Searching the source#

More helpful is the find_all method we wnat to find all div tags that are “peopleitem” class. We decided this by inspecting the code on the website.

this is a long, object and we can see it looks iterable ([ at the start)

people_items = cs_people.find_all('div','peopleitem')


answer to questions about searching from the docs

We can also look at only the first instance

<div class="peopleitem h-card has-thumbnail">
<div class="header">
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/"><img alt="" class="u-photo wp-post-image" height="1600" loading="lazy" sizes="(max-width: 1600px) 100vw, 1600px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg 1600w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1024x1024.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1536x1536.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-500x500.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1000x1000.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1280x1280.jpg 1280w" width="1600"/></a>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
<div class="inside">
<p class="people-title p-job-title">Associate Professor |  Chair</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.4388</span> <br/> <a class="u-email" href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></p>
<div style="clear:both;"></div>

We notice that the name is inside a <h3> tag with class p-name and then inside an a tag after that. We also know from looking at the overall page that there are lots of other a tags, so we do not want to search all of those.

<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>

Then we see in this, there is an <a> tag, so we can pull that out next, we can use the tag attribute, because the first instance of the tag is exactly what we want.

<a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a>

inside that there is the text in a string, so we can pull that out

'Gavino Puggioni'

Finally, now that we know how to get one out, we can put it all in a list comprehension

names = [person.find('h3','p-name').a.string for person in people_items]

9.4. Pulling more information#

First, we look at the whole person entry again.

<div class="peopleitem h-card has-thumbnail">
<div class="header">
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/"><img alt="" class="u-photo wp-post-image" height="1600" loading="lazy" sizes="(max-width: 1600px) 100vw, 1600px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg 1600w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1024x1024.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1536x1536.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-500x500.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1000x1000.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1280x1280.jpg 1280w" width="1600"/></a>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
<div class="inside">
<p class="people-title p-job-title">Associate Professor |  Chair</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.4388</span> <br/> <a class="u-email" href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></p>
<div style="clear:both;"></div>

How to pull out the titles for each person (eg Assitatn Teaching Professor, Associate Professor)

on one item, the p tag with the people-title class gets us what we want and then we can

9.5. Crawling and scraping#

Remember we pulled the names out of links, when in the browser, we click on the links, we see that they are to a profile page. On these pages, they have the office number. Let’s add those to our dataframe.

First, we will do it for one person, then make a loop.

<a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a>

We see that the information that we want is in the href attribute, to read that, we check the documentation. This tells us there is a .attrs attribute of the python object we are working with.

{'href': 'https://web.uri.edu/cs/meet/gavino-puggioni/'}

It’s a dictionary and the attribute we want is the key we want.

puggioni_url = people_items[0].find('h3','p-name').a.attrs['href']

Now, we do the same thing we did above, request, pull the content from the response and then use the parser.

puggioni_html = requests.get(puggioni_url).content
puggioni_info = BeautifulSoup(puggioni_html,'html.parser')

then we find the tag and class we need from inspecting and pull that.

[<li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>]

it’s an interable, so we pull the item out

<li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>

Then we try to pull the string out an that is empty


Here, we could go to the documentation and look up what the object contains, but insteas we can use object serialization.
We can use the python __dict__ to inspect the object and see where it stored what we want.

{'parser_class': bs4.BeautifulSoup,
 'name': 'li',
 'namespace': None,
 '_namespaces': {},
 'prefix': None,
 'sourceline': 372,
 'sourcepos': 303,
 'known_xml': False,
 'attrs': {'class': ['people-location']},
 'contents': [<strong>Office Location:</strong>, ' Tyler Hall 254'],
 'parent': <ul class="people-list">
 <li class="people-title">Associate Professor |  Chair</li> <li class="people-department">Statistics</li> <li class="people-phone"><strong>Phone:</strong> 401.874.4388</li> <li class="people-email"><strong>Email:</strong> <a href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></li> <li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>
 'previous_element': ' ',
 'next_element': <strong>Office Location:</strong>,
 'next_sibling': '\n',
 'previous_sibling': ' ',
 'hidden': False,
 'can_be_empty_element': False,
 'cdata_list_attributes': {'*': ['class', 'accesskey', 'dropzone'],
  'a': ['rel', 'rev'],
  'link': ['rel', 'rev'],
  'td': ['headers'],
  'th': ['headers'],
  'form': ['accept-charset'],
  'object': ['archive'],
  'area': ['rel'],
  'icon': ['sizes'],
  'iframe': ['sandbox'],
  'output': ['for']},
 'preserve_whitespace_tags': {'pre', 'textarea'},
 'interesting_string_types': (bs4.element.NavigableString, bs4.element.CData)}

we see its the second element in a list in the 'content' value

' Tyler Hall 254'

Now tht we know how to do it, we can put it in a loop.

offices = []
for name_link in cs_people.find_all('h3','p-name'):
    url = name_link.a.attrs['href']
    person_html = requests.get(url).content
    person_info = BeautifulSoup(person_html,'html.parser')

css_df['office'] = offices

We added the try and except to handle when there is no office location. This is something in practice you would often think to do due an error.

Here we check the info and we can see how it is.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        27 non-null     object
 1   title       27 non-null     object
 2   email       27 non-null     object
 3   discipline  27 non-null     object
 4   office      25 non-null     object
dtypes: object(5)
memory usage: 1.2+ KB

We can also, finally save out our ready dataset:


9.6. Questions after class#

9.6.1. what does .a do?#

it gives the first instance of the <a> tag

9.6.2. is it worth it to try and web scrape a page that is poorly written?#

If it is important information. In these cases, you might have to do more manual parsing or even some manual fixes.

For this class, no.

9.6.3. Is there a way to check robots.txt through BeautifulSoup or must that be done manually in a browser?#

it could maybe be read programmaticlaly, but it doesn’t necessarily save time to do it that way.

9.6.4. What else can I do with inspect?#

It lets you view the code. It’s most often used to debug websites.

9.6.5. when web scraping if the html is not set up well is it possible to change the html to make it easier to parse#

Technically you could manually edit a copy of it.

9.6.6. Are there instances where you can get data from websites that are not in tabular form?#

Web scraping is for when the website is not in tabular form. It should be strucutred, but the structure does not need to come from a single page. It could be that there are many pages strucutred similarly and you build most of the columns from the other pages, not the starting page.

For example from the teams page of the nba you can get to a page with info about each team that includes all time records and the current rosters. On these individual pages, most info is an actual table, so you can use pd.read_html for those, but the crawing part from the first page would count.

9.6.7. A source table would be the people’s page on the URI website, but when you click on their individual names does that count as another source table?#

Not as we did above because we combined the data by adding another column. If you built a whole table on each of the sub-pages it would count.

9.7. Portfolio Question#

9.7.1. I guess how to further edit the submission_1.intro. I know about the chapters you gotta add, but what else? Is submission_1.intro the only file you gotta edit?#

Yes you edit that file and the _toc.yml. There are instructions on the portfolio page

There are also formatting tips and ideas

9.7.2. for portfolio one, can we submit whatever we want? like as little OR as much?#

Yes, exactly!

However, I do want to really encourage you to submit whatever you are thinking of, even if it is not as complete as you want. If you submit, you will get feedback even if you do not earn all of the achievements you try. Tht will prepare you for the next one.