9. Web Scraping#

Note

Dates:

  • A5 due 10/10

  • P1 due 10/16 (ideally you have already started this)

  • A6 due 10/18 (it’s short, I promise)

  • then back to assignments due Monday

import requests
from bs4 import BeautifulSoup
import pandas as pd

Warning

If it says it cannot load one of the libraries, use pip inside your notebook to install,

pip install beautifulsoup4

then restart your kernel (Kernel menu, choose restart)

9.1. Getting Data From Websites#

We have seen that read_html can get content from an actual website, not a data file that is hosted somewhere on the internet, that takes tables on a website and returns a list of DataFrames.

pd.read_html('https://rhodyprog4ds.github.io/BrownSpring23/syllabus/achievements.html')
Hide code cell output
[   Unnamed: 0_level_0                                             topics  \
                  week                                 Unnamed: 1_level_1   
 0                   1                             [admin, python review]   
 1                   2                        Loading data, Python review   
 2                   3                          Exploratory Data Analysis   
 3                   4                                      Data Cleaning   
 4                   5                      Databases, Merging DataFrames   
 5                   6  Modeling, classification performance metrics, ...   
 6                   7                        Naive Bayes, decision trees   
 7                   8                                         Regression   
 8                   9                                         Clustering   
 9                  10                              SVM, parameter tuning   
 10                 11                              KNN, Model comparison   
 11                 12                                      Text Analysis   
 12                 13                                    Images Analysis   
 13                 14                                      Deep Learning   
 
                              skills  
                  Unnamed: 2_level_1  
 0                           process  
 1      [access, prepare, summarize]  
 2            [summarize, visualize]  
 3   [prepare, summarize, visualize]  
 4    [access, construct, summarize]  
 5                        [evaluate]  
 6        [classification, evaluate]  
 7            [regression, evaluate]  
 8            [clustering, evaluate]  
 9                 [optimize, tools]  
 10                 [compare, tools]  
 11                   [unstructured]  
 12            [unstructured, tools]  
 13                 [tools, compare]  ,
    Unnamed: 0_level_0                                              skill  \
               keyword                                 Unnamed: 1_level_1   
 0              python                              pythonic code writing   
 1             process                 describe data science as a process   
 2              access                    access data in multiple formats   
 3           construct           construct datasets from multiple sources   
 4           summarize                        Summarize and describe data   
 5           visualize                                     Visualize data   
 6             prepare                          prepare data for analysis   
 7            evaluate                         Evaluate model performance   
 8      classification                               Apply classification   
 9          regression                                   Apply Regression   
 10         clustering                                         Clustering   
 11           optimize                          Optimize model parameters   
 12            compare                                     compare models   
 13     representation          Choose representations and transform data   
 14           workflow  use industry standard data science tools and w...   
 
                                               Level 1  \
                                    Unnamed: 2_level_1   
 0   python code that mostly runs, occasional pep8 ...   
 1           Identify basic components of data science   
 2   load data from at least one format; identify t...   
 3   identify what should happen to merge datasets ...   
 4   Describe the shape and structure of a dataset ...   
 5   identify plot types, generate basic plots from...   
 6   identify if data is or is not ready for analys...   
 7   Explain basic performance metrics for differen...   
 8   identify and describe what classification is, ...   
 9   identify what data that can be used for regres...   
 10                        describe what clustering is   
 11  Identify when model parameters need to be opti...   
 12                Qualitatively compare model classes   
 13  Identify options for representing text and cat...   
 14  Solve well strucutred fully specified problems...   
 
                                               Level 2  \
                                    Unnamed: 3_level_1   
 0   python code that reliably runs, frequent pep8 ...   
 1   Describe and define each stage of the data sci...   
 2   Load data for processing from the most common ...   
 3                                  apply basic merges   
 4   compute summary statndard statistics of a whol...   
 5   generate multiple plot types with complete lab...   
 6   apply data reshaping, cleaning, and filtering ...   
 7   Apply and interpret basic model evaluation met...   
 8   fit, apply, and interpret preselected classifi...   
 9          fit and interpret linear regression models   
 10                             apply basic clustering   
 11  Optimize basic model parameters such as model ...   
 12  Compare model classes in specific terms and fi...   
 13  Apply at least one representation to transform...   
 14  Solve well-strucutred, open-ended problems, ap...   
 
                                               Level 3  
                                    Unnamed: 4_level_1  
 0   reliable, efficient, pythonic code that consis...  
 1   Compare different ways that data science can f...  
 2   access data from both common and uncommon form...  
 3        merge data that is not automatically aligned  
 4   Compute and interpret various summary statisti...  
 5   generate complex plots with pandas and plottin...  
 6   apply data reshaping, cleaning, and filtering ...  
 7   Evaluate a model with multiple metrics and cro...  
 8   fit and apply classification models and select...  
 9   fit and explain regrularized or nonlinear regr...  
 10  apply multiple clustering techniques, and inte...  
 11  Select optimal parameters based of mutiple qua...  
 12  Evaluate tradeoffs between different model com...  
 13  apply transformations in different contexts OR...  
 14  Independently scope and solve realistic data s...  ,
    Unnamed: 0_level_0                 A1                 A2  \
               keyword Unnamed: 1_level_1 Unnamed: 2_level_1   
 0              python                  1                  1   
 1             process                  1                  0   
 2              access                  0                  1   
 3           construct                  0                  0   
 4           summarize                  0                  0   
 5           visualize                  0                  0   
 6             prepare                  0                  0   
 7            evaluate                  0                  0   
 8      classification                  0                  0   
 9          regression                  0                  0   
 10         clustering                  0                  0   
 11           optimize                  0                  0   
 12            compare                  0                  0   
 13     representation                  0                  0   
 14           workflow                  0                  0   
 
                    A3                 A4                 A5  \
    Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1   
 0                   0                  1                  1   
 1                   0                  0                  0   
 2                   1                  1                  1   
 3                   0                  0                  1   
 4                   1                  1                  1   
 5                   1                  1                  0   
 6                   0                  1                  1   
 7                   0                  0                  0   
 8                   0                  0                  0   
 9                   0                  0                  0   
 10                  0                  0                  0   
 11                  0                  0                  0   
 12                  0                  0                  0   
 13                  0                  0                  0   
 14                  0                  0                  0   
 
                    A6                 A7                 A8  \
    Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1   
 0                   0                  0                  0   
 1                   1                  1                  1   
 2                   0                  0                  0   
 3                   0                  1                  1   
 4                   1                  1                  1   
 5                   1                  1                  1   
 6                   0                  0                  0   
 7                   1                  1                  1   
 8                   0                  1                  0   
 9                   0                  0                  1   
 10                  0                  0                  0   
 11                  0                  0                  0   
 12                  0                  0                  0   
 13                  0                  0                  0   
 14                  0                  0                  0   
 
                    A9                 A10                 A11  \
    Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1   
 0                   0                   0                   0   
 1                   1                   1                   1   
 2                   0                   0                   0   
 3                   0                   0                   0   
 4                   1                   1                   1   
 5                   1                   1                   1   
 6                   0                   0                   0   
 7                   0                   1                   1   
 8                   0                   1                   0   
 9                   0                   0                   1   
 10                  1                   0                   1   
 11                  0                   1                   1   
 12                  0                   0                   1   
 13                  0                   0                   0   
 14                  0                   1                   1   
 
                    A12                 A13       # Assignments  
    Unnamed: 12_level_1 Unnamed: 13_level_1 Unnamed: 14_level_1  
 0                    0                   0                   4  
 1                    0                   0                   7  
 2                    0                   0                   4  
 3                    0                   0                   3  
 4                    1                   1                  11  
 5                    1                   1                  10  
 6                    0                   0                   2  
 7                    0                   0                   5  
 8                    0                   0                   2  
 9                    0                   0                   2  
 10                   0                   0                   2  
 11                   0                   0                   2  
 12                   0                   1                   2  
 13                   1                   1                   2  
 14                   1                   1                   4  ,
    Unnamed: 0_level_0                                            Level 3  \
               keyword                                 Unnamed: 1_level_1   
 0              python  reliable, efficient, pythonic code that consis...   
 1             process  Compare different ways that data science can f...   
 2              access  access data from both common and uncommon form...   
 3           construct       merge data that is not automatically aligned   
 4           summarize  Compute and interpret various summary statisti...   
 5           visualize  generate complex plots with pandas and plottin...   
 6             prepare  apply data reshaping, cleaning, and filtering ...   
 7            evaluate  Evaluate a model with multiple metrics and cro...   
 8      classification  fit and apply classification models and select...   
 9          regression  fit and explain regrularized or nonlinear regr...   
 10         clustering  apply multiple clustering techniques, and inte...   
 11           optimize  Select optimal parameters based of mutiple qua...   
 12            compare  Evaluate tradeoffs between different model com...   
 13     representation  apply transformations in different contexts OR...   
 14           workflow  Independently scope and solve realistic data s...   
 
                    P1                 P2                 P3                 P4  
    Unnamed: 2_level_1 Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1  
 0                   1                  1                  0                  1  
 1                   0                  1                  1                  1  
 2                   1                  1                  0                  1  
 3                   1                  1                  0                  1  
 4                   1                  1                  0                  1  
 5                   1                  1                  0                  1  
 6                   1                  1                  0                  1  
 7                   0                  1                  1                  1  
 8                   0                  1                  1                  1  
 9                   0                  1                  1                  1  
 10                  0                  1                  1                  1  
 11                  0                  0                  1                  1  
 12                  0                  0                  1                  1  
 13                  0                  0                  1                  1  
 14                  0                  0                  1                  1  ]

This gives us a list of DataFrames that come from the website. pandas gets tables by looking in the html for the site and finding the <table> tags.

9.2. Everything is Data#

For the purpose of this class, it is best to think of the content on a web page like a datastructure.

html anatomy

HTML tree structure

there are tags <> that define the structure, and these can be further classified with classes

9.3. Scraping a URI website#

We’re going to create a DataFrame about URI CS & Statistics Faculty.

from the people page of the department website.

We can inspect the page to check that it’s well structured.

Warning

With great power comes great responsibility.

  • always check the robots.txt

  • do not do things that the owner says not to do

  • government websites are typically safe

We’ll save the URL for easy use

cs_people_url = 'https://web.uri.edu/cs/people/'

Then we can use the requests library to make a call to the internet. It actually gets back a response object which has a lot of extra information. For today we only need the content from the page which is an attrtibute of that object:

cs_people_html = requests.get(cs_people_url).content

This is raw:

cs_people_html[:200]
b'\n<!DOCTYPE html>\n<html lang="en-US">\n\t\n<head>\n<meta charset="UTF-8"><script type="text/javascript">(window.NREUM||(NREUM={})).init={privacy:{cookies_enabled:true},ajax:{deny_list:["bam.nr-data.net"]},'

But we do not need to manually write search tools, that’s what BeautifulSoup is for.

cs_people = BeautifulSoup(cs_people_html,'html.parser')
type(cs_people)
bs4.BeautifulSoup

9.3.1. Looking at tags#

In this object we can use any tag from the file and get back the first instance

cs_people.a
<a class="skip-link screen-reader-text" href="#content">Skip to content</a>

We also see <h3> in the code so we can get the first one like this:

cs_people.h3
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>

this cheatsheet shows lots of html tags, but for this purpose you do not really need it. You’ll be inspecting the page and then looking for what you want

9.3.2. Searching the source#

More helpful is the find_all method we wnat to find all div tags that are “peopleitem” class. We decided this by inspecting the code on the website.

type(cs_people.find_all('div','peopleitem'))
bs4.element.ResultSet
cs_people.find_all('div','peopleitem')
[<div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/gavino-puggioni/"><img alt="" class="u-photo wp-post-image" height="1600" loading="lazy" sizes="(max-width: 1600px) 100vw, 1600px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg 1600w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1024x1024.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1536x1536.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-500x500.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1000x1000.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1280x1280.jpg 1280w" width="1600"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Professor |  Chair</p>
 <p class="people-department">Statistics</p>
 <p class="people-misc"><span class="p-tel">401.874.4388</span> <br/> <a class="u-email" href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/lisa-dipippo/"><img alt="" class="u-photo wp-post-image" height="126" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/lisa-dipippo-web.jpg" width="125"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/lisa-dipippo/">Lisa DiPippo</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Professor | Director of Undergraduate Studies</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:ldipippo@uri.edu">ldipippo@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/natallia-katenka/"><img alt="" class="u-photo wp-post-image" height="200" loading="lazy" sizes="(max-width: 200px) 100vw, 200px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/natalia-katenka-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/natalia-katenka-web.jpg 200w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/natalia-katenka-web-150x150.jpg 150w" width="200"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/natallia-katenka/">Natallia Katenka</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Professor | Director of Data Science</p>
 <p class="people-department">Statistics</p>
 <p class="people-misc"><a class="u-email" href="mailto:nkatenka@uri.edu">nkatenka@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/krishna-venkatasubramanian/"><img alt="" class="u-photo wp-post-image" height="2703" loading="lazy" sizes="(max-width: 2437px) 100vw, 2437px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2.jpg 2437w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-270x300.jpg 270w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-768x852.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-923x1024.jpg 923w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-364x404.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-500x555.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-1000x1109.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-1280x1420.jpg 1280w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/KV_biopic_2-2000x2218.jpg 2000w" width="2437"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/krishna-venkatasubramanian/">Krishna Venkatasubramanian</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor | Director of Graduate Studies</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:krish@uri.edu">krish@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/jing-wu/"><img alt="" class="u-photo wp-post-image" height="200" loading="lazy" sizes="(max-width: 200px) 100vw, 200px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/jing-wu-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/jing-wu-web.jpg 200w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/jing-wu-web-150x150.jpg 150w" width="200"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/jing-wu/">Jing Wu</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Professor | Director of Graduate Studies</p>
 <p class="people-department">Statistics</p>
 <p class="people-misc"><span class="p-tel">401.874.4504</span> <br/> <a class="u-email" href="mailto:jing_wu@uri.edu">jing_wu@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/yichi-zhang/"><img alt="" class="u-photo wp-post-image" height="240" loading="lazy" sizes="(max-width: 240px) 100vw, 240px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/yichi-zhang-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/yichi-zhang-web.jpg 240w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/yichi-zhang-web-150x150.jpg 150w" width="240"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/yichi-zhang/">Yichi Zhang</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor | Director of Undergraduate Studies</p>
 <p class="people-department">Statistics</p>
 <p class="people-misc"><a class="u-email" href="mailto:yichizhang@uri.edu">yichizhang@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/marco-alvarez/"><img alt="" class="u-photo wp-post-image" height="120" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/marco-alvarez.png" width="120"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/marco-alvarez/">Marco Alvarez</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><span class="p-tel">401.874.5009</span> <br/> <a class="u-email" href="mailto:malvarez@uri.edu">malvarez@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card">
 <header>
 <div class="header">
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/samantha-armenti/">Samantha Armenti</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Teaching Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:sarmenti@uri.edu ">sarmenti@uri.edu </a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/sarah-brown/"><img alt="" class="u-photo wp-post-image" height="300" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Sarah-Brown.png" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Sarah-Brown.png 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Sarah-Brown-150x150.png 150w" width="300"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/sarah-brown/">Sarah Brown</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:brownsarahm@uri.edu">brownsarahm@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/michael-conti/"><img alt="" class="u-photo wp-post-image" height="2475" loading="lazy" sizes="(max-width: 2560px) 100vw, 2560px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-scaled.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-scaled.jpg 2560w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-300x290.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-1024x990.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-768x743.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-1536x1485.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-2048x1980.jpg 2048w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-364x352.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-500x483.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-1000x967.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-1280x1238.jpg 1280w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/mike-conti-2000x1934.jpg 2000w" width="2560"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/michael-conti/">Michael Conti</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Teaching Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:michaelconti@uri.edu ">michaelconti@uri.edu </a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/noah-daniels/"><img alt="" class="u-photo wp-post-image" height="219" loading="lazy" sizes="(max-width: 219px) 100vw, 219px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/noah-daniels-web.png" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/noah-daniels-web.png 219w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/noah-daniels-web-150x150.png 150w" width="219"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/noah-daniels/">Noah Daniels</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:noah_daniels@uri.edu">noah_daniels@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/victor-fay-wolfe/"><img alt="" class="u-photo wp-post-image" height="125" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/victor-fay-wolfe-web.jpg" width="125"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/victor-fay-wolfe/">Victor Fay-Wolfe</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:vfaywolfe@uri.edu">vfaywolfe@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card">
 <header>
 <div class="header">
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/lutz-hamel/">Lutz Hamel</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Professor </p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:lutzhamel@uri.edu">lutzhamel@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/abdeltawab-hendawi/"><img alt="" class="u-photo wp-post-image" height="885" loading="lazy" sizes="(max-width: 1000px) 100vw, 1000px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1.jpeg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1.jpeg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1-300x266.jpeg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1-768x680.jpeg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1-364x322.jpeg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-1-500x443.jpeg 500w" width="1000"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/abdeltawab-hendawi/">Abdeltawab Hendawi</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor</p>
 <p class="people-department">Data Science | Computer Science</p>
 <p class="people-misc"><span class="p-tel">401.874.5738</span> <br/> <a class="u-email" href="mailto:hendawi@uri.edu">hendawi@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/jean-yves-herve/"><img alt="" class="u-photo wp-post-image" height="299" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/jean-yves-herve-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/jean-yves-herve-web.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/jean-yves-herve-web-150x150.jpg 150w" width="300"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/jean-yves-herve/">Jean-Yves Hervé</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:jyh@cs.uri.edu">jyh@cs.uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card">
 <header>
 <div class="header">
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/soheyb-kouider/">Soheyb Kouider</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Teaching Professor</p>
 <p class="people-department">Statistics</p>
 <p class="people-misc"><span class="p-tel">401.874.2562</span> <br/> <a class="u-email" href="mailto:soheyb@uri.edu">soheyb@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/edmund-lamagna/"><img alt="" class="u-photo wp-post-image" height="150" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/edmund-lamagna-web.jpg" width="150"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/edmund-lamagna/">Edmund Lamagna</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:eal@cs.uri.edu">eal@cs.uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/indrani-mandal/"><img alt="" class="u-photo wp-post-image" height="280" loading="lazy" sizes="(max-width: 278px) 100vw, 278px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/0.jpeg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/0.jpeg 278w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/0-150x150.jpeg 150w" width="278"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/indrani-mandal/">Indrani Mandal</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Associate Teaching Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:indrani_mandal@uri.edu ">indrani_mandal@uri.edu </a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card">
 <header>
 <div class="header">
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/jonathan-schrader/">Jonathan Schrader</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Teaching Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:jonathan.schrader@uri.edu">jonathan.schrader@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/shaun-wallace/"><img alt="Shaun Wallace" class="u-photo wp-post-image" height="300" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Shaun-Wallace_300.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Shaun-Wallace_300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Shaun-Wallace_300-150x150.jpg 150w" width="300"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/shaun-wallace/">Shaun Wallace</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:shaun.wallace@uri.edu">shaun.wallace@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card">
 <header>
 <div class="header">
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/yunshu-jasmine-wang/">Yunshu (Jasmine) Wang</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Teaching Professor</p>
 <p class="people-department">Statistics</p>
 <p class="people-misc"><a class="u-email" href="mailto:yunshu_wang@uri.edu">yunshu_wang@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/haihan-mark-yu/"><img alt="Haihan Yu" class="u-photo wp-post-image" height="300" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/haihan_2307.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/haihan_2307.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/haihan_2307-150x150.jpg 150w" width="300"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/haihan-mark-yu/">Haihan (Mark) Yu</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor</p>
 <p class="people-department">Statistics</p>
 <p class="people-misc"><a class="u-email" href="mailto:haihan.yu@uri.edu">haihan.yu@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/guangyu-zhu/"><img alt="" class="u-photo wp-post-image" height="200" loading="lazy" sizes="(max-width: 200px) 100vw, 200px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Guangyu-Zhu-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Guangyu-Zhu-web.jpg 200w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Guangyu-Zhu-web-150x150.jpg 150w" width="200"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/guangyu-zhu/">Guangyu Zhu</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor</p>
 <p class="people-department">Statistics</p>
 <p class="people-misc"><a class="u-email" href="mailto:guangyuzhu@uri.edu">guangyuzhu@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/ashley-buchanan/"><img alt="" class="u-photo wp-post-image" height="966" loading="lazy" sizes="(max-width: 966px) 100vw, 966px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan.jpg 966w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Ashley-Buchanan-500x500.jpg 500w" width="966"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/ashley-buchanan/">Ashley Buchanan</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Limited Joint Appointment</p>
 <p class="people-department">Biostatistics</p>
 <p class="people-misc"><span class="p-tel">401.874.4739</span> <br/> <a class="u-email" href="mailto:buchanan@uri.edu">buchanan@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/nina-kajiji/"><img alt="" class="u-photo wp-post-image" height="125" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/nina-kajiji-web.jpg" width="125"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/nina-kajiji/">Nina Kajiji</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Adjunct Associate Professor</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><a class="u-email" href="mailto:nina@uri.edu">nina@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/rachel-schwartz/"><img alt="" class="u-photo wp-post-image" height="120" loading="lazy" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Rachel-Schwartz-web.jpg" width="120"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/rachel-schwartz/">Rachel Schwartz</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor – Limited Joint Appointment</p>
 <p class="people-department">Biological Sciences</p>
 <p class="people-misc"><span class="p-tel">401.874.5404</span> <br/> <a class="u-email" href="mailto:rsschwartz@uri.edu">rsschwartz@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/ying-zhang/"><img alt="" class="u-photo wp-post-image" height="250" loading="lazy" sizes="(max-width: 249px) 100vw, 249px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/ying-zhang-web.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/ying-zhang-web.jpg 249w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/ying-zhang-web-150x150.jpg 150w" width="249"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/ying-zhang/">Ying Zhang</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor – Limited Joint Appointment</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><span class="p-tel">401.874.4915</span> <br/> <a class="u-email" href="mailto:yingzhang@uri.edu">yingzhang@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>]

this is a long, object and we can see it looks iterable ([ at the start)

people_items = cs_people.find_all('div','peopleitem')
len(people_items)
27

Important

answer to questions about searching from the docs

We can also look at only the first instance

people_items[0]
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/"><img alt="" class="u-photo wp-post-image" height="1600" loading="lazy" sizes="(max-width: 1600px) 100vw, 1600px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg 1600w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1024x1024.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1536x1536.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-500x500.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1000x1000.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1280x1280.jpg 1280w" width="1600"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor |  Chair</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.4388</span> <br/> <a class="u-email" href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>

We notice that the name is inside a <h3> tag with class p-name and then inside an a tag after that. We also know from looking at the overall page that there are lots of other a tags, so we do not want to search all of those.

people_items[0].find('h3','p-name')
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>

Then we see in this, there is an <a> tag, so we can pull that out next, we can use the tag attribute, because the first instance of the tag is exactly what we want.

people_items[0].find('h3','p-name').a
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a>

inside that there is the text in a string, so we can pull that out

people_items[0].find('h3','p-name').a.string
'Gavino Puggioni'

Finally, now that we know how to get one out, we can put it all in a list comprehension

names = [person.find('h3','p-name').a.string for person in people_items]

9.4. Pulling more information#

First, we look at the whole person entry again.

people_items[0]
<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/"><img alt="" class="u-photo wp-post-image" height="1600" loading="lazy" sizes="(max-width: 1600px) 100vw, 1600px" src="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg" srcset="https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381.jpg 1600w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-300x300.jpg 300w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1024x1024.jpg 1024w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-150x150.jpg 150w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-768x768.jpg 768w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1536x1536.jpg 1536w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-364x364.jpg 364w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-500x500.jpg 500w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1000x1000.jpg 1000w, https://web.uri.edu/cs/wp-content/uploads/sites/1531/Gavino_headshot1-e1610666088381-1280x1280.jpg 1280w" width="1600"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Associate Professor |  Chair</p>
<p class="people-department">Statistics</p>
<p class="people-misc"><span class="p-tel">401.874.4388</span> <br/> <a class="u-email" href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>

How to pull out the titles for each person (eg Assitatn Teaching Professor, Associate Professor)

on one item, the p tag with the people-title class gets us what we want and then we can

[person.find('p','people-title').a.string for person in people_items]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[19], line 1
----> 1 [person.find('p','people-title').a.string for person in people_items]

Cell In[19], line 1, in <listcomp>(.0)
----> 1 [person.find('p','people-title').a.string for person in people_items]

AttributeError: 'NoneType' object has no attribute 'string'

This give an error because there is no <a> tag inside the <p> tag

Python’s null concept (outside of pandas and numpy that have nan float values) is None we can assign it to a variable if we want.

a = None

Its type is NoneType

type(a)
NoneType

So the error message says that the thing we are applying .string to is None, since that is the .a which means there is no <a> inside a <p class = 'people-title'>

If we take out the a it works and then we can iterate and store like above.

titles = [person.find('p','people-title').string for person in people_items]

We can pull out two more things, the people-department indicates who is CS & who is Statistics.

disciplines = [d.string for d in cs_people.find_all("p",'people-department')]
emails = [e.string for e in cs_people.find_all("a",'u-email')]

We can finally use the DataFrame constructor to make it a table. I chose to use a dictionary in class

css_df = pd.DataFrame({'name':names,'title':titles,'email':emails,'discipline':disciplines})
css_df
name title email discipline
0 Gavino Puggioni Associate Professor | Chair gpuggioni@uri.edu Statistics
1 Lisa DiPippo Professor | Director of Undergraduate Studies ldipippo@uri.edu Computer Science
2 Natallia Katenka Associate Professor | Director of Data Science nkatenka@uri.edu Statistics
3 Krishna Venkatasubramanian Assistant Professor | Director of Graduate Stu... krish@uri.edu Computer Science
4 Jing Wu Associate Professor | Director of Graduate Stu... jing_wu@uri.edu Statistics
5 Yichi Zhang Assistant Professor | Director of Undergraduat... yichizhang@uri.edu Statistics
6 Marco Alvarez Associate Professor malvarez@uri.edu Computer Science
7 Samantha Armenti Associate Teaching Professor sarmenti@uri.edu Computer Science
8 Sarah Brown Assistant Professor brownsarahm@uri.edu Computer Science
9 Michael Conti Associate Teaching Professor michaelconti@uri.edu Computer Science
10 Noah Daniels Associate Professor noah_daniels@uri.edu Computer Science
11 Victor Fay-Wolfe Professor vfaywolfe@uri.edu Computer Science
12 Lutz Hamel Associate Professor lutzhamel@uri.edu Computer Science
13 Abdeltawab Hendawi Assistant Professor hendawi@uri.edu Data Science | Computer Science
14 Jean-Yves Hervé Associate Professor jyh@cs.uri.edu Computer Science
15 Soheyb Kouider Associate Teaching Professor soheyb@uri.edu Statistics
16 Edmund Lamagna Professor eal@cs.uri.edu Computer Science
17 Indrani Mandal Associate Teaching Professor indrani_mandal@uri.edu Computer Science
18 Jonathan Schrader Assistant Teaching Professor jonathan.schrader@uri.edu Computer Science
19 Shaun Wallace Assistant Professor shaun.wallace@uri.edu Computer Science
20 Yunshu (Jasmine) Wang Assistant Teaching Professor yunshu_wang@uri.edu Statistics
21 Haihan (Mark) Yu Assistant Professor haihan.yu@uri.edu Statistics
22 Guangyu Zhu Assistant Professor guangyuzhu@uri.edu Statistics
23 Ashley Buchanan Limited Joint Appointment buchanan@uri.edu Biostatistics
24 Nina Kajiji Adjunct Associate Professor nina@uri.edu Computer Science
25 Rachel Schwartz Assistant Professor – Limited Joint Appointment rsschwartz@uri.edu Biological Sciences
26 Ying Zhang Assistant Professor – Limited Joint Appointment yingzhang@uri.edu Computer Science

We could also use a list of list and separate list of column names.

css_df_b = pd.DataFrame(data=[names,titles,emails,disciplines],
                       columns =['names','titles','emails','disciplines'])
css_df_b
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:934, in _finalize_columns_and_data(content, columns, dtype)
    933 try:
--> 934     columns = _validate_or_indexify_columns(contents, columns)
    935 except AssertionError as err:
    936     # GH#26429 do not raise user-facing AssertionError

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:981, in _validate_or_indexify_columns(content, columns)
    979 if not is_mi_list and len(columns) != len(content):  # pragma: no cover
    980     # caller's responsibility to check for this...
--> 981     raise AssertionError(
    982         f"{len(columns)} columns passed, passed data had "
    983         f"{len(content)} columns"
    984     )
    985 if is_mi_list:
    986     # check if nested list column, length of each sub-list should be equal

AssertionError: 4 columns passed, passed data had 27 columns

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[25], line 1
----> 1 css_df_b = pd.DataFrame(data=[names,titles,emails,disciplines],
      2                        columns =['names','titles','emails','disciplines'])
      3 css_df_b

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:782, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    780     if columns is not None:
    781         columns = ensure_index(columns)
--> 782     arrays, columns, index = nested_data_to_arrays(
    783         # error: Argument 3 to "nested_data_to_arrays" has incompatible
    784         # type "Optional[Collection[Any]]"; expected "Optional[Index]"
    785         data,
    786         columns,
    787         index,  # type: ignore[arg-type]
    788         dtype,
    789     )
    790     mgr = arrays_to_mgr(
    791         arrays,
    792         columns,
   (...)
    795         typ=manager,
    796     )
    797 else:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:498, in nested_data_to_arrays(data, columns, index, dtype)
    495 if is_named_tuple(data[0]) and columns is None:
    496     columns = ensure_index(data[0]._fields)
--> 498 arrays, columns = to_arrays(data, columns, dtype=dtype)
    499 columns = ensure_index(columns)
    501 if index is None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:840, in to_arrays(data, columns, dtype)
    837     data = [tuple(x) for x in data]
    838     arr = _list_to_arrays(data)
--> 840 content, columns = _finalize_columns_and_data(arr, columns, dtype)
    841 return content, columns

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/construction.py:937, in _finalize_columns_and_data(content, columns, dtype)
    934     columns = _validate_or_indexify_columns(contents, columns)
    935 except AssertionError as err:
    936     # GH#26429 do not raise user-facing AssertionError
--> 937     raise ValueError(err) from err
    939 if len(contents) and contents[0].dtype == np.object_:
    940     contents = convert_object_array(contents, dtype=dtype)

ValueError: 4 columns passed, passed data had 27 columns

9.5. Crawling and scraping#

Remember we pulled the names out of links, when in the browser, we click on the links, we see that they are to a profile page. On these pages, they have the office number. Let’s add those to our dataframe.

First, we will do it for one person, then make a loop.

people_items[0].find('h3','p-name').a
<a href="https://web.uri.edu/cs/meet/gavino-puggioni/">Gavino Puggioni</a>

We see that the information that we want is in the href attribute, to read that, we check the documentation. This tells us there is a .attrs attribute of the python object we are working with.

people_items[0].find('h3','p-name').a.attrs
{'href': 'https://web.uri.edu/cs/meet/gavino-puggioni/'}

It’s a dictionary and the attribute we want is the key we want.

puggioni_url = people_items[0].find('h3','p-name').a.attrs['href']

Now, we do the same thing we did above, request, pull the content from the response and then use the parser.

puggioni_html = requests.get(puggioni_url).content
puggioni_info = BeautifulSoup(puggioni_html,'html.parser')

then we find the tag and class we need from inspecting and pull that.

puggioni_info.find_all('li','people-location')
[<li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>]

it’s an interable, so we pull the item out

puggioni_info.find_all('li','people-location')[0]
<li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>

Then we try to pull the string out an that is empty

puggioni_info.find_all('li','people-location')[0].string

Here, we could go to the documentation and look up what the object contains, but insteas we can use object serialization.
We can use the python __dict__ to inspect the object and see where it stored what we want.

puggioni_info.find_all('li','people-location')[0].__dict__
{'parser_class': bs4.BeautifulSoup,
 'name': 'li',
 'namespace': None,
 '_namespaces': {},
 'prefix': None,
 'sourceline': 372,
 'sourcepos': 303,
 'known_xml': False,
 'attrs': {'class': ['people-location']},
 'contents': [<strong>Office Location:</strong>, ' Tyler Hall 254'],
 'parent': <ul class="people-list">
 <li class="people-title">Associate Professor |  Chair</li> <li class="people-department">Statistics</li> <li class="people-phone"><strong>Phone:</strong> 401.874.4388</li> <li class="people-email"><strong>Email:</strong> <a href="mailto:gpuggioni@uri.edu">gpuggioni@uri.edu</a></li> <li class="people-location"><strong>Office Location:</strong> Tyler Hall 254</li>
 </ul>,
 'previous_element': ' ',
 'next_element': <strong>Office Location:</strong>,
 'next_sibling': '\n',
 'previous_sibling': ' ',
 'hidden': False,
 'can_be_empty_element': False,
 'cdata_list_attributes': {'*': ['class', 'accesskey', 'dropzone'],
  'a': ['rel', 'rev'],
  'link': ['rel', 'rev'],
  'td': ['headers'],
  'th': ['headers'],
  'form': ['accept-charset'],
  'object': ['archive'],
  'area': ['rel'],
  'icon': ['sizes'],
  'iframe': ['sandbox'],
  'output': ['for']},
 'preserve_whitespace_tags': {'pre', 'textarea'},
 'interesting_string_types': (bs4.element.NavigableString, bs4.element.CData)}

we see its the second element in a list in the 'content' value

puggioni_info.find_all('li','people-location')[0].contents[1]
' Tyler Hall 254'

Now tht we know how to do it, we can put it in a loop.

offices = []
for name_link in cs_people.find_all('h3','p-name'):
    url = name_link.a.attrs['href']
    person_html = requests.get(url).content
    person_info = BeautifulSoup(person_html,'html.parser')
    try: 
        offices.append(person_info.find_all('li','people-location')[0].contents[1])
    except:
        offices.append(pd.NA)


css_df['office'] = offices

We added the try and except to handle when there is no office location. This is something in practice you would often think to do due an error.

Here we check the info and we can see how it is.

css_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        27 non-null     object
 1   title       27 non-null     object
 2   email       27 non-null     object
 3   discipline  27 non-null     object
 4   office      25 non-null     object
dtypes: object(5)
memory usage: 1.2+ KB

We can also, finally save out our ready dataset:

css_df.to_csv('css_faculty.csv')

9.6. Questions after class#

9.6.1. what does .a do?#

it gives the first instance of the <a> tag

9.6.2. is it worth it to try and web scrape a page that is poorly written?#

If it is important information. In these cases, you might have to do more manual parsing or even some manual fixes.

For this class, no.

9.6.3. Is there a way to check robots.txt through BeautifulSoup or must that be done manually in a browser?#

it could maybe be read programmaticlaly, but it doesn’t necessarily save time to do it that way.

9.6.4. What else can I do with inspect?#

It lets you view the code. It’s most often used to debug websites.

9.6.5. when web scraping if the html is not set up well is it possible to change the html to make it easier to parse#

Technically you could manually edit a copy of it.

9.6.6. Are there instances where you can get data from websites that are not in tabular form?#

Web scraping is for when the website is not in tabular form. It should be strucutred, but the structure does not need to come from a single page. It could be that there are many pages strucutred similarly and you build most of the columns from the other pages, not the starting page.

For example from the teams page of the nba you can get to a page with info about each team that includes all time records and the current rosters. On these individual pages, most info is an actual table, so you can use pd.read_html for those, but the crawing part from the first page would count.

9.6.7. A source table would be the people’s page on the URI website, but when you click on their individual names does that count as another source table?#

Not as we did above because we combined the data by adding another column. If you built a whole table on each of the sub-pages it would count.

9.7. Portfolio Question#

9.7.1. I guess how to further edit the submission_1.intro. I know about the chapters you gotta add, but what else? Is submission_1.intro the only file you gotta edit?#

Yes you edit that file and the _toc.yml. There are instructions on the portfolio page

There are also formatting tips and ideas

9.7.2. for portfolio one, can we submit whatever we want? like as little OR as much?#

Yes, exactly!

However, I do want to really encourage you to submit whatever you are thinking of, even if it is not as complete as you want. If you submit, you will get feedback even if you do not earn all of the achievements you try. Tht will prepare you for the next one.