# Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

````{admonition} Tip
To update a package while running jupyter
in a notebook:
```
!pip install packagename update
```
then restart the kernel
````

## Figuring out what to scrape

We're going to create a DataFrame about URI CS & Statistics Faculty.

from the [people page](https://web.uri.edu/cs/people/) of the department website.


With great power comes great responsibility.

- always check [robots.txt](https://web.uri.edu/robots.txt)
- do not do things that the owner says not to do
- government websites are typically safe, because of open data rules


We can inspect the page to check that it's well structured by right clicking in
a browser tab and looking at the same code that our browser sees in order to
render the page.  

````{margin}
```{admonition} Think Ahead
You can use this same basic logic, that anything can be data to consider other
questions like, for example:
- How do the sizes of datasets for Tidy Tuesday vary? (you don't need to download and load all of the data, only traverse the readmes)
- On average, how many code cells are there in each class? How much does the amount of code in the posted notes vary from your notes you take in class?

```
````
We can basically think of web scraping as loading data that's *not* tabular but
instead is formatted as html code.  HTML code *can* be well strucutured and
hierarchical within a single page, or you could collect information about a
broad topic by using a little bit from many many pages.  We're going to work
from one page here.

HTML code consists of tags and the text of the page. The tags label the content
and define different structure of the page.  

![HTML tree structure](http://www.w3schools.com/js/pic_htmltree.gif)

Once we've decided that it will work, we can being working.

In [2]:
cs_people_url ='https://web.uri.edu/cs/people/'

## Loading with requests



First, we `get` the data using the python [requests](https://docs.python-requests.org/en/latest/)
library.

In [3]:
requests.get(cs_people_url)

<Response [200]>

this returns an object, but we want the content from it, so we'll save that to a
variable

In [4]:
cs_people_html = requests.get(cs_people_url).content

cs_people_html



This is literally just the html as a browser would see, but we pulled it into
python.


## Parsing with BeautifulSoup

 Next, we'll use BeautifulSoup to parse the text.  Parsing means to make
sense of it. In this case, it transforms from a string or characters, to a datastructure
that we can work with.

In [5]:
cs_people = BeautifulSoup(cs_people_html,'html.parser')

First we note that it now formats the new lines.

In [6]:
cs_people


<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="http://gmpg.org/xfn/11" rel="profile"/>
<title>People – Department of Computer Science and Statistics</title>
<meta content="max-image-preview:large" name="robots">
<link href="//s.w.org" rel="dns-prefetch">
<link href="https://web.uri.edu/cs/feed/" rel="alternate" title="Department of Computer Science and Statistics » Feed" type="application/rss+xml"/>
<link href="https://web.uri.edu/cs/comments/feed/" rel="alternate" title="Department of Computer Science and Statistics » Comments Feed" type="application/rss+xml"/>
<script type="text/javascript">
			window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.1\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.1\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/web.uri.edu\/cs\/wp-includes\/js\/wp-emoji-release.min.js?ver=5.7.1"

We can use `prettify` method to format it more nicely.  This would add tabs even
if there were none in the source code.

In [7]:
print(cs_people.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <title>
   People – Department of Computer Science and Statistics
  </title>
  <meta content="max-image-preview:large" name="robots">
   <link href="//s.w.org" rel="dns-prefetch">
    <link href="https://web.uri.edu/cs/feed/" rel="alternate" title="Department of Computer Science and Statistics » Feed" type="application/rss+xml"/>
    <link href="https://web.uri.edu/cs/comments/feed/" rel="alternate" title="Department of Computer Science and Statistics » Comments Feed" type="application/rss+xml"/>
    <script type="text/javascript">
     window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.1\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.0.1\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/web.uri.edu\/cs\/wp-includes\/js\/w

It also makes the tags (structure in HTML) attributes.

In [8]:
cs_people.title

<title>People – Department of Computer Science and Statistics</title>

For tags that have multiple instances like the `<a>` tag that defines a link, it returns the first instance.

In [9]:
cs_people.a

<a class="skip-link screen-reader-text" href="#content">Skip to content</a>

## Finding all instances of a tag

We can use `find_all` to make a list of all occurences of a tag. For example, we
could get all of the links from a page:

In [10]:
cs_people.find_all('a')

[<a class="skip-link screen-reader-text" href="#content">Skip to content</a>,
 <a href="https://www.uri.edu/" title="University of Rhode Island"><div id="identity">University of Rhode Island</div></a>,
 <a href="https://www.uri.edu/gateway/future-students" role="menuitem">Future Students</a>,
 <a href="https://www.uri.edu/gateway/students" role="menuitem">Students</a>,
 <a href="https://www.uri.edu/gateway/faculty" role="menuitem">Faculty</a>,
 <a href="https://www.uri.edu/gateway/staff" role="menuitem">Staff</a>,
 <a href="https://www.uri.edu/gateway/families" role="menuitem">Parents and Families</a>,
 <a href="https://www.uri.edu/gateway/alumni" role="menuitem">Alumni</a>,
 <a href="https://www.uri.edu/gateway/community" role="menuitem">Community</a>,
 <a href="https://web.uri.edu/cs/" rel="home">
 			Department of Computer Science and Statistics		</a>,
 <a href="https://www.uri.edu/">URI</a>,
 <a href="https://web.uri.edu/artsci">Arts and Sciences</a>,
 <a href="https://web.uri.edu/

We noted before that each person's information is contained in a `div` tag with
the `class = peopleitem`.  `find_all` can also take values for attributs of a
tag.  Attributes are the modifiers of a tag, in this case, a class is a label
that defines formatting, and in this case, acts as metadata about what is in the
div.  Scraping relies on the HTML code being well organized.

In [11]:
cs_people.find_all("div","peopleitem")

[<div class="peopleitem h-card has-thumbnail">
 <header>
 <div class="header">
 <figure>
 <a href="https://web.uri.edu/cs/meet/marco-alvarez/"><img alt="" class="u-photo wp-post-image" height="120" loading="lazy" src="https://web.uri.edu/cs/files/marco-alvarez.png" width="120"/></a>
 </figure>
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/marco-alvarez/">Marco Alvarez</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-job-title">Assistant Professor | Director of Graduate Studies</p>
 <p class="people-department">Computer Science</p>
 <p class="people-misc"><span class="p-tel">401.874.5009</span> – <a class="u-email" href="mailto:malvarez@uri.edu">malvarez@uri.edu</a></p>
 <div style="clear:both;"></div>
 </div>
 </div>,
 <div class="peopleitem h-card">
 <header>
 <div class="header">
 <h3 class="p-name"><a href="https://web.uri.edu/cs/meet/samantha-armenti/">Samantha Armenti</a></h3>
 </div>
 </header>
 <div class="inside">
 <p class="people-title p-

We could use this to see how many people there are

In [12]:
len(cs_people.find_all("div","peopleitem"))

23

or use multiple to see how many people have  thumbnail

In [13]:
len(cs_people.find_all("div",{"has-thumbnail"}))

20

## Finding data we can make tabular

We can look at the first one in detail to determine what to extract for each of the column.

In [14]:
cs_people.find_all("div","peopleitem")[0]

<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/marco-alvarez/"><img alt="" class="u-photo wp-post-image" height="120" loading="lazy" src="https://web.uri.edu/cs/files/marco-alvarez.png" width="120"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/marco-alvarez/">Marco Alvarez</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor | Director of Graduate Studies</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><span class="p-tel">401.874.5009</span> – <a class="u-email" href="mailto:malvarez@uri.edu">malvarez@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>

We can see that the name is an `<h3>` tag with `class = "p-name"`

In [15]:
first_name = cs_people.find_all("h3","p-name")[0]

We can examine this using our typical tools:

In [16]:
type(first_name)

bs4.element.Tag

It has attributes, since it's a tag object:

In [17]:
first_name.contents

[<a href="https://web.uri.edu/cs/meet/marco-alvarez/">Marco Alvarez</a>]

What we want is the string:

In [18]:
first_name.string

'Marco Alvarez'

```{admonition} Question in class
How can we extract the link?
```

That's a child, because the `<a>` tag is inside of the the `<h3>` tag, so let's explore the children.

In [19]:
[type(c) for c in first_name.children]

[bs4.element.Tag]

Alternatively, we can pick the `a` tag out by name.

In [20]:
first_name.a

<a href="https://web.uri.edu/cs/meet/marco-alvarez/">Marco Alvarez</a>

the url or `href` is an attribute of the `a` tag.

In [21]:
first_name.a.attrs

{'href': 'https://web.uri.edu/cs/meet/marco-alvarez/'}

In [22]:
first_name.a.attrs.href

AttributeError: 'dict' object has no attribute 'href'

we can get it out using the `[]` to index into the `attrs` dictionary

In [23]:
first_name.a.attrs['href']

'https://web.uri.edu/cs/meet/marco-alvarez/'

## Bulding a DataFrame

Now that we know what to look for, we can start building.  First, we'll find all of the names and extract the string from each using a list comprehension.

In [24]:
names = [name.string for name in cs_people.find_all("h3","p-name")]
names

['Marco Alvarez',
 'Samantha Armenti',
 'Sarah Brown',
 'Michael Conti',
 'Noah Daniels',
 'Lisa DiPippo',
 'Victor Fay-Wolfe',
 'Lutz Hamel',
 'Abdeltawab Hendawi',
 'Jean-Yves Hervé',
 'Natallia Katenka',
 'Soheyb Kouider',
 'Edmund Lamagna',
 'Indrani Mandal',
 'Gavino Puggioni',
 'Krishna Venkatasubramanian',
 'Jing Wu',
 'Yichi Zhang',
 'Guangyu Zhu',
 'Ashley Buchanan',
 'Nina Kajiji',
 'Rachel Schwartz',
 'Ying Zhang']

We can use the same process for each other attribut we want.  First, we'll
look at the whole peopleitem again , and then decide what we want.

In [25]:
cs_people.find_all("div","peopleitem")[0]

<div class="peopleitem h-card has-thumbnail">
<header>
<div class="header">
<figure>
<a href="https://web.uri.edu/cs/meet/marco-alvarez/"><img alt="" class="u-photo wp-post-image" height="120" loading="lazy" src="https://web.uri.edu/cs/files/marco-alvarez.png" width="120"/></a>
</figure>
<h3 class="p-name"><a href="https://web.uri.edu/cs/meet/marco-alvarez/">Marco Alvarez</a></h3>
</div>
</header>
<div class="inside">
<p class="people-title p-job-title">Assistant Professor | Director of Graduate Studies</p>
<p class="people-department">Computer Science</p>
<p class="people-misc"><span class="p-tel">401.874.5009</span> – <a class="u-email" href="mailto:malvarez@uri.edu">malvarez@uri.edu</a></p>
<div style="clear:both;"></div>
</div>
</div>

We'll extract the department, title, and e-mail.

In [26]:
disciplines = [d.string for d in cs_people.find_all("p",
                                                    'people-department')]
titles = [t.string for t in cs_people.find_all("p","people-title")]
emails = [e.string for e in cs_people.find_all("a",'u-email')]
pd.DataFrame({'name':names, 'title':titles,
              'e-mails':emails, 'discipline':disciplines})

Unnamed: 0,name,title,e-mails,discipline
0,Marco Alvarez,Assistant Professor | Director of Graduate Stu...,malvarez@uri.edu,Computer Science
1,Samantha Armenti,Lecturer,sarmenti@uri.edu,Computer Science
2,Sarah Brown,Assistant Professor,brownsarahm@uri.edu,Computer Science
3,Michael Conti,Lecturer,michaelconti@uri.edu,Computer Science
4,Noah Daniels,Assistant Professor,noah_daniels@uri.edu,Computer Science
5,Lisa DiPippo,Professor | Chair,ldipippo@uri.edu,Computer Science
6,Victor Fay-Wolfe,Professor,wolfe@cs.uri.edu,Computer Science
7,Lutz Hamel,Associate Professor,lutzhamel@uri.edu,Computer Science
8,Abdeltawab Hendawi,Assistant Professor,hendawi@uri.edu,Data Science | Computer Science
9,Jean-Yves Hervé,Associate Professor,jyh@cs.uri.edu,Computer Science


In [27]:
sp22_csc_sta_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/reg_CSCSTA_courses.csv'

In [28]:
courses_df = pd.read_csv(sp22_csc_sta_url)
courses_df.head()

Unnamed: 0,Subject,Cat#,Component,Section,GenEd,Title,Max Size,Campus,Acad Org,Class Stat,Course Topic,Instr Name,Instr Name 2,Instr Name 3
0,CSC,101,LEC,1,GE,Computing Concepts,35,URI,COMP_SCIEN,A,,"Fay-Wolfe,Victor",,
1,CSC,101,LEC,2,GE,Computing Concepts,35,ONLIN,COMP_SCIEN,A,,"Fay-Wolfe,Victor",,
2,CSC,101,LEC,3,GE,Computing Concepts,35,ONLIN,COMP_SCIEN,A,,Staff,,
3,CSC,101,LEC,L01,GE,Computing Concepts,35,ONLIN,COMP_SCIEN,A,,"Fay-Wolfe,Victor",,
4,CSC,104,LEC,1,GE,Puzzles+Games=Analytical Think,40,URI,COMP_SCIEN,A,,"Mandal,Indrani",,


We saw the `attrs` for a link above, where there was only one attribute on the tag, but for example, the images on each person's card have many attributes.

In [29]:
cs_people.find_all("div","peopleitem")[0].img.attrs

{'width': '120',
 'height': '120',
 'src': 'https://web.uri.edu/cs/files/marco-alvarez.png',
 'class': ['u-photo', 'wp-post-image'],
 'alt': '',
 'loading': 'lazy'}

In [30]:
cs_people.find_all("p")

[<p class="people-title p-job-title">Assistant Professor | Director of Graduate Studies</p>,
 <p class="people-department">Computer Science</p>,
 <p class="people-misc"><span class="p-tel">401.874.5009</span> – <a class="u-email" href="mailto:malvarez@uri.edu">malvarez@uri.edu</a></p>,
 <p class="people-title p-job-title">Lecturer</p>,
 <p class="people-department">Computer Science</p>,
 <p class="people-misc"><a class="u-email" href="mailto:sarmenti@uri.edu ">sarmenti@uri.edu </a></p>,
 <p class="people-title p-job-title">Assistant Professor</p>,
 <p class="people-department">Computer Science</p>,
 <p class="people-misc"><a class="u-email" href="mailto:brownsarahm@uri.edu">brownsarahm@uri.edu</a></p>,
 <p class="people-title p-job-title">Lecturer</p>,
 <p class="people-department">Computer Science</p>,
 <p class="people-misc"><a class="u-email" href="mailto:michaelconti@uri.edu ">michaelconti@uri.edu </a></p>,
 <p class="people-title p-job-title">Assistant Professor</p>,
 <p class="pe

URI websites are probably formatted consistently, so we could build information about more departments.

- [csc/sta emeriti](https://web.uri.edu/cs/people/faculty-emeriti/)
- [a&s dean's office](https://web.uri.edu/artsci/people/)
- [math](https://www.math.uri.edu/people/)
- [philosophy](https://web.uri.edu/philosophy/people/)
- [business](https://web.uri.edu/business/people/faculty/)



## Thinking Ahead


The spreadsheet of spring classes in the department is posted:

```
sp22_csc_sta_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/reg_CSCSTA_courses.csv'
```

this is a minimal copy where I removed enrollments and locations in case those change.
this is derived from the last version the Dean asked us to make corrections to, so things
will definitely be different before registration opens (eg I'm teaching the
CSC392 and it's listed as "Staff")

sections have definitely been added/removed and teaching assignments changed, so
don't use this for making plans.

How could you merge this with the DataFrame we just scraped?




## More Practice

1. Add a phone number column
1. On the page linked from each person's name, their office number; add a column for office number.
1. Make the code we wrote in class into a function so that you can pass a page and get back a DataFrame.
1. Parse the [course descriptions](https://web.uri.edu/cs/academics/computer-science/course-descriptions/)  page and make a DataFrame with columns for subject code, course number, course title, and course description.
1. Merge the descriptions with the table about enrollments and instructors.