{ "cells": [ { "cell_type": "markdown", "id": "521b5929", "metadata": {}, "source": [ "# Web Scraping" ] }, { "cell_type": "code", "execution_count": 1, "id": "f55da20f", "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "4346de7c", "metadata": {}, "source": [ "````{admonition} Tip\n", "To update a package while running jupyter\n", "in a notebook:\n", "```\n", "!pip install packagename update\n", "```\n", "then restart the kernel\n", "````\n", "\n", "## Figuring out what to scrape\n", "\n", "We're going to create a DataFrame about URI CS & Statistics Faculty.\n", "\n", "from the [people page](https://web.uri.edu/cs/people/) of the department website.\n", "\n", "\n", "With great power comes great responsibility.\n", "\n", "- always check [robots.txt](https://web.uri.edu/robots.txt)\n", "- do not do things that the owner says not to do\n", "- government websites are typically safe, because of open data rules\n", "\n", "\n", "We can inspect the page to check that it's well structured by right clicking in\n", "a browser tab and looking at the same code that our browser sees in order to\n", "render the page. \n", "\n", "````{margin}\n", "```{admonition} Think Ahead\n", "You can use this same basic logic, that anything can be data to consider other\n", "questions like, for example:\n", "- How do the sizes of datasets for Tidy Tuesday vary? (you don't need to download and load all of the data, only traverse the readmes)\n", "- On average, how many code cells are there in each class? How much does the amount of code in the posted notes vary from your notes you take in class?\n", "\n", "```\n", "````\n", "We can basically think of web scraping as loading data that's *not* tabular but\n", "instead is formatted as html code. HTML code *can* be well strucutured and\n", "hierarchical within a single page, or you could collect information about a\n", "broad topic by using a little bit from many many pages. We're going to work\n", "from one page here.\n", "\n", "HTML code consists of tags and the text of the page. The tags label the content\n", "and define different structure of the page. \n", "\n", "![HTML tree structure](http://www.w3schools.com/js/pic_htmltree.gif)\n", "\n", "Once we've decided that it will work, we can being working." ] }, { "cell_type": "code", "execution_count": 2, "id": "c526b5b9", "metadata": {}, "outputs": [], "source": [ "cs_people_url ='https://web.uri.edu/cs/people/'" ] }, { "cell_type": "markdown", "id": "af77233f", "metadata": {}, "source": [ "## Loading with requests\n", "\n", "\n", "\n", "First, we `get` the data using the python [requests](https://docs.python-requests.org/en/latest/)\n", "library." ] }, { "cell_type": "code", "execution_count": 3, "id": "be103791", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "requests.get(cs_people_url)" ] }, { "cell_type": "markdown", "id": "3254775c", "metadata": {}, "source": [ "this returns an object, but we want the content from it, so we'll save that to a\n", "variable" ] }, { "cell_type": "code", "execution_count": 4, "id": "c954f57c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'\\n\\n\\n\\t\\n\\n\\n\\n\\n\\nPeople – Department of Computer Science and Statistics\\n\\r\\n\\t \\n\\n\\n\\n\\t\\t\\n\\t\\t\\n\\t\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\n\\n\\n\\n\\n\\n\\n\\n\\t\\n\\n\\t\\n\\t\\n
\\n\\tSkip to content\\n\\n\\n\\t
\\n\\t\\t\\n\\t
\\n\\t\\t\\n\\t\\t
\"University
\\n\\t\\t\\n\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t
\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t
\\n\\t\\t
\\n\\t\\t\\n\\t\\t
\\n\\t\\t\\t
\\n\\t\\t\\t\\t
University of Rhode Island
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t
\\n\\t\\t
\\n\\n\\t
\\n\\t\\t\\n\\t\\t
\\n\\n\\t\\t\\t\\n\\n
\\n\\t\\n\\t
\\n\\t\\t
\\n\\t\\t
\\n\\t
\\n\\t\\n\\t
\\n\\n\\t\\t\\n
\\n\\t\\t\\t\\n\\t

\\n\\t\\t\\n\\t\\t\\tDepartment of Computer Science and Statistics\\t\\t\\n\\t

\\n\\t\\t\\t

College of Arts and Sciences

\\n\\t\\n
\\n\\t\\t\\n\\t\\t
\\n\\t\\t\\t\\t\\t
\\n\\t\\t\\n\\t
\\n\\n
\\n\\n\\t\\t\\t
\\n\\t\\t\\t\\t\\n\\n\\n \\n\\t\\t\\t\\t\\n\\n\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t\\n\\t
\\n\\t\\t\\n\\t\\n\\n\\t
\\n\\n\\t\\t\\n
\\n\\t\\n\\t
\\n\\t\\t

People

\\n
\\n

Full-time Faculty

\\n

\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Marco Alvarez

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor | Director of Graduate Studies

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

401.874.5009malvarez@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Samantha Armenti

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Lecturer

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

sarmenti@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Sarah Brown

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

brownsarahm@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Michael Conti

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Lecturer

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

michaelconti@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Noah Daniels

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

noah_daniels@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Lisa DiPippo

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Professor | Chair

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

ldipippo@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Victor Fay-Wolfe

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

wolfe@cs.uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Lutz Hamel

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Associate Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

lutzhamel@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Abdeltawab Hendawi

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Data Science | Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

401.874.5738hendawi@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Jean-Yves Herv\\xc3\\xa9

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Associate Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

jyh@cs.uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Natallia Katenka

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Associate Professor | Director of Undergraduate Studies

\\n\\t\\t\\n\\t\\t\\t\\t

Statistics

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

nkatenka@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Soheyb Kouider

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Lecturer

\\n\\t\\t\\n\\t\\t\\t\\t

Statistics

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

401.874.2562soheyb@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Edmund Lamagna

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

eal@cs.uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Indrani Mandal

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Lecturer

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

indrani_mandal@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Gavino Puggioni

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Associate Professor | Statistics Section Head | Director of Graduate Studies

\\n\\t\\t\\n\\t\\t\\t\\t

Statistics

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

401.874.4388gpuggioni@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Krishna Venkatasubramanian

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

krish@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Jing Wu

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Statistics

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

401.874.4504jing_wu@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Yichi Zhang

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Statistics

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

yichizhang@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Guangyu Zhu

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Statistics

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

guangyuzhu@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n

\\n

Adjunct Faculty and Limited Join Appointments

\\n

\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Ashley Buchanan

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Limited Joint Appointment

\\n\\t\\t\\n\\t\\t\\t\\t

Biostatistics

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

401.874.4739buchanan@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Nina Kajiji

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Adjunct Associate Professor

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

nina@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Rachel Schwartz

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor \\xe2\\x80\\x93 Limited Joint Appointment

\\n\\t\\t\\n\\t\\t\\t\\t

Biological Sciences

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

401.874.5404rsschwartz@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n\\t
\\n\\t\\t
\\n\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\"\"\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t

Ying Zhang

\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t
\\n\\t
\\n\\t
\\n\\n\\t\\t

Assistant Professor \\xe2\\x80\\x93 Limited Joint Appointment

\\n\\t\\t\\n\\t\\t\\t\\t

Computer Science

\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\t\\t

401.874.4915yingzhang@uri.edu

\\n\\t\\t\\n\\t\\t
\\n\\t
\\n
\\n
\\n

\\n\\t

\\n\\n\\t
\\n\\n\\t
\\n\\n\\n\\t
\\n\\n\\t\\n\\t
\\n\\t\\t
\\n\\t\\t\\tConnectApplyTourGive\\t\\t
\\n\\t
\\n\\n\\t
\\n\\t\\t
\\n\\t\\t\\t
\\n\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t
\\n\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t
\\n\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t
\\n\\t\\t\\t
\\n\\t\\t\\t
\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t
\\n\\t\\t
\\n\\t\\t
\\n\\t\\t
\\n\\t\\t\\t

Copyright © University of Rhode Island | University of Rhode Island, Kingston, RI 02881, USA | 1.401.874.1000

\\n\\t\\t\\t

URI is an equal opportunity employer committed to the principles of affirmative action.  Work at URI

\\n\\t\\t
\\n\\t
\\n
\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people_html = requests.get(cs_people_url).content\n", "\n", "cs_people_html" ] }, { "cell_type": "markdown", "id": "99a346ba", "metadata": {}, "source": [ "This is literally just the html as a browser would see, but we pulled it into\n", "python.\n", "\n", "\n", "## Parsing with BeautifulSoup\n", "\n", " Next, we'll use BeautifulSoup to parse the text. Parsing means to make\n", "sense of it. In this case, it transforms from a string or characters, to a datastructure\n", "that we can work with." ] }, { "cell_type": "code", "execution_count": 5, "id": "ef6165b1", "metadata": {}, "outputs": [], "source": [ "cs_people = BeautifulSoup(cs_people_html,'html.parser')" ] }, { "cell_type": "markdown", "id": "ddadb82c", "metadata": {}, "source": [ "First we note that it now formats the new lines." ] }, { "cell_type": "code", "execution_count": 6, "id": "4f9e05fb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "People – Department of Computer Science and Statistics\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "Skip to content\n", "
\n", "
\n", "
\"University
\n", "
\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "
University of Rhode Island
\n", "
\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", "\n", "\t\t\tDepartment of Computer Science and Statistics\t\t\n", "

\n", "

College of Arts and Sciences

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

People

\n", "
\n", "

Full-time Faculty

\n", "\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Marco Alvarez

\n", "
\n", "
\n", "
\n", "

Assistant Professor | Director of Graduate Studies

\n", "

Computer Science

\n", "

401.874.5009malvarez@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Samantha Armenti

\n", "
\n", "
\n", "
\n", "

Lecturer

\n", "

Computer Science

\n", "

sarmenti@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Sarah Brown

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Computer Science

\n", "

brownsarahm@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Michael Conti

\n", "
\n", "
\n", "
\n", "

Lecturer

\n", "

Computer Science

\n", "

michaelconti@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Noah Daniels

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Computer Science

\n", "

noah_daniels@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Lisa DiPippo

\n", "
\n", "
\n", "
\n", "

Professor | Chair

\n", "

Computer Science

\n", "

ldipippo@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Victor Fay-Wolfe

\n", "
\n", "
\n", "
\n", "

Professor

\n", "

Computer Science

\n", "

wolfe@cs.uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Lutz Hamel

\n", "
\n", "
\n", "
\n", "

Associate Professor

\n", "

Computer Science

\n", "

lutzhamel@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Abdeltawab Hendawi

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Data Science | Computer Science

\n", "

401.874.5738hendawi@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Jean-Yves Hervé

\n", "
\n", "
\n", "
\n", "

Associate Professor

\n", "

Computer Science

\n", "

jyh@cs.uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Natallia Katenka

\n", "
\n", "
\n", "
\n", "

Associate Professor | Director of Undergraduate Studies

\n", "

Statistics

\n", "

nkatenka@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Soheyb Kouider

\n", "
\n", "
\n", "
\n", "

Lecturer

\n", "

Statistics

\n", "

401.874.2562soheyb@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Edmund Lamagna

\n", "
\n", "
\n", "
\n", "

Professor

\n", "

Computer Science

\n", "

eal@cs.uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Indrani Mandal

\n", "
\n", "
\n", "
\n", "

Lecturer

\n", "

Computer Science

\n", "

indrani_mandal@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Gavino Puggioni

\n", "
\n", "
\n", "
\n", "

Associate Professor | Statistics Section Head | Director of Graduate Studies

\n", "

Statistics

\n", "

401.874.4388gpuggioni@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Krishna Venkatasubramanian

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Computer Science

\n", "

krish@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Jing Wu

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Statistics

\n", "

401.874.4504jing_wu@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Yichi Zhang

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Statistics

\n", "

yichizhang@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Guangyu Zhu

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Statistics

\n", "

guangyuzhu@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "

\n", "

Adjunct Faculty and Limited Join Appointments

\n", "

\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Ashley Buchanan

\n", "
\n", "
\n", "
\n", "

Limited Joint Appointment

\n", "

Biostatistics

\n", "

401.874.4739buchanan@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Nina Kajiji

\n", "
\n", "
\n", "
\n", "

Adjunct Associate Professor

\n", "

Computer Science

\n", "

nina@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Rachel Schwartz

\n", "
\n", "
\n", "
\n", "

Assistant Professor – Limited Joint Appointment

\n", "

Biological Sciences

\n", "

401.874.5404rsschwartz@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Ying Zhang

\n", "
\n", "
\n", "
\n", "

Assistant Professor – Limited Joint Appointment

\n", "

Computer Science

\n", "

401.874.4915yingzhang@uri.edu

\n", "
\n", "
\n", "
\n", "
\n", "

\n", "

\n", "
\n", "
\n", "
\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people" ] }, { "cell_type": "markdown", "id": "dbc730da", "metadata": {}, "source": [ "We can use `prettify` method to format it more nicely. This would add tabs even\n", "if there were none in the source code." ] }, { "cell_type": "code", "execution_count": 7, "id": "b2b93895", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " People – Department of Computer Science and Statistics\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " \n", " Skip to content\n", " \n", "
\n", "
\n", "
\n", " \"University\n", "
\n", "
\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", " University of Rhode Island\n", "
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Department of Computer Science and Statistics\n", " \n", "

\n", "

\n", " College of Arts and Sciences\n", "

\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " People\n", "

\n", "
\n", "
\n", " \n", "
\n", "
\n", "

\n", " Full-time Faculty\n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Marco Alvarez\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor | Director of Graduate Studies\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " 401.874.5009\n", " \n", " –\n", " \n", " malvarez@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Samantha Armenti\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Lecturer\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " sarmenti@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Sarah Brown\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " brownsarahm@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Michael Conti\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Lecturer\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " michaelconti@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Noah Daniels\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " noah_daniels@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Lisa DiPippo\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Professor | Chair\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " ldipippo@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Victor Fay-Wolfe\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Professor\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " wolfe@cs.uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Lutz Hamel\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Associate Professor\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " lutzhamel@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Abdeltawab Hendawi\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor\n", "

\n", "

\n", " Data Science | Computer Science\n", "

\n", "

\n", " \n", " 401.874.5738\n", " \n", " –\n", " \n", " hendawi@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Jean-Yves Hervé\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Associate Professor\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " jyh@cs.uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Natallia Katenka\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Associate Professor | Director of Undergraduate Studies\n", "

\n", "

\n", " Statistics\n", "

\n", "

\n", " \n", " nkatenka@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " \n", " Soheyb Kouider\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Lecturer\n", "

\n", "

\n", " Statistics\n", "

\n", "

\n", " \n", " 401.874.2562\n", " \n", " –\n", " \n", " soheyb@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Edmund Lamagna\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Professor\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " eal@cs.uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Indrani Mandal\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Lecturer\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " indrani_mandal@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Gavino Puggioni\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Associate Professor | Statistics Section Head | Director of Graduate Studies\n", "

\n", "

\n", " Statistics\n", "

\n", "

\n", " \n", " 401.874.4388\n", " \n", " –\n", " \n", " gpuggioni@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Krishna Venkatasubramanian\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " krish@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Jing Wu\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor\n", "

\n", "

\n", " Statistics\n", "

\n", "

\n", " \n", " 401.874.4504\n", " \n", " –\n", " \n", " jing_wu@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Yichi Zhang\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor\n", "

\n", "

\n", " Statistics\n", "

\n", "

\n", " \n", " yichizhang@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Guangyu Zhu\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor\n", "

\n", "

\n", " Statistics\n", "

\n", "

\n", " \n", " guangyuzhu@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", "

\n", " Adjunct Faculty and Limited Join Appointments\n", "

\n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Ashley Buchanan\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Limited Joint Appointment\n", "

\n", "

\n", " Biostatistics\n", "

\n", "

\n", " \n", " 401.874.4739\n", " \n", " –\n", " \n", " buchanan@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Nina Kajiji\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Adjunct Associate Professor\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " nina@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Rachel Schwartz\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor – Limited Joint Appointment\n", "

\n", "

\n", " Biological Sciences\n", "

\n", "

\n", " \n", " 401.874.5404\n", " \n", " –\n", " \n", " rsschwartz@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \"\"\n", " \n", "
\n", "

\n", " \n", " Ying Zhang\n", " \n", "

\n", "
\n", "
\n", "
\n", "

\n", " Assistant Professor – Limited Joint Appointment\n", "

\n", "

\n", " Computer Science\n", "

\n", "

\n", " \n", " 401.874.4915\n", " \n", " –\n", " \n", " yingzhang@uri.edu\n", " \n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", "

\n", "
\n", " \n", "
\n", " \n", "
\n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n" ] } ], "source": [ "print(cs_people.prettify())" ] }, { "cell_type": "markdown", "id": "2495b8c1", "metadata": {}, "source": [ "It also makes the tags (structure in HTML) attributes." ] }, { "cell_type": "code", "execution_count": 8, "id": "1bc07adc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "People – Department of Computer Science and Statistics" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people.title" ] }, { "cell_type": "markdown", "id": "82728b93", "metadata": {}, "source": [ "For tags that have multiple instances like the `` tag that defines a link, it returns the first instance." ] }, { "cell_type": "code", "execution_count": 9, "id": "ab0a084d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Skip to content" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people.a" ] }, { "cell_type": "markdown", "id": "a5567efb", "metadata": {}, "source": [ "## Finding all instances of a tag\n", "\n", "We can use `find_all` to make a list of all occurences of a tag. For example, we\n", "could get all of the links from a page:" ] }, { "cell_type": "code", "execution_count": 10, "id": "bad8e488", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Skip to content,\n", "
University of Rhode Island
,\n", " Future Students,\n", " Students,\n", " Faculty,\n", " Staff,\n", " Parents and Families,\n", " Alumni,\n", " Community,\n", " \n", " \t\t\tDepartment of Computer Science and Statistics\t\t,\n", " URI,\n", " Arts and Sciences,\n", " Department of Computer Science and Statistics,\n", " Home,\n", " About,\n", " Academics,\n", " People,\n", " Research,\n", " News and Events,\n", " Contact,\n", " Faculty,\n", " Staff,\n", " Faculty Emeriti,\n", " \"\",\n", " Marco Alvarez,\n", " malvarez@uri.edu,\n", " Samantha Armenti,\n", " sarmenti@uri.edu ,\n", " \"\",\n", " Sarah Brown,\n", " brownsarahm@uri.edu,\n", " \"\",\n", " Michael Conti,\n", " michaelconti@uri.edu ,\n", " \"\",\n", " Noah Daniels,\n", " noah_daniels@uri.edu,\n", " \"\",\n", " Lisa DiPippo,\n", " ldipippo@uri.edu,\n", " \"\",\n", " Victor Fay-Wolfe,\n", " wolfe@cs.uri.edu,\n", " Lutz Hamel,\n", " lutzhamel@uri.edu,\n", " \"\",\n", " Abdeltawab Hendawi,\n", " hendawi@uri.edu,\n", " \"\",\n", " Jean-Yves Hervé,\n", " jyh@cs.uri.edu,\n", " \"\",\n", " Natallia Katenka,\n", " nkatenka@uri.edu,\n", " Soheyb Kouider,\n", " soheyb@uri.edu,\n", " \"\",\n", " Edmund Lamagna,\n", " eal@cs.uri.edu,\n", " \"\",\n", " Indrani Mandal,\n", " indrani_mandal@uri.edu ,\n", " \"\",\n", " Gavino Puggioni,\n", " gpuggioni@uri.edu,\n", " \"\",\n", " Krishna Venkatasubramanian,\n", " krish@uri.edu,\n", " \"\",\n", " Jing Wu,\n", " jing_wu@uri.edu,\n", " \"\",\n", " Yichi Zhang,\n", " yichizhang@uri.edu,\n", " \"\",\n", " Guangyu Zhu,\n", " guangyuzhu@uri.edu,\n", " \"\",\n", " Ashley Buchanan,\n", " buchanan@uri.edu,\n", " \"\",\n", " Nina Kajiji,\n", " nina@uri.edu,\n", " \"\",\n", " Rachel Schwartz,\n", " rsschwartz@uri.edu,\n", " \"\",\n", " Ying Zhang,\n", " yingzhang@uri.edu,\n", " Connect,\n", " Apply,\n", " Tour,\n", " Give,\n", " Leadership,\n", " Diversity and Inclusion,\n", " Global,\n", " Campuses,\n", " Safety,\n", " Housing,\n", " Dining,\n", " Athletics and Recreation,\n", " Health and Wellness,\n", " Events,\n", " Undergraduate,\n", " Graduate,\n", " Advising,\n", " Libraries,\n", " Internships,\n", " Facebook,\n", " Instagram,\n", " Twitter,\n", " YouTube,\n", " University of Rhode Island,\n", " Work at URI]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people.find_all('a')" ] }, { "cell_type": "markdown", "id": "cc8d7be8", "metadata": {}, "source": [ "We noted before that each person's information is contained in a `div` tag with\n", "the `class = peopleitem`. `find_all` can also take values for attributs of a\n", "tag. Attributes are the modifiers of a tag, in this case, a class is a label\n", "that defines formatting, and in this case, acts as metadata about what is in the\n", "div. Scraping relies on the HTML code being well organized." ] }, { "cell_type": "code", "execution_count": 11, "id": "ee4a79bc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Marco Alvarez

\n", "
\n", "
\n", "
\n", "

Assistant Professor | Director of Graduate Studies

\n", "

Computer Science

\n", "

401.874.5009malvarez@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "

Samantha Armenti

\n", "
\n", "
\n", "
\n", "

Lecturer

\n", "

Computer Science

\n", "

sarmenti@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Sarah Brown

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Computer Science

\n", "

brownsarahm@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Michael Conti

\n", "
\n", "
\n", "
\n", "

Lecturer

\n", "

Computer Science

\n", "

michaelconti@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Noah Daniels

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Computer Science

\n", "

noah_daniels@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Lisa DiPippo

\n", "
\n", "
\n", "
\n", "

Professor | Chair

\n", "

Computer Science

\n", "

ldipippo@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Victor Fay-Wolfe

\n", "
\n", "
\n", "
\n", "

Professor

\n", "

Computer Science

\n", "

wolfe@cs.uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "

Lutz Hamel

\n", "
\n", "
\n", "
\n", "

Associate Professor

\n", "

Computer Science

\n", "

lutzhamel@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Abdeltawab Hendawi

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Data Science | Computer Science

\n", "

401.874.5738hendawi@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Jean-Yves Hervé

\n", "
\n", "
\n", "
\n", "

Associate Professor

\n", "

Computer Science

\n", "

jyh@cs.uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Natallia Katenka

\n", "
\n", "
\n", "
\n", "

Associate Professor | Director of Undergraduate Studies

\n", "

Statistics

\n", "

nkatenka@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "

Soheyb Kouider

\n", "
\n", "
\n", "
\n", "

Lecturer

\n", "

Statistics

\n", "

401.874.2562soheyb@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Edmund Lamagna

\n", "
\n", "
\n", "
\n", "

Professor

\n", "

Computer Science

\n", "

eal@cs.uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Indrani Mandal

\n", "
\n", "
\n", "
\n", "

Lecturer

\n", "

Computer Science

\n", "

indrani_mandal@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Gavino Puggioni

\n", "
\n", "
\n", "
\n", "

Associate Professor | Statistics Section Head | Director of Graduate Studies

\n", "

Statistics

\n", "

401.874.4388gpuggioni@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Krishna Venkatasubramanian

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Computer Science

\n", "

krish@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Jing Wu

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Statistics

\n", "

401.874.4504jing_wu@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Yichi Zhang

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Statistics

\n", "

yichizhang@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Guangyu Zhu

\n", "
\n", "
\n", "
\n", "

Assistant Professor

\n", "

Statistics

\n", "

guangyuzhu@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Ashley Buchanan

\n", "
\n", "
\n", "
\n", "

Limited Joint Appointment

\n", "

Biostatistics

\n", "

401.874.4739buchanan@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Nina Kajiji

\n", "
\n", "
\n", "
\n", "

Adjunct Associate Professor

\n", "

Computer Science

\n", "

nina@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Rachel Schwartz

\n", "
\n", "
\n", "
\n", "

Assistant Professor – Limited Joint Appointment

\n", "

Biological Sciences

\n", "

401.874.5404rsschwartz@uri.edu

\n", "
\n", "
\n", "
,\n", "
\n", "
\n", "
\n", "
\n", " \"\"\n", "
\n", "

Ying Zhang

\n", "
\n", "
\n", "
\n", "

Assistant Professor – Limited Joint Appointment

\n", "

Computer Science

\n", "

401.874.4915yingzhang@uri.edu

\n", "
\n", "
\n", "
]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people.find_all(\"div\",\"peopleitem\")" ] }, { "cell_type": "markdown", "id": "6456b87c", "metadata": {}, "source": [ "We could use this to see how many people there are" ] }, { "cell_type": "code", "execution_count": 12, "id": "107efeca", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "23" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(cs_people.find_all(\"div\",\"peopleitem\"))" ] }, { "cell_type": "markdown", "id": "e194af2d", "metadata": {}, "source": [ "or use multiple to see how many people have thumbnail" ] }, { "cell_type": "code", "execution_count": 13, "id": "581c8694", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(cs_people.find_all(\"div\",{\"has-thumbnail\"}))" ] }, { "cell_type": "markdown", "id": "b1a71463", "metadata": {}, "source": [ "## Finding data we can make tabular\n", "\n", "We can look at the first one in detail to determine what to extract for each of the column." ] }, { "cell_type": "code", "execution_count": 14, "id": "38ba1be4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Marco Alvarez

\n", "
\n", "
\n", "
\n", "

Assistant Professor | Director of Graduate Studies

\n", "

Computer Science

\n", "

401.874.5009malvarez@uri.edu

\n", "
\n", "
\n", "
" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people.find_all(\"div\",\"peopleitem\")[0]" ] }, { "cell_type": "markdown", "id": "b0a66c92", "metadata": {}, "source": [ "We can see that the name is an `

` tag with `class = \"p-name\"`" ] }, { "cell_type": "code", "execution_count": 15, "id": "d0a22f52", "metadata": {}, "outputs": [], "source": [ "first_name = cs_people.find_all(\"h3\",\"p-name\")[0]" ] }, { "cell_type": "markdown", "id": "284ff9b6", "metadata": {}, "source": [ "We can examine this using our typical tools:" ] }, { "cell_type": "code", "execution_count": 16, "id": "55b843f4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(first_name)" ] }, { "cell_type": "markdown", "id": "e7668492", "metadata": {}, "source": [ "It has attributes, since it's a tag object:" ] }, { "cell_type": "code", "execution_count": 17, "id": "b58bba35", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Marco Alvarez]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_name.contents" ] }, { "cell_type": "markdown", "id": "bcdfefad", "metadata": {}, "source": [ "What we want is the string:" ] }, { "cell_type": "code", "execution_count": 18, "id": "602683b5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Marco Alvarez'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_name.string" ] }, { "cell_type": "markdown", "id": "92b3f719", "metadata": {}, "source": [ "```{admonition} Question in class\n", "How can we extract the link?\n", "```\n", "\n", "That's a child, because the `` tag is inside of the the `

` tag, so let's explore the children." ] }, { "cell_type": "code", "execution_count": 19, "id": "caa816a4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[bs4.element.Tag]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[type(c) for c in first_name.children]" ] }, { "cell_type": "markdown", "id": "392fe51e", "metadata": {}, "source": [ "Alternatively, we can pick the `a` tag out by name." ] }, { "cell_type": "code", "execution_count": 20, "id": "895e29fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Marco Alvarez" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_name.a" ] }, { "cell_type": "markdown", "id": "01c5fd2a", "metadata": {}, "source": [ "the url or `href` is an attribute of the `a` tag." ] }, { "cell_type": "code", "execution_count": 21, "id": "f2a64557", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'href': 'https://web.uri.edu/cs/meet/marco-alvarez/'}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_name.a.attrs" ] }, { "cell_type": "code", "execution_count": 22, "id": "11e869d6", "metadata": { "tag": [ "raises-exception" ] }, "outputs": [ { "ename": "AttributeError", "evalue": "'dict' object has no attribute 'href'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "Input \u001b[0;32mIn [22]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mfirst_name\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43ma\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mattrs\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mhref\u001b[49m\n", "\u001b[0;31mAttributeError\u001b[0m: 'dict' object has no attribute 'href'" ] } ], "source": [ "first_name.a.attrs.href" ] }, { "cell_type": "markdown", "id": "460a9c97", "metadata": {}, "source": [ "we can get it out using the `[]` to index into the `attrs` dictionary" ] }, { "cell_type": "code", "execution_count": 23, "id": "12f80a30", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://web.uri.edu/cs/meet/marco-alvarez/'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_name.a.attrs['href']" ] }, { "cell_type": "markdown", "id": "21a19ff5", "metadata": {}, "source": [ "## Bulding a DataFrame\n", "\n", "Now that we know what to look for, we can start building. First, we'll find all of the names and extract the string from each using a list comprehension." ] }, { "cell_type": "code", "execution_count": 24, "id": "76857c3e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Marco Alvarez',\n", " 'Samantha Armenti',\n", " 'Sarah Brown',\n", " 'Michael Conti',\n", " 'Noah Daniels',\n", " 'Lisa DiPippo',\n", " 'Victor Fay-Wolfe',\n", " 'Lutz Hamel',\n", " 'Abdeltawab Hendawi',\n", " 'Jean-Yves Hervé',\n", " 'Natallia Katenka',\n", " 'Soheyb Kouider',\n", " 'Edmund Lamagna',\n", " 'Indrani Mandal',\n", " 'Gavino Puggioni',\n", " 'Krishna Venkatasubramanian',\n", " 'Jing Wu',\n", " 'Yichi Zhang',\n", " 'Guangyu Zhu',\n", " 'Ashley Buchanan',\n", " 'Nina Kajiji',\n", " 'Rachel Schwartz',\n", " 'Ying Zhang']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names = [name.string for name in cs_people.find_all(\"h3\",\"p-name\")]\n", "names" ] }, { "cell_type": "markdown", "id": "be660517", "metadata": {}, "source": [ "We can use the same process for each other attribut we want. First, we'll\n", "look at the whole peopleitem again , and then decide what we want." ] }, { "cell_type": "code", "execution_count": 25, "id": "7625bfba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
\n", "
\n", "
\n", "
\n", "\"\"\n", "
\n", "

Marco Alvarez

\n", "
\n", "
\n", "
\n", "

Assistant Professor | Director of Graduate Studies

\n", "

Computer Science

\n", "

401.874.5009malvarez@uri.edu

\n", "
\n", "
\n", "
" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people.find_all(\"div\",\"peopleitem\")[0]" ] }, { "cell_type": "markdown", "id": "f10089af", "metadata": {}, "source": [ "We'll extract the department, title, and e-mail." ] }, { "cell_type": "code", "execution_count": 26, "id": "0a96f872", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametitlee-mailsdiscipline
0Marco AlvarezAssistant Professor | Director of Graduate Stu...malvarez@uri.eduComputer Science
1Samantha ArmentiLecturersarmenti@uri.eduComputer Science
2Sarah BrownAssistant Professorbrownsarahm@uri.eduComputer Science
3Michael ContiLecturermichaelconti@uri.eduComputer Science
4Noah DanielsAssistant Professornoah_daniels@uri.eduComputer Science
5Lisa DiPippoProfessor | Chairldipippo@uri.eduComputer Science
6Victor Fay-WolfeProfessorwolfe@cs.uri.eduComputer Science
7Lutz HamelAssociate Professorlutzhamel@uri.eduComputer Science
8Abdeltawab HendawiAssistant Professorhendawi@uri.eduData Science | Computer Science
9Jean-Yves HervéAssociate Professorjyh@cs.uri.eduComputer Science
10Natallia KatenkaAssociate Professor | Director of Undergraduat...nkatenka@uri.eduStatistics
11Soheyb KouiderLecturersoheyb@uri.eduStatistics
12Edmund LamagnaProfessoreal@cs.uri.eduComputer Science
13Indrani MandalLecturerindrani_mandal@uri.eduComputer Science
14Gavino PuggioniAssociate Professor | Statistics Section Head...gpuggioni@uri.eduStatistics
15Krishna VenkatasubramanianAssistant Professorkrish@uri.eduComputer Science
16Jing WuAssistant Professorjing_wu@uri.eduStatistics
17Yichi ZhangAssistant Professoryichizhang@uri.eduStatistics
18Guangyu ZhuAssistant Professorguangyuzhu@uri.eduStatistics
19Ashley BuchananLimited Joint Appointmentbuchanan@uri.eduBiostatistics
20Nina KajijiAdjunct Associate Professornina@uri.eduComputer Science
21Rachel SchwartzAssistant Professor – Limited Joint Appointmentrsschwartz@uri.eduBiological Sciences
22Ying ZhangAssistant Professor – Limited Joint Appointmentyingzhang@uri.eduComputer Science
\n", "
" ], "text/plain": [ " name \\\n", "0 Marco Alvarez \n", "1 Samantha Armenti \n", "2 Sarah Brown \n", "3 Michael Conti \n", "4 Noah Daniels \n", "5 Lisa DiPippo \n", "6 Victor Fay-Wolfe \n", "7 Lutz Hamel \n", "8 Abdeltawab Hendawi \n", "9 Jean-Yves Hervé \n", "10 Natallia Katenka \n", "11 Soheyb Kouider \n", "12 Edmund Lamagna \n", "13 Indrani Mandal \n", "14 Gavino Puggioni \n", "15 Krishna Venkatasubramanian \n", "16 Jing Wu \n", "17 Yichi Zhang \n", "18 Guangyu Zhu \n", "19 Ashley Buchanan \n", "20 Nina Kajiji \n", "21 Rachel Schwartz \n", "22 Ying Zhang \n", "\n", " title \\\n", "0 Assistant Professor | Director of Graduate Stu... \n", "1 Lecturer \n", "2 Assistant Professor \n", "3 Lecturer \n", "4 Assistant Professor \n", "5 Professor | Chair \n", "6 Professor \n", "7 Associate Professor \n", "8 Assistant Professor \n", "9 Associate Professor \n", "10 Associate Professor | Director of Undergraduat... \n", "11 Lecturer \n", "12 Professor \n", "13 Lecturer \n", "14 Associate Professor | Statistics Section Head... \n", "15 Assistant Professor \n", "16 Assistant Professor \n", "17 Assistant Professor \n", "18 Assistant Professor \n", "19 Limited Joint Appointment \n", "20 Adjunct Associate Professor \n", "21 Assistant Professor – Limited Joint Appointment \n", "22 Assistant Professor – Limited Joint Appointment \n", "\n", " e-mails discipline \n", "0 malvarez@uri.edu Computer Science \n", "1 sarmenti@uri.edu Computer Science \n", "2 brownsarahm@uri.edu Computer Science \n", "3 michaelconti@uri.edu Computer Science \n", "4 noah_daniels@uri.edu Computer Science \n", "5 ldipippo@uri.edu Computer Science \n", "6 wolfe@cs.uri.edu Computer Science \n", "7 lutzhamel@uri.edu Computer Science \n", "8 hendawi@uri.edu Data Science | Computer Science \n", "9 jyh@cs.uri.edu Computer Science \n", "10 nkatenka@uri.edu Statistics \n", "11 soheyb@uri.edu Statistics \n", "12 eal@cs.uri.edu Computer Science \n", "13 indrani_mandal@uri.edu Computer Science \n", "14 gpuggioni@uri.edu Statistics \n", "15 krish@uri.edu Computer Science \n", "16 jing_wu@uri.edu Statistics \n", "17 yichizhang@uri.edu Statistics \n", "18 guangyuzhu@uri.edu Statistics \n", "19 buchanan@uri.edu Biostatistics \n", "20 nina@uri.edu Computer Science \n", "21 rsschwartz@uri.edu Biological Sciences \n", "22 yingzhang@uri.edu Computer Science " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "disciplines = [d.string for d in cs_people.find_all(\"p\",\n", " 'people-department')]\n", "titles = [t.string for t in cs_people.find_all(\"p\",\"people-title\")]\n", "emails = [e.string for e in cs_people.find_all(\"a\",'u-email')]\n", "pd.DataFrame({'name':names, 'title':titles,\n", " 'e-mails':emails, 'discipline':disciplines})" ] }, { "cell_type": "code", "execution_count": 27, "id": "1c86ab29", "metadata": {}, "outputs": [], "source": [ "sp22_csc_sta_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/reg_CSCSTA_courses.csv'" ] }, { "cell_type": "code", "execution_count": 28, "id": "9f703cfb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SubjectCat#ComponentSectionGenEdTitleMax SizeCampusAcad OrgClass StatCourse TopicInstr NameInstr Name 2Instr Name 3
0CSC101LEC1GEComputing Concepts35URICOMP_SCIENANaNFay-Wolfe,VictorNaNNaN
1CSC101LEC2GEComputing Concepts35ONLINCOMP_SCIENANaNFay-Wolfe,VictorNaNNaN
2CSC101LEC3GEComputing Concepts35ONLINCOMP_SCIENANaNStaffNaNNaN
3CSC101LECL01GEComputing Concepts35ONLINCOMP_SCIENANaNFay-Wolfe,VictorNaNNaN
4CSC104LEC1GEPuzzles+Games=Analytical Think40URICOMP_SCIENANaNMandal,IndraniNaNNaN
\n", "
" ], "text/plain": [ " Subject Cat# Component Section GenEd Title \\\n", "0 CSC 101 LEC 1 GE Computing Concepts \n", "1 CSC 101 LEC 2 GE Computing Concepts \n", "2 CSC 101 LEC 3 GE Computing Concepts \n", "3 CSC 101 LEC L01 GE Computing Concepts \n", "4 CSC 104 LEC 1 GE Puzzles+Games=Analytical Think \n", "\n", " Max Size Campus Acad Org Class Stat Course Topic Instr Name \\\n", "0 35 URI COMP_SCIEN A NaN Fay-Wolfe,Victor \n", "1 35 ONLIN COMP_SCIEN A NaN Fay-Wolfe,Victor \n", "2 35 ONLIN COMP_SCIEN A NaN Staff \n", "3 35 ONLIN COMP_SCIEN A NaN Fay-Wolfe,Victor \n", "4 40 URI COMP_SCIEN A NaN Mandal,Indrani \n", "\n", " Instr Name 2 Instr Name 3 \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "courses_df = pd.read_csv(sp22_csc_sta_url)\n", "courses_df.head()" ] }, { "cell_type": "markdown", "id": "29f21db8", "metadata": {}, "source": [ "We saw the `attrs` for a link above, where there was only one attribute on the tag, but for example, the images on each person's card have many attributes." ] }, { "cell_type": "code", "execution_count": 29, "id": "1f0f9a8f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'width': '120',\n", " 'height': '120',\n", " 'src': 'https://web.uri.edu/cs/files/marco-alvarez.png',\n", " 'class': ['u-photo', 'wp-post-image'],\n", " 'alt': '',\n", " 'loading': 'lazy'}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people.find_all(\"div\",\"peopleitem\")[0].img.attrs" ] }, { "cell_type": "code", "execution_count": 30, "id": "b29b5c63", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[

Assistant Professor | Director of Graduate Studies

,\n", "

Computer Science

,\n", "

401.874.5009malvarez@uri.edu

,\n", "

Lecturer

,\n", "

Computer Science

,\n", "

sarmenti@uri.edu

,\n", "

Assistant Professor

,\n", "

Computer Science

,\n", "

brownsarahm@uri.edu

,\n", "

Lecturer

,\n", "

Computer Science

,\n", "

michaelconti@uri.edu

,\n", "

Assistant Professor

,\n", "

Computer Science

,\n", "

noah_daniels@uri.edu

,\n", "

Professor | Chair

,\n", "

Computer Science

,\n", "

ldipippo@uri.edu

,\n", "

Professor

,\n", "

Computer Science

,\n", "

wolfe@cs.uri.edu

,\n", "

Associate Professor

,\n", "

Computer Science

,\n", "

lutzhamel@uri.edu

,\n", "

Assistant Professor

,\n", "

Data Science | Computer Science

,\n", "

401.874.5738hendawi@uri.edu

,\n", "

Associate Professor

,\n", "

Computer Science

,\n", "

jyh@cs.uri.edu

,\n", "

Associate Professor | Director of Undergraduate Studies

,\n", "

Statistics

,\n", "

nkatenka@uri.edu

,\n", "

Lecturer

,\n", "

Statistics

,\n", "

401.874.2562soheyb@uri.edu

,\n", "

Professor

,\n", "

Computer Science

,\n", "

eal@cs.uri.edu

,\n", "

Lecturer

,\n", "

Computer Science

,\n", "

indrani_mandal@uri.edu

,\n", "

Associate Professor | Statistics Section Head | Director of Graduate Studies

,\n", "

Statistics

,\n", "

401.874.4388gpuggioni@uri.edu

,\n", "

Assistant Professor

,\n", "

Computer Science

,\n", "

krish@uri.edu

,\n", "

Assistant Professor

,\n", "

Statistics

,\n", "

401.874.4504jing_wu@uri.edu

,\n", "

Assistant Professor

,\n", "

Statistics

,\n", "

yichizhang@uri.edu

,\n", "

Assistant Professor

,\n", "

Statistics

,\n", "

guangyuzhu@uri.edu

,\n", "

\n", "

Adjunct Faculty and Limited Join Appointments

\n", "

,\n", "

Limited Joint Appointment

,\n", "

Biostatistics

,\n", "

401.874.4739buchanan@uri.edu

,\n", "

Adjunct Associate Professor

,\n", "

Computer Science

,\n", "

nina@uri.edu

,\n", "

Assistant Professor – Limited Joint Appointment

,\n", "

Biological Sciences

,\n", "

401.874.5404rsschwartz@uri.edu

,\n", "

Assistant Professor – Limited Joint Appointment

,\n", "

Computer Science

,\n", "

401.874.4915yingzhang@uri.edu

,\n", "

\n", "

,\n", "

Copyright © University of Rhode Island | University of Rhode Island, Kingston, RI 02881, USA | 1.401.874.1000

,\n", "

URI is an equal opportunity employer committed to the principles of affirmative action.  Work at URI

]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs_people.find_all(\"p\")" ] }, { "cell_type": "markdown", "id": "54aa12d3", "metadata": {}, "source": [ "URI websites are probably formatted consistently, so we could build information about more departments.\n", "\n", "- [csc/sta emeriti](https://web.uri.edu/cs/people/faculty-emeriti/)\n", "- [a&s dean's office](https://web.uri.edu/artsci/people/)\n", "- [math](https://www.math.uri.edu/people/)\n", "- [philosophy](https://web.uri.edu/philosophy/people/)\n", "- [business](https://web.uri.edu/business/people/faculty/)\n", "\n", "\n", "\n", "## Thinking Ahead\n", "\n", "\n", "The spreadsheet of spring classes in the department is posted:\n", "\n", "```\n", "sp22_csc_sta_url = 'https://raw.githubusercontent.com/rhodyprog4ds/rhodyds/main/data/reg_CSCSTA_courses.csv'\n", "```\n", "\n", "this is a minimal copy where I removed enrollments and locations in case those change.\n", "this is derived from the last version the Dean asked us to make corrections to, so things\n", "will definitely be different before registration opens (eg I'm teaching the\n", "CSC392 and it's listed as \"Staff\")\n", "\n", "sections have definitely been added/removed and teaching assignments changed, so\n", "don't use this for making plans.\n", "\n", "How could you merge this with the DataFrame we just scraped?\n", "\n", "\n", "\n", "\n", "## More Practice\n", "\n", "1. Add a phone number column\n", "1. On the page linked from each person's name, their office number; add a column for office number.\n", "1. Make the code we wrote in class into a function so that you can pass a page and get back a DataFrame.\n", "1. Parse the [course descriptions](https://web.uri.edu/cs/academics/computer-science/course-descriptions/) page and make a DataFrame with columns for subject code, course number, course title, and course description.\n", "1. Merge the descriptions with the table about enrollments and instructors." ] } ], "metadata": { "jupytext": { "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.10.3" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "source_map": [ 12, 16, 20, 71, 73, 83, 85, 90, 94, 106, 108, 111, 113, 117, 119, 123, 125, 128, 130, 138, 140, 148, 150, 152, 154, 156, 158, 163, 165, 168, 170, 173, 175, 178, 180, 183, 185, 192, 194, 197, 199, 202, 206, 209, 212, 214, 220, 223, 229, 231, 235, 244, 248, 251, 254, 259, 261 ] }, "nbformat": 4, "nbformat_minor": 5 }