{ Web Scraping. }

Objectives

By the end of this chapter, you should be able to:

  • Understand what web scraping is useful for
  • Describe why robots.txt is important when scraping
  • Use BeautifulSoup to scrape web sites

Web Scraping

Web scraping is the process of downloading and extracting data from a website. There are 3 main steps in scraping:

  1. Downloading the HTML document from a website
  2. Extracting data from the downloaded HTML
  3. Doing something with the data (usually saving it somehow)

Typically, you would want to access the data using a website's API, but often websites don't provide this programmatic access. When a website doesn't provide a programmatic way to download data (like an API) web scraping is a great way to solve the problem!

Robots.txt

Before you begin web scraping, it is a best practice to understand and honor the robots.txt file. The file may exist on any website that you visit and its role is to tell programs (like our web scraper) about rules on what it should and should not download on the site. Here is Inf-Paces School's robots.txt file. As you can see, it doesn't provide any restrictions. Compare that file to Craigslist's robots.txt file which is much more restrictive on what can be downloaded by a program.

You can find out more information about the robots.txt file at http://www.robotstxt.org/robotstxt.html

Beautiful Soup

The library we will be using for web scraping is called BeautifulSoup, which you can install using pip3 install beautifulsoup4. You can read more about it here.

Keeping in mind that there are 3 steps to web scraping, BeautifulSoup is a tool that helps with the second step, extracting data from the downloaded HTML. Without a tool like BeautifulSoup the job of extracting data out of HTML is a very hard problem.

Here is an example of BeautifulSoup in action. The code uses find_all and text to access the names in the li elements:

import bs4
html = """
<html>
<body>
  <h3>Names</h3>
  <ul>
    <li>Tim</li>
    <li>Matt</li>
    <li>Elie</li>
    <li>Janey</li>
  </ul>
</body>
</html>
"""

soup = bs4.BeautifulSoup(html, "html.parser")

for li in soup.find_all('li'):
    print(li.text)

Notice that we're using BeautifulSoup and passing it two parameters: the html is the first and the type of parser is the second. For our purposes, the html.parser will work well.

Next, let's use BeautifulSoup to find an element by id:

import bs4
html = """
<html>
<body>
  <div id="interesting-data">Moxie is my cat</div>
</body>
</html>
"""

soup = bs4.BeautifulSoup(html, "html.parser")

div = soup.find(id="interesting-data")
print(div.text)

Here are some other useful elements that you can take advantage of in BeautifulSoup:

.select() - find an element / elements using CSS selectors.

.children - find all children of an element

.parent - find the parent element of a specific element

Downloading And Scraping a Page

Now that we're familiar with BeautifulSoup, we are going to download real html from a site and scrape some data!

Create a file called first_scraping.py and add the following (make sure you pip3 install beautifulsoup4 first).

import urllib.request
import bs4

url = 'https://news.ycombinator.com/'
data = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(data, "html.parser")

links = soup.select("a.storylink")

for link in links:
    print(f"{link['href']} {link.text}")

Now run python3 first_scraping.py and examine the data you scraped!

Saving Scraped Data To CSV

Let's modify the previous example and save our results to a TSV file (We chose TSV, tab separated values, because the article titles often have commas in them):

import urllib.request
import bs4
import csv

url = 'https://news.ycombinator.com/'
data = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(data, "html.parser")

links = soup.select("a.storylink")

with open('articles.tsv', 'w') as tsvfile:
    writer = csv.writer(tsvfile, delimiter="\t")
    writer.writerow( ('Link', 'Title') )
    for link in links:
        writer.writerow( (link['href'], link.text) )

Great! Now you've done all 3 steps of scaping: downloading, extracting and saving!

When you're ready, move on to Server Side HTTP Requests

Continue

Creative Commons License