×
By the end of this chapter, you should be able to:
robots.txt
is important when scrapingBeautifulSoup
to scrape web sitesWeb scraping is the process of downloading and extracting data from a website. There are 3 main steps in scraping:
Typically, you would want to access the data using a website's API, but often websites don't provide this programmatic access. When a website doesn't provide a programmatic way to download data (like an API) web scraping is a great way to solve the problem!
Before you begin web scraping, it is a best practice to understand and honor the robots.txt
file. The file may exist on any website that you visit and its role is to tell programs (like our web scraper) about rules on what it should and should not download on the site. Here is Inf-Paces School's robots.txt file. As you can see, it doesn't provide any restrictions. Compare that file to Craigslist's robots.txt file which is much more restrictive on what can be downloaded by a program.
You can find out more information about the robots.txt file at http://www.robotstxt.org/robotstxt.html
The library we will be using for web scraping is called BeautifulSoup
, which you can install using pip3 install beautifulsoup4
. You can read more about it here.
Keeping in mind that there are 3 steps to web scraping, BeautifulSoup
is a tool that helps with the second step, extracting data from the downloaded HTML. Without a tool like BeautifulSoup
the job of extracting data out of HTML is a very hard problem.
Here is an example of BeautifulSoup
in action. The code uses find_all
and text
to access the names in the li elements:
import bs4 html = """ <html> <body> <h3>Names</h3> <ul> <li>Tim</li> <li>Matt</li> <li>Elie</li> <li>Janey</li> </ul> </body> </html> """ soup = bs4.BeautifulSoup(html, "html.parser") for li in soup.find_all('li'): print(li.text)
Notice that we're using BeautifulSoup
and passing it two parameters: the html is the first and the type of parser is the second. For our purposes, the html.parser
will work well.
Next, let's use BeautifulSoup
to find an element by id:
import bs4 html = """ <html> <body> <div id="interesting-data">Moxie is my cat</div> </body> </html> """ soup = bs4.BeautifulSoup(html, "html.parser") div = soup.find(id="interesting-data") print(div.text)
Here are some other useful elements that you can take advantage of in BeautifulSoup
:
.select()
- find an element / elements using CSS selectors.
.children
- find all children of an element
.parent
- find the parent element of a specific element
Now that we're familiar with BeautifulSoup
, we are going to download real html from a site and scrape some data!
Create a file called first_scraping.py
and add the following (make sure you pip3 install beautifulsoup4
first).
import urllib.request import bs4 url = 'https://news.ycombinator.com/' data = urllib.request.urlopen(url).read() soup = bs4.BeautifulSoup(data, "html.parser") links = soup.select("a.storylink") for link in links: print(f"{link['href']} {link.text}")
Now run python3 first_scraping.py
and examine the data you scraped!
Let's modify the previous example and save our results to a TSV file (We chose TSV, tab separated values, because the article titles often have commas in them):
import urllib.request import bs4 import csv url = 'https://news.ycombinator.com/' data = urllib.request.urlopen(url).read() soup = bs4.BeautifulSoup(data, "html.parser") links = soup.select("a.storylink") with open('articles.tsv', 'w') as tsvfile: writer = csv.writer(tsvfile, delimiter="\t") writer.writerow( ('Link', 'Title') ) for link in links: writer.writerow( (link['href'], link.text) )
Great! Now you've done all 3 steps of scaping: downloading, extracting and saving!
When you're ready, move on to Server Side HTTP Requests