So far, we have implemented two simple crawlers that take advantage of the structure of our sample website to download all published countries. These techniques should be used when available, because they minimize the number of web pages to download. However, for other websites, we need to make our crawler act more like a typical user and follow links to reach the interesting content.
We could simply download the entire website by following every link. However, this would likely download many web pages we don't need. For example, to scrape user account details from an online forum, only account pages need to be downloaded and not discussion threads. The link crawler we use in this chapter will use regular expressions to determine which web pages it should download. Here is an initial version of the code:
import re
def link_crawler(start_url, link_regex):
""" Crawl from the given start URL following links matched by link_regex
"""
crawl_queue = [start_url]
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
if html is not None:
continue
# filter for links matching our regular expression
for link in get_links(html):
if re.match(link_regex, link):
crawl_queue.append(link)
def get_links(html):
""" Return a list of links from html
"""
# a regular expression to extract all links from the webpage
webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
# list of all links from the webpage
return webpage_regex.findall(html)
To run this code, simply call the link_crawler function with the URL of the website you want to crawl and a regular expression to match links you want to follow. For the example website, we want to crawl the index with the list of countries and the countries themselves.
We know from looking at the site that the index links follow this format:
The country web pages follow this format:
So a simple regular expression to match both types of web page is /(index|view)/. What happens when the crawler is run with these inputs? You receive the following download error:
>>> link_crawler('http://example.webscraping.com', '/(index|view)/')
Downloading: http://example.webscraping.com
Downloading: /index/1
Traceback (most recent call last):
...
ValueError: unknown url type: /index/1
Regular expressions are great tools for extracting information from strings, and I recommend every programmer
learn how to read and write a few of them. That said, they tend to be quite brittle and easily break. We'll cover more advanced ways to extract links and identify their pages as we advance through the book.
The problem with downloading /index/1 is that it only includes the path of the web page and leaves out the protocol and server, which is known as a relative link. Relative links work when browsing because the web browser knows which web page you are currently viewing and takes the steps necessary to resolve the link. However, urllib doesn't have this context. To help urllib locate the web page, we need to convert this link into an absolute link, which includes all the details to locate the web page. As might be expected, Python includes a module in urllib to do just this, called parse. Here is an improved version of link_crawler that uses the urljoin method to create the absolute links:
from urllib.parse import urljoin
def link_crawler(start_url, link_regex):
""" Crawl from the given start URL following links matched by link_regex
"""
crawl_queue = [start_url]
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
if not html:
continue
for link in get_links(html):
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
crawl_queue.append(abs_link)
When this example is run, you can see it downloads the matching web pages; however, it keeps downloading the same locations over and over. The reason for this behavior is that these locations have links to each other. For example, Australia links to Antarctica and Antarctica links back to Australia, so the crawler will continue to queue the URLs and never reach the end of the queue. To prevent re-crawling the same links, we need to keep track of what's already been crawled. The following updated version of link_crawler stores the URLs seen before, to avoid downloading duplicates:
def link_crawler(start_url, link_regex):
crawl_queue = [start_url]
# keep track which URL's have seen before
seen = set(crawl_queue)
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
if not html:
continue
for link in get_links(html):
# check if link matches expected regex
if re.match(link_regex, link):
abs_link = urljoin(start_url, link)
# check if have already seen this link
if abs_link not in seen:
seen.add(abs_link)
crawl_queue.append(abs_link)
When this script is run, it will crawl the locations and then stop as expected. We finally have a working link crawler!