You're reading from Python Web Scraping Successfully scrape data from any website with the power of Python

Product type Paperback

Published in Oct 2015

Publisher Packt

ISBN-13 9781782164364

Length 174 pages

Edition 1st Edition

Languages

Python

Tools

Scrapy

Concepts

Data Mining

Author (1):

Richard Penman

View More author details

Table of Contents (11) Chapters

Preface

1. Introduction to Web Scraping FREE CHAPTER

2. Scraping the Data

3. Caching Downloads

4. Concurrent Downloading

5. Dynamic Content

6. Interacting with Forms

7. Solving CAPTCHA

8. Scrapy

9. Overview

Index

Crawling your first website

In order to scrape a website, we first need to download its web pages containing the data of interest—a process known as crawling. There are a number of approaches that can be used to crawl a website, and the appropriate choice will depend on the structure of the target website. This chapter will explore how to download web pages safely, and then introduce the following three common approaches to crawling a website:

Crawling a sitemap
Iterating the database IDs of each web page
Following web page links

Downloading a web page

To crawl web pages, we first need to download them. Here is a simple Python script that uses Python's urllib2 module to download a URL:

import urllib2
def download(url):
    return urllib2.urlopen(url).read()

When a URL is passed, this function will download the web page and return the HTML. The problem with this snippet is that when downloading the web page, we might encounter errors that are beyond our control; for example, the requested page may no longer exist. In these cases, urllib2 will raise an exception and exit the script. To be safer, here is a more robust version to catch these exceptions:

import urllib2

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
    return html

Now, when a download error is encountered, the exception is caught and the function returns None.

Retrying downloads

Often, the errors encountered when downloading are temporary; for example, the web server is overloaded and returns a 503 Service Unavailable error. For these errors, we can retry the download as the server problem may now be resolved. However, we do not want to retry downloading for all errors. If the server returns 404 Not Found, then the web page does not currently exist and the same request is unlikely to produce a different result.

The full list of possible HTTP errors is defined by the Internet Engineering Task Force, and is available for viewing at https://tools.ietf.org/html/rfc7231#section-6. In this document, we can see that the 4xx errors occur when there is something wrong with our request and the 5xx errors occur when there is something wrong with the server. So, we will ensure our download function only retries the 5xx errors. Here is the updated version to support this:

def download(url, num_retries=2):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries-1)
    return html

Now, when a download error is encountered with a 5xx code, the download is retried by recursively calling itself. The function now also takes an additional argument for the number of times the download can be retried, which is set to two times by default. We limit the number of times we attempt to download a web page because the server error may not be resolvable. To test this functionality we can try downloading http://httpstat.us/500, which returns the 500 error code:

>>> download('http://httpstat.us/500')
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error

As expected, the download function now tries downloading the web page, and then on receiving the 500 error, it retries the download twice before giving up.

Setting a user agent

By default, urllib2 will download content with the Python-urllib/2.7 user agent, where 2.7 is the version of Python. It would be preferable to use an identifiable user agent in case problems occur with our web crawler. Also, some websites block this default user agent, perhaps after they experienced a poorly made Python web crawler overloading their server. For example, this is what http://www.meetup.com/ currently returns for Python's default user agent:

So, to download reliably, we will need to have control over setting the user agent. Here is an updated version of our download function with the default user agent set to 'wswp' (which stands for Web Scraping with Python):

def download(url, user_agent='wswp', num_retries=2):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                return download(url, user_agent, num_retries-1)
    return html

Now we have a flexible download function that can be reused in later examples to catch errors, retry the download when possible, and set the user agent.

Sitemap crawler

For our first simple crawler, we will use the sitemap discovered in the example website's robots.txt to download all the web pages. To parse the sitemap, we will use a simple regular expression to extract URLs within the <loc> tags. Note that a more robust parsing approach called CSS selectors will be introduced in the next chapter. Here is our first example crawler:

def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here
        # ...

Now, we can run the sitemap crawler to download all countries from the example website:

>>> crawl_sitemap('http://example.webscraping.com/sitemap.xml')
Downloading: http://example.webscraping.com/sitemap.xml
Downloading: http://example.webscraping.com/view/Afghanistan-1
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Albania-3
...

This works as expected, but as discussed earlier, Sitemap files often cannot be relied on to provide links to every web page. In the next section, another simple crawler will be introduced that does not depend on the Sitemap file.

ID iteration crawler

In this section, we will take advantage of weakness in the website structure to easily access all the content. Here are the URLs of some sample countries:

We can see that the URLs only differ at the end, with the country name (known as a slug) and ID. It is a common practice to include a slug in the URL to help with search engine optimization. Quite often, the web server will ignore the slug and only use the ID to match with relevant records in the database. Let us check whether this works with our example website by removing the slug and loading http://example.webscraping.com/view/1:

The web page still loads! This is useful to know because now we can ignore the slug and simply iterate database IDs to download all the countries. Here is an example code snippet that takes advantage of this trick:

import itertools
for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        break
    else:
        # success - can scrape the result
        pass

Here, we iterate the ID until we encounter a download error, which we assume means that the last country has been reached. A weakness in this implementation is that some records may have been deleted, leaving gaps in the database IDs. Then, when one of these gaps is reached, the crawler will immediately exit. Here is an improved version of the code that allows a number of consecutive download errors before exiting:

# maximum number of consecutive download errors allowed
max_errors = 5
# current number of consecutive download errors
num_errors = 0
for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        # received an error trying to download this webpage
        num_errors += 1
        if num_errors == max_errors:
            # reached maximum number of
            # consecutive errors so exit
            break
    else:
        # success - can scrape the result
        # ...
        num_errors = 0

The crawler in the preceding code now needs to encounter five consecutive download errors to stop iterating, which decreases the risk of stopping the iteration prematurely when some records have been deleted.

Iterating the IDs is a convenient approach to crawl a website, but is similar to the sitemap approach in that it will not always be available. For example, some websites will check whether the slug is as expected and if not return a 404 Not Found error. Also, other websites use large nonsequential or nonnumeric IDs, so iterating is not practical. For example, Amazon uses ISBNs as the ID for their books, which have at least ten digits. Using an ID iteration with Amazon would require testing billions of IDs, which is certainly not the most efficient approach to scraping their content.

Link crawler

So far, we have implemented two simple crawlers that take advantage of the structure of our sample website to download all the countries. These techniques should be used when available, because they minimize the required amount of web pages to download. However, for other websites, we need to make our crawler act more like a typical user and follow links to reach the content of interest.

We could simply download the entire website by following all links. However, this would download a lot of web pages that we do not need. For example, to scrape user account details from an online forum, only account pages need to be downloaded and not discussion threads. The link crawler developed here will use a regular expression to decide which web pages to download. Here is an initial version of the code:

import re

def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex
    """
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                crawl_queue.append(link)

def get_links(html):
    """Return a list of links from html
    """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)

To run this code, simply call the link_crawler function with the URL of the website you want to crawl and a regular expression of the links that you need to follow. For the example website, we want to crawl the index with the list of countries and the countries themselves. The index links follow this format:

The country web pages will follow this format:

So a simple regular expression to match both types of web pages is /(index|view)/. What happens when the crawler is run with these inputs? You would find that we get the following download error:

>>> link_crawler('http://example.webscraping.com', 'example.webscraping.com/(index|view)/')
Downloading: http://example.webscraping.com
Downloading: /index/1
Traceback (most recent call last):
  ...
ValueError: unknown url type: /index/1

The problem with downloading /index/1 is that it only includes the path of the web page and leaves out the protocol and server, which is known as a relative link. Relative links work when browsing because the web browser knows which web page you are currently viewing. However, urllib2 is not aware of this context. To help urllib2 locate the web page, we need to convert this link into an absolute link, which includes all the details to locate the web page. As might be expected, Python includes a module to do just this, called urlparse. Here is an improved version of link_crawler that uses the urlparse module to create the absolute links:

import urlparse
def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex
    """
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urlparse.urljoin(seed_url, link)
                crawl_queue.append(link)

When this example is run, you will find that it downloads the web pages without errors; however, it keeps downloading the same locations over and over. The reason for this is that these locations have links to each other. For example, Australia links to Antarctica and Antarctica links right back, and the crawler will cycle between these forever. To prevent re-crawling the same links, we need to keep track of what has already been crawled. Here is the updated version of link_crawler that stores the URLs seen before, to avoid redownloading duplicates:

def link_crawler(seed_url, link_regex):
    crawl_queue = [seed_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            # check if link matches expected regex
            if re.match(link_regex, link):
                # form absolute link
                link = urlparse.urljoin(seed_url, link)
                # check if have already seen this link
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)

When this script is run, it will crawl the locations and then stop as expected. We finally have a working crawler!

Advanced features

Now, let's add some features to make our link crawler more useful for crawling other websites.

Parsing robots.txt

Firstly, we need to interpret robots.txt to avoid downloading blocked URLs. Python comes with the robotparser module, which makes this straightforward, as follows:

>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.webscraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.webscraping.com'
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True

The robotparser module loads a robots.txt file and then provides a can_fetch() function, which tells you whether a particular user agent is allowed to access a web page or not. Here, when the user agent is set to 'BadCrawler', the robotparser module says that this web page can not be fetched, as was defined in robots.txt of the example website.

To integrate this into the crawler, we add this check in the crawl loop:

...
while crawl_queue:
    url = crawl_queue.pop()
    # check url passes robots.txt restrictions
    if rp.can_fetch(user_agent, url):
         ...
    else:
        print 'Blocked by robots.txt:', url

Supporting proxies

Sometimes it is necessary to access a website through a proxy. For example, Netflix is blocked in most countries outside the United States. Supporting proxies with urllib2 is not as easy as it could be (for a more user-friendly Python HTTP module, try requests, documented at http://docs.python-requests.org/). Here is how to support a proxy with urllib2:

proxy = ...
opener = urllib2.build_opener()
proxy_params = {urlparse.urlparse(url).scheme: proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
response = opener.open(request)

Here is an updated version of the download function to integrate this:

def download(url, user_agent='wswp', proxy=None, num_retries=2):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    opener = urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    try:
        html = opener.open(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download(url, user_agent, proxy, num_retries-1)
    return html

Throttling downloads

If we crawl a website too fast, we risk being blocked or overloading the server. To minimize these risks, we can throttle our crawl by waiting for a delay between downloads. Here is a class to implement this:

class Throttle:
    """Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse.urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).seconds
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = datetime.datetime.now()

This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:

throttle = Throttle(delay)
...
throttle.wait(url)
result = download(url, headers, proxy=proxy, num_retries=num_retries)

Avoiding spider traps

Currently, our crawler will follow any link that it has not seen before. However, some websites dynamically generate their content and can have an infinite number of web pages. For example, if the website has an online calendar with links provided for the next month and year, then the next month will also have links to the next month, and so on for eternity. This situation is known as a spider trap.

A simple way to avoid getting stuck in a spider trap is to track how many links have been followed to reach the current web page, which we will refer to as depth. Then, when a maximum depth is reached, the crawler does not add links from this web page to the queue. To implement this, we will change the seen variable, which currently tracks the visited web pages, into a dictionary to also record the depth they were found at:

def link_crawler(..., max_depth=2):
    max_depth = 2
    seen = {}
    ...
    depth = seen[url]
    if depth != max_depth:
        for link in links:
            if link not in seen:
                seen[link] = depth + 1
                crawl_queue.append(link)

Now, with this feature, we can be confident that the crawl will always complete eventually. To disable this feature, max_depth can be set to a negative number so that the current depth is never equal to it.

Final version

The full source code for this advanced link crawler can be downloaded at https://bitbucket.org/wswp/code/src/tip/chapter01/link_crawler3.py. To test this, let us try setting the user agent to BadCrawler, which we saw earlier in this chapter was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:

>>> seed_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(seed_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/

Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:

>>> link_crawler(seed_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1

As expected, the crawl stopped after downloading the first page of countries.