4 common challenges in Web Scraping and how to handle them

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box]

In this article, we will explore primary challenges of Web Scraping and how to get away with it easily.

Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds.

Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands:

docker pull mheydt/pywebscrapecookbook

docker run -p 5001:5001 pywebscrapecookbook

Retrying failed page downloads

Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes:

[500, 502, 503, 504, 408]

The process can be further configured using the following parameters:

RETRY_ENABLED (True/False - default is True)
RETRY_TIMES (# of times to retry on any errors - default is 2)
RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408])

How to do it

The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy:

process = CrawlerProcess({

'LOG_LEVEL': 'DEBUG',

'DOWNLOADER_MIDDLEWARES':

{

"scrapy.downloadermiddlewares.retry.RetryMiddleware": 500

},

'RETRY_ENABLED': True,

'RETRY_TIMES': 3

})

process.crawl(Spider)

process.start()

How it works

Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up.

Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters:

REDIRECT_ENABLED: (True/False - default is True)
REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20)

How to do it

The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output:

Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/>

['http://www.nasa.gov/content/earth-expeditions-above',

'https://www.nasa.gov/content/earth-expeditions-above']

This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example:

2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG:

Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>:

max redirections reached

How it works

The spider is defined as the following:

class Spider(scrapy.spiders.SitemapSpider):

name = 'spider'

sitemap_urls = ['https://www.nasa.gov/sitemap.xml']

def parse(self, response):

print("Parsing: ", response)

print (response.request.meta.get('redirect_urls'))

This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code:

process = CrawlerProcess({

'LOG_LEVEL': 'DEBUG',

'DOWNLOADER_MIDDLEWARES':

{

"scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500

},

'REDIRECT_ENABLED': True,

'REDIRECT_MAX_TIMES': 2

})

Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20.

Waiting for content to be available in Selenium

A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading.

Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button:

4-common-challenges-web-scraping-handle-img-0

When pressing the button, we are presented with a progress bar for five seconds:

4-common-challenges-web-scraping-handle-img-1

And when this is completed, we are presented with Hello World!

4-common-challenges-web-scraping-handle-img-2

Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this?

How to do it

We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page.

First, we get the button and click it. The button's HTML is the following:

<div id='start'>

<button>Start</button>

</div>

When the button is pressed and the load completes, the following HTML is added to the document:

<div id='finish'>

<h4>Hello World!"</h4>

</div>

We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag.

You can try this by running 06/03_press_and_wait.py. It's output will be the following:

clicked

Hello World!

Now let's see how it worked.

How it works

Let us break down the explanation:

We start by importing the required items from Selenium:

from selenium import webdriver

from selenium.webdriver.support import ui

Now we load the driver and the page:

driver = webdriver.PhantomJS()

driver.get("http://the-internet.herokuapp.com/dynamic_loading/2")

With the page loaded, we can retrieve the button:

button =

driver.find_element_by_xpath("//*/div[@id='start']/button")

And then we can click the button:

button.click()

print("clicked")

Next we create a WebDriverWait object:

wait = ui.WebDriverWait(driver, 10)

With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath:

wait.until(lambda driver:

driver.find_element_by_xpath("//*/div[@id='finish']"))

When this completes, we can retrieve the h4 element and get its enclosing text:

finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/

h4")

print(finish_element.text)

Limiting crawling to a single domain

We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class.

How to do it

The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov.

How it works

The code is the same as previous NASA site crawlers except that we include

allowed_domains=['nasa.gov']:

class Spider(scrapy.spiders.SitemapSpider):

name = 'spider'

sitemap_urls = ['https://www.nasa.gov/sitemap.xml']

allowed_domains=['nasa.gov']

def parse(self, response):

print("Parsing: ", response)

The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites.

Processing infinitely scrolling pages

Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page. Let's look at spidyquotes.herokuapp.com/scroll as an example.

Getting ready

Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page:

4-common-challenges-web-scraping-handle-img-3

Screenshot of the quotes to scrape

Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel:

4-common-challenges-web-scraping-handle-img-4

When we click on one of the links, we can see the following JSON:

{

"has_next": true,

"page": 2,

"quotes": [{

"author": {

"goodreads_link": "/author/show/82952.Marilyn_Monroe",

"name": "Marilyn Monroe",

"slug": "Marilyn-Monroe"

},

"tags": ["friends", "heartbreak", "inspirational", "life", "love",

"sisters"],

"text": "u201cThis life is what you make it...."

}, {

"author": {

"goodreads_link": "/author/show/1077326.J_K_Rowling",

"name": "J.K. Rowling",

"slug": "J-K-Rowling"

},

"tags": ["courage", "friends"],

"text": "u201cIt takes a great deal of bravery to stand up to our enemies,

but just as much to stand up to our friends.u201d"

},

This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document.

How to do it

The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output):

<200 http://spidyquotes.herokuapp.com/api/quotes?page=2>

2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200

http://spidyquotes.herokuapp.com/api/quotes?page=2>

{'text': "“This life is what you make it. No matter what, you're going to

mess up sometimes, it's a universal truth. But the good part is you get to

decide how you're going to mess it up. Girls will be your friends - they'll

act like it anyway. But just remember, some come, some go. The ones that

stay with you through everything - they're your true best friends. Don't

let go of them. Also remember, sisters make the best friends in the world.

As for lovers, well, they'll come and go too. And baby, I hate to say it,

most of them - actually pretty much all of them are going to break your

heart, but you can't give up because if you give up, you'll never find your

soulmate. You'll never find that half who makes you whole and that goes for

everything. Just because you fail once, doesn't mean you're gonna fail at

everything. Keep trying, hold on, and always, always, always believe in

yourself, because if you don't, then who will, sweetie? So keep your head

high, keep your chin up, and most importantly, keep smiling, because life's

a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn

Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love',

'Sisters']}

2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200

http://spidyquotes.herokuapp.com/api/quotes?page=2>

{'text': '“It takes a great deal of bravery to stand up to our enemies, but

just as much to stand up to our friends.”', 'author': 'J.K. Rowling',

'tags': ['courage', 'friends']}

2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200

http://spidyquotes.herokuapp.com/api/quotes?page=2>

{'text': "“If you can't explain it to a six year old, you don't understand

it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity',

'Understand']}

When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content.

How it works

Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL:

class Spider(scrapy.Spider):

name = 'spidyquotes'

quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes'

start_urls = [quotes_base_url]

download_delay = 1.5

The parse method then prints the response and also parses the JSON into the data variable:

def parse(self, response):

print(response)

data = json.loads(response.body)

Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine:

for item in data.get('quotes', []):

yield {

'text': item.get('text'),

'author': item.get('author', {}).get('name'),

'tags': item.get('tags'),

}

It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page:

if data['has_next']:

next_page = data['page'] + 1

yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page)

There's more...

It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py:

from selenium import webdriver

import time

driver = webdriver.PhantomJS()

print("Starting")

driver.get("https://twitter.com")

scroll_pause_time = 1.5

# Get scroll height

last_height = driver.execute_script("return document.body.scrollHeight")

while True:

print(last_height)

# Scroll down to bottom

driver.execute_script("window.scrollTo(0,

document.body.scrollHeight);")

# Wait to load page

time.sleep(scroll_pause_time)

# Calculate new scroll height and compare with last scroll height

new_height = driver.execute_script("return document.body.scrollHeight")

print(new_height, last_height)

if new_height == last_height:

break

last_height = new_height

The output would be similar to the following:

Starting

4882

8139 4882

8139

11630 8139

11630

15055 11630

15055

15055 15055

Process finished with exit code 0

This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page.

We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround.

If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.