Parsing a web page using BeautifulSoup
In this section, we will use the BeautifulSoup
library to parse an HTML web page to extract information from it. This is particularly useful for when you wish to interact with a web page that does not provide an API to access their data, with the drawback being that it is more likely that an application using this method will be broken by a change in the web page structure (rather than an API, which is rarely changed, and when they are, developers are typically given warning of such a change).
In this next example, we will write a simple script to download low resolution previews of images from Pixiv (www.pixiv.net). This script will start in a similar way to the others we have written so far. Note that the UTF-8 character encoding is required here as the contents of the web pages are likely to contain Japanese characters.
# -*- coding: utf-8 -*- from bs4 import BeautifulSoup import urllib2 import os import sys from string import Template
This string template...