Mission One – upgrade Beautiful Soup
It seems like the first practical piece of software that every agent needs is Beautiful Soup. We often make extensive use of this to extract meaningful information from HTML web pages. A great deal of the world's information is published in the HTML format. Sadly, browsers must tolerate broken HTML. Even worse, website designers have no incentive to make their HTML simple. This means that HTML extraction is something every agent needs to master.
Upgrading the Beautiful Soup package is a core mission that sets us up to do more useful espionage work. First, check the PyPI description of the package. Here's the URL: https://pypi.python.org/pypi/beautifulsoup4. The language is described as Python 3, which is usually a good indication that the package will work with any release of Python 3.
To confirm the Python 3 compatibility, track down the source of this at the following URL:
http://www.crummy.com/software/BeautifulSoup/.
This page simply lists Python 3 without any specific minor version number. That's encouraging. We can even look at the following link to see more details of the development of this package:
https://groups.google.com/forum/#!forum/beautifulsoup
The installation is generally just as follows:
MacBookPro-SLott:Code slott$ sudo pip3.4 install beautifulsoup4
Windows agents can omit the sudo prefix.
This will use the pip application to download and install BeautifulSoup. The output will look as shown in the following:
Collecting beautifulsoup4 Downloading beautifulsoup4-4.3.2.tar.gz (143kB) 100% |████████████████████████████████| 143kB 1.1MB/s Installing collected packages: beautifulsoup4 Running setup.py install for beautifulsoup4 Successfully installed beautifulsoup4-4.3.2
Note that Pip 7 on Macintosh uses the █ character instead of #
to show status. The installation was reported as successful. That means we can start using the package to analyze the data.
We'll finish this mission by gathering and parsing a very simple page of data.
We need to help agents make the sometimes dangerous crossing of the Gulf Stream between Florida and the Bahamas. Often, Bimini is used as a stopover; however, some faster boats can go all the way from Florida to Nassau in a single day. On a slower boat, the weather can change and an accurate multi-day forecast is essential.
The Georef code for this area is GHLL140032
. For more information, look at the 25°32′N 79°46′W position on a world map. This will show the particular stretch of ocean for which we need to supply forecast data.
Here's a handy URL that provides weather forecasts for agents who are trying to make the passage between Florida and the Bahamas:
http://forecast.weather.gov/shmrn.php?mz=amz117&syn=amz101.
This page includes a weather synopsis for the overall South Atlantic (the amz101
zone) and a day-by-day forecast specific to the Bahamas (the amz117
zone). We want to trim this down to the relevant text.
Getting an HTML page
The first step in using BeautifulSoup is to get the HTML page from the US National Weather Service and parse it in a proper document structure. We'll use urllib
to get the document and create a Soup structure from that. Here's the essential processing:
from bs4 import BeautifulSoup import urllib.request query= "http://forecast.weather.gov/shmrn.php?mz=amz117&syn=amz101" with urllib.request.urlopen(query) as amz117: document= BeautifulSoup(amz117.read())
We've opened a URL and assigned the file-like object to the amz117
variable. We've done this in a with
statement. Using with
will guarantee that all network resources are properly disconnected when the execution leaves the indented body of the statement.
In the with
statement, we've read the entire document available at the given URL. We've provided the sequence of bytes to the BeautifulSoup
parser, which creates a parsed Soup data structure that we can assign to the document
variable.
The with
statement makes an important guarantee; when the indented body is complete, the resource manager will close. In this example, the indented body is a single statement that reads the data from the URL and parses it to create a BeautifulSoup
object. The resource manager is the connection to the Internet based on the given URL. We want to be absolutely sure that all operating system (and Python) resources that make this open connection work are properly released. This release when finished guarantees what the with
statement offers.
Navigating the HTML structure
HTML documents are a mixture of tags and text. The parsed structure is iterable, allowing us to work through text and tags using the for
statement. Additionally, the parsed structure contains numerous methods to search for arbitrary features in the document.
Here's the first example of using methods names to pick apart a document:
content= document.body.find('div', id='content').div
When we use a tag name, such as body
, as an attribute name, this is a search request for the first occurrence of that tag in the given container. We've used document.body
to find the <body>
tag in the overall HTML document.
The find()
method finds the first matching instance using more complex criteria than the tag's name. In this case, we've asked to find <div id="content">
in the body
tag of the document. In this identified <div>
, we need to find the first nested <div>
tag. This division has the synopsis and forecast.
The content in this division consists of a mixed sequence of text and tags. A little searching shows us that the synopsis text is the fifth item. Since Python sequences are based at zero, this has an index of four in the <div>
. We'll use the contents
attribute of a given object to identify tags or text blocks by position in a document object.
The following is how we can get the synopsis and forecast. Once we have the forecast, we'll need to create an iterator for each day in the forecast:
synopsis = content.contents[4] forecast = content.contents[5] strong_list = list(forecast.findAll('strong')) timestamp_tag, *forecast_list = strong_list
We've extracted the synopsis as a block of text. HTML has a quirky feature of an <hr>
tag that contains the forecast. This is, in principle, invalid HTML. Even though it seems invalid, browsers tolerate it. It has the data that we want, so we're forced to work with it as we find it.
In the forecast <hr>
tag, we've used the findAll()
method to create a list of the sequence of <strong>
tags. These tags are interleaved between blocks of text. Generally, the text in the tag tells us the day and the text between the <strong>
tags is the forecast for that day. We emphasize generally as there's a tiny, but important special case.
Due to the special case, we've split the strong_list
sequence into a head and a tail. The first item in the list is assigned to the timestamp_tag
variable. All the remaining items are assigned to the forecast_list
variable. We can use the value of timestamp_tag.string
to recover the string value in the tag, which will be the timestamp for the forecast.
Your extension to this mission is to parse this with datetime.datetime.strptime()
. It will improve the overall utility of the data in order to replace strings with proper datetime
objects.
The value of the forecast_list
variable is an alternating sequence of <strong>
tags and forecast text. Here's how we can extract these pairs from the overall document:
for strong in forecast_list: desc= strong.string.strip() print( desc, strong.nextSibling.string.strip() )
We've written a loop to step through the rest of the <strong>
tags in the forecast_list
object. Each item is a highlighted label for a given day. The value of strong.nextSibling
will be the document object after the <strong>
tag. We can use strong.nextSibling.string
to extract the string from this block of text; this will be the details of the forecast.
We've used the strip()
method of the string to remove extraneous whitespace around the forecast elements. This makes the resulting text block more compact.
With a little more cleanup, we can have a tidy forecast that looks similar to the following:
TONIGHT 2015-06-30 -------------------- E TO SE WINDS 10 TO 15 KT...INCREASING TO 15 TO 20 KT LATE. SEAS 3 TO 5 FT ATLC EXPOSURES...AND 2 FT OR LESS ELSEWHERE. WED 2015-07-01 -------------------- E TO SE WINDS 15 TO 20 KT...DIMINISHING TO 10 TO 15 KT LATE. SEAS 4 TO 6 FT ATLC EXPOSURES...AND 2 FT OR LESS ELSEWHERE.
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We've stripped away a great deal of HTML overhead. We've reduced the forecast to the barest facts. With a little more fiddling, we can get it down to a pretty tiny block of text. We might want to represent this in JavaScript Object Notation (JSON). We can then encrypt the JSON string before the transmission. Then, we could use steganography to embed the encrypted text in another kind of document in order to transmit to a friendly ship captain that is working the route between Key Biscayne and Bimini. It may look as if we're just sending each other pictures of rainbow butterfly unicorn kittens.
This mission demonstrates that we can use Python 3, urllib
, and BeautifulSoup. Now, we've got a working environment.