Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python for Secret Agents - Volume II

You're reading from   Python for Secret Agents - Volume II Gather, analyze, and decode data to reveal hidden facts using Python, the perfect tool for all aspiring secret agents

Arrow left icon
Product type Paperback
Published in Dec 2015
Publisher
ISBN-13 9781785283406
Length 180 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Steven F. Lott Steven F. Lott
Author Profile Icon Steven F. Lott
Steven F. Lott
Arrow right icon
View More author details
Toc

Mission One – upgrade Beautiful Soup

It seems like the first practical piece of software that every agent needs is Beautiful Soup. We often make extensive use of this to extract meaningful information from HTML web pages. A great deal of the world's information is published in the HTML format. Sadly, browsers must tolerate broken HTML. Even worse, website designers have no incentive to make their HTML simple. This means that HTML extraction is something every agent needs to master.

Upgrading the Beautiful Soup package is a core mission that sets us up to do more useful espionage work. First, check the PyPI description of the package. Here's the URL: https://pypi.python.org/pypi/beautifulsoup4. The language is described as Python 3, which is usually a good indication that the package will work with any release of Python 3.

To confirm the Python 3 compatibility, track down the source of this at the following URL:

http://www.crummy.com/software/BeautifulSoup/.

This page simply lists Python 3 without any specific minor version number. That's encouraging. We can even look at the following link to see more details of the development of this package:

https://groups.google.com/forum/#!forum/beautifulsoup

The installation is generally just as follows:

MacBookPro-SLott:Code slott$ sudo pip3.4 install beautifulsoup4

Windows agents can omit the sudo prefix.

This will use the pip application to download and install BeautifulSoup. The output will look as shown in the following:

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.3.2.tar.gz (143kB)
    100% |████████████████████████████████| 143kB 1.1MB/s 
Installing collected packages: beautifulsoup4
  Running setup.py install for beautifulsoup4
Successfully installed beautifulsoup4-4.3.2

Note that Pip 7 on Macintosh uses the █ character instead of # to show status. The installation was reported as successful. That means we can start using the package to analyze the data.

We'll finish this mission by gathering and parsing a very simple page of data.

We need to help agents make the sometimes dangerous crossing of the Gulf Stream between Florida and the Bahamas. Often, Bimini is used as a stopover; however, some faster boats can go all the way from Florida to Nassau in a single day. On a slower boat, the weather can change and an accurate multi-day forecast is essential.

The Georef code for this area is GHLL140032. For more information, look at the 25°32′N 79°46′W position on a world map. This will show the particular stretch of ocean for which we need to supply forecast data.

Here's a handy URL that provides weather forecasts for agents who are trying to make the passage between Florida and the Bahamas:

http://forecast.weather.gov/shmrn.php?mz=amz117&syn=amz101.

This page includes a weather synopsis for the overall South Atlantic (the amz101 zone) and a day-by-day forecast specific to the Bahamas (the amz117 zone). We want to trim this down to the relevant text.

Getting an HTML page

The first step in using BeautifulSoup is to get the HTML page from the US National Weather Service and parse it in a proper document structure. We'll use urllib to get the document and create a Soup structure from that. Here's the essential processing:

from bs4 import BeautifulSoup
import urllib.request
query= "http://forecast.weather.gov/shmrn.php?mz=amz117&syn=amz101"
with urllib.request.urlopen(query) as amz117:
    document= BeautifulSoup(amz117.read())

We've opened a URL and assigned the file-like object to the amz117 variable. We've done this in a with statement. Using with will guarantee that all network resources are properly disconnected when the execution leaves the indented body of the statement.

In the with statement, we've read the entire document available at the given URL. We've provided the sequence of bytes to the BeautifulSoup parser, which creates a parsed Soup data structure that we can assign to the document variable.

The with statement makes an important guarantee; when the indented body is complete, the resource manager will close. In this example, the indented body is a single statement that reads the data from the URL and parses it to create a BeautifulSoup object. The resource manager is the connection to the Internet based on the given URL. We want to be absolutely sure that all operating system (and Python) resources that make this open connection work are properly released. This release when finished guarantees what the with statement offers.

Navigating the HTML structure

HTML documents are a mixture of tags and text. The parsed structure is iterable, allowing us to work through text and tags using the for statement. Additionally, the parsed structure contains numerous methods to search for arbitrary features in the document.

Here's the first example of using methods names to pick apart a document:

content= document.body.find('div', id='content').div

When we use a tag name, such as body, as an attribute name, this is a search request for the first occurrence of that tag in the given container. We've used document.body to find the <body> tag in the overall HTML document.

The find() method finds the first matching instance using more complex criteria than the tag's name. In this case, we've asked to find <div id="content"> in the body tag of the document. In this identified <div>, we need to find the first nested <div> tag. This division has the synopsis and forecast.

The content in this division consists of a mixed sequence of text and tags. A little searching shows us that the synopsis text is the fifth item. Since Python sequences are based at zero, this has an index of four in the <div>. We'll use the contents attribute of a given object to identify tags or text blocks by position in a document object.

The following is how we can get the synopsis and forecast. Once we have the forecast, we'll need to create an iterator for each day in the forecast:

synopsis = content.contents[4]
forecast = content.contents[5]
strong_list = list(forecast.findAll('strong'))
timestamp_tag, *forecast_list = strong_list

We've extracted the synopsis as a block of text. HTML has a quirky feature of an <hr> tag that contains the forecast. This is, in principle, invalid HTML. Even though it seems invalid, browsers tolerate it. It has the data that we want, so we're forced to work with it as we find it.

In the forecast <hr> tag, we've used the findAll() method to create a list of the sequence of <strong> tags. These tags are interleaved between blocks of text. Generally, the text in the tag tells us the day and the text between the <strong> tags is the forecast for that day. We emphasize generally as there's a tiny, but important special case.

Due to the special case, we've split the strong_list sequence into a head and a tail. The first item in the list is assigned to the timestamp_tag variable. All the remaining items are assigned to the forecast_list variable. We can use the value of timestamp_tag.string to recover the string value in the tag, which will be the timestamp for the forecast.

Your extension to this mission is to parse this with datetime.datetime.strptime(). It will improve the overall utility of the data in order to replace strings with proper datetime objects.

The value of the forecast_list variable is an alternating sequence of <strong> tags and forecast text. Here's how we can extract these pairs from the overall document:

for strong in forecast_list:
    desc= strong.string.strip()
    print( desc, strong.nextSibling.string.strip() )

We've written a loop to step through the rest of the <strong> tags in the forecast_list object. Each item is a highlighted label for a given day. The value of strong.nextSibling will be the document object after the <strong> tag. We can use strong.nextSibling.string to extract the string from this block of text; this will be the details of the forecast.

We've used the strip() method of the string to remove extraneous whitespace around the forecast elements. This makes the resulting text block more compact.

With a little more cleanup, we can have a tidy forecast that looks similar to the following:

TONIGHT 2015-06-30
--------------------
E TO SE WINDS 10 TO 15 KT...INCREASING TO 15 TO 20 KT
 LATE. SEAS 3 TO 5 FT ATLC EXPOSURES...AND 2 FT OR LESS
 ELSEWHERE.
WED 2015-07-01
--------------------
E TO SE WINDS 15 TO 20 KT...DIMINISHING TO 10 TO 15 KT
 LATE. SEAS 4 TO 6 FT ATLC EXPOSURES...AND 2 FT OR
 LESS ELSEWHERE.

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

We've stripped away a great deal of HTML overhead. We've reduced the forecast to the barest facts. With a little more fiddling, we can get it down to a pretty tiny block of text. We might want to represent this in JavaScript Object Notation (JSON). We can then encrypt the JSON string before the transmission. Then, we could use steganography to embed the encrypted text in another kind of document in order to transmit to a friendly ship captain that is working the route between Key Biscayne and Bimini. It may look as if we're just sending each other pictures of rainbow butterfly unicorn kittens.

This mission demonstrates that we can use Python 3, urllib, and BeautifulSoup. Now, we've got a working environment.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image