In this article by Steven F. Lott author of Python for Secret Agents, we will use Python to meet our secret informant at a good restaurant that's a reasonable distance from our base. In order to locate a good restaurant, we need to gather some additional information. In this case, good means a passing grade from the health inspectors.
Before we can even have a meeting, we'll need to use basic espionage skills to locate the health code survey results for local restaurants.
(For more resources related to this topic, see here.)
We'll create a Python application to combine many things to sort through the results. We'll perform the following steps:
In many cities, the health code data is available online. A careful search will reveal a useful dataset. In other cities, the health inspection data isn't readily available online. We might have to dig considerably deep to track down even a few restaurants near our base of operations.
Some cities use Yelp to publicize restaurant health code inspection data. We can read about the YELP API to search for restaurants on the following link:
http://www.yelp.com/developers/documentation
We might also find some useful data on InfoChimps at http://www.infochimps.com/tags/restaurant.
One complexity we often encounter is the use of HTML-based APIs for this kind of information. This is not intentional obfuscation, but the use of HTML complicates analysis of the data. Parsing HTML to extract meaningful information isn't easy; we'll need an extra library to handle this.
We'll look at two approaches: good, clean data and more complex HTML data parsing. In both cases, we need to create a Python object that acts as a container for a collection of attributes. First, we'll divert to look at the SimpleNamespace class. Then, we'll use this to collect information.
We have a wide variety of ways to define our own Python objects. We can use the central built-in types such as dict to define an object that has a collection of attribute values. When looking at information for a restaurant, we could use something like this:
some_place = { 'name': 'Secret Base', 'address': '333 Waterside Drive' }
Since this is a mutable object, we can add attribute values and change the values of the existing attributes. The syntax is a bit clunky, though. Here's what an update to this object looks like:
some_place['lat']= 36.844305 some_place['lng']= -76.29112
One common solution is to use a proper class definition. The syntax looks like this:
class Restaurant: def __init__(self, name, address): self.name= name self.address= address
We've defined a class with an initialization method, __init__(). The name of the initialization method is special, and only this name can be used. When the object is built, the initialization method is evaluated to assign initial values to the attributes of the object.
This allows us to create an object more succinctly:
some_place= Restaurant( name='Secret Base', address='333 Waterside Drive' )
We've used explicit keyword arguments. The use of name= and address= isn't required. However, as class definitions become more complex, it's often more flexible and more clear to use keyword argument values.
We can update the object nicely too, as follows:
This works out best when we have a lot of unique processing that is bound to each object. In this case, we don't actually have any processing to associate with the attributes; we just want to collect those attributes in a tidy capsule. The formal class definition is too much overhead for such a simple problem.
Python also gives us a very flexible structure called a namespace. This is a mutable object that we can access using simple attribute names, as shown in the following code:
from types import SimpleNamespace some_place= SimpleNamespace( name='Secret Base', address='333 Waterside Drive' )
The syntax to create a namespace must use keyword arguments (name='The Name'). Once we've created this object, we can update it using a pleasant attribute access, as shown in the following snippet:
some_place.lat= 36.844305 some_place.lng= -76.29112
The SimpleNamespace class gives us a way to build an object that contains a number of individual attribute values.
We can also create a namespace from a dictionary using Python's ** notation. Here's an example:
>>> SimpleNamespace( **{'name': 'Secret Base', 'address': '333 Waterside Drive'} ) namespace(address='333 Waterside Drive', name='Secret Base')
The ** notation tells Python that a dictionary object contains keyword arguments for the function. The dictionary keys are the parameter names. This allows us to build a dictionary object and then use it as the arguments to a function.
Recall that JSON tends to encode complex data structures as a dictionary. Using this ** technique, we can transform a JSON dictionary into SimpleNamespace, and replace the clunky object['key'] notation with a cleaner object.key notation.
In some cases, the data we want is tied up in HTML websites. The City of Norfolk, for example, relies on the State of Virginia's VDH health portal to store its restaurant health code inspection data.
In order to make sense of the intelligence encoded in the HTML notation on the WWW, we need to be able to parse the HTML markup that surrounds the data. Our job is greatly simplified by the use of special higher-powered weaponry; in this case, BeautifulSoup.
Start with https://pypi.python.org/pypi/beautifulsoup4/4.3.2 or http://www.crummy.com/software/BeautifulSoup/.
If we have Easy Install (or PIP), we can use these tools to install BeautifulSoup.
We can use Easy Install to install BeautifulSoup like this:
sudo easy_install-3.3 beautifulsoup4
Mac OS X and GNU/Linux users will need to use the sudo command. Windows users won't use the sudo command.
Once we have BeautifulSoup, we can use it to parse the HTML code looking for specific facts buried in an otherwise cryptic jumble of HTML tags.
Before we can go on, you'll need to read the quickstart documentation and bring yourself up to speed on BeautifulSoup. Once you've done that, we'll move to extracting data from HTML web pages.
Start with http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start.
An alternative tool is scrapy. For information see http://scrapy.org. Also, read Instant Scrapy Web Mining and Scraping, Travis Briggs, Packt Publishing, for details on using this tool. Unfortunately, as of this writing, scrapy is focused on Python 2, not Python 3.
In the case of VDH health data for the City of Norfolk, the HTML scraping is reasonably simple. We can leverage the strengths of BeautifulSoup to dig into the HTML page very nicely.
Once we've created a BeautifulSoup object from the HTML page, we will have an elegant technique to navigate down through the hierarchy of the HTML tags. Each HTML tag name (html, body, and so on) is also a BeautifulSoup query that locates the first instance of that tag.
An expression such as soup.html.body.table can locate the first <table> in the HTML <body> tag. In the case of the VDH restaurant data, that's precisely the data we want.
Once we've found the table, we need to extract the rows. The HTML tag for each row is <tr> and we can use the BeautifulSoup table.find_all("tr") expression to locate all rows within a given <table> tag. Each tag's text is an attribute, .text. If the tag has attributes, we can treat the tag as if it's a dictionary to extract the attribute values.
We'll break down the processing of the VDH restaurant data into two parts: the web services query that builds Soup from HTML and the HTML parsing to gather restaurant information.
Here's the first part, which is getting the raw BeautifulSoup object:
scheme_host= "http://healthspace.com" def get_food_list_by_name(): path= "/Clients/VDH/Norfolk/Norolk_Website.nsf/Food-List-ByName" form = { "OpenView": "", "RestrictToCategory": "FAA4E68B1BBBB48F008D02BF09DD656F", "count": "400", "start": "1", } query= urllib.parse.urlencode( form ) with urllib.request.urlopen(scheme_host + path + "?" + query) as data: soup= BeautifulSoup( data.read() ) return soup
This repeats the web services queries we've seen before. We've separated three things here: the scheme_host string, the path string, and query. The reason for this is that our overall script will be using the scheme_host with other paths. And we'll be plugging in lots of different query data.
For this basic food_list_by_name query, we've built a form that will get 400 restaurant inspections. The RestrictToCategory field in the form has a magical key that we must provide to get the Norfolk restaurants. We found this via a basic web espionage technique: we poked around on the website and checked the URLs used when we clicked on each of the links. We also used the Developer mode of Safari to explore the page source.
In the long run, we want all of the inspections. To get started, we've limited ourselves to 400 so that we don't spend too long waiting to run a test of our script.
The response object was used by BeautifulSoup to create an internal representation of the web page. We assigned this to the soup variable and returned it as the result of the function.
In addition to returning the soup object, it can also be instructive to print it. It's quite a big pile of HTML. We'll need to parse this to get the interesting details away from the markup.
Once we have a page of HTML information parsed into a BeautifulSoup object, we can examine the details of that page. Here's a function that will locate the table of restaurant inspection details buried inside the page.
We'll use a generator function to yield each individual row of the table, as shown in the following code:
def food_table_iter( soup ): """Columns are 'Name', '', 'Facility Location', 'Last Inspection', Plus an unnamed column with a RestrictToCategory key """ table= soup.html.body.table for row in table.find_all("tr"): columns = [ td.text.strip() for td in row.find_all("td") ] for td in row.find_all("td"): if td.a: url= urllib.parse.urlparse( td.a["href"] ) form= urllib.parse.parse_qs( url.query ) columns.append( form['RestrictToCategory'][0] ) yield columns
Notice that this function begins with a triple-quoted string. This is a docstring and it provides documentation about the function. Good Python style insists on a docstring in every function. The Python help system will display the docstrings for functions, modules, and classes. We've omitted them to save space. Here, we included it because the results of this particular iterator can be quite confusing.
This function requires a parsed Soup object. The function uses simple tag navigation to locate the first <table> tag in the HTML <body> tag. It then uses the table's find_all() method to locate all of the rows within that table.
For each row, there are two pieces of processing. First, a generator expression is used to find all the <td> tags within that row. Each <td> tag's text is stripped of excess white space and the collection forms a list of cell values. In some cases, this kind of processing is sufficient.
In this case, however, we also need to decode an HTML <a> tag, which has a reference to the details for a given restaurant. We use a second find_all("td") expression to examine each column again. Within each column, we check for the presence of an <a> tag using a simple if td.a: loop. If there is an <a> tag, we can get the value of the href attribute on that tag. When looking at the source HTML, this is the value inside the quotes of <a href="">.
This value of an HTML href attribute is a URL. We don't actually need the whole URL. We only need the query string within the URL. We've used the urllib.parse.urlparse() function to extract the various bits and pieces of the URL. The value of the url.query attribute is just the query string, after the ?.
It turns out, we don't even want the entire query string; we only want the value for the key RestrictToCategory. We can parse the query string with urllib.parse.parse_qs() to get a form-like dictionary, which we assigned to the variable form. This function is the inverse of urllib.parse.urlencode(). The dictionary built by the parse_qs() function associates each key with a list of values. We only want the first value, so we use form['RestrictToCategory'][0] to get the key required for a restaurant.
Since this food_table_iter () function is a generator, it must be used with a for statement or another generator function. We can use this function with a for statement as follows:
for row in food_table_iter(get_food_list_by_name()): print(row)
This prints each row of data from the HTML table. It starts like this:
['Name', '', 'Facility Location', 'Last Inspection'] ["Todd's Refresher", '', '150 W. Main St #100', '6-May-2014', '43F6BE8576FFC376852574CF005E3FC0'] ["'Chick-fil-A", '', '1205 N Military Highway', '13-Jun-2014', '5BDECD68B879FA8C8525784E005B9926']
This goes on for 400 locations.
The results are unsatisfying because each row is a flat list of attributes. The name is in row[0] and the address in row[2]. This kind of reference to columns by position can be obscure. It would be much nicer to have named attributes. If we convert the results to a SimpleNamespace object, we can then use the row.name and row.address syntax.
We really want to work with an object that has easy-to-remember attribute names and not a sequence of anonymous column names. Here's a generator function that will build a SimpleNamespace object from a sequence of values produced by a function such as the food_table_iter() function:
def food_row_iter( table_iter ): heading= next(table_iter) assert ['Name', '', 'Facility Location', 'Last Inspection'] == heading for row in table_iter: yield SimpleNamespace( name= row[0], address= row[2], last_inspection= row[3], category= row[4] )
This function's argument must be an iterator like food_table_iter(get_food_list_by_name()). The function uses next(table_iter) to grab the first row, since that's only going to be a bunch of column titles. We'll assert that the column titles really are the standard column titles in the VDH data. If the assertion ever fails, it's a hint that VDH web data has changed.
For every row after the first row, we build a SimpleNamespace object by taking the specific columns from each row and assigning them nice names.
We can use this function as follows:
soup= get_food_list_by_name() raw_columns= food_table_iter(soup) for business in food_row_iter( raw_column ): print( business.name, business.address )
The processing can now use nice attribute names, for example, business.name, to refer to the data we extracted from the HTML page. This makes the rest of the programming meaningful and clear.
What's also important is that we've combined two generator functions. The food_table_iter() function will yield small lists built from HTML table rows. The food_row_iter() function expects a sequence of lists that can be iterated, and will build SimpleNamespace objects from that sequence of lists. This defines a kind of composite processing pipeline built from smaller steps. Each row of the HTML table that starts in food_table_iter() is touched by food_row_iter() and winds up being processed by the print() function.
The next steps are also excellent examples of the strengths of Python for espionage purposes. We need to geocode the restaurant addresses if it hasn't been done already. In some cases, geocoding is done for us. In other cases, we'll be using a web service for this. It varies from city to city whether or not the data is geocoded.
One popular geocoding service (Google) can be accessed using Python's httplib and json modules. In a few lines of code we can extract the location of an address. We'll also need to implement the haversine formula for computing the distances between two points on the globe. This is not only easy, but the code is available on the web as a tidy example of good Python programming. Well worth an agent's time to search for this code.
Once we have the raw data on good restaurants close to our secret lair, we still need to filter and make the final decision. Given the work done in the previous steps, it's a short, clear Python loop that will show a list of restaurants with top health scores within short distances of our lair.
As we noted above, we'll also need to communicate this. We can use steganography to encode a message into an image file. In addition to data scraping from the web, and using web services, Python is also suitable for this kind of bit-and-byte-level fiddling with the internals of a TIFF image.
Every secret agent can leverage Python for gathering, analyzing and distributing information.
In this article we learned about different functionalities in OpenCV 3.0.
Further resources on this subject: