In this article by Chad Adams, author of the book Learning Python Data Visualization, we will go over the finer points of pulling data from the Web using the Python language and its built-in libraries and cover parsing XML, JSON, and JSONP data.

(For more resources related to this topic, see here.)

Since we now have an understanding of how to work with the pygal library and building charts and graphics in general, this is the time to start looking at building an application using Python.

In this article, we will take a look at the fundamentals of pulling data from the Web, parsing the data, and adding it to our code base and formatting the data into a useable format, and we will look at how to carry those fundamentals over to our Python code. We will also cover parsing XML and JSON data.

Pulling data from the Web

For many non-developers, it may seem like witchcraft that developers are magically able to pull data from an online resource and integrate that with an iPhone app, or a Windows Store app, or pull data to a cloud resource that is able to generate various versions of the data upon request.

To be fair, they do have a general understanding; data is pulled from the Web and formatted to their app of choice. They just may not get the full background of how that process workflow happens. It's the same case with some developers as well—many developers mainly work on a technology that only works on a locked down environment, or generally, don't use the Internet for their applications. Again, they understand the logic behind it; somehow an RSS feed gets pulled into an application.

In many languages, the same task is done in various ways, usually depending on which language is used. Let's take a look at a few examples using Packt's own news RSS feed, using an iOS app pulling in data via Objective-C.

Now, if you're reading this and not familiar with Objective-C, that's OK, the important thing is that we have the inner XML contents of an XML file showing up in an iPhone application:

#import "ViewController.h"

@interfaceViewController ()
@property (weak, nonatomic) IBOutletUITextView *output;

@end

@implementationViewController

- (void)viewDidLoad
{
    [super viewDidLoad];
	// Do any additional setup after loading the view, typically from a nib.
    
    NSURL *packtURL = [NSURLURLWithString:@"http://www.packtpub.com/rss.xml"];
    NSURLRequest *request = [NSURLRequestrequestWithURL:packtURL];
    NSURLConnection *connection = [[NSURLConnectionalloc]
initWithRequest:requestdelegate:selfstartImmediately:YES];
    
    [connection start];
}

- (void)connection:(NSURLConnection *)connection didReceiveData:(NSData *)data {
    NSString *downloadstring = [[NSStringalloc] initWithData:dataencoding:NSUTF8StringEncoding];
    
    [self.outputsetText:downloadstring];
    
}


- (void)didReceiveMemoryWarning
{
    [superdidReceiveMemoryWarning];
    // Dispose of any resources that can be recreated.
}

@end

Here, we can see in iPhone Simulator that our XML output is pulled dynamically through HTTP from the Web to our iPhone simulator. This is what we'll want to get started with doing in Python:

importing-dynamic-data-img-0

The XML refresher

Extensible Markup Language (XML) is a data markup language that sets a series of rules and hierarchy to a data group, which is stored as a static file. Typically, servers update these XML files on the Web periodically to be reused as data sources. XML is really simple to pick up as it's similar to HTML. You can start with the document declaration in this case:

<?xml version="1.0" encoding="utf-8"?>

Next, a root node is set. A node is like an HTML tag (which is also called a node). You can tell it's a node by the brackets around the node's name. For example, here's a node named root:

<root></root>

Note that we close the node by creating a same-named node with a backslash. We can also add parameters to the node and assign a value, as shown in the following root node:

<root parameter="value"></root>

Data in XML is set through a hierarchy. To declare that hierarchy, we create another node and place that inside the parent node, as shown in the following code:

<root parameter="value">
     <subnode>Subnode's value</subnode>
</root>

In the preceding parent node, we created a subnode. Inside the subnode, we have an inner value called Subnode's value. Now, in programmatical terms, getting data from an XML data file is a process called parsing. With parsing, we specify where in the XML structure we would like to get a specific value; for instance, we can crawl the XML structure and get the inner contents like this:

/root/subnode

This is commonly referred to as XPath syntax, a cross-language way of going through an XML file. For more on XML and XPath, check out the full spec at: http://www.w3.org/TR/REC-xml/ and here http://www.w3.org/TR/xpath/ respectively.

RSS and the ATOM

Really simple syndication (RSS) is simply a variation of XML. RSS is a spec that defines specific nodes that are common for sending data. Typically, many blog feeds include an RSS option for users to pull down the latest information from those sites. Some of the nodes used in RSS include rss, channel, item, title, description, pubDate, link, and GUID.

Looking at our iPhone example in this article from the Pulling data from the Web section, we can see what a typical RSS structure entails. RSS feeds are usually easy to spot since the spec requires the root node to be named rss for it to be a true RSS file.

In some cases, some websites and services will use .rss rather than .xml; this is typically fine since most readers for RSS content will pull in the RSS data like an XML file, just like in the iPhone example.

Another form of XML is called ATOM. ATOM was another spec similar to RSS, but developed much later than RSS. Because of this, ATOM has more features than RSS: XML namespacing, specified content formats (video, or audio-specific URLs), support for internationalization, and multilanguage support, just to name a few.

ATOM does have a few different nodes compared to RSS; for instance, the root node to an RSS feed would be <rss>. ATOM's root starts with <feed>, so it's pretty easy to spot the difference. Another difference is that ATOM can also end in .atom or .xml.

For more on the RSS and ATOM spec, check out the following sites:

http://www.rssboard.org/rss-specification

http://tools.ietf.org/html/rfc4287

Understanding HTTP

All these samples from the RSS feed of the Packt Publishing website show one commonality that's used regardless of the technology coded in, and that is the method used to pull down these static files is through the Hypertext Transfer Protocol (HTTP). HTTP is the foundation of Internet communication. It's a protocol with two distinct types of requests: a request for data or GET and a push of data called a POST.

Typically, when we download data using HTTP, we use the GET method of HTTP in order to pull down the data. The GET request will return a string or another data type if we mention a specific type. We can either use this value directly or save to a variable.

With a POST request, we are sending values to a service that handles any incoming values; say we created a new blog post's title and needed to add to a list of current titles, a common way of doing that is with URL parameters. A URL parameter is an existing URL but with a suffixed key-value pair.

The following is a mock example of a POST request with a URL parameter:

http://www.yourwebsite.com/blogtitles/?addtitle=Your%20New%20Title

If our service is set up correctly, it will scan the POST request for a key of addtitle and read the value, in this case: Your New Title. We may notice %20 in our title for our request. This is an escape character that allows us to send a value with spaces; in this case, %20 is a placehoder for a space in our value.

Using HTTP in Python

The RSS samples from the Packt Publishing website show a few commonalities we use in programming when working in HTTP; one is that we always account for the possibility of something potentially going wrong with a connection and we always close our request when finished. Here's an example on how the same RSS feed request is done in Python using a built-in library called urllib2:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2

try:
    #Open the file via HTTP.
    response = urllib2.urlopen('http://www.packtpub.com/rss.xml')
    #Read the file to a variable we named 'xml'
    xml = response.read()
    #print to the console.
    print(xml)
    #Finally, close our open network.
    response.close()
except:
    #If we have an issue show a message and alert the user.
    print('Unable to connect to RSS...')

If we look in the following console output, we can see the XML contents just as we saw in our iOS code example:

importing-dynamic-data-img-1

In the example, notice that we wrapped our HTTP request around a try except block. For those coming from another language, except can be considered the same as a catch statement. This allows us to detect if an error occurs, which might be an incorrect URL or lack of connectivity, for example, to programmatically set an alternate result to our Python script.

Parsing XML in Python with HTTP

With these examples including our Python version of the script, it's still returning a string of some sorts, which by itself isn't of much use to grab values from the full string. In order to grab specific strings and values from an XML pulled through HTTP, we need to parse it. Luckily, Python has a built-in object in the Python main library for this, called as ElementTree, which is a part of the XML library in Python.

Let's incorporate ElementTree into our example and see how that works:

# -*- coding: utf-8 -*-

import urllib2
from xml.etree import ElementTree

try:
    #Open the file via HTTP.
    response = urllib2.urlopen('http://www.packtpub.com/rss.xml')

    tree = ElementTree.parse(response)
    root = tree.getroot()

    #Create an 'Element' group from our XPATH using findall.
    news_post_title = root.findall("channel//title")

    #Iterate in all our searched elements and print the inner text for each.
    for title in news_post_title:
        print title.text

    #Finally, close our open network.
    response.close()
except Exception as e:
    #If we have an issue show a message and alert the user.
    print(e)

The following screenshot shows the results of our script:

importing-dynamic-data-img-2

As we can see, our output shows each title for each blog post. Notice how we used channel//item for our findall() method. This is using XPath, which allows us to write in a shorthand manner on how to iterate an XML structure. It works like this; let's use the http://www.packtpub.com feed as an example. We have a root, followed by channel, then a series of title elements.

The findall() method found each element and saved them as an Element type specific to the XML library ElementTree uses in Python, and saved each of those into an array. We can then use a for in loop to iterate each one and print the inner text using the text property specific to the Element type.

You may notice in the recent example that I changed except with a bit of extra code and added Exception as e. This allows us to help debug issues and print them to a console or display a more in-depth feedback to the user. An Exception allows for generic alerts that the Python libraries have built-in warnings and errors to be printed back either through a console or handled with the code. They also have more specific exceptions we can use such as IOException, which is specific for working with file reading and writing.

About JSON

Now, another data type that's becoming more and more common when working with web data is JSON. JSON is an acronym for JavaScript Object Notation, and as the name implies, is indeed true JavaScript. It has become popular in recent years with the rise of mobile apps, and Rich Internet Applications (RIA).

JSON is great for JavaScript developers; it's easier to work with when working in HTML content, compared to XML. Because of this, JSON is becoming a more common data type for web and mobile application development.

Parsing JSON in Python with HTTP

To parse JSON data in Python is a pretty similar process; however, in this case, our ElementTree library isn't needed, since that only works with XML. We need a library designed to parse JSON using Python. Luckily, the Python library creators already have a library for us, simply called json.

Let's build an example similar to our XML code using the json import; of course, we need to use a different data source since we won't be working in XML. One thing we may note is that there aren't many public JSON feeds to use, many of which require using a code that gives a developer permission to generate a JSON feed through a developer API, such as Twitter's JSON API. To avoid this, we will use a sample URL from Google's Books API, which will show demo data of Pride and Prejudice, Jane Austen. We can view the JSON (or download the file), by typing in the following URL:

https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ

Notice the API uses HTTPS, many JSON APIs are moving to secure methods of transmitting data, so be sure to include the secure in HTTP, with HTTPS.

Let's take a look at the JSON output:

{
 "kind": "books#volume",
 "id": "s1gVAAAAYAAJ",
 "etag": "yMBMZ85ENrc",
 "selfLink": "https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ",
 "volumeInfo": {
  "title": "Pride and Prejudice",
  "authors": [
   "Jane Austen"
  ],
  "publisher": "C. Scribner's Sons",
  "publishedDate": "1918",
  "description": "Austen's most celebrated novel tells
the story of Elizabeth Bennet, a bright, lively young woman
with four sisters, and a mother determined to marry
them to wealthy men. At a party near
the Bennets' home in the English countryside, Elizabeth meets the wealthy,
proud Fitzwilliam Darcy. Elizabeth initially finds Darcy haughty
and intolerable, but circumstances continue to unite the pair.
Mr. Darcy finds himself captivated by Elizabeth's wit and candor,
while her reservations about his character slowly vanish. The story is as
much a social critique as it is a love story, and the prose crackles
with Austen's wry wit.",
  "readingModes": {
   "text": true,
   "image": true
  },
  "pageCount": 401,
  "printedPageCount": 448,
  "dimensions": {
   "height": "18.00 cm"
  },
  "printType": "BOOK",
  "averageRating": 4.0,
  "ratingsCount": 433,
  "contentVersion": "1.1.5.0.full.3",
  "imageLinks": {
   "smallThumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec
=frontcover&img=1&zoom=5&edge=curl&imgtk=AFLRE73F8btNqKpVjGX6q7V3XS77
QA2PftQUxcEbU3T3njKNxezDql_KgVkofGxCPD3zG1yq39u0XI8s4wjrqFahrWQ-
5Epbwfzfkoahl12bMQih5szbaOw&source=gbs_api",
   "thumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec=
frontcover&img=1&zoom=1&edge=curl&imgtk=AFLRE70tVS8zpcFltWh_
7K_5Nh8BYugm2RgBSLg4vr9tKRaZAYoAs64RK9aqfLRECSJq7ATs_j38JRI3D4P48-2g_
k4-EY8CRNVReZguZFMk1zaXlzhMNCw&source=gbs_api",
   "small": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec
=frontcover&img=1&zoom=2&edge=curl&imgtk=AFLRE71qcidjIs37x0jN2dGPstn
6u2pgeXGWZpS1ajrGgkGCbed356114HPD5DNxcR5XfJtvU5DKy5odwGgkrwYl9gC9fo3y-
GM74ZIR2Dc-BqxoDuUANHg&source=gbs_api",
   "medium": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec=
frontcover&img=1&zoom=3&edge=curl&imgtk=AFLRE73hIRCiGRbfTb0uNIIXKW
4vjrqAnDBSks_ne7_wHx3STluyMa0fsPVptBRW4yNxNKOJWjA4Od5GIbEKytZAR3Nmw_
XTmaqjA9CazeaRofqFskVjZP0&source=gbs_api",
   "large": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec=
frontcover&img=1&zoom=4&edge=curl&imgtk=AFLRE73mlnrDv-rFsL-
n2AEKcOODZmtHDHH0QN56oG5wZsy9XdUgXNnJ_SmZ0sHGOxUv4sWK6GnMRjQm2eEwnxIV4dcF9eBhghMcsx
-S2DdZoqgopJHk6Ts&source=gbs_api",
   "extraLarge": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec=
frontcover&img=1&zoom=6&edge=curl&imgtk=AFLRE73KIXHChszn
TbrXnXDGVs3SHtYpl8tGncDPX_7GH0gd7sq7SA03aoBR0mDC4-euzb4UCIDiDNLYZUBJwMJxVX_
cKG5OAraACPLa2QLDcfVkc1pcbC0&source=gbs_api"
  },
  "language": "en",
  "previewLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api",
  "infoLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api",
  "canonicalVolumeLink": "http://books.google.com/books/about/
Pride_and_Prejudice.html?hl=&id=s1gVAAAAYAAJ"
 },
 "layerInfo": {
  "layers": [
   {
    "layerId": "geo",
    "volumeAnnotationsVersion": "6"
   }
  ]
 },
 "saleInfo": {
  "country": "US",
  "saleability": "FREE",
  "isEbook": true,
  "buyLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&buy=&source=gbs_api"
 },
 "accessInfo": {
  "country": "US",
  "viewability": "ALL_PAGES",
  "embeddable": true,
  "publicDomain": true,
  "textToSpeechPermission": "ALLOWED",
  "epub": {
   "isAvailable": true,
   "downloadLink": "http://books.google.com/books/download
/Pride_and_Prejudice.epub?id=s1gVAAAAYAAJ&hl=&output=epub
&source=gbs_api"
  },
  "pdf": {
   "isAvailable": true,
   "downloadLink": "http://books.google.com/books/download/Pride_and_Prejudice.pdf
?id=s1gVAAAAYAAJ&hl=&output=pdf&sig=ACfU3U3dQw5JDWdbVgk2VRHyDjVMT4oIaA
&source=gbs_api"
  },
  "webReaderLink": "http://books.google.com/books/reader
?id=s1gVAAAAYAAJ&hl=&printsec=frontcover&
output=reader&source=gbs_api",
  "accessViewStatus": "FULL_PUBLIC_DOMAIN",
  "quoteSharingAllowed": false
 }
}

One downside to JSON is that it can be hard to read complex data. So, if we run across a complex JSON feed, we can visualize it using a JSON Visualizer. Visual Studio includes one with all its editions, and a web search will also show multiple online sites where you can paste JSON and an easy-to-understand data structure will be displayed. Here's an example using http://jsonviewer.stack.hu/ loading our example JSON URL:

importing-dynamic-data-img-3

Next, let's reuse some of our existing Python code using our urllib2 library to request our JSON feed, and then we will parse it with the Python's JSON library. Let's parse the volumeInfo node of the book by starting with the JSON (root) node that is followed by volumeInfo as the subnode. Here's our example from the XML section, reworked using JSON to parse all child elements:

# -*- coding: utf-8 -*-

import urllib2
import json
try:
    #Set a URL variable.
    url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ'
    #Open the file via HTTP.
    response = urllib2.urlopen(url)

    #Read the request as one string.
    bookdata = response.read()

    #Convert the string to a JSON object in Python.
    data = json.loads(bookdata)

    for r in data ['volumeInfo']:
        print r

    #Close our response.
    response.close()

except:
    #If we have an issue show a message and alert the user.
    print('Unable to connect to JSON API...')

Here's our output. It should match the child nodes of volumeInfo, which it does in the output screen, as shown in the following screenshot:

importing-dynamic-data-img-4

Well done! Now, let's grab the value for title. Look at the following example and notice we have two brackets: one for volumeInfo and another for title. This is similar to navigating our XML hierarchy:

# -*- coding: utf-8 -*-

import urllib2
import json

try:
    #Set a URL variable.
    url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ'

    #Open the file via HTTP.
    response = urllib2.urlopen(url)

    #Read the request as one string.
    bookdata = response.read()

    #Convert the string to a JSON object in Python.
    data = json.loads(bookdata)

    print data['volumeInfo']['title']

    #Close our response.
    response.close()

except Exception as e:
    #If we have an issue show a message and alert the user.
    #'Unable to connect to JSON API...'
    print(e)

The following screenshot shows the results of our script:

importing-dynamic-data-img-5

As you can see in the preceding screenshot, we return one line with Pride and Prejudice parsed from our JSON data.

About JSONP

JSONP, or JSON with Padding, is actually JSON but it is set up differently compared to traditional JSON files. JSONP is a workaround for web cross-browser scripting. Some web services can serve up JSONP rather than pure JSON JavaScript files. The issue with that is JSONP isn't compatible with many JSON Python-based parsers including one covered here, so you will want to avoid JSONP style JSON whenever possible.

So how can we spot JSONP files; do they have a different extension? No, it's simply a wrapper for JSON data; here's an example without JSONP:

/*
 *Regular JSON
 */
{ authorname: 'Chad Adams' }

The same example with JSONP:

/*
 * JSONP
 */
callback({ authorname: 'Chad Adams' });

Notice we wrapped our JSON data with a function wrapper, or a callback. Typically, this is what breaks in our parsers and is a giveaway that this is a JSONP-formatted JSON file. In JavaScript, we can even call it in code like this:

/*
 * Using JSONP in JavaScript
 */
callback = function (data) {
    alert(data.authorname);
};

JSONP with Python

We can get around a JSONP data source though, if we need to; it just requires a bit of work. We can use the str.replace() method in Python to strip out the callback before running the string through our JSON parser. If we were parsing our example JSONP file in our JSON parser example, the string would look something like this:

#Convert the string to a JSON object in Python.
data = json.loads(bookdata.replace('callback(', '').) .replace(')', ''))

Summary

In this article, we covered HTTP concepts and methodologies for pulling strings and data from the Web. We learned how to do that with Python using the urllib2 library, and parsed XML data and JSON data. We discussed the differences between JSON and JSONP, and how to work around JSONP if needed.