Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Scraping the Web with Python - Quick Start

Save for later
  • 9 min read
  • 17 Feb 2016

article-image

In this article we're going to acquire intelligence data from a variety of sources. We might interview people. We might steal files from a secret underground base. We might search the World Wide Web (WWW).

(For more resources related to this topic, see here.)

Accessing data from the Internet

The WWW and Internet are based on a series of agreements called Request for Comments (RFC). The RFCs define the standards and protocols to interconnect different networks, that is, the rules for internetworking. The WWW is defined by a subset of these RFCs that specifies the protocols, behaviors of hosts and agents (servers and clients), and file formats, among other details.

In a way, the Internet is a controlled chaos. Most software developers agree to follow the RFCs. Some don't. If their idea is really good, it can catch on, even though it doesn't precisely follow the standards. We often see this in the way some browsers don't work with some websites. This can cause confusion and questions. We'll often have to perform both espionage and plain old debugging to figure out what's available on a given website.

Python provides a variety of modules that implement the software defined in the Internet RFCs. We'll look at some of the common protocols to gather data through the Internet and the Python library modules that implement these protocols.

Background briefing – the TCP/IP protocols

The essential idea behind the WWW is the Internet. The essential idea behind the Internet is the TCP/IP protocol stack. The IP part of this is the internetworking protocol. This defines how messages can be routed between networks. Layered on top of IP is the TCP protocol to connect two applications to each other. TCP connections are often made via a software abstraction called a socket. In addition to TCP, there's also UDP; it's not used as much for the kind of WWW data we're interested in.

In Python, we can use the low-level socket library to work with the TCP protocol, but we won't. A socket is a file-like object that supports open, close, input, and output operations. Our software will be much simpler if we work at a higher level of abstraction. The Python libraries that we'll use will leverage the socket concept under the hood.

The Internet RFCs defines a number of protocols that build on TCP/IP sockets. These are more useful definitions of interactions between host computers (servers) and user agents (clients). We'll look at two of these: Hypertext Transfer Protocol (HTTP) and File Transfer Protocol (FTP).

Using http.client for HTTP GET

The essence of web traffic is HTTP. This is built on TCP/IP. HTTP defines two roles: host and user agent, also called server and client, respectively. We'll stick to server and client. HTTP defines a number of kinds of request types, including GET and POST.

A web browser is one kind of client software we can use. This software makes GET and POST requests, and displays the results from the web server. We can do this kind of client-side processing in Python using two library modules.

The http.client module allows us to make GET and POST requests as well as PUT and DELETE. We can read the response object. Sometimes, the response is an HTML page. Sometimes, it's a graphic image. There are other things too, but we're mostly interested in text and graphics.

Here's a picture of a mysterious device we've been trying to find. We need to download this image to our computer so that we can see it and send it to our informant from http://upload.wikimedia.org/wikipedia/commons/7/72/IPhone_Internals.jpg:

scraping-web-python-quick-start-img-0

Here's a picture of the currency we're supposed to track down and pay with:

scraping-web-python-quick-start-img-1

We need to download this image. Here is the link:

http://upload.wikimedia.org/wikipedia/en/c/c1/1drachmi_1973.jpg

Here's how we can use http.client to get these two image files:

import http.client
import contextlib

path_list = [
   "/wikipedia/commons/7/72/IPhone_Internals.jpg",
   "/wikipedia/en/c/c1/1drachmi_1973.jpg",
]
host = "upload.wikimedia.org"

with contextlib.closing(http.client.HTTPConnection( host )) as connection:
   for path in path_list:
       connection.request( "GET", path )
       response= connection.getresponse()
       print("Status:", response.status)
       print("Headers:", response.getheaders())
       _, _, filename = path.rpartition("/")
       print("Writing:", filename)
       with open(filename, "wb") as image:
           image.write( response.read() )

We're using http.client to handle the client side of the HTTP protocol. We're also using the contextlib module to politely disentangle our application from network resources when we're done using them.

We've assigned a list of paths to the path_list variable. This example introduces list objects without providing any background. It's important that lists are surrounded by [] and the items are separated by ,. Yes, there's an extra , at the end. This is legal in Python.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime

We created an http.client.HTTPConnection object using the host computer name. This connection object is a little like a file; it entangles Python with operating system resources on our local computer plus a remote server. Unlike a file, an HTTPConnection object isn't a proper context manager. As we really like context managers to release our resources, we made use of the contextlib.closing() function to handle the context management details. The connection needs to be closed; the closing() function assures that this will happen by calling the connection's close() method.

For all of the paths in our path_list, we make an HTTP GET request. This is what browsers do to get the image files mentioned in an HTML page. We print a few things from each response. The status, if everything worked, will be 200. If the status is not 200, then something went wrong and we'll need to read up on the HTTP status code to see what happened.

If you use a coffee shop Wi-Fi connection, perhaps you're not logged in. You might need to open a browser to set up a connection.

An HTTP response includes headers that provide some additional details about the request and response. We've printed the headers because they can be helpful in debugging any problems we might have. One of the most useful headers is ('Content-Type', 'image/jpeg'). This confirms that we really did get an image.

We used _, _, filename = path.rpartition("/") to locate the right-most / character in the path. Recall that the partition() method locates the left-most instance. We're using the right-most one here. We assigned the directory information and separator to the variable _. Yes, _ is a legal variable name. It's easy to ignore, which makes it a handy shorthand for we don't care. We kept the filename in the filename variable.

We create a nested context for the resulting image file. We can then read the body of the response—a collection of bytes—and write these bytes to the image file. In one quick motion, the file is ours.

The HTTP GET request is what underlies much of the WWW. Programs such as curl and wget are expansions of this example. They execute batches of GET requests to locate one or more pages of content. They can do quite a bit more, but this is the essence of extracting data from the WWW.

Changing our client information

An HTTP GET request includes several headers in addition to the URL. In the previous example, we simply relied on the Python http.client library to supply a suitable set of default headers. There are several reasons why we might want to supply different or additional headers.

First, we might want to tweak the User-Agent header to change the kind of browser that we're claiming to be. We might also need to provide cookies for some kinds of interactions. For information on the user agent string, see http://en.wikipedia.org/wiki/User_agent_string#User_agent_identification.

This information may be used by the web server to determine if a mobile device or desktop device is being used. We can use something like this:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14

This makes our Python request appear to come from the Safari browser instead of a Python application. We can use something like this to appear to be a different browser on a desktop computer:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0

We can use something like this to appear to be an iPhone instead of a
Python application:

Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D201 Safari/9537.53

We make this change by adding headers to the request we're making. The change looks like this:

connection.request( "GET", path, headers= {
   'User-Agent':
       'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D201 Safari/9537.53',
})

This will make the web server treat our Python application like it's on an iPhone. This might lead to a more compact page of data than might be provided to a full desktop computer that makes the same request.

The header information is a structure with the { key: value, } syntax. It's important that dictionaries are surrounded by {}, the keys and values are separated by :, and each key-value pair is separated by ,. Yes, there's an extra , at the end. This is legal in Python.

There are many more HTTP headers we can provide. The User-Agent header is perhaps most important to gather different kinds of intelligence data from web servers.

You can refer more book related to this topic on the following links:

  • Python for Secret Agents - Volume II: (https://www.packtpub.com/application-development/python-secret-agents-volume-ii)
  • Expert Python Programming: (https://www.packtpub.com/application-development/expert-python-programming)
  • Raspberry Pi for Secret Agents: (https://www.packtpub.com/hardware-and-creative/raspberry-pi-secret-agents)

Resources for Article:


Further resources on this subject: