Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python for Secret Agents - Volume II

You're reading from   Python for Secret Agents - Volume II Gather, analyze, and decode data to reveal hidden facts using Python, the perfect tool for all aspiring secret agents

Arrow left icon
Product type Paperback
Published in Dec 2015
Publisher
ISBN-13 9781785283406
Length 180 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Steven F. Lott Steven F. Lott
Author Profile Icon Steven F. Lott
Steven F. Lott
Arrow right icon
View More author details
Toc

Mission to expand our toolkit

Now that we know our Python 3 is up-to-date, we can add some additional tools. We'll be using several advanced packages to help in acquiring and analyzing raw data.

We're going to need to tap into the social network. There are a large number of candidate social networks that we could mine for information. We'll start with Twitter. We can access the Twitter feed using direct API requests. Rather than working through the protocols at a low level, we'll make use of a Python package that provides some simplifications.

Our first choice is the Twitter API project on PyPI, as follows: https://pypi.python.org/pypi/TwitterAPI/2.3.3.

This can be installed using sudo pip3.4 install twitterapi.

We have some alternatives, one of which is the Twitter project from sixohsix. Here's the URL: https://pypi.python.org/pypi/twitter/1.17.0.

We can install this using sudo pip3.4 install twitter.

We'll focus on the twitterapi package. Here's what happens when we do the installation:

MacBookPro-SLott:Code slott$ sudo -H pip3.4 install twitterapi
Password:
Collecting twitterapi
  Downloading TwitterAPI-2.3.3.1.tar.gz
Collecting requests (from twitterapi)
  Downloading requests-2.7.0-py2.py3-none-any.whl (470kB)
    100% |████████████████████████████████| 471kB 751kB/s 
Collecting requests-oauthlib (from twitterapi)
  Downloading requests_oauthlib-0.5.0-py2.py3-none-any.whl
Collecting oauthlib>=0.6.2 (from requests-oauthlib->twitterapi)
  Downloading oauthlib-0.7.2.tar.gz (106kB)
    100% |████████████████████████████████| 106kB 1.6MB/s 
Installing collected packages: requests, oauthlib, requests-oauthlib, twitterapi
  Running setup.py install for oauthlib
  Running setup.py install for twitterapi
Successfully installed oauthlib-0.7.2 requests-2.7.0 requests-oauthlib-0.5.0 twitterapi-2.3.3.1

We used the sudo -H option, as required by Mac OS X. Windows agents would omit this. Some Linux agents can omit the -H option as it may be the default behavior.

Note that four packages were installed. The twitterapi package included the requests and requests-oauthlib packages. This, in turn, required the oauthlib package, which was downloaded automatically for us.

The missions for using this package start in Chapter 3, Following the Social Network. For now, we'll count the installation as a successful preliminary mission.

Scraping data from PDF files

In addition to HTML, a great deal of data is packaged as PDF files. PDF files are designed as the requirements to produce the printed output consistently across a variety of devices. When we look at the structure of these documents, we find that we have a complex and compressed storage format. In this structure, there are fonts, rasterized images, and descriptions of text elements in a simplified version of the PostScript language.

There are several issues the come into play here, as follows:

  • The files are quite complex. We don't want to tackle the algorithms that are required to read the streams encoded in the PDF since we're focused on the content.
  • The content is organized for tidy printing. What we perceive as a single page of text is really just a collection of text blobs. We've been taught how to identify the text blobs as headers, footers, sidebars, titles, code examples, and other semantic features of a page. This is actually a pretty sophisticated bit of pattern matching. There's an implicit agreement between readers and book designers to stick to some rules to place the content on the pages.
  • It's possible that a PDF can be created from a scanned image. This will require Optical Character Recognition (OCR) in order to recover useful text from the image.

In order to extract text from a PDF, we'll need to use a tool such as the PDF Miner 3k. Look for this package at https://pypi.python.org/pypi/pdfminer3k/1.3.0.

An alternative is the pdf package. You can look at:

https://pypi.python.org/pypi/PDF/1.0 for the package.

In Chapter 4, Dredging up History, we'll look at the kinds of algorithms that we'll need to write in order to extract useful content from PDF files.

However, for now, we need to install this package in order to be sure that we can process PDF files. We'll use sudo -H pip3.4 install pdfminer3k to do the installation. The output looks as shown in the following:

MacBookPro-SLott:Code slott$ sudo -H pip3.4 install pdfminer3k
Collecting pdfminer3k
  Downloading pdfminer3k-1.3.0.tar.gz (9.7MB)
    100% |████████████████████████████████| 9.7MB 55kB/s 
Collecting pytest>=2.0 (from pdfminer3k)
  Downloading pytest-2.7.2-py2.py3-none-any.whl (127kB)
    100% |████████████████████████████████| 131kB 385kB/s 
Collecting ply>=3.4 (from pdfminer3k)
  Downloading ply-3.6.tar.gz (281kB)
    100% |████████████████████████████████| 282kB 326kB/s 
Collecting py>=1.4.29 (from pytest>=2.0->pdfminer3k)
  Downloading py-1.4.30-py2.py3-none-any.whl (81kB)
    100% |████████████████████████████████| 86kB 143kB/s 
Installing collected packages: py, pytest, ply, pdfminer3k
  Running setup.py install for ply
  Running setup.py install for pdfminer3k
Successfully installed pdfminer3k-1.3.0 ply-3.6 py-1.4.30 pytest-2.7.2

Windows agents will omit the sudo -H prefix. This is a large and complex installation. The package itself is pretty big (almost 10 Mb.) It requires additional packages such as pytest, and py. It also incorporates ply, which is an interesting tool in its own right.

Interestingly, the documentation for how to use this package can be hard to locate. Here's the link to locate it:

http://www.unixuser.org/~euske/python/pdfminer/index.html.

Note that the documentation is older than the actual package as it says (in red) Python 3 is not supported. However, the pdfminer3k project clearly states that pdfminer3k is a Python 3 port of pdfminer. While the software may have been upgraded, some of the documentation still needs work.

We can learn more about ply here at https://pypi.python.org/pypi/ply/3.6. The lex and yacc summary may not be too helpful for most of the agents. These terms refer to the two classic programs that are widely used to create the tools that support software development.

Sidebar on the ply package

When we work with the Python language, we rarely give much thought on how the Python program actually works. We're mostly interested in results, not the details of how Python language statements lead to useful processing by the Python program. The ply package solves the problem of translating characters to meaningful syntax.

Agents that are interested in the details of how Python works will need to consider the source text that we write. When we write the Python code, we're writing a sequence of intermingled keywords, symbols, operators, and punctuation. These various language elements are just sequences of Unicode characters that follow a strict set of rules. One wrong character and we get errors from Python.

There's a two-tier process to translate a .py file of the source text to something that is actionable.

At the lowest tier, an algorithm must do the lexical scanning of our text. A lexical scanner identifies the keywords, symbols, literals, operators, and punctuation marks; the generic term for these various language elements is tokens. A classic program to create lexical scanners is called lex. The lex program uses a set of rules to transform a sequence of characters into a sequence of higher-level tokens.

The process of compiling Python tokens to useful statements is the second tier. The classic program for this is called Yacc (Yet Another Compiler Compiler). The yacc language contained the rules to interpret a sequence of tokens as a valid statement in the language. Associated with the rules to parse a target language, the yacc language also contained statements for an action to be taken when the statement was recognized. The yacc program compiles the rules and statements into a new program that we call a compiler.

The ply Python package implements both the tiers. We can use it to define a lexical scanner and a parser that is based on the the classic lex and yacc concepts. Software developers will use a tool such as ply to process statements in a well-defined formal language.

Building our own gadgets

Sometimes, we need to move beyond the data that is readily available on computers. We might need to build our own devices for espionage. There are a number of handy platforms that we can use to build our own sensors and gadgets. These are all single-board computers. These computers have a few high-level interfaces, often USB-based, along with a lot of low-level interfaces that allow us to create simple and interactive devices.

To work with these, we'll create a software on a large computer, such as a laptop or desktop system. We'll upload our software to our a single board computer and experiment with the gadget that we're building.

There are a variety of these single board computers. Two popular choices are the Raspberry Pi and the Arduino. One of the notable differences between these devices is that a Raspberry Pi runs a small GNU/Linux operating system, where as an Arduino doesn't offer much in the way of OS features.

Both devices allow us to create simple, interactive devices. There are ways to run Python on Raspberry Pi using the RPi GPIO module. Our gadget development needs to focus on Arduino as there is a rich variety of hardware that we can use. We can find small, robust Arduinos that are suitable for harsh environments.

A simple Arduino Uno isn't the only thing that we'll need. We'll also need some sensors and wires. We'll save the detailed shopping list for Chapter 5, Data Collection Gadgets. At this point, we're only interested in software tools.

Getting the Arduino IDE

To work with Arduino, we'll need to download the Arduino Integrated Development Environment (IDE.) This will allow us to write programs in the Arduino language, upload them to our Arduino, and do some basic debugging. An Arduino program is called a sketch.

We'll need to get the Arduino IDE from https://www.arduino.cc/en/Main/Software. On the right-hand side of this web page, you can pick the OS for our working computer and download the proper Arduino tool set. Some agents prefer the idea of making a contribution to the Arduino foundation. However, it's possible to download the IDE without making a contribution.

For Mac OS X, the download will be a .ZIP file. This will unpack itself in the IDE application; we can copy this to our Applications folder and we're ready to go.

For Windows agents, we can download a .MSI file that will do the complete installation. This is preferred for computers where we have full administrative access. In some cases, where we may not have administrative rights, we'll need to download the .ZIP file, which we can unpack in a C:\Arduino directory.

We can open the Arduino application to see an initial sketch. The screen looks something similar to the following screenshot:

Getting the Arduino IDE

The sketch name will be based on the date on which you run the application. Also, the communications port shown in the lower right-hand corner may change, depending on whether your Arduino is plugged in.

We don't want to do anything more than be sure that the Arduino IDE program runs. Once we see that things are working, we can quit the IDE application.

An alternative is the Fritzing application. Refer to http://www.fritzing.org for more information. We can use this software to create engineering diagrams and lists of parts for a particular gadget. In some cases, we can also use this to save software sketches that are associated with a gadget. The Arduino IDE is used by the Fritzing tool. Go to http://fritzing.org/download/ to download Fritzing.

Getting a Python serial interface

In many cases, we'll want to have a more complex interaction between a desktop computer and an Arduino-based sensor. This will often lead to using the USB devices on our computer from our Python applications. If we want to interact directly with an Arduino (or other single-board computer) from Python, we'll need PySerial. An alternate is the USPP (Universal Serial Port Python) library. This allows us to communicate without having the Arduino IDE running on our computer. It allows us separate our data that is being gathered from our software development.

For PySerial, refer to https://pypi.python.org/pypi/pyserial/2.7. We can install this with sudo -H pip3.4 install pyserial.

Here's how the installation looks:

MacBookPro-SLott:Code slott$ sudo -H pip3.4 install pyserial
Password:
Collecting pyserial
  Downloading pyserial-2.7.tar.gz (122kB)
    100% |████████████████████████████████| 122kB 1.5MB/s 
Installing collected packages: pyserial
  Running setup.py install for pyserial
Successfully installed pyserial-2.7

Windows agents will omit the sudo -H command. This command has downloaded and installed the small PySerial module.

We can leverage this to communicate with an Arduino (or any other device) through a USB port. We'll look at the interaction in Chapter 5, Data Collection Gadgets.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image