Essential Scripting and File Information Recipes

The following recipes are covered in this chapter:

Handling arguments like an adult
Iterating over loose files
Recording file attributes
Copying files, attributes, and timestamps
Hashing files and data streams
Keeping track with a progress bar
Logging results
Multiple hands make light work

Introduction

Digital forensics involves the identification and analysis of digital media to assist in legal, business, and other types of investigations. Oftentimes, results stemming from our analysis have a major impact on the direction of an investigation. With Moore’s law more or less holding true, the amount of data we are expected to review is steadily growing. Given this, it’s a foregone conclusion that an investigator must rely on some level of automation to effectively review evidence. Automation, much like a theory, must be thoroughly vetted and validated so as not to allow for falsely drawn conclusions. Unfortunately, investigators may use a tool to automate some process but not fully understand the tool, the underlying forensic artifact, or the output’s significance. This is where Python comes into play.

In Python Digital Forensics Cookbook, we develop and detail recipes covering a number of typical scenarios. The purpose is to not only demonstrate Python features and libraries for those learning the language but to also illustrate one of its great benefits: namely, a forced basic understanding of the artifact. Without this understanding, it is impossible to develop the code in the first place, thereby forcing you to understand the artifact at a deeper level. Add to that the relative ease of Python and the obvious benefits of automation, and it is easy to see why this language has been adapted so readily by the community.

One method of ensuring that investigators understand the product of our scripts is to provide meaningful documentation and explanation of the code. Hence the purpose of this book. The recipes demonstrated throughout show how to configure argument parsing that is both easy to develop and simple for the user to understand. To add to the script's documentation, we will cover techniques to effectively log the process that was taken and any errors encountered by the script.

Another unique feature of scripts designed for digital forensics is the interaction with files and their associated metadata. Forensic scripts and applications require the accurate retrieval and preservation of file attributes, including dates, permissions, and file hashes. This chapter will cover methods to extract and present this data to the examiner.

Interaction with the operating system and files found on attached volumes are at the core of any script designed for use in digital forensics. During analysis, we need to access and parse files with a wide variety of structures and formats. For this reason, it's important to accurately and properly handle and interact with files. The recipes presented in this chapter cover common libraries and techniques that will continue to be used throughout the book:

Parsing command-line arguments
Recursively iterating over files and folders
Recording and preserving file and folder metadata
Generating hash values of files and other content
Monitoring code with progress bars
Logging recipe execution information and errors
Improving performance with multiprocessing

Visit www.packtpub.com/books/content/support to download the code bundle for this chapter.

Handling arguments like an adult

Recipe Difficulty: Easy

Python Version: 2.7 or 3.5

Operating System: Any

Person A: I came here for a good argument!
Person B: Ah, no you didn't, you came here for an argument!
Person A: An argument isn't just contradiction.
Person B: Well! it can be!
Person A: No it can't! An argument is a connected series of statements
intended to establish a proposition.
Person B: No it isn't!
Person A: Yes it is! It isn't just contradiction.

Monty Python (http://www.montypython.net/scripts/argument.php) aside, arguments are an integral part of any script. Arguments allow us to provide an interface for users to specify options and configurations that change the way the code behaves. Effective use of arguments, not just contradictions, can make a tool more versatile and a favorite among examiners.

Getting started

All libraries used in this script are present in Python's standard library. While there are other argument-handling libraries available, such as optparse and ConfigParser, our scripts will leverage argparse as our de facto command-line handler. While optparse was the library to use in prior versions of Python, argparse has served as the replacement for creating argument handling code. The ConfigParser library parses arguments from a configuration file instead of the command line. This is useful for code that requires a large number of arguments or has a significant number of options. We will not cover ConfigParser in this book, though it is worth exploring if you find your argparse configuration becomes difficult to maintain.

To learn more about the argparse library, visit https://docs.python.org/3/library/argparse.html.

How to do it…

In this script, we perform the following steps:

Create positional and optional arguments.
Add descriptions to arguments.
Configure arguments with select choices.

How it works…

To begin, we import print_function and the argparse module. By importing the print_function from the __future__ library we can write print statements as they are written in Python 3.X but still run them in Python 2.X. This allows us to make recipes compatible with both Python 2.X and 3.X. Where possible, we carry this through with most recipes in the book.

After creating a few descriptive variables about the recipe, we initialize our ArgumentParser instance. Within the constructor, we define the description and epilog keyword arguments. This data will display when the user specifies the -h argument and can give the user additional context about the script being run. The argparse library is very flexible and can scale in complexity if required for a script. Throughout this book, we cover many of the library's different features, which are detailed on its document page:

from __future__ import print_function
import argparse

__authors__ = ["Chapin Bryce", "Preston Miller"]
__date__ = 20170815
__description__ = 'A simple argparse example'

parser = argparse.ArgumentParser(
    description=__description__,
    epilog="Developed by {} on {}".format(
        ", ".join(__authors__), __date__)
)

With the parser instance created, we can now begin adding arguments to our command-line handler. There are two types of arguments: positional and optional. Positional arguments start with an alphabetic character, unlike optional arguments, which start with a dash, and are required to execute the script. Optional arguments start with a single or double dash character and are non-positional (that is, the order does not matter). These characteristics can be manually specified to overwrite the default behavior we’ve described if desired. The following code block illustrates how to create two positional arguments:

# Add Positional Arguments
parser.add_argument("INPUT_FILE", help="Path to input file")
parser.add_argument("OUTPUT_FILE", help="Path to output file")

In addition to changing whether an argument is required, we can specify help information, create default values, and other actions. The help parameter is useful in conveying what the user should provide. Other important parameters are default, type, choices, and action. The default parameter allows us to set a default value, while type converts the type of the input, which is a string by default, to the specified Python object type. The choices parameter uses a defined list, dictionary, or set to create valid options the user can select from.
The action parameter specifies the type of action that should be applied to a given argument. Some common actions include store, which is the default and stores the passed value associated with the argument; store_true, which assigns True to the argument; and version, which prints the version of the code specified by the version parameter:

# Optional Arguments
parser.add_argument("--hash", help="Hash the files", action="store_true")

parser.add_argument("--hash-algorithm",
                    help="Hash algorithm to use. ie md5, sha1, sha256",
                    choices=['md5', 'sha1', 'sha256'], default="sha256"
                    )

parser.add_argument("-v", "--version", "--script-version",
                    help="Displays script version information",
                    action="version", version=str(__date__)
                    )

parser.add_argument('-l', '--log', help="Path to log file", required=True)

With our arguments defined and configured, we can now parse them and use the provided inputs in our code. The following snippet shows how we can access the values and test whether the user specified an optional argument. Notice how we refer to arguments by the name we assign them. If we specify a short and long argument name, we must use the long name:

# Parsing and using the arguments
args = parser.parse_args()

input_file = args.INPUT_FILE
output_file = args.OUTPUT_FILE

if args.hash:
    ha = args.hash_algorithm
    print("File hashing enabled with {} algorithm".format(ha))
if not args.log:
    print("Log file not defined. Will write to stdout")

When combined into a script and executed at the command line with the -h argument, the preceding code will provide the following output:

As seen here, the -h flag displays the script help information, automatically created by argparse, along with the valid options for the --hash-algorithm argument. We can also use the -v option to display the version information. The --script-version argument displays the version in the same manner as the -v or -version arguments as shown here:

The following screenshot shows the message printed to the console when we select one of our valid hashing algorithms:

There's more…

This script can be further improved. We have provided a couple of recommendations here:

Explore additional argparse functionality. For example, the argparse.FileType object can be used to accept a File object as an input.
We can also use the argparse.ArgumentDefaultsHelpFormatter class to show defaults we set to the user. This is helpful when combined with optional arguments to show the user what will be used if nothing is specified.

Iterating over loose files

Recipe Difficulty: Easy

Python Version: 2.7 or 3.5

Operating System: Any

Often it is necessary to iterate over a directory and its subdirectories to recursively process all files. In this recipe, we will illustrate how to use Python to walk through directories and access files within them. Understanding how you can recursively navigate a given input directory is key as we frequently perform this exercise in our scripts.

Getting started

All libraries used in this script are present in Python's standard library. The preferred library, in most situations, for handling file and folder iteration is the built-in os library. While this library supports many useful operations, we will focus on the os.path() and os.walk() functions. Let’s use the following folder hierarchy as an example to demonstrate how directory iteration works in Python:

SecretDocs/
|-- key.txt
|-- Plans
|   |-- plans_0012b.txt
|   |-- plans_0016.txt
|   `-- Successful_Plans
|       |-- plan_0001.txt
|       |-- plan_0427.txt
|       `-- plan_0630.txt
|-- Spreadsheets
|   |-- costs.csv
|   `-- profit.csv
`-- Team
    |-- Contact18.vcf
    |-- Contact1.vcf
    `-- Contact6.vcf

4 directories, 11 files

How to do it…

The following steps are performed in this recipe:

Create a positional argument for the input directory to scan.
Iterate over all subdirectories and print file paths to the console.

How it works…

We create a very basic argument handler that accepts one positional input, DIR_PATH, the path of the input directory to iterate. As an example, we will use the ~/Desktop path, the parent of SecretDocs, as the input argument for the script. We parse the command-line arguments and assign the input directory to a local variable. We’re now ready to begin iterating over this input directory:

from __future__ import print_function
import argparse
import os

__authors__ = ["Chapin Bryce", "Preston Miller"]
__date__ = 20170815
__description__ = "Directory tree walker"

parser = argparse.ArgumentParser(
    description=__description__,
    epilog="Developed by {} on {}".format(
        ", ".join(__authors__), __date__)
)
parser.add_argument("DIR_PATH", help="Path to directory")
args = parser.parse_args()
path_to_scan = args.DIR_PATH

To iterate over a directory, we need to provide a string representing its path to os.walk(). This method returns three objects in each iteration, which we have captured in the root, directories, and files variables:

root: This value provides the relative path to the current directory as a string. Using the example directory structure, root would start as SecretDocs and eventually become SecretDocs/Team and SecretDocs/Plans/SuccessfulPlans.
directories: This value is a list of sub-directories located within the current root location. We can iterate through this list of directories, although the entries in this list will become part of the root value during successive os.walk() calls. For this reason, the value is not frequently used.
files: This value is a list of files in the current root location.

Be careful in naming the directory and file variables. In Python the dir and file names are reserved for other uses and should not be used as variable names.

# Iterate over the path_to_scan
for root, directories, files in os.walk(path_to_scan):

It is common to create a second for loop, as shown in the following code, to step through each of the files located in that directory and perform some action on them. Using the os.path.join() method, we can join the root and file_entry variables to obtain the file’s path. We then print this file path to the console. We may also, for example, append this file path to a list that we later iterate over to process each of the files:

    # Iterate over the files in the current "root"
    for file_entry in files:
        # create the relative path to the file
        file_path = os.path.join(root, file_entry)
        print(file_path)

We can also use root + os.sep() + file_entry to achieve the same effect, but it is not as Pythonic as the method we're using to join paths. Using os.path.join(), we can pass two or more strings to form a single path, such as directories, subdirectories, and files.

When we run the preceding script with our example input directory, we see the following output:

As seen, the os.walk() method iterates through a directory, then will descend into any discovered sub-directories, thereby scanning the entire directory tree.

There's more…

This script can be further improved. Here's a recommendation:

Check out and implement similar functionality using the glob library which, unlike the os module, allows for wildcard pattern recursive searches for files and directories

Recording file attributes

Recipe Difficulty: Easy

Python Version: 2.7 or 3.5

Operating System: Any

Now that we can iterate over files and folders, let’s learn to record metadata about these objects. File metadata plays an important role in forensics, as collecting and reviewing this information is a basic task during most investigations. Using a single Python library, we can gather some of the most important attributes of files across platforms.

Getting started

All libraries used in this script are present in Python’s standard library. The os library, once again, can be used here to gather file metadata. One of the most helpful methods for gathering file metadata is the os.stat() function. It's important to note that the stat() call only provides information available with the current operating system and the filesystem of the mounted volume. Most forensic suites allow an examiner to mount a forensic image as a volume on a system and generally preserve the file attributes available to the stat call. In Chapter 8, Working with Forensic Evidence Containers Recipes, we will demonstrate how to open forensic acquisitions to directly extract file information.

To learn more about the os library, visit https://docs.python.org/3/library/os.html.

How to do it…

We will record file attributes using the following steps:

Obtain the input file to process.
Print various metadata: MAC times, file size, group and owner ID, and so on.

How it works…

To begin, we import the required libraries: argparse for argument handling, datetime for interpretation of timestamps, and os to access the stat() method. The sys module is used to identify the platform (operating system) the script is running on. Next, we create our command-line handler, which accepts one argument, FILE_PATH, a string representing the path to the file we will extract metadata from. We assign this input to a local variable before continuing execution of the script:

from __future__ import print_function
import argparse
from datetime import datetime as dt
import os
import sys

__authors__ = ["Chapin Bryce", "Preston Miller"]
__date__ = 20170815
__description__ = "Gather filesystem metadata of provided file"

parser = argparse.ArgumentParser(
    description=__description__,
    epilog="Developed by {} on {}".format(", ".join(__authors__), __date__)
)
parser.add_argument("FILE_PATH",
                    help="Path to file to gather metadata for")
args = parser.parse_args()
file_path = args.FILE_PATH

Timestamps are one of the most common file metadata attributes collected. We can access the creation, modification, and access timestamps using the os.stat() method. The timestamps are returned as a float representing the seconds since 1970-01-01. Using the datetime.fromtimestamp() method, we convert this value into a readable format.

The os.stat() module interprets timestamps differently depending on the platform. For example, the st_ctime value on Windows displays the file's creation time, while on macOS and UNIX this same attribute displays the last modification of the file's metadata, similar to the NTFS entry modified time. This is not the only part of os.stat() that varies by platform, though the remainder of this recipe uses items that are common across platforms.

stat_info = os.stat(file_path)
if "linux" in sys.platform or "darwin" in sys.platform:
    print("Change time: ", dt.fromtimestamp(stat_info.st_ctime))
elif "win" in sys.platform:
    print("Creation time: ", dt.fromtimestamp(stat_info.st_ctime))
else:
    print("[-] Unsupported platform {} detected. Cannot interpret "
          "creation/change timestamp.".format(sys.platform)
          )
print("Modification time: ", dt.fromtimestamp(stat_info.st_mtime))
print("Access time: ", dt.fromtimestamp(stat_info.st_atime))

We continue printing file metadata following the timestamps. The file mode and inode properties return the file permissions and inode as an integer, respectively. The device ID refers to the device the file resides on. We can convert this integer into major and minor device identifiers using the os.major() and os.minor() methods:

print("File mode: ", stat_info.st_mode)
print("File inode: ", stat_info.st_ino)
major = os.major(stat_info.st_dev)
minor = os.minor(stat_info.st_dev)
print("Device ID: ", stat_info.st_dev)
print("\tMajor: ", major)
print("\tMinor: ", minor)

The st_nlink property returns a count of the number of hard links to the file. We can print the owner and group information using the st_uid and st_gid properties, respectively. Lastly, we can gather file size using st_size, which returns an integer representing the file's size in bytes.

Be aware that if the file is a symbolic link, the st_size property reflects the length of the path to the target file rather than the target file’s size.

print("Number of hard links: ", stat_info.st_nlink)
print("Owner User ID: ", stat_info.st_uid)
print("Group ID: ", stat_info.st_gid)
print("File Size: ", stat_info.st_size)

But wait, that’s not all! We can use the os.path() module to extract a few more pieces of metadata. For example, we can use it to determine whether a file is a symbolic link, as shown below with the os.islink() method. With this, we could alert the user if the st_size attribute is not equivalent to the target file's size. The os.path() module can also gather the absolute path, check whether it exists, and get the parent directory. We can also gather the parent directory using the os.path.dirname() function or by accessing the first element of the os.path.split() function. The split() method is more commonly used to acquire the filename from a path:

# Gather other properties
print("Is a symlink: ", os.path.islink(file_path))
print("Absolute Path: ", os.path.abspath(file_path))
print("File exists: ", os.path.exists(file_path))
print("Parent directory: ", os.path.dirname(file_path))
print("Parent directory: {} | File name: {}".format(
    *os.path.split(file_path)))

By running the script, we can relevant metadata about the file. Notice how the format() method allows us to print values without concern for their data types. Normally, we would have to convert integers and other data types to strings first if we were to try printing the variable directly without string formatting:

There's more…

This script can be further improved. We have provided a couple of recommendations here:

Integrate this recipe with the Iterating over loose files recipe to recursively extract metadata for files in a given series of directories
Implement logic to filter by file extension, date modified, or even file size to only collect metadata information on files matching the desired criteria

Copying files, attributes, and timestamps

Recipe Difficulty: Easy

Python Version: 2.7 or 3.5

Operating System: Windows

Preserving files is a fundamental task in digital forensics. It is often preferable to containerize files in a format that can store hashes and other metadata of loose files. However, sometimes we need to copy files in a forensic manner from one location to another. Using this recipe, we will demonstrate some of the methods available to copy files while preserving common metadata fields.

Getting started

This recipe requires the installation of two third-party modules pywin32 and pytz. All other libraries used in this script are present in Python's standard library. This recipe will primarily use two libraries, the built-in shutil and a third-party library, pywin32. The shutil library is our go-to for copying files within Python, and we can use it to preserve most of the timestamps and other file attributes. The shutil module, however, is unable to preserve the creation time of files it copies. Rather, we must rely on the Windows-specific pywin32 library to preserve it. While the pywin32 library is platform specific, it is incredibly useful to interact with the Windows operating system.

To learn more about the shutil library, visit https://docs.python.org/3/library/shutil.html.

To install pywin32, we need to access its SourceForge page at https://sourceforge.net/projects/pywin32/ and download the version that matches our Python installation. To check our Python version, we can import the sys module and call sys.version within an interpreter. Both the version and the architecture are important when selecting the correct pywin32 installer.

To learn more about the sys library, visit https://docs.python.org/3/library/sys.html.

In addition to the installation of the pywin32 library, we need to install pytz, a third-party library used to manage time zones in Python. We can install this library using the pip command:

pip install pytz==2017.2

How to do it…

We perform the following steps to forensically copy files on a Windows system:

Gather source file and destination arguments.
Use shutil to copy and preserve most file metadata.
Manually set timestamp attributes with win32file.

How it works…

Let’s now dive into copying files and preserving their attributes and timestamps. We use some familiar libraries to assist us in the execution of this recipe. Some of the libraries, such as pytz, win32file, and pywintypes are new. Let’s briefly discuss their purpose here. The pytz module allows us to work with time zones more granularly and allows us to initialize dates for the pywin32 library.

To allow us to pass timestamps in the correct format, we must also import pywintypes. Lastly, the win32file library, available through our installation of pywin32, provides various methods and constants for file manipulation in Windows:

from __future__ import print_function
import argparse
from datetime import datetime as dt
import os
import pytz
from pywintypes import Time
import shutil
from win32file import SetFileTime, CreateFile, CloseHandle
from win32file import GENERIC_WRITE, FILE_SHARE_WRITE
from win32file import OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL

__authors__ = ["Chapin Bryce", "Preston Miller"]
__date__ = 20170815
__description__ = "Gather filesystem metadata of provided file"

This recipe's command-line handler takes two positional arguments, source and dest, which represent the source file to copy and the output directory, respectively. This recipe has an optional argument, timezone, which allows the user to specify a time zone.

To prepare the source file, we store the absolute path and split the filename from the rest of the path, which we may need to use later if the destination is a directory. Our last bit of preparation involves reading the timezone input from the user, one of the four common US time zones, and UTC. This allows us to initialize the pytz time zone object for later use in the recipe:

parser = argparse.ArgumentParser(
    description=__description__,
    epilog="Developed by {} on {}".format(
        ", ".join(__authors__), __date__)
)
parser.add_argument("source", help="Source file")
parser.add_argument("dest", help="Destination directory or file")
parser.add_argument("--timezone", help="Timezone of the file's timestamp",
                    choices=['EST5EDT', 'CST6CDT', 'MST7MDT', 'PST8PDT'],
                    required=True)
args = parser.parse_args()

source = os.path.abspath(args.source)
if os.sep in args.source:
    src_file_name = args.source.split(os.sep, 1)[1]
else:
    src_file_name = args.source

dest = os.path.abspath(args.dest)
tz = pytz.timezone(args.timezone)

At this point, we can copy the source file to the destination using the shutil.copy2() method. This method accepts either a directory or file as the destination. The major difference between the shutil copy() and copy2() methods is that the copy2() method also preserves file attributes, including the last written time and permissions. This method does not preserve file creation times on Windows, for that we need to leverage the pywin32 bindings.

To that end, we must build the destination path for the file copied by the copy2() call by using the following if statement to join the correct path if the user provided a directory at the command line:

shutil.copy2(source, dest)
if os.path.isdir(dest):
    dest_file = os.path.join(dest, src_file_name)
else:
    dest_file = dest

Next, we prepare the timestamps for the pywin32 library. We use the os.path.getctime() methods to gather the respective Windows creation times, and convert the integer value into a date using the datetime.fromtimestamp() method. With our datetime object ready, we can make the value time zone-aware by using the specified timezone and providing it to the pywintype.Time() function before printing the timestamps to the console:

created = dt.fromtimestamp(os.path.getctime(source))
created = Time(tz.localize(created))
modified = dt.fromtimestamp(os.path.getmtime(source))
modified = Time(tz.localize(modified))
accessed = dt.fromtimestamp(os.path.getatime(source))
accessed = Time(tz.localize(accessed))

print("Source\n======")
print("Created: {}\nModified: {}\nAccessed: {}".format(
    created, modified, accessed))

With the preparation complete, we can open the file with the CreateFile() method and pass the string path, representing the copied file, followed by arguments specified by the Windows API for accessing the file. Details of these arguments and their meanings can be reviewed at https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx:

handle = CreateFile(dest_file, GENERIC_WRITE, FILE_SHARE_WRITE,
                    None, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, None)
SetFileTime(handle, created, accessed, modified)
CloseHandle(handle)

Once we have an open file handle, we can call the SetFileTime() function to update, in order, the file's created, accessed, and modified timestamps. With the destination file's timestamps set, we need to close the file handle using the CloseHandle() method. To confirm to the user that the copying of the file's timestamps was successful, we print the destination file's created, modified, and accessed times:

created = tz.localize(dt.fromtimestamp(os.path.getctime(dest_file)))
modified = tz.localize(dt.fromtimestamp(os.path.getmtime(dest_file)))
accessed = tz.localize(dt.fromtimestamp(os.path.getatime(dest_file)))
print("\nDestination\n===========")
print("Created: {}\nModified: {}\nAccessed: {}".format(
    created, modified, accessed))

The script output shows copying a file from the source to the destination with timestamps successfully preserved:

There's more…

This script can be further improved. We have provided a couple of recommendations here:

Hash the source and destination files to ensure they were copied successfully. Hashing files are introduced in the hashing files and data streams recipe in the next section.
Output a log of the files copied and any exceptions encountered during the copying process.

Hashing files and data streams

Recipe Difficulty: Easy

Python Version: 2.7 or 3.5

Operating System: Any

File hashes are a widely accepted identifier for determining file integrity and authenticity. While some algorithms have become vulnerable to collision attacks, the process is still important in the field. In this recipe, we will cover the process of hashing a string of characters and a stream of file content.

Getting started

All libraries used in this script are present in Python’s standard library. For generating hashes of files and other data sources, we implement the hashlib library. This built-in library has support for common algorithms, such as MD5, SHA-1, SHA-256, and more. As of the writing of this book, many tools still leverage the MD5 and SHA-1 algorithms, though the current recommendation is to use SHA-256 at a minimum. Alternatively, one could use multiple hashes of a file to further decrease the odds of a hash collision. While we'll showcase a few of these algorithms, there are other, less commonly used, algorithms available.

To learn more about the hashlib library, visit https://docs.python.org/3/library/hashlib.html.

How to do it…

We hash files with the following steps:

Print hashed filename using the specified input file and algorithm.
Print hashed file data using the specified input file and algorithm.

How it works…

To begin, we must import hashlib as shown in the following. For ease of use, we have defined a dictionary of algorithms that our script can use: MD5, SHA-1, SHA-256 and SHA-512. By updating this dictionary, we can support other hash functions that have update() and hexdigest() methods, including some from libraries other than hashlib:

from __future__ import print_function
import argparse
import hashlib
import os

__authors__ = ["Chapin Bryce", "Preston Miller"]
__date__ = 20170815
__description__ = "Script to hash a file's name and contents"

available_algorithms = {
    "md5": hashlib.md5,
    "sha1": hashlib.sha1,
    "sha256": hashlib.sha256,
    "sha512": hashlib.sha512
}

parser = argparse.ArgumentParser(
    description=__description__,
    epilog="Developed by {} on {}".format(", ".join(__authors__), __date__)
)
parser.add_argument("FILE_NAME", help="Path of file to hash")
parser.add_argument("ALGORITHM", help="Hash algorithm to use",
                    choices=sorted(available_algorithms.keys()))
args = parser.parse_args()

input_file = args.FILE_NAME
hash_alg = args.ALGORITHM

Notice how we define our hashing algorithm object using our dictionary and the argument provided at the command line, followed by open and close parentheses to initiate the object. This provides additional flexibility when adding new hashing algorithms.

With our hash algorithms defined, we now can hash the file's absolute path, a similar method employed during file naming for iTunes backups of an iOS device, by passing the string into the update() method. When we are ready to display the hex value of the calculated hash, we can call the hexdigest() method on our file_name object:

file_name = available_algorithms[hash_alg]()
abs_path = os.path.abspath(input_file)
file_name.update(abs_path.encode())

print("The {} of the filename is: {}".format(
    hash_alg, file_name.hexdigest()))

Let's move onto opening the file and hashing its contents. While we can read the entire file and pass it to the hash function, not all files are small enough to fit in memory. To ensure our code works on larger files, we will use the technique in the following example to read a file in a piecemeal fashion and hash it in chunks.

By opening the file as rb, we will ensure that we are reading the binary contents of the file, not the string content that may exist. With the file open, we will define the buffer size to read in content and then read the first chunk of data in.

Entering a while loop, we will update our hashing object with the new content for as long as there is content in the file. This is possible as the read() method allows us to pass an integer of the number of bytes to read and, if the integer is larger than the number of bytes remaining in the file, will simply pass us the remaining bytes.

Once the entire file is read, we call the hexdigest() method of our object to display the file hash to the examiner:

file_content = available_algorithms[hash_alg]()
with open(input_file, 'rb') as open_file:
    buff_size = 1024
    buff = open_file.read(buff_size)

    while buff:
        file_content.update(buff)
        buff = open_file.read(buff_size)

print("The {} of the content is: {}".format(
    hash_alg, file_content.hexdigest()))

When we execute the code, we see the output from the two print statements revealing the hash value of the file's absolute path and content. We can generate additional hashes for the file by changing the algorithm at the command line:

There's more…

This script can be further improved. Here's a recommendation:

Add support for additional hashing algorithms and create the appropriate entry within the available_algorithms global variable

Keeping track with a progress bar

Recipe Difficulty: Easy

Python Version: 2.7 or 3.5

Operating System: Any

Long-running scripts are unfortunately commonplace when processing data measured in gigabytes or terabytes. While your script may be processing this data smoothly, a user may think it's frozen after three hours with no indication of progress. Luckily, several developers have built an incredibly simple progress bar library, giving us little excuse for not incorporating this into our code.

Getting started

This recipe requires the installation of the third-party module tqdm. All other libraries used in this script are present in Python's standard library. The tqdm library, pronounced taqadum, can be installed via pip or downloaded from GitHub at https://github.com/tqdm/tqdm. To use all of the features shown in this recipe, ensure you are using release 4.11.2, available on the tqdm GitHub page or with pip using the following command:

pip install tqdm==4.11.2

How to do it…

To create a simple progress bar, we follow these steps:

Import tqdm and time.
Create multiple examples with tqdm and loops.

How it works…

As with all other recipes, we begin with the imports. While we only need the tqdm import to enable the progress bars, we will use the time module to slow down our script to better visualize the progress bar. We use a list of fruits as our sample data and identify which fruits containing "berry" or "berries" in their name:

from __future__ import print_function
from time import sleep
import tqdm

fruits = [
    "Acai", "Apple", "Apricots", "Avocado", "Banana", "Blackberry",
    "Blueberries", "Cherries", "Coconut", "Cranberry", "Cucumber",
    "Durian", "Fig", "Grapefruit", "Grapes", "Kiwi", "Lemon", "Lime",
    "Mango", "Melon", "Orange", "Papaya", "Peach", "Pear", "Pineapple",
    "Pomegranate", "Raspberries", "Strawberries", "Watermelon"
]

The following for loop is very straightforward and iterates through our list of fruits, checking for the substring berr is within the fruit's name before sleeping for one-tenth of a second. By wrapping the tqdm() method around the iterator, we automatically have a nice-looking progress bar giving us the percentage complete, elapsed time, remaining time, the number of iterations complete, and total iterations.

These display options are the defaults for tqdm and gather all of the necessary information using properties of our list object. For example, the library knows almost all of these details for the progress bar just by gathering the length and calculating the rest based on the amount of time per iteration and the number elapsed:

contains_berry = 0
for fruit in tqdm.tqdm(fruits):
    if "berr" in fruit.lower():
        contains_berry += 1
    sleep(.1)
print("{} fruit names contain 'berry' or 'berries'".format(contains_berry))

Extending the progress bar beyond the default configuration is as easy as specifying keyword arguments. The progress bar object can also be created prior to the start of the loop and using the list object, fruits, as the iterable argument. The following code exhibits how we can define our progress bar with our list, a description, and providing the unit name.

If we were not using a list but another iterator type that does not have a __len__ attribute defined, we would need to manually supply a total with the total keyword. Only basic statistics about elapsed time and iterations per second display if the total number of iterations is unavailable.

Once we are in the loop, we can display the number of results discovered using the set_postfix() method. Each iteration will provide an update of the number of hits we have found to the right of the progress bar:

contains_berry = 0
pbar = tqdm.tqdm(fruits, desc="Reviewing names", unit="fruits")
for fruit in pbar:
    if "berr" in fruit.lower():
        contains_berry += 1
    pbar.set_postfix(hits=contains_berry)
    sleep(.1)
print("{} fruit names contain 'berry' or 'berries'".format(contains_berry))

One other common use case for progress bars is to measure execution in a range of integers. Since this is a common use of the library the developers built a range call into the library, called trange(). Notice how we can specify the same arguments here as before. One new argument that we will use here, due to the larger numbers, is the unit_scale argument, which simplifies large numbers into a small number with a letter to designate the magnitude:

for i in tqdm.trange(10000000, unit_scale=True, desc="Trange: "):
    pass

When we execute the code, the following output is visible. Our first progress bar displays the default format, while the second and third show the customizations we have added:

There's more…

This script can be further improved. Here's a recommendation:

Further explore the capabilities the tqdm library affords developers. Consider using the tqdm.write() method to print status messages without breaking the progress bar.

Logging results

Recipe Difficulty: Easy

Python Version: 2.7 or 3.5

Operating System: Any

Outside of progress bars, we generally need to provide messages to the user to describe any exceptions, errors, warnings, or other information that has occurred during execution. With logging, we can provide this information at execution and in a text file for future reference.

Getting started

All libraries used in this script are present in Python’s standard library. This recipe will use the built-in logging library to generate status messages to the console and a text file.

To learn more about the logging library, visit https://docs.python.org/3/library/logging.html.

How to do it…

The following steps can be used to effectively log program execution data:

Create a log formatting string.
Log various message types during script execution.

How it works…

Let's now learn to log results. After our imports, we create our logger object by initializing an instance using the script's name represented by the __file__ attribute. With our logging object initiated, we will set the level and specify various formatters and handlers for this script. The formatters provide the flexibility to define what fields will be displayed for each message, including timestamps, function name, and the message level. The format strings follow the standards of Python string formatting, meaning we can specify padding for the following strings:

from __future__ import print_function
import logging
import sys

logger = logging.getLogger(__file__)
logger.setLevel(logging.DEBUG)

msg_fmt = logging.Formatter("%(asctime)-15s %(funcName)-20s"
                            "%(levelname)-8s %(message)s")

The handlers allow us to specify where the log message should be recorded, including a log file, standard output (console), or standard error. In the following example, we use the standard output for our stream handler and the script's name with the .log extension for the file handler. Lastly, we register these handlers with our logger object:

strhndl = logging.StreamHandler(sys.stdout)
strhndl.setFormatter(fmt=msg_fmt)

fhndl = logging.FileHandler(__file__ + ".log", mode='a')
fhndl.setFormatter(fmt=msg_fmt)

logger.addHandler(strhndl)
logger.addHandler(fhndl)

The logging library by default uses the following levels in increasing order of severity: NOTSET, DEBUG, INFORMATION, WARNING, ERROR, and CRITICAL. To showcase some of the features of the format string, we will log a few types of messages from functions:

logger.info("information message")
logger.debug("debug message")


def function_one():
    logger.warning("warning message")


def function_two():
    logger.error("error message")


function_one()
function_two()

When we execute this code, we can see the following message information from the invocation of the script. Inspection of the generated log file matches what was recorded in the console:

There’s more…

This script can be further improved. Here's a recommendation:

It is often important to provide as much information as possible to the user in the event of an error in the script or for a user's validation of the process. Therefore, we recommend implementing additional formatters and logging levels. Using the stderr stream is best practice for logging, as we can provide the output at the console while not disrupting stdout.

Multiple hands make light work

Recipe Difficulty: Medium

Python Version: 2.7 or 3.5

Operating System: Any

While Python is known for being single threaded, we can use built-in libraries to spin up new processes to handle tasks. Generally, this is preferred when there are a series of tasks that can be run simultaneously and the processing is not already bound by hardware limits, such as network bandwidth or disk speed.

Getting started

All libraries used in this script are present in Python’s standard library. Using the built-in multiprocessing library, we can handle the majority of situations where we would need multiple processes to efficiently tackle a problem.

To learn more about the multiprocessing library, visit https://docs.python.org/3/library/multiprocessing.html.

How to do it…

With the following steps, we showcase basic multiprocessing support in Python:

Set up a log to record multiprocessing activity.
Append data to a list using multiprocessing.

How it works…

Let's now look at how we can achieve multiprocessing in Python. Our imports include the multiprocessing library, shortened to mp, as it is quite lengthy otherwise; the logging and sys libraries for thread status messages; the time library to slow down execution for our example; and the randint method to generate times that each thread should wait for:

from __future__ import print_function
import logging
import multiprocessing as mp
from random import randint
import sys
import time

Before creating our processes, we set up a function that they will execute. This is where we put the task each process should execute before returning to the main thread. In this case, we take a number of seconds for the thread to sleep as our only argument. To print a status message that allows us to differentiate between the processes, we use the current_process() method to access the name property for each thread:

def sleepy(seconds):
    proc_name = mp.current_process().name
    logger.info("{} is sleeping for {} seconds.".format(
        proc_name, seconds))
    time.sleep(seconds)

With our worker function defined, we create our logger instance, borrowing code from the previous recipe, and set it to only record to the console.

logger = logging.getLogger(__file__)
logger.setLevel(logging.DEBUG)
msg_fmt = logging.Formatter("%(asctime)-15s %(funcName)-7s "
                            "%(levelname)-8s %(message)s")
strhndl = logging.StreamHandler(sys.stdout)
strhndl.setFormatter(fmt=msg_fmt)
logger.addHandler(strhndl)

We now define the number of workers we want to spawn and create them in a for loop. Using this technique, we can easily adjust the number of processes we have running. Inside of our loop, we define each worker using the Process class and set our target function and the required arguments. Once the process instance is defined, we start it and append the object to a list for later use:

num_workers = 5
workers = []
for w in range(num_workers):
    p = mp.Process(target=sleepy, args=(randint(1, 20),))
    p.start()
    workers.append(p)

By appending the workers to a list, we can join them in sequential order. Joining, in this context, is the process of waiting for a process to complete before execution continues. If we do not join our process, one of them could continue to the end of the script and complete the code before other processes complete. While that wouldn't cause huge problems in our example, it can cause the next snippet of code to start too early:

for worker in workers:
    worker.join()
    logger.info("Joined process {}".format(worker.name))

When we execute the script, we can see the processes start and join over time. Since we stored these items in a list, they will join in an ordered fashion, regardless of the time it takes for one worker to finish. This is visible below as Process-5 slept for 14 seconds before completing, and meanwhile, Process-4 and Process-3 had already completed:

There's more…

This script can be further improved. We have provided a recommendation here:

Rather than using function arguments to pass data between threads, review pipes and queues as alternatives to sharing data. Additional information about these objects can be found at https://docs.python.org/3/library/multiprocessing.html#exchanging-objects-between-processes.