2.1 General data acquisition
All data analysis processing starts with the essential step of acquiring the data from a source.
The above statement seems almost silly, but failures in this effort often lead to complicated rework later. It’s essential to recognize that data exists in these two essential forms:
Python objects, usable in analytic programs. While the obvious candidates are numbers and strings, this includes using packages like Pillow to operate on images as Python objects. A package like librosa can create objects representing audio data.
A serialization of a Python object. There are many choices here:
Text. Some kind of string. There are numerous syntax variants, including CSV, JSON, TOML, YAML, HTML, XML, etc.
Pickled Python Objects. These are created by the
pickle
module.Binary Formats. Tools like Protobuf can serialize native Python objects into a stream of bytes. Some YAML extensions, similarly, can serialize an object in a binary format that isn’t text. Images and audio samples are often stored in compressed binary formats.
The format for the source data is — almost universally — not fixed by any rules or conventions. Writing an application based on the assumption that source data is always a CSV-format file can lead to problems when a new format is required.
It’s best to treat all input formats as subject to change. The data — once acquired — can be saved in a common format used by the analysis pipeline, and independent of the source format (we’ll get to the persistence in Clean, validate, standardize, and persist).
We’ll start with Project 1.1: ”Acquire Data”. This will build the Data Acquisition Base Application. It will acquire CSV-format data and serve as the basis for adding formats in later projects.
There are a number of variants on how data is acquired. In the next few chapters, we’ll look at some alternative data extraction approaches.