Chapter 9
Project 3.1: Data Cleaning Base Application
Data validation, cleaning, converting, and standardizing are steps required to transform raw data acquired from source applications into something that can be used for analytical purposes. Since we started using a small data set of very clean data, we may need to improvise a bit to create some ”dirty” raw data. A good alternative is to search for more complicated, raw data.
This chapter will guide you through the design of a data cleaning application, separate from the raw data acquisition. Many details of cleaning, converting, and standardizing will be left for subsequent projects. This initial project creates a foundation that will be extended by adding features. The idea is to prepare for the goal of a complete data pipeline that starts with acquisition and passes the data through a separate cleaning stage. We want to exploit the Linux principle of having applications connected by a shared buffer, often referred to as a shell pipeline.
This chapter will cover a number of skills related to the design of data validation and cleaning applications:
CLI architecture and how to design a pipeline of processes
The core concepts of validating, cleaning, converting, and standardizing raw data
We won’t address all the aspects of converting and standardizing data in this chapter. Projects in Chapter 10, Data Cleaning Features will expand on many conversion topics. The project in Chapter 12, Project 3.8: Integrated Data Acquisition Web Service will address the integrated pipeline idea. For now, we want to build an adaptable base application that can be extended to add features.
We’ll start with a description of an idealized data cleaning application.