Identifying data transformations and optimizations
In a typical data analytics project, we ingest data from multiple data sources and then perform transforms on those datasets to optimize them for the required analytics.
In Chapter 7, Transforming Data to Optimize for Analytics we will do a deeper dive into typical transformations and optimizations, but we will provide a high-level overview of the most common transformations here.
File format optimizations
CSV, XML, JSON, and other types of plaintext files are commonly used to store structured and semi-structured data. These file formats are useful when manually exploring data, but there are much better, binary-based file formats to use for computer-based analytics. A common binary format that is optimized for read-heavy analytics is the Apache Parquet format. A common transformation is to convert plaintext files into an optimized format, such as Apache Parquet.
Data standardization
When building out a pipeline, we...