Data cleansing and preprocessing
As I mentioned earlier, we’ve been pretty lucky. Some of the data my team works with can be downright filthy. When we use terms such as “dirty," “filthy,” and “cleansing” concerning data, what we’re talking about is addressing the format of the data, as well as the fitness of the data for processing. Data is only useful if it’s in a format we can work with. Structured data is what we always prefer.
Structured data refers to data that is split into identifiable fields. We’ve seen comma-separated and tab-separated text. Other examples of structured data include formats such as XML, JSON, Parquet, and HDF5. The first two, XML and JSON, are very common and have the advantage of being text formats. The latter two, Parquet and HDF5, are binary files and are specialized for storing larger datasets than would be comfortable when working with text. As we’ve seen, most tools, including...