Real-world datasets are very varied: variables can be textual, numerical, or categorical, and observations can be missing, false, or wrong (outliers). To perform a proper data analysis, we will understand how to correctly parse data, clean it, and create an output matrix optimally built for machine learning analysis. To extract knowledge, it is essential that the reader is able to create an observation matrix using different techniques of data analysis and cleaning.
In this chapter, we'll present Cloud Dataprep, a service useful to preprocess the data, extract features, and clean up the records. We'll also cover Cloud Dataflow, a service to implement streaming and batch processing. We'll go into some practical details with real-life examples. We'll start from discovering different ways to transform data and the degree of cleaning data...