Section 2: Data Science
Once we have clean data in a data lake, we can get started with performing data science and machine learning on the historical data. This section helps you understand the importance and need for scalable machine learning. The chapters in this section show how to perform exploratory data analysis, feature engineering, and machine learning model training in a scalable and distributed fashion using PySpark. This section also introduces MLflow, an open source machine learning life cycle management tool useful for tracking machine learning experiments and productionizing machine learning models. This section also introduces you to some techniques for scaling out single-machine machine learning libraries based on standard Python.
This section includes the following chapters:
Chapter 5, Scalable Machine Learning with PySpark
Chapter 6, Feature Engineering – Extraction, Transformation, and Selection
Chapter 7, Supervised Machine Learning
Chapter...