What this book covers
Chapter 1, Getting Started and Lakehouse Concepts, covers the different techniques and methods for data engineering and machine learning. The goal is not to unveil insights into data never seen before. If that were the case, this would be an academic paper. Instead, the goal of this chapter is to use open and free data to demonstrate advanced technology and best practices. You will list and describe each dataset present in the book.
Chapter 2, Designing Databricks: Day One, covers workspace design, model life cycle practices, naming conventions, what not to put in DBFS, and other preparatory topics. The Databricks platform is simple to use. However, there are many options available to cater to the different needs of different organizations. During my years as a contractor and my time at Databricks, I have seen teams succeed and fail. I will share with you the successful dynamics as well as any configurations that accompany those insights in this chapter.
Chapter 3, Building the Bronze Layer, begins your data journey in the Databricks DI Platform by exploring the fundamentals of the Bronze layer of the Medallion architecture. The Bronze layer is the first step in transforming your data for downstream projects, and this chapter will focus on the Databricks features and techniques you have available for the necessary transformations. We will start by introducing you to Auto Loader, a tool to automate data ingestion, which you can implement with or without Delta Live Tables (DLT) to insert and transform your data.
Chapter 4, Getting to Know Your Data, explores the features within the Databricks DI Platform that help improve and monitor data quality and facilitate data exploration. There are numerous approaches to getting to know your data better with Databricks. First, we cover how to oversee data quality with DLT to catch quality issues early and prevent the contamination of entire pipelines. We will take our first close look at Lakehouse Monitoring, which helps us analyze data changes over time and can alert us to changes that concern us.
Chapter 5, Feature Engineering on Databricks, progresses from Chapter 4, where we harnessed the power of Databricks to explore and refine our datasets, to delve into the components of Databricks that enable the next step – feature engineering. We will start by covering Databricks Feature Engineering (DFE) in Unity Catalog to show you how you can efficiently manage engineered features using Unity Catalog. Understanding how to leverage DFE in UC is crucial for creating reusable and consistent features across training and inference. Then, you will learn how to leverage Structured Streaming to calculate features on a stream, which allows you to create stateful features needed for models to make quick decisions.
Chapter 6, Tools for Model Training and Experimenting, examines how to use data science to search for a signal hidden in the noise of data. We will leverage the features we created within the Databricks platform during the previous chapter. We will start by using AutoML in a basic modeling approach, providing auto-generated code and quickly enabling data scientists to establish a baseline model to beat. When searching for a signal, we experiment with different features, hyperparameters, and models. Historically, tracking these configurations and their corresponding evaluation metrics is a time-consuming project in and of itself. A low-overhead tracking mechanism, such as the tracking provided by MLflow, an open source platform for managing data science projects and supporting MLOps, will reduce the burden of manually capturing configurations. More specifically, we’ll introduce MLflow Tracking, an MLflow component that significantly improves tracking each permutation’s many outputs. However, that is only the beginning.
Chapter 7, Productionizing ML on Databricks, explores productionizing a machine learning model using Databricks products, which makes the journey more straightforward and cohesive by incorporating functionality such as the Unity Catalog Registry, Databricks Workflows, Databricks Asset Bundles, and Model Serving capabilities. This chapter will cover the tools and practices to take your models from development to production.
Chapter 8, Monitoring, Evaluating, and More, covers how to create visualizations for dashboards in both the new Lakeview dashboards and the standard DBSQL dashboards. Deployed models can be shared via a web application. Therefore, we will not only introduce Hugging Face Spaces but also deploy the RAG chatbot using a Gradio app to apply what we have learned.