Understanding the typical machine learning process
The following diagram summarizes the ML process in an organization:
Figure 1.1 – The data science development life cycle consists of three main stages – data preparation, modeling, and deployment
Note
It is an iterative process. The raw structured and unstructured data first lands into a data lake from different sources. A data lake utilizes the scalable and cheap storage provided by cloud storage such as Amazon Simple Storage Service (S3) or Azure Data Lake Storage (ADLS), depending on which cloud provider an organization uses. Due to regulations, many organizations have a multi-cloud strategy, making it essential to choose cloud-agnostic technologies and frameworks to simplify infrastructure management and reduce operational overhead.
Databricks defined a design pattern called the medallion architecture to organize data in a data lake. Before moving forward, let’s briefly understand what the medallion architecture is:
Figure 1.2 – Databricks medallion architecture
The medallion architecture is a data design pattern that’s used in a Lakehouse to organize data logically. It involves structuring data into layers (Bronze, Silver, and Gold) to progressively improve its quality and structure. The medallion architecture is also referred to as a “multi-hop” architecture.
The Lakehouse architecture, which combines the best features of data lakes and data warehouses, offers several benefits, including a simple data model, ease of implementation, incremental extract, transform, and load (ETL), and the ability to recreate tables from raw data at any time. It also provides features such as ACID transactions and time travel for data versioning and historical analysis. We will expand more on the lakehouse in the Exploring the Databricks Lakehouse architecture section.
In the medallion architecture, the Bronze layer holds raw data sourced from external systems, preserving its original structure along with additional metadata. The focus here is on quick change data capture (CDC) and maintaining a historical archive. The Silver layer, on the other hand, houses cleansed, conformed, and “just enough” transformed data. It provides an enterprise-wide view of key business entities and serves as a source for self-service analytics, ad hoc reporting, and advanced analytics.
The Gold layer is where curated business-level tables reside that have been organized for consumption and reporting purposes. This layer utilizes denormalized, read-optimized data models with fewer joins. Complex transformations and data quality rules are applied here, facilitating the final presentation layer for various projects, such as customer analytics, product quality analytics, inventory analytics, and more. Traditional data marts and enterprise data warehouses (EDWs) can also be integrated into the lakehouse to enable comprehensive “pan-EDW” advanced analytics and ML.
The medallion architecture aligns well with the concept of a data mesh, where Bronze and Silver tables can be joined in a “one-to-many” fashion to generate multiple downstream tables, enhancing data scalability and autonomy.
Apache Spark has taken over Hadoop as the de facto standard for processing data at scale in the last six years due to advancements in performance and large-scale developer community adoption and support. There are many excellent books on Apache Spark written by the creators of Apache Spark themselves; these have been listed in the Further reading section. They can give more insights into the other benefits of Apache Spark.
Once the clean data lands in the Gold standard tables, features are generated by combining gold datasets, which act as input for ML model training.
During the model development and training phase, various sets of hyperparameters and ML algorithms are tested to identify the optimal combination of the model and corresponding hyperparameters. This process relies on predetermined evaluation metrics such as accuracy, R2 score, and F1 score.
In the context of ML, hyperparameters are parameters that govern the learning process of a model. They are not learned from the data itself but are set before training. Examples of hyperparameters include the learning rate, regularization strength, number of hidden layers in a neural network, or the choice of a kernel function in a support vector machine. Adjusting these hyperparameters can significantly impact the performance and behavior of the model.
On the other hand, training an ML model involves deriving values for other model parameters, such as node weights or model coefficients. These parameters are learned during the training process using the training data to minimize a chosen loss or error function. They are specific to the model being trained and are determined iteratively through optimization techniques such as gradient descent or closed-form solutions.
Expanding beyond node weights, model parameters can also include coefficients in regression models, intercept terms, feature importance scores in decision trees, or filter weights in convolutional neural networks. These parameters are directly learned from the data during the training process and contribute to the model’s ability to make predictions.
Parameters
You can learn more about parameters at https://en.wikipedia.org/wiki/Parameter.
The finalized model is deployed either for batch, streaming, or real-time inference as a Representational State Transfer (REST) endpoint using containers. In this phase, we set up monitoring for drift and governance around the deployed models to manage the model life cycle and enforce access control around usage. Let’s take a look at the different personas involved in taking an ML use case from development to production.