Overview of the ML process
Unfortunately, there is no established how-to guide when performing ML. This is because every ML use case is unique and specific to the application that leverages the resultant ML model. Instead, there is a general process pattern that most data scientists, ML engineers, and ML practitioners follow. This process model is called the Cross-Industry Standard Process for Data Mining (CRISP-DM) and while not everyone follows the specific steps of the process verbatim, most production ML models have probably, in some shape or form, been built by using the guardrails that the CRISP-DM methodology provides.
So, when we refer to the ML process, we are invariably referring to the overall methodology of building production-ready ML models using the guardrails from CRSIP-DM.
The following diagram shows an overview of the CRISP-DM guidelines for creating a typical process that an ML practitioner might follow:
In a nutshell, the process starts with the ML practitioner being tasked with providing an ML model that addresses a specific business use case. The ML practitioner then finds, ingests, and analyzes an appropriate dataset that can be effectively leveraged to accomplish the goals of the ML project.
Once the data has been analyzed, the ML practitioner determines the most applicable modeling techniques that extract the most relevant information from the data to address the use case. These techniques include the following:
- Determining the most applicable ML algorithm
- Creating new aspects (engineering new features) of the data that can further improve the chosen model's overall effectiveness
- Separating the data into training and testing sets for model training and evaluation
The ML practitioner then codifies the algorithm's architecture and training/testing/evaluation routines. These routines are then executed to determine the best possible model parameters – ones that optimize the model to fit both the data and the business use case.
Finally, the best model is deployed into production to serve predictions that match the initial objective of the business use case.
As you can see, the overall process seems relatively straightforward and easy to follow. So, you may be wondering what all the fuss is about. For example, you may be asking yourself, Where is the complexity in this process? or Why do you say that this is so hard to automate?
While the process may look simplistic, the reality when executing it is vastly different. The following diagram provides a more realistic representation of what an ML practitioner may observe when developing an ML use case:
As you can see, the overall process is far more convoluted than the typical representation shown in Figure 1.1. There are potentially multiple different paths that can be taken through the process. Each course of action is based on the results captured from the previous step in the process. Additionally, taking a particular course of action may not always yield the desired results, thus forcing the ML practitioner to have to reset or go back and choose a different set of criteria that will hopefully produce a better result.
So, now that we have provided a high-level overview of what the typical ML process should entail, let's examine some of the complexities and challenges that make the ML process difficult.