Exploring the ML life cycle
The ML life cycle refers to the various stages in the conceptualization, design, development, and deployment of an ML model. These stages in the ML model development process consist of a few key steps that help data scientists come up with the best possible outcome for the problem at hand. These steps are usually repeatable and iterative and are combined into a pipeline commonly known as the ML pipeline. An ideal ML pipeline is automated and repeatable so it can be deployed and maintained as a production pipeline. Here are the common stages of an ML life cycle.
Figure 1.6 – A diagram showing the steps of an ML life cycle
Figure 1.6 shows the various steps of the ML life cycle. It starts with having a business understanding of the problem and ends with a deployed model. The iterative steps such as data preparation and model training are denoted by loops to depict that the data scientists would perform those steps repeatedly until they are satisfied with the results. Let us now look at the steps in more detail.
Problem definition
A common mistake is to think ML can solve any problem! Problem definition is key to determining whether ML can be utilized to solve it. In this step, data scientists work with business stakeholders to find out whether the problem satisfies the key tenets of a good ML problem:
- Predictive element: During the ML problem definition, data scientists try to determine whether the problem has a predictive element. It may well be the case that the output being requested can be modeled as a rule that is calculated using existing data instead of creating a model to predict it.
For example, let us take into consideration the problem of health insurance claim fraud identification. There are some tell-tale signs of a claim being fraudulent that are derivable from the existing claims database using data transformations and analytical metrics. For example, verifying whether it’s a duplicate claim, whether the claim amount is unusually high, whether the reason for the claim matches the patient demographic or history, and so on. These attributes can help determine the high-risk claim transactions, which can then be flagged. For this particular problem, there is no need for an ML model to flag such claim transactions as the rules applied to existing claim transaction data are enough to achieve what is needed. On the other hand, if the solution requires a deeper analysis of multiple sources of data and looks at patterns across a large volume of such transactions, it may not be a good candidate for rules or analytical metrics. Applying conventional analytics to large volumes of heterogeneous datasets can result in extremely complicated analytical queries that are hard to debug and maintain. Moreover, the processing of rules on these large volumes of data can be compute-intensive and may become a bottleneck for the timely identification of fraudulent claims. In such cases, applying ML can be beneficial. A model can look at features from different sources of data and learn how they are associated with the target variable (fraud versus no fraud). It can then be used to generate a risk score for each new claim.
It is important to talk to key business stakeholders to understand the different factors that go into determining whether a claim is fraudulent or not. In the process, data scientists document a list of input features that can be used in the ML model. These factors help in the overall determination of the predictive element of the problem statement.
- Availability of dataset: Once the problem is determined to be a good candidate for ML, the next important thing to check is the availability of a high-quality labeled dataset. We cannot train models without the available data. The dataset should also be clean, with no missing values, and be evenly distributed across all features and values. It should have mapping to the target variable and the target itself should be evenly distributed across the dataset. This is obviously the ideal condition and real-world scenarios may be far from ideal. However, the closer we can get to this ideal state of data, the easier it is to produce a highly accurate model from it. In some cases, data scientists may recommend to the business they collect more data containing more examples of a certain type or even more features before starting to experiment with ML methods. In other cases, labeling and annotation of the raw data by subject matter experts (SMEs) may be needed. This is a time-consuming step and may require multiple rounds of discussions with the SMEs, business stakeholders, and data scientists before arriving at an appropriate dataset to begin the ML modeling process. It is worth the time, as utilizing the right dataset ensures the success of the ML project.
- Appetite for experimentation: In a few scenarios, it is important to highlight the fact that data science is a process of experimentation and the chances of it being successful are not always high. In a software development exercise, the work involved in each phase of requirements gathering, development, testing, and deployment can be largely predictable and can be used to accurately estimate the time it will take to complete the project. In an ML project, that may be difficult to determine from the outset. Steps such as data gathering and training the tuning hyperparameters are highly iterative and it could take a long time to come up with the best model. In some cases where the problem and dataset are well known, it may be easier to estimate the time as the results have been proven. However, the time taken to solve novel problems using ML methods could be difficult to determine. It is therefore recommended that the key stakeholders are aware of this and make sure the business has an appetite for experimentation.
Data processing and feature engineering
Before data can be fed into an algorithm for training a model, it needs to be transformed, cleaned, and formatted in a way that can be understood by ML algorithms. For example, raw data may have missing values and may not be standardized across all columns. It may also need transformations to create new derived columns or drop a few columns that may not be needed for ML. Once these data processing steps are complete, the data needs to be made suitable for ML algorithms for training. As you know by now, algorithms are representative of a mathematical equation that accepts the input values of the training datasets and tries to learn its association with the target. Therefore, it cannot accept non-numeric values. In a typical training dataset, you may have numeric, categorical, or text values that have to be appropriately engineered to make them appropriate for training. Some of the common techniques of feature engineering are as follows:
- Scaling: This is a technique by which a feature that may vary a lot across the dataset can be represented at a common scale. This allows the final model to be less sensitive to the variations in the feature.
- Standardizing: This technique allows the feature distribution to have a mean value of zero and a standard deviation of one.
- Binning: This approach allows for granular numerical values to be grouped into a set, resulting in categorical variables. For example, people above 60 years of age are old, between 18 and 60 are adults, and below 18 are young.
- Label encoding: This technique is used to convert categorical features into numeric features by associating a numerical value to each unique value of the categorical variable. For example, if a feature named
color
consists of three unique values –Blue
,Black
, andRed
– label encoders can associate a unique number with each of those colors, such asBlue=1
,Black=2
, andRed=3
. - One-hot encoding: This is another technique for encoding categorical variables. Instead of assigning a unique number to each value of a categorical feature, this technique converts each feature into a column in the dataset and assigns it a 1 or 0. Here is an example:
Price
Model
1000
iPhone
800
Samsung
900
Sony
700
Motorola
Table 1.1 – A table showing data about cell phone models and their price
Applying one-hot encoding to the preceding table will result in the following structure.
Price |
iPhone |
Samsung |
Sony |
Motorola |
1000 |
1 |
0 |
0 |
0 |
800 |
0 |
1 |
0 |
0 |
900 |
0 |
0 |
1 |
0 |
700 |
0 |
0 |
0 |
1 |
Table 1.2 – A table showing the results of one-hot encoding applied to the table in Figure 1.7
The resulting table is sparse in nature and consists of numeric features that can be fed into an ML algorithm for training.
The data processing and feature engineering steps you ultimately apply depend on your source data. We will look at some of these techniques applied to datasets in subsequent chapters where we will see examples of building, training, and deploying ML models with different datasets.
Model training and deployment
Once the features have been engineered and are ready, it is time to enter into the training and deployment phase. As mentioned earlier, it’s a highly repetitive phase of the ML life cycle where the training data is fed into the algorithm to come up with the best fit model. This process involves analyzing the output of the training metrics and tweaking the input features and/or the hyperparameters to achieve a better model. Tuning the hyperparameters of a model is driven by intuition and experience. Experienced data scientists select the initial parameters based on their knowledge of solving similar problems using the algorithm of choice and can come up with the best fit model faster. However, the trial-and-error process can be time-consuming for a new data scientist who is starting off with a random search of the parameters. This process of identifying the best hyperparameters of a model is known as hyperparameter tuning.
The trained model is then deployed typically as a REST API that can be invoked for generating predictions. It’s important to note that training and deployment is a continuous process in an ML life cycle. As discussed earlier, models that perform well in the training phase may degrade in performance in production over a period of time and may require retraining. It is also important to keep training the model at regular intervals with newly available real-world data to make sure it is able to predict accurately in all variations of production data. For this reason, ML engineers prefer to create a repeatable ML pipeline that continuously trains, tunes, and deploys newer versions of models as needed. This process is known as ML Operations, or simply MLOps, and the pipeline that performs these tasks is known as an MLOps pipeline.