An example of the end-to-end ML process
To better illustrate that the overall ML process is hard and that automation is challenging but crucial, we will set the stage with a hands-on example use case.
Introducing ACME Fishing Logistics
ACME Fishing Logistics is a fictitious organization that's concerned with the overfishing of the Sea Snail or Abalone population. Their primary goal is to educate fishermen on how to determine whether an abalone is old enough for breeding. What makes the age determination process challenging is that to verify the abalone's age, it needs to be shucked so that the inside of the shell can be stained and then the number of rings can be counted through a microscope. This involves destroying the abalone to determine whether it is old enough to be kept or returned to the ocean. So, ACME's charter and the goal behind their website is to help fishermen evaluate the various physical characteristics of an abalone so that they can determine its age without killing it.
The case for ML
As you can probably imagine, ACME has not been incredibly successful in its endeavor to prevent abalone overfishing through a simple education process. The CTO has determined that a more proactive strategy must be implemented. Due to this, they have tasked the website manager to make use of ML to make a more accurate prediction of an abalone's age when fishermen enter the physical characteristics of their catch into the new Age Calculator module of the website. This is where you come in, as ACME's resident ML practitioner – it is your job to create the ML model that serves abalone age predictions to the new Age Calculator.
We can start by using the CRISP-DM guidelines and frame the business use case. The business use case is an all-encompassing step that establishes the overall framework and incorporates the individual steps of the CRISP-DM process.
The purpose of this stage of the process is to establish what the business goals are and to create a project plan that achieves these goals. This stage also includes determining the relevant criteria that define whether, from a business perspective, the project is deemed a success; for example:
- Business Goal: The goal of this initiative is to create an Age Calculator web application that enables fishermen to determine the age of their abalone catch to determine whether it is below the breeding age threshold. To establish how this business goal can be achieved, several questions arise. For example, how accurate does the age prediction need to be? What evaluation metrics will be used to determine the prediction's accuracy? What is the acceptable accuracy threshold? Is there valid data for the use case? How long will the project take? Having questions like these helps set realistic goals for planning.
- Project Plan: A project plan can be formulated by investigating what the answers to some of these questions might be. For example, by investigating what data to use and where to find it, we can start to formulate the difficulties in acquiring the data, which impacts how long the project might take. Additionally, understanding about the model's complexity, which also impacts project timelines, as more complicated models require more time to build, evaluate, and tweak.
- Success Criteria: As the project plan starts to formulate, we start to get a picture of what success looks like and how to measure it. For example, if we know that creating a complicated model will negatively impact the delivery timeline, we can relax the acceptable prediction accuracy criteria for the model and reduce the time it takes to develop a production-grade model. Additionally, if the business goal is simply to help the fishermen determine the abalone age but we have no way of tracking whether they abide by the recommendation, then our success criteria can be measured – not in terms of the model's accuracy but how often the Age Calculator is accessed and used. For instance, if we get 10 application hits a day, then the project can be deemed successful.
While these are only examples of what this stage of the process might look like, it illustrates that careful forethought and planning, along with a very specific set of objectives, must be outlined before any ML processes can start. It also illustrates that this stage of the process cannot be automated, though having a set plan with predefined objectives creates the foundation on which an automation framework could potentially be incorporated.
Getting insights from the data
Now that the overall business case is in place, we can dive into the meat of the actual ML process, starting with the data stage. As shown in the following diagram, the data stage is the first individual step within the framework of the business case:
It is at this point that we determine what data is available, how to ingest the data, what the data looks like, what characteristics of the data are most relevant to predicting the age, and which features need to be re-engineered to create the most optimal production-ready model.
Important Note
It is a well-known fact that the data acquisition and exploratory analysis part of the process can account for 70%–80% of the overall effort.
A model worthy of being considered production-ready is only as good as the data it has been trained on. The data needs to be fully analyzed and completely understood to extract the most relevant features for model building and training. We can accomplish this using a technique commonly referred to as Exploratory Data Analysis (EDA), where we assess the statistical components of the data, potentially visualizing and creating charts to fully grasp feature relevance. Once we have grasped the feature's importance, we might choose to get more important data, remove unimportant data, and potentially engineer new facets of the data, all to have the trained model learn from these optimal features.
Let's walk through an example of what this stage of the process might look like for the Age Calculator use case.
Sourcing, ingesting, and understanding the data
For our example, we will be using the Abalone Dataset.
Note
The Abalone Dataset is sourced from the University of California, Irvine's ML repository: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
This dataset contains the various physical characteristics of the abalone that can be used to determine its age. The following steps will walk you through how to access and explore the dataset:
- We can load the dataset with the following sample Python code, which uses the
pandas
library (https://pandas.pydata.org) to ingest the data in a comma-separated value (csv
) format using theread_csv()
method. Since the source data doesn't have any column names, we can review the Attribute Information section of the dataset website and manually create ourcolumn_names
:import pandas as pd column_names = ["sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings"] abalone_data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", names=column_names)
- Now that the data has been downloaded, we can start analyzing it as a DataFrame. First, we will take a sample of the first five rows of the data to ensure we have successfully downloaded it and verify that it matches the attribute information highlighted on the website. The following sample Python code calls the
head()
method on theabalone_data
DataFrame:abalone_data.head()
The following screenshot shows the output of executing this call:
Although we are only viewing the first five rows of the data, it matches the attribute information provided by the repository website. For example, we can see that the sex column has nominal values showing if the abalone is male (M), female (F), or an infant (I). We also have the rings column, which is used to determine the age of the abalone. The additional columns, such as weight, diameter, and height, detail additional characteristics of the abalone. These characteristics all contribute to determining its age (in years). The age is calculated using the number of rings, plus 1.5.
- Next, we can use the following sample code to call the
describe()
method on theabalone_data
DataFrame:abalone_data.describe()
The following screenshot shows the summary statistics of the dataset, as well as various statistical details, such as the percentile, mean, and standard deviation:
Note
At this point, we can gain an understanding of the data by visualizing and plotting any correlations between the key features to further understand how the data is distributed, as well as to determine the most important features in the dataset. We should also determine whether we have missing data and if we have enough data.
Only using summary statistics to understand the data can often be misleading. Although we will not be performing these visualization tasks on this example, you can review why using graphical techniques is so important to understanding data by looking at the Anscombe's Quartet example on Kaggle (https://www.kaggle.com/carlmcbrideellis/anscombe-s-quartet-and-the-importance-of-eda).
The previous tasks highlight a few important observations we derived from the summary statistics of the dataset. For example, after reviewing the descriptive statistics from the dataset (Figure 1.5), we made the following important observations:
- The count value for each column is 4177. We can deduce that we have the same number of observations for each feature and therefore, no missing values. This means that we won't have to somehow infer what these missing values might be or remove the row containing them from the data. Most ML algorithms fail if data is missing.
- If you look at the 75% value for the rings column, there is a significant variance between the 11 rings and that of the max amount of rings, which is 29. This means that the data potentially contains outliers that could add unnecessary noise and influence the overall model effectiveness of the trained model.
- While the sex column is visible in Figure 1.4, the summary statistics displayed in Figure 1.5 do not include it. This is because of the type of data in this column. If you refer to the Attribute Information section of the dataset's website (https://archive.ics.uci.edu/ml/datasets/abalone), you will see that this sex column is comprised of nominal data. This type of data is used to provide a label or category for data that doesn't have a quantitative value. Since there is no quantitative value, the summary statistics for this column cannot be displayed. Depending on the type of ML algorithm that's selected to address the business objective, we may need to convert this data into a quantitative format as not all ML algorithms will work with nominal data.
The next set of steps will help us apply what we have learned from the dataset to make it more compatible with the model training part of the process:
- In this step, we focus on converting the sex column into quantitative data. The sample code highlights using the
get_dummies()
method on theabalone_data
DataFrame, which will convert the categories of Male (M), Female (F), and Infant (I) into separate feature columns. Here, the data in these new columns will either reflect one of the categories, represented by a one (1) if true or a zero (0) if false:abalone_data = pd.get_dummies(abalone_data)
- Running the
head()
method again now shows the first five rows of the newly converted data:Abalone_data.head()
The following screenshot shows the first five rows of the converted dataset. Here, you can see that the sex column has been removed and that, in its place, there are three new columns (one for each new category) with the data now represented as discrete values of 1 or 0:
- The next step in preparing the data for model building and training is to separate the rings column from the data to establish it as the target, or variable, we are trying to predict. The following sample code shows this:
y = abalone_data.rings.values del abalone_data["rings"]
- Now that the target variable has been isolated, we can normalize the features. Not all datasets require normalization, however. By looking at Figure 1.5, we can see that the summary statistics show that the features have different ranges. These different ranges, especially if the values are large, can influence the overall effectiveness of the model during training. Thus, by normalizing the features, the model can converge to a global minimum much faster. The following code sample shows how the existing features can be normalized by first converting it into a NumPy array (https://numpy.org) and then using the
normalize()
method from the scikit-learn or sklearn Python library (https://scikit-learn.org/stable/):import numpy as np from sklearn import preprocessing X = abalone_data.values.astype(np.float) X = preprocessing.normalize(X)
Based on the initial observations from the dataset, we have applied the necessary transformations to prepare the features for model training. For example, we converted the sex column from a nominal data type into a quantitative data type since this data will play an important part in determining the age of an abalone.
From this example, you can see that goal of the Data step is to focus on exploring and understanding the dataset. We also use this step to apply what we've learned and change the data or preprocess it into a representation that suits the downstream model building and training process.
Building the right model
Now that the data has been ingested, analyzed, and processed, we are ready to move onto the next stage of the ML process, where we will look at building the right ML model to suit both the business use case as well as to match it to our newly acquired understanding of the data:
Unfortunately, there is no one size fits all algorithm that can be applied to every use case. However, by taking the knowledge we have gleaned from both the business objective and dataset, we can define a list of potential algorithms to use.
For example, we know from our business case that we want to predict the age of the abalone by using the number of rings to get its age. We also know from analyzing and understanding the dataset that we have a target or labeled variable from the rings column. This target variable is a discrete, numerical value between 1 and 29, so we can refine our list of possible algorithms to a supervised learning algorithm that predicts a numerical value among a discrete set of possible values.
The following are just a few of the possible algorithms that could be applied to the example business case:
- Linear regression
- Support vector machines
- Decision trees
- Naïve Bayes
- Neural networks
Once again, there is no one algorithm in this list that perfectly matches the use case and the data. Therefore, the ML process is an experiment to work through multiple possible permutations, get insight from each permutation, and apply what has been learned to further refine the optimal model.
Some of the additional factors that influence which algorithm to start with are based on the ML practitioner's experience, plus how the chosen algorithm addresses the required business goals and success measurements. For example, if a required success criterion is to have the model completed within 2 weeks, then that might eliminate the option to use a more complicated algorithm.
Building a neural network model
Continuing with the Age Calculator experiment, we will implement a neural network algorithm, also referred to as Artificial Neural Network (ANN), Deep Neural Network (DNN), or Multilayer Perceptron (MLP).
At a high level, a neural network is an artificial construct modeled on the brain, whereby small, non-linear calculations are made on the data by what is commonly referred to as a neuron or perceptron. By grouping these neurons into individual layers and then compounding these layers together, we can assemble the building blocks of a mechanism that takes the data as input and finds the dependencies (or correlations) for the output (or target). Through an optimization process, these dependencies are further refined to get the predicted output as close as possible to the actual target value.
Note
The primary reason a neural network model is being used in this example is to introduce a deep learning framework. Deep learning frameworks, such as PyTorch (https://pytorch.org/), TensorFlow (https://www.tensorflow.org/), and MXNet (https://mxnet.apache.org/), can be used to create more complicated neural networks. However, from the perspective of ML process automation, they can also introduce several complexities. So, by making use of a deep learning framework, we can lay the foundation to address some of these complexities later in this book.
The following is a graphical representation of the neural network architecture that we will be building for our example:
The individual components that make up this architecture will be explained in the following steps:
- To start building the model architecture, we need to load the necessary libraries from the TensorFlow deep learning framework. Along with the
tensorflow
libraries, we will also import the Keras API. The Keras (https://keras.io/) library allows us to create higher-level abstractions of the neural network architecture that are easier to understand and work with. For example, from Keras, we also load theSequential
andDense
classes. These classes allow us to define a model architecture that uses sequential neural network layers and define the type and quantity of neurons in each of these layers:import tensorflow as tf from tensorflow import keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense
- Next, we can use the
Dense
class to define the list of layers that make up the neural network:network_layers = [ Dense(256, activation='relu', kernel_initializer="normal", input_dim=10), Dense(128, activation='relu'), Dense(64, activation='relu'), Dense(32, activation='relu'), Dense(1, activation='linear') ]
- Next, we must define the model as being a
Sequential()
model or simply a list of layers:model = Sequential(network_layers)
- Once the model structure has been defined, we must compile it for training using the
compile()
method:model.compile(optimizer="adam", loss="mse", metrics=["mae", "accuracy"])
- Once the model has been compiled, the
summary()
method can be called to view its architecture:model.summary()
The following screenshot shows the results of calling this method. Even though it's showing text output, the network architecture matches the one shown in Figure 1.8:
As you can see, the first layer of the model matches Layer 1 in Figure 1.8, where the Dense()
class is used to express that this layer has 256 neurons, or units, that connect to every neuron in the next layer. Layer 1 also initializes the parameters (model weights and bias) so that each neuron behaves differently and captures the different patterns we wish to optimize through training. Layer 1 is also configured to expect input data that has 10 dimensions. These dimensions correspond to the following features of the Abalone Dataset:
- Length
- Diameter
- Height
- Whole Weight
- Shucked Weight
- Viscera Weight
- Shell Weight
- Sex_F
- Sex_I
- Sex_M
Layer 1 is also configured to use the nonlinear Rectified Linear Unit (ReLU) activation function, which allows the neural network to learn complex relationships from the dataset. We then repeat the process, adding Layer 2 through Layer 4, specifying that each of these layers has 128, 64, 32, and 1 neuron(s) or unit(s), respectively. The final Layer only has a single output – the predicted number of rings. Since the objective of the model is to determine how this output relates to the actual number of rings in the dataset, a linear activation function is used.
Once we have constructed the model architecture, we use the following important parameters to compile the model using the compile()
method:
- Loss: This parameter specifies the type of objective function (also referred to as the cost function) that will be used. At a high level, the objective function calculates how far away or how close the predicted result is to the actual value. It calculates the amount of error between the number of rings that the model predicts, based on the input data, versus what the actual number of rings is. In this example, the Mean Squared Error (MSE) is used as the objective function, where the average amount of error is measured across all the data points.
- Optimizer: The objective during training is to minimize the amount of error between the predicted number of rings and the actual number of rings. The Adam optimizer is used to iteratively update the neural network weights that contribute to reducing the loss (or error).
- Metrics: The evaluation metrics, Mean Absolute Error (MAE), and prediction accuracy are captured during model training and used to provide insight into how effectively the model is learning from the input data.
Note
If you are unfamiliar with any of these terms, there are a significant amount of references available when you search for them. Additionally, you may find it helpful to take the Deep Learning Specialization course offered by Coursera (https://www.coursera.org/specializations/deep-learning). Further details on these parameters can be found in the Keras API documentation (https://keras.io/api/models/model_training_apis/#compile-method).
Now that we have built the architecture for the neural network algorithm, we need to see how it fits on top of the preprocessed dataset. This task is commonly referred to as training the model.
Training the model
The next step of the ML process, as illustrated in the following diagram, is to train the dataset on the preprocessed abalone data:
Training the compiled model is relatively straightforward. The following steps outline how to kick off the model training part of the process:
- This first step is not necessary to train the model, but sometimes, the output from the training process can be unwieldy and difficult to interpret. Therefore, a custom class called
cleanPrint()
can be created to ensure that the training output is neat. This class uses the KerasCallback()
method to print a dash ("-"
) as the training output:class cleanPrint(keras.callbacks.Callback): def on_epoch_end(self, epoch, logs): if epoch+1 % 100 == 0: print("!") else: print("-", end="")
Note
It is a good practice to display the model's performance at each epoch as this provides insight into the improvements after each epoch. However, since we are training for
2000
epochs, we are using thecleanPrint()
class to make the output neater. We will remove this callback later. - Next, we must separate the preprocessed abalone data into two main groups – one for the training data and one for testing data. The splitting process is performed by using the
train_test_split()
method from themodel_selection()
class of the sklearn library:from sklearn.model_selection import train_test_split training_features, testing_features, training_labels, testing_labels = train_test_split(X, y, test_size=0.2, random_state=42)
- The final part of the training process is to launch the model training process. This is done by calling the
fit()
method on the compiled model and supplying thetraining_features
andtraining_labels
datasets, as shown in the following example code:training_results = model.fit(training_features, training_labels, validation_data=(testing_features, testing_labels), batch_size=32, epochs=2000, shuffle=True, verbose=0, callbacks=[cleanPrint()])
Now that the model training process has started, we can review a few key aspects of our code. First, splitting the data into training and testing datasets is typically performed as part of the data preprocessing step. However, we are performing this task during the model training step to provide additional context to the loss and optimization functions. For example, creating these two separate datasets is an important part of evaluating how well the model is being trained. The model is trained using the training dataset and then its effectiveness is evaluated against the testing dataset. This evaluation procedure guides the model (using the loss function and the optimization function) to reduce the amount of error between the predicted number of rings and the actual number of rings. In essence, this makes the model better or optimizes the model. To create a good split of training and testing data, we must provide four additional variables, as follows:
training_features
: The 10 columns of the Abalone Dataset that correspond to the abalone attributes, comprising 80% of these observations.testing_features
: The same 10 columns of the Abalone Dataset, comprising the other 20% of the observations.training_labels
: The number of rings (target label) for each observation in thetraining_features
dataset.testing_labels
: The number of rings (target label) for each observation in thetesting_features
dataset.Tip
Further details about each of these parameters, as well as more parameters that you can use to tweak the training process, can be found in the Keras API documentation (https://keras.io/api/models/model_training_apis/#fit-method).
Secondly, once the data has been successfully split, we can use the fit()
method and add the following parameters to further govern the training process:
validation_data
: Thetesting_features
andtesting_labels
datasets, which the model uses to evaluate how well the trained neural network weights reduce the amount of error between the predicted number of rings and the actual number of rings in the testing data.batch_size
: This parameter defines the number of samples from the training data that are propagated through the neural network. This parameter can be used to influence the overall speed of the training process. The higherbatch_size
is, the higher the number of samples that are used from the training data, which means the higher the number of samples that are combined to estimate the loss before updating the neural network's weights.epochs
: This parameter defines how many times the training process will iterate through the training data. The higherepochs
is, the more iterations must be made through the training data to optimize the neural network's weights.shuffle
: This parameter specifies whether to shuffle the data before starting a training iteration. Shuffling the data each time the model iterates through the data forces the model to generalize better and prevent it from learning ordered patterns in the training data.verbose
andcallbacks
: These parameters are related to displaying the training progress and output for eachepoch
. Setting the output to zero and using thecleanPrint()
class will simply display a dash (-) as the output for eachepoch
.
The training process should take 12 minutes to complete, providing us with a trained model
object. In the next section, we will use the trained model to evaluate how well it makes predictions on new data.
Evaluating the trained model
Once the model has been trained, we can move on to the next stage of the ML process: the model evaluation stage. It is at this stage that the trained model is evaluated against the objectives and success criterion that have been established within the business use case, with the goal being to determine if the trained model is ready for production or not:
When evaluating a trained model, most ML practitioners simply score the quality of the model predictions using an evaluation metric that is suited to the type of model. Other ML practitioners go one step further to visualize and further understand the predictions. The following steps will walk you through using the latter of these two approaches:
- Using the following sample code, we can load the necessary Python libraries. The first library is
matplotlib
. Thepyplot()
class is a collection of different functions that allow for interactive and programmatic plot generation. The second library,mean_squarred_error()
, comes from thesklearn
package and provides the ML practitioner with an easy way to evaluate the quality of the model using the Root Mean Squared Error (RMSE) metric. Since the neural network model is a supervised learning-based regression model, RMSE is a popular method that's used to measure the error rate of the model predictions:import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error
- The imported libraries are then used to visualize the predictions to provide a better understanding of the model's quality. The following code generates a plot that incorporates the information that's required to quantify the prediction's quality:
fig, ax = plt.subplots(figsize=(15, 10)) ax.plot(testing_labels, model.predict(testing_features), "ob") ax.plot([0, 25], [0, 25], "-r") ax.text(8, 1, f"RMSE = {mean_squared_error(testing_labels, model.predict(testing_features), squared=False)}", color="r", fontweight=1000) plt.grid() plt.title("Abalone Model Evaluation", fontweight="bold", fontsize=12) plt.xlabel("Actual 'Rings'", fontweight="bold", fontsize=12) plt.ylabel("Predicted 'Rings'", fontweight="bold", fontsize=12) plt.legend(["Predictions", "Regression Line"], loc="upper left", prop={"weight": "bold"}) plt.show()
Executing this code will create two sub-plots. The first sub-plot is a scatterplot displaying the model predictions from the test dataset, as well as the ground truth labels. The second sub-plot superimposes a regression line over these predictions to highlight the linear relationship between the predicted number of rings versus the actual number of rings. The rest of the code labels the various properties of the plot and displays the RMSE score of the predictions. The following is an example of this plot:
Three things should immediately stand out here:
- The RMSE evaluation metric scores the trained model at 2.54.
- The regression line depicting the correlation between the actual number of rings and the predicted number of rings does not pass through the majority of the predictions.
- There are a significant number of predictions that are far away from the regression line on both the positive and negative scales. This shows a high error rate between the number of rings that are predicted versus the actual number of rings for a data point.
These observations and others should be compared to the objectives and success criteria that are outlined in the business use case. Both the ML practitioner and business owner can then judge whether the trained model is ready for production.
For example, if the primary objective of the Age Calculator application is to use the model predictions as a rough guide for the fishermen to get a simple idea of the abalone age, then the model does this and can therefore be considered ready for production. If, on the other hand, the primary goal of the Age Calculator application is to provide an accurate age prediction, then the example model probably cannot be considered production-ready.
So, if we determine that the model is not ready for production, what are the subsequent steps of the ML process? The next section will review some options.
Exploring possible next steps
Since the model has been deemed unfit for production, several approaches can be taken after the model evaluation stage. The following diagram highlights three possible options that can be considered as possible next steps:
Let's explore these three possible next steps in more depth to determine which option best suits the objectives of the Age Calculator use case.
Option 1 – get mode data
The first option requires the ML practitioner to go back to the beginning of the process and acquire more data. Since the UCI abalone repository is the only publicly available dataset, this task might involve physically gathering more observations by manually fishing for abalone or conducting a survey with fishermen on their catch. Either way, this takes time!
However, simply adding more observations to the dataset does not necessarily translate to a better-quality model. So, getting more data could also mean getting better-quality features. This means that the ML practitioner would need to reevaluate the existing data, dive further into the analysis to better understand which of the features are of the most importance, and then re-engineer those features or create new features from them. This too is time-consuming!
Option 2 – choose another model
The second option to consider involves building an entirely new model using a completely different algorithm that still matches the use case. For example, the ML practitioner might investigate using another supervised learning, regression-based algorithm.
Different algorithms might also require the data to be restructured so that it's more suited to the algorithm's required type of input. For example, choosing a Gradient Boosting Regression algorithm, such as XGBoost, requires the target label to be the first column in the dataset. Choosing another algorithm and reengineering the data requires additional time!
Option 3 – tuning the existing model
Recall that when the existing neural network model was built, there were a few tunable parameters that were configured during its compilation. For example, the model was compiled using particular optimizer and loss functions.
Additionally, when the existing neural network model was trained, other tunable parameters were supplied, such as the number of epochs and the batch size.
Note
There is no best practice for choosing the right option. Remember that each iteration through the process is an experiment whereby the goal is to glean more information from the experiment to determine the next course of action or next option.
While Option 3 may seem straightforward, in the next section, you will see that this option also involves multiple potential iterations and is therefore also time-consuming.
Tuning our model
As we've already highlighted, multiple parameters or hyperparameters can be tuned to better tune or optimize an existing model. Hence, this stage of the process is also referred to as hyperparameter optimization. The following diagram shows what the hyperparameter optimization process entails:
After evaluating the model to determine which hyperparameters can be tweaked, the model is trained using these parameters. The trained model is, once again, compared to the business objectives and success criterion to determine if it is ready for production. This process is then repeated, constantly tweaking, training, and evaluating until a production-ready model is produced.
Determining the best hyperparameters to tune
Once again, there is no exact approach to getting the optimal hyperparameters. Each iteration through the process helps narrow down which combination of hyperparameters contributes to a more optimized model.
However, a good place to start the process is to dive deeper into what is happening during model training and derive further insights into how the model is learning from the data.
You will recall that, when executing the fit()
method to train the model and by binding the results to the training_results
parameter, we are able to get additional metrics that were needed for model tuning. The following steps will walk you through an example of how to extract and visualize these metrics:
- By using the
history()
method on thetraining_results
parameter, we can use the following sample code to plot the prediction error for both the training and testing processes.plt.rcParams["figure.figsize"] = (15, 10) plt.plot(training_results.history["loss"]) plt.plot(training_results.history["val_loss"]) plt.title("Training vs. Testing Loss", fontweight="bold", fontsize=14) plt.ylabel("Loss", fontweight="bold", fontsize=14) plt.xlabel("Epochs", fontweight="bold", fontsize=14) plt.legend(["Training Loss", "Testing Loss"], loc="upper right", prop={"weight": "bold"}) plt.grid() plt.show()
The following is an example of what the plot might look like after executing the preceding code:
- Similarly, by replacing the
loss
andval_loss
parameters in the sample code withmae
andval_mae
, respectively, we can see a consistent trend:plt.rcParams["figure.figsize"] = (15, 10) plt.plot(training_results.history["mae"]) plt.plot(training_results.history["val_mae"]) plt.title("Training vs. Testing Mean Absolute Error", fontweight="bold", fontsize=14) plt.ylabel("mae", fontweight="bold", fontsize=14) plt.xlabel("Epochs", fontweight="bold", fontsize=14) plt.legend(["Training MAE", "Testing MAE"], loc="upper right", prop={"weight": "bold"}) plt.grid() plt.show()
After executing the preceding code, we will get the following output:
Both Figure 1.16 and Figure 1.15 clearly show a few especially important trends:
- There is a clear divergence between what the model is learning from the training data and its predictions on the testing data. This indicates that the model is not learning anything new as it trains and is essentially overfitting the data. The model relates to the training data and is unable to relate to new, unseen data in the testing dataset.
- This divergence seems to happen around 250 epochs/training iterations. Since the training process was set to 2,000 epochs, this indicates that the model is being over-trained, which could be the reason it is overfitting the training data.
- Both the testing MAE and the testing loss have an erratic gradient. This means that as the model parameters are being updated through the training process, the magnitude of the updates is too large, resulting in an unstable neural network, and therefore unstable predictions on the testing data. So, the fluctuations depicted by the plot essentially highlight an exploding gradient problem, indicating that the model is overfitting the data.
Based on these observations, several hyperparameters can be tuned. For example, an obvious parameter to change is the number of epochs or training iterations to prevent overfitting. Similarly, we could change the optimization function from Adam to Stochastic Gradient Descent (SGD). SGD allows a specific learning rate to be set as one of its parameters, as opposed to the adaptive learning rate used by the Adam optimizer. By specifying a small learning rate parameter, we are essentially rescaling the model updates to ensure that they are small and controlled.
Another solution might be to use a regularization technique, such as L1 or L2 regularization, to penalize some of the neurons on the model, thus creating a simpler neural network. Likewise, simplifying the neural network architecture by reducing the number of layers and neurons within each layer would have the same effect as regularization.
Lastly, reducing the number of samples or batch size can control the stability of the gradient during training.
Now that we have a fair idea of which hyperparameters to tweak, the next section will show you how to further optimize the model.
Tuning, training, and reevaluating the existing model
We can start model tuning by walking through the following steps:
- The first change we must make is to the neural network architecture itself. The following example code depicts the new structure, where only two network layers are used instead of four. Each layer only has 64 neurons:
network_layers = [ Dense(64, activation='relu', kernel_initializer="normal", input_dim=10), Dense(64, activation='relu'), Dense(1, activation='linear') ]
- Once again, the model is recompiled using the same parameters as those from the previous example:
model = Sequential(network_layers) model.compile(optimizer="adam", loss="mse", metrics=["mae", "accuracy"]) model.summary()
The following screenshot shows the text summary of the tuned neural network architecture:
The following diagram shows a visual representation of the turned neural network architecture:
- Lastly, the
fit()
method is called on the new model. However, this time, the number ofepochs
has been reduced to200
andbatch_size
has also been reduced to8
:training_results = model.fit(training_features, training_labels, validation_data=(testing_features, testing_labels), batch_size=8, epochs=200, shuffle=True, verbose=1)
Note
In the previous code example, the
cleanPrint()
callback has been removed to show the evaluation metrics on both the training and validation data at 200 epochs. - Once the new model training has been completed, the previously used evaluation code can be re-executed to display the evaluation scatterplot. The following is an example of this scatterplot:
The new model does not capture all the predictions as there are still several outliers on the positive and negative scales. However, there is a drastic improvement to the overall fit on most data points. This is further quantified by the RMSE score dropping from 2.54 to 2.08.
Once again, these observations should be compared to the objectives and the success criteria that are outlined in the business use case to gauge whether the model is ready for production.
As the following diagram illustrates, if a production-ready model cannot be found, then the options to further tune the model, get and engineer more data, or build a completely different model are still available:
Should the model be deemed as production-ready, the ML practitioner can move onto the final stage of the ML process, As shown in the following diagram this is the model deployment stage:
In the next section, we will review the processes involved in deploying the model into production.
Deploying the optimized model into production
Model deployment is somewhat of a gray area in that some ML practitioners do not apply this stage to their ML process. For example, some ML practitioners may feel that the scope of their task is to simply provide a production-ready ML model that addresses the business use case. Once this model has been trained, they simply hand it over to the application development teams or application owners for them to test and integrate the model into the application.
Alternatively, some ML practitioners will work with the application teams to deploy the model into a test or Quality Assurance (QA) environment to ensure that the trained model successfully integrates with the application.
Whatever the scope of the ML practitioner role, model deployment is part of the CRISP-DM methodology and should always be factored into the overall ML process, especially if the ML process is to be automated.
While the CRISP-DM methodology ends with the model deployment stage, as shown in the preceding diagram, the process is, in fact, a continuous process. Once the model has been deployed into a production application, it needs to be constantly monitored to ensure that it does not drift from its intended purpose, to consistently provide accurate predictions on unseen data or new data. Should this situation arise, the ML practitioner will be called upon to start the ML process again to reoptimize the model and make it generalize to this new data. The following diagram shows what the ML process looks like in reality:
So, once again, why is the ML process hard?
Using this simple example use case, you can hopefully see that not only are there inherent complexities to the process of exploring the data, as well as building, training, evaluating, tuning, deploying, and monitoring the model – the entire process is also complex, manual, iterative, and continuous.
How can we streamline the process to ensure that the outcome is always an optimized model that matches the business use case? This is where AutoML comes into play.
Streamlining the ML process with AutoML
AutoML is a broad term that has different a meaning depending on who you ask. When referring to AutoML, some ML practitioners may point to a dedicated software application, a set of tools/libraries, or even a dedicated cloud service. In a nutshell, AutoML is a methodology that allows you to create a repeatable, reliable, streamlined, and, of course, automated ML process.
The process is repeatable in that it follows the same pattern every time it is executed. The process is reliable in that it guarantees that an optimized model that matches the use case is always produced. The process is streamlined and any unnecessary steps are removed, making it as efficient as possible. Finally, and most importantly, the process can be started and executed automatically and triggered by an event, such as retraining the model after model concept drift has been detected.
AWS provides multiple capabilities that can be used to build a streamlined AutoML process. In the next section, I will highlight some of the dedicated cloud services, as well as other services, that can be leveraged to make the ML process easier and automated.