You're reading from R Machine Learning Projects Implement supervised, unsupervised, and reinforcement learning techniques using R 3.5

Product type Paperback

Published in Jan 2019

Publisher Packt

ISBN-13 9781789807943

Length 334 pages

Edition 1st Edition

Languages

Tools

H2O

Concepts

Machine Learning

Author (1):

Dr. Sunil Kumar Chinnamgari

View More author details

ML project pipeline

Most of the content available on ML projects, either through books, blogs, or tutorials, explains the mechanics of machine learning in such a way that the dataset available is split into training, validation, and test datasets. Models are built using training datasets, and model improvements through hyperparameter tuning are done iteratively through validation data. Once a model is built and improved upon to a point that is acceptable, it is tested for goodness with unseen test data and the results of testing are reported out. Most of the public content available, ends at this point.

In reality, the ML projects in a business situation go beyond this step. We may observe that if one stops at testing and reporting a built model performance, there is no real use of the model in terms of predicting about data that is coming up in future. We also need to realize that the idea of building a model is to be able to deploy the model in production and have the predictions based on new data so that businesses can take appropriate action.

In a nutshell, the model needs to be saved and reused. This also means that any new data on which predictions need to be made needs to be preprocessed in the same way as training data. This ensures that, the new data has the same number of columns and also the same types of columns as training data. This part of productionalization of the models built in the lab is totally ignored when being taught. This section covers an end-to-end pipeline for the models, right from data preprocessing to building the models in the lab to productionalization of the models.

ML pipelines describe the entire process from raw data acquisition to obtaining post processing of the prediction results on unseen data so as to make it available for some kind of action by business. It is possible that a pipeline may be depicted at a generalized level or described at a very granular level. This current section focuses on describing a generic pipeline that may be applied to any ML project. Figure 1.8 shows the various components of the ML project pipeline otherwise known as the cross-industry standard process for data mining (CRISP-DM).

Business understanding

Once the problem description that is to be solved using ML is clearly articulated, the first step in the ML pipeline is to be able to ascertain if the problem is of relevance to business and if the goals of the project are laid out without any ambiguities. It may also be wise to check if the problem at hand is feasible to be solved as an ML problem. These are the various aspects typically covered during the business understanding step.

Understanding and sourcing the data

The next step is to identify all the sources of data that are relevant to the business problem at hand. Organizations will have one or more systems, such as HR management systems, accounting systems, and inventory management systems. Depending on the nature of the problem at hand, we may need to fetch the data from multiple sources. Also, data that is obtained through the data acquisition step need not always be structured as tabular data; it could be unstructured data, such as emails, recorded voice files, and images.

In corporate organizations of reasonable size, it may not be possible for an ML professional to work all alone on the task of fetching the data from the diverse range of systems. Tight collaboration with other professionals in the organization may be required to complete this step of the pipeline successfully:

Preparing the data

Data preparation enables the creation of input data for ML algorithms to consume. Raw data that we get from data sources is often not very clean. Sometimes, the data cannot be readily fused into an ML algorithm to create a model. We need to ensure that the raw data is cleaned up and it is prepared in a format that is acceptable for the ML algorithm to take as input.

EDA is a substep in the process of creating the input data. It is a process of using visual and quantitative aids to understand the data without getting prejudice about the contents of the data. EDA gives us deeper insights into the data available at hand. It helps us to understand the required data preparation steps. Some of the insights that we could obtain during EDA are the existence of outliers in the data, missing values existence in the data, and the duplication of data. All of these problems are addressed during data cleansing which is another substep in data preparation. Several techniques may be adopted during data cleansing and the following mentioned are some of the popular techniques:

Deleting records that are outliers
Deleting redundant columns and irrelevant columns in data
Missing values imputation—filling missing values with special value NA or a blank or median or mean or mode or with a regressed value
Scaling the data
Removing stop words such as a, and, and how, from unstructured text data
Normalizing words in unstructured text documents with techniques such as stemming, and lemmatization
Eliminating non-dictionary words in text data
Spelling corrections on misspelled words in text documents
Replacing non-recognizable domain-specific acronyms in the text with actual word descriptions
Rotation, scaling, and translation of image data

Representing the unstructured data as vectors, providing labels for the records if the problem at hand needs to be dealt with by supervised learning, handling class imbalance problems in the data, feature engineering, transforming the data through transformation functions such as log transform, min-max transform, square root transform, and cube transform, are all part of the data preparation process.

The output of the data preparation step is tabular data that can be fit readily into an ML algorithm as input in order to create models.

An additional substep that is typically done in data preparation is to divide the dataset into training data, validation data, and test data. These various datasets are used for specific purposes in the model-building step.

Model building and evaluation

Once the data is ready and prior to the creation of the model, we need to pick and select the features from the list of features available. This can be accomplished through several off-the-shelf feature selection techniques. Some ML algorithms (for example, XGBoost) have feature selection inbuilt within the algorithm, therefore we need not explicitly perform feature selection prior to carrying out the modeling activity.

A suite of ML algorithms is available to try and create models on the data. Additionally, models may be created through ensembling techniques as well. One needs to pick the algorithm(s) and create models using training datasets, then tune the hyperparameters of the model using validation datasets. Finally, the model that is created can be tested using the test dataset. Issues, such as selecting the right metric to evaluate model performance, overfitting, underfitting, and acceptable performance thresholds all need to be taken care of in the model-building step itself. It may be noted that if we do not obtain acceptable performance on the model, it is required to go back to previous steps in order to get more data or to create additional features and then repeat the model-building step once again to check if the model performance improves. This may be done as many times as required until the desired level of performance is achieved by the model.

At the end of the modeling step, we might end up with a suite of models each having its own performance measurement on the unseen test data. The model that has the best performance can be selected for use in production.

Model deployment

The next step is to save the final model that can be used for the future. There are several ways to save the model as an object. Once saved, the model can be reloaded any time and it can be used for scoring the new data. Saving the model as an object is a trivial task and a number of libraries are available in Python and R to achieve it. As a result of saving the model, the model object gets persisted to the disk as a .sav file or a .pkl file or a .pmml object depending on the library used. The object can then be loaded into the memory to perform scoring on unseen data.

The final model that is selected for use in production can be deployed to score unseen data in the following two modes:

Batch mode: Batch mode scoring is when one accumulates the unseen data to be scored in a file, then run a batch job (just another executable script) at a predetermined time to perform scoring. The job loads the model object from disk to the memory and runs on each of the records in the file that needs to be scored. The output is written to another file at a specified location as directed in the batch job script. It may be noted that the records to be scored should have the same number of columns as in the training data and the type of columns should also comply with the training data. It should be ensured that the number of levels in factors columns (nominal type data) should also match with that of the training data.
Real-time mode: There are times where the business needs model scoring to happen on the fly. In this case, unlike the batch mode, data is not accumulated and we do not wait until the batch job runs for scoring. The expectation is that each record of the data, as and when it is available for scoring should be scored by the model. The result of the scoring is to be available to business users almost instantaneously. In this case, a model needs to be deployed as a web service that can serve any requests that come in. The record to be scored can be passed to the web service through a simple API call which, in turn, returns the scored result that can be consumed by the downstream applications. Again, the unscored data record that is passed in the API call should comply with the format of the training data records.

Yet another way of achieving near real-time results is by running the model job on micro batches of data several times a day and at very frequent intervals. The data gets accumulated between the intervals until a point where the model job kicks off. The model job scores and outputs the results for the data that is accumulated similar to batch mode. The business user gets to see the scored results as soon as the micro batch job finishes execution. The only difference between the micro batches processing versus the batch is that unlike the batch mode, business users need not wait until the next business day to get the scored results.

Though, the model building pipeline ends with successfully deploying the ML model and making it available for scoring, in real-world business situations, the job does not end here. Of course, the success parties flow in but there is a need to look again at the models post a certain point in time (maybe in a few months post the deployment). A model that is not maintained at regular intervals does not get very well used by businesses.

To avoid the models from perishing and not being used by business users, it is important to collect feedback on the performance of the model over a period of time and capture if any improvements need to be incorporated in the models. The unseen data does not come with labels, therefore comparing the model output with that of the desired output by business is a manual exercise. Collaborating with business users is a strong requirement to get feedback in this situation.

If there is a continued business need for the model and if the performance is not up to the mark on the unseen data that is scored with existing model, it needs to be investigated to identify the root cause(s). It may so happen that several things have changed in the data that is scored over a period of time when compared to the data on which model was initially trained. In which case, there is a strong need to recalibrate the model and it is essentially a jolly good idea to start once again!

Now that the book has covered all the essentials of ML and the project pipeline, the next topic to be covered is the learning paradigm, which will help us learn several ML algorithms.