The next step is to save the final model that can be used for the future. There are several ways to save the model as an object. Once saved, the model can be reloaded any time and it can be used for scoring the new data. Saving the model as an object is a trivial task and a number of libraries are available in Python and R to achieve it. As a result of saving the model, the model object gets persisted to the disk as a .sav file or a .pkl file or a .pmml object depending on the library used. The object can then be loaded into the memory to perform scoring on unseen data.
The final model that is selected for use in production can be deployed to score unseen data in the following two modes:
- Batch mode: Batch mode scoring is when one accumulates the unseen data to be scored in a file, then run a batch job (just another executable script) at a predetermined time to perform scoring. The job loads the model object from disk to the memory and runs on each of the records in the file that needs to be scored. The output is written to another file at a specified location as directed in the batch job script. It may be noted that the records to be scored should have the same number of columns as in the training data and the type of columns should also comply with the training data. It should be ensured that the number of levels in factors columns (nominal type data) should also match with that of the training data.
- Real-time mode: There are times where the business needs model scoring to happen on the fly. In this case, unlike the batch mode, data is not accumulated and we do not wait until the batch job runs for scoring. The expectation is that each record of the data, as and when it is available for scoring should be scored by the model. The result of the scoring is to be available to business users almost instantaneously. In this case, a model needs to be deployed as a web service that can serve any requests that come in. The record to be scored can be passed to the web service through a simple API call which, in turn, returns the scored result that can be consumed by the downstream applications. Again, the unscored data record that is passed in the API call should comply with the format of the training data records.
Yet another way of achieving near real-time results is by running the model job on micro batches of data several times a day and at very frequent intervals. The data gets accumulated between the intervals until a point where the model job kicks off. The model job scores and outputs the results for the data that is accumulated similar to batch mode. The business user gets to see the scored results as soon as the micro batch job finishes execution. The only difference between the micro batches processing versus the batch is that unlike the batch mode, business users need not wait until the next business day to get the scored results.
Though, the model building pipeline ends with successfully deploying the ML model and making it available for scoring, in real-world business situations, the job does not end here. Of course, the success parties flow in but there is a need to look again at the models post a certain point in time (maybe in a few months post the deployment). A model that is not maintained at regular intervals does not get very well used by businesses.
To avoid the models from perishing and not being used by business users, it is important to collect feedback on the performance of the model over a period of time and capture if any improvements need to be incorporated in the models. The unseen data does not come with labels, therefore comparing the model output with that of the desired output by business is a manual exercise. Collaborating with business users is a strong requirement to get feedback in this situation.
If there is a continued business need for the model and if the performance is not up to the mark on the unseen data that is scored with existing model, it needs to be investigated to identify the root cause(s). It may so happen that several things have changed in the data that is scored over a period of time when compared to the data on which model was initially trained. In which case, there is a strong need to recalibrate the model and it is essentially a jolly good idea to start once again!
Now that the book has covered all the essentials of ML and the project pipeline, the next topic to be covered is the learning paradigm, which will help us learn several ML algorithms.