Introducing machine learning pipelines
In Chapter 6, Solving Real-World Data Science Problems with LightGBM, we gave a detailed overview of the data science life cycle, which includes various steps to train an ML model. If we were to focus only on the steps required to train a model, given data that has already been collected, those would be as follows:
- Data cleaning and preparation
- Feature engineering
- Model training and tuning
- Model evaluation
- Model deployment
In previous case studies, we applied these steps manually while working through a Jupyter notebook. However, what would happen if we shifted the context to a long-term ML project? If we had to repeat the process when new data becomes available, we’d have to follow the same procedure to build a model successfully.
Similarly, when we want to use the model to score new data, we must apply the steps correctly and with the correct parameters and configuration every time.
In a sense, these...