Roadmap for Building Machine Learning Models
The roadmap for building machine learning models is straightforward and consists of five major steps, which are explained here:
- Data Pre-processing
This is the first step in building a machine learning model. Data pre-processing refers to the transformation of data before feeding it into the model. It deals with the techniques that are used to convert unusable raw data into clean reliable data.
Since data collection is often not performed in a controlled manner, raw data often contains outliers (for example, age = 120), nonsensical data combinations (for example, model: bicycle, type: 4-wheeler), missing values, scale problems, and so on. Because of this, raw data cannot be fed into a machine learning model because it might compromise the quality of the results. As such, this is the most important step in the process of data science.
- Model Learning
After pre-processing the data and splitting it into train/test sets (more on this later), we move on to modeling. Models are nothing but sets of well-defined methods called algorithms that use pre-processed data to learn patterns, which can later be used to make predictions. There are different types of learning algorithms, including supervised, semi-supervised, unsupervised, and reinforcement learning. These will be discussed later.
- Model Evaluation
In this stage, the models are evaluated with the help of specific performance metrics. With these metrics, we can go on to tune the hyperparameters of a model in order to improve it. This process is called hyperparameter optimization. We will repeat this step until we are satisfied with the performance.
- Prediction
Once we are happy with the results from the evaluation step, we will then move on to predictions. Predictions are made by the trained model when it is exposed to a new dataset. In a business setting, these predictions can be shared with decision makers to make effective business choices.
- Model Deployment
The whole process of machine learning does not just stop with model building and prediction. It also involves making use of the model to build an application with the new data. Depending on the business requirements, the deployment may be a report, or it may be some repetitive data science steps that are to be executed. After deployment, a model needs proper management and maintenance at regular intervals to keep it up and running.
This chapter will mainly focus on pre-processing. We will cover the different tasks involved in data pre-processing, such as data representation, data cleaning, and others.