A data science project aims to infuse an application with intelligence extracted from data. In this section, you will discover the common tasks and key considerations needed within such a project. There are quite a few well-established life cycle processes, such as Team Data Science Process (TDSP) and Cross-Industry Standard Process for Data Mining (CRISP-DM), that describe the iterative stages executed in a typical project. The most common stages are shown in Figure 1.1:
Figure 1.1 – The iterative stages of a data science project
Although the diagram shows some indicative flows between the phases, you are free to jump from one phase to any other if needed. Moreover, this approach is iterative, and the data science team should go through multiple iterations, improving its business understanding and the resulting model until the success criteria are met. You will read more about the benefits of an iterative process in this chapter's Adopting the DevOps mindset section. The data science process starts from the business understanding phase, something you will read more about in the next section.
Understanding of the business problem
The first stage in a data science project is that of business understanding. In this stage, the data science team collaborates with the business stakeholders to define a short, straightforward question that machine learning will try to answer.
Figure 1.2 shows the five most frequent questions that machine learning can answer:
Figure 1.2 – Five questions machine learning can answer
Behind each of those questions, there is a group of modeling techniques you will use:
- Regression models allow you to predict a numeric value based on one or more features. For example, in Chapter 8, Experimenting with Python Code, you will be trying to predict a numeric value based on 10 measurements that were taken one year before the value you are trying to predict. Training a regression model is a supervised machine learning task, meaning that you need to provide enough sample data to train the model to predict the desired numeric value.
- Classification models allow you to predict a class label for a given set of inputs. This label can be as simple as a yes/no label or a blue, green, or red color. For example, in Chapter 5, Letting the Machines Do the Model Training, you will be training a classification model to detect whether a customer is going to cancel their phone subscription or not. Predicting whether a person is going to stop doing something is referred to as churn or attrition detection. Training a classification model is a supervised machine learning task and requires a labeled dataset to train the model. A labeled dataset contains both the inputs and the label that you want the model to predict.
- Clustering is an unsupervised machine learning task that groups data. In contrast to the previous two model types, clustering doesn't require any training data. It operates on the given dataset and creates the desired number of clusters, assigning each data point to the collection it belongs. A common use case of clustering models is when you try to identify distinct consumer groups in your customer base that you will be targeting with specific marketing campaigns.
- Recommender systems are designed to recommend the best options based on user profiles. Search engines, e-shops, and popular video streaming platforms utilize this type of model to produce personalized recommendations on what to do next.
- Anomaly detection models can detect outliers from a dataset or within a data stream. Outliers are items that don't belong with the rest of the elements, indicating anomalies. For example, if a vibration sensor of a machine starts sending abnormal measurements, it may be a good indication that the device is about to fail.
During the business understanding phase, you will try to understand the problem statement and define the success criteria. Setting up proper expectations of what machine learning can and cannot do is key to ensure alignment between teams.
Throughout a data science project, it is common to have multiple rounds of business understandings. The data science team acquires a lot of insights after exploring datasets or training a model. It is helpful to bring those gathered insights to the business stakeholders and either verify your team's hypothesis or gain even more insights into the problem you are tackling. For example, business stakeholders may explain a pattern that you may detect in the data but cannot explain its rationale.
Once you get a good grasp of what you are trying to solve, you need to get some data, explore it, and even label it, something you will read about in the next section.
Acquiring and exploring the data
After understanding the problem you are trying to solve, it's time to gather the data to support the machine learning process. Data can have many forms and formats. It can be either well-structured tabular data stored in database systems or even files, such as images, stored in file shares. Initially, you will not know which data to collect, but you must start from somewhere. A typical anecdote while looking for data is the belief that there is always an Excel file that will contain critical information, and you must keep asking for it until you find it.
Once you have located the data, you will have to analyze it to understand whether the dataset is complete or not. Data is often stored within on-premises systems or Online Transactional Processing (OLTP) databases that you cannot easily access. Even if data is accessible, it is not advised to explore it directly within the source system, as you may accidentally impact the performance of the underlying engine that hosts the data. For example, a complex query on top of a sales table may affect the performance of the e-shop solution. In these cases, it is common to export the required datasets in a file format, such as the most interoperable Comma-Separated Values (CSV) format or the much more optimized for analytical processing Parquet format. These files are then uploaded to cheap cloud storage and become available for further analysis.
Within Microsoft Azure, the most common target is either a Blob container within a storage account or a folder in the filesystem of Azure Data Lake Gen 2, which offers a far more granular access control mechanism. Copying the data can be done in a one-off manner by using tools such as AzCopy or Storage Explorer. If you would like to configure a repeatable process that could pull incrementally new data on a schedule, you can use more advanced tools such as the pipelines of Azure Data Factory or Azure Synapse Analytics. In Chapter 4, Configuring the Workspace, you will review the components needed to pull data from on-premises and the available datastores to which you can connect from within the AzureML workspace to access the various datasets. In the Working with datasets section of Chapter 4, Configuring the Workspace, you will read about the dataset types supported by AzureML and how you can explore them to gain insights into the info stored within them.
A common task when gathering data is the data cleansing step. You remove duplicate records, impute missing values, or fix common data entry issues during this step. For example, you could harmonize a country text field by replacing UK records with United Kingdom. Within AzureML, you can perform such cleansing operations either in the designer that you will see in Chapter 6, Visual Model Training and Publishing, or through the notebooks experience you will be working with from Chapter 7, The AzureML Python SDK, onward. Although you may start doing those cleansing operations with AzureML, as the project matures, these cleansing activities tend to move within the pipelines of Azure Data Factory or Azure Synapse Analytics, which pulls the data out of the source systems.
Important note
While doing data cleansing, be aware of yak shaving. The term yak shaving was coined in the 90s to describe the situation where, while working on a task, you realize that you must do another task, which leads to another one, and so on. This chain of tasks may take you away from your original goal. For example, you may realize that some records have invalid encoding on the country text field example, but you can understand the referenced country. You decide to change the export encoding of the CSV file, and you realize that the export tool you were using is old and doesn't support UTF-8. That leads you to a quest to find a system administrator to get your software updated. Instead of going down that route, make a note of what needs to be done and add it to your backlog. You can fix this issue in the next iteration when you will have a better understanding of whether you actually need this field or not.
Another common task is labeling the dataset, especially if you will be dealing with supervised machine learning models. For example, if you are curating a dataset to predict whether a customer will churn or not, you will have to flag the records of the customers that canceled their subscriptions. A more complex labeling case is when you create a sentiment analysis model for social media messages. In that case, you will need to get a feed of messages, go through them, and assign a label on whether it is a positive or negative sentiment.
Within AzureML Studio, you can create labeling projects that allow you to scale the labeling efforts of datasets. AzureML allows you to define either a text labeling or an image labeling task. You then bring in team members to label the data based on the given instructions. Once the team has started labeling the data, AzureML automatically trains a model relative to your defined task. When the model is good enough, it starts providing suggestions to the labelers to improve their productivity. Figure 1.3 shows the labeling project creation wizard and the various options available currently in the image labeling task:
Figure 1.3 – Creating an AzureML labeling project
Through this project phase, you should have discovered the related source systems and produced a cleansed dataset ready for the machine learning training. In the next section, you will learn how to create additional data features that will assist the model training process, a process known as feature engineering.
Feature engineering
During the feature engineering phase, you will be generating new data features that will better represent the problem you are trying to solve and help machines learn from the dataset. For example, the following code block creates a new feature named product_id
by transforming the product
column of the sales dataset:
product_map = { "orange juice": 1, "lemonade juice": 2 }
dataset["product_id"] = dataset["product"].map(product_map)
This code block uses the pandas map
method to convert text into numerical values. The product
column is referred to as being a categorical variable, as all records are within a finite number of categories, in this case, orange juice
or lemonade juice
. If you had a 1-to-5 rating feature in the same dataset, that would have been a discrete numeric variable with a finite number of values that it can take, in this case, only 1, 2, 3, 4, or 5. If you had a column that kept how many liters or gallons the customer bought, that would have been a continuous numeric variable that could take any numeric value greater than or equal to zero, such as half a liter. Besides numeric values, dates fields are also considered as continuous variables.
Important note
Although the product_id
feature is a discrete numeric variable in the preceding example, features such as that are commonly treated as a categorical variable, as you will see in Chapter 5, Letting the Machines Do the Model Training.
There are many featurization techniques available. An indicative list is as follows:
In Chapter 10, Understanding Model Results, you will use the MinMaxScaler
method from the sklearn
library to scale numeric features.
As a last step in the feature engineering stage, you normally remove unnecessary or highly correlated features, a process called feature selection. You will be dropping columns that will not be used to train the machine learning model. By dropping those columns, you reduce the memory requirements of the machines that will be doing the training, you reduce the computation time needed to train the model, and the resulting model will be much smaller in size.
While creating those features, it is logical that you may need to go back to the Acquiring and exploring the data phase or even to the Understanding of the business problem stage to get more data and insights. At some point, though, your training dataset will be ready to train the model, something you will read about in the next section.
Training the model
As soon as you have prepared the dataset, the machine learning training process can begin. If the model requires supervised learning and you have enough data, you split them into a training dataset and validation dataset in a 70% to 30% or 80% to 20% ratio. You select the model type you want to train, specify the model's training parameters (called hyperparameters), and train the model. With the remaining validation dataset, you evaluate the trained model's performance according to a metric and you decide whether the model is good enough to move to the next stage, or perhaps return to the Understanding of the business problem stage. The training process of a supervised model is depicted in Figure 1.4:
Figure 1.4 – Training a supervised machine learning model
There are a couple of variations to the preceding statement:
- If the model is in the unsupervised learning category, such as the clustering algorithms, you just pass all the data to train the model. You then evaluate whether the detected clusters address the business need or not, modify the hyperparameters, and try again.
- If you have a model that requires supervised learning but don't have enough data, the k-fold cross validation technique is commonly used. With k-fold, you specify the number of folds you want to split the dataset. AzureML uses AutoML and performs either 10 folds if the data is less than 1,000 rows or 3 folds if the dataset is between 1,000 and 20,000 rows. Once you have those folds, you start an iterative process where you do the following:
- Keep a fold away for validation and train with the rest of the folds a new model.
- Evaluate the produced model against the fold that you kept out.
- Record the model score and discard the model.
- Repeat step I by keeping another fold away for validation until all folds have been used for validation.
- Produce the aggregated model's performance.
Important note
In the machine learning research literature, there is an approach called semi-supervised learning. In that approach, a small amount of labeled data is combined with a large amount of unlabeled data to train the model.
Instead of training a single model, evaluating the results, and trying again with a different set of hyperparameters, you can automate the process and evaluate multiple models in parallel. This process is called hyperparameter tuning, something you will dive deep into in Chapter 9, Optimizing the ML Model. In the same chapter, you will learn how you can even automate the model selection, an AzureML capability referred to as AutoML.
Metrics help you select the model that minimizes the difference between the predicted value and the actual one. They differ depending on the model type you are training. In regression models, metrics try to minimize the error between the predicted value and the actual one. The most common ones are Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Relative Squared Error (RSE), Relative Absolute Error (RAE), the coefficient of determination (R²), and Normalized Root Mean Squared Error (NRMSE), which you are going to see in Chapter 8, Experimenting with Python Code.
In a classification model, metrics are slightly different, as they have to evaluate both how many results it got right and how many it misclassified. For example, in the churn binary classification problem, there are four possible results:
- The model predicted that the customer would churn, and the customer churned. This is considered a True Positive (TP).
- The model predicted that the customer would churn, but the customer remained loyal. This is considered a False Positive (FP), since the model was wrong about the customer leaving.
- The model predicted that the customer would not churn, and the customer churned. This is considered a False Negative (FN), since the model was wrong about the customer being loyal.
- The model predicted that the customer would not churn, and the customer remained loyal. This is considered a True Negative (TN).
These four states make up the confusion matrix that is shown in Figure 1.5:
Figure 1.5 – The classification model's evaluation
Through that confusion matrix, you can calculate other metrics, such as accuracy, which calculates the total number of correct results in the evaluation test (in this case, 1132 TP + 2708 TN = 3840 records versus 2708 + 651 + 2229 + 1132 = 6720 total records). On the other hand, precision or Positive Predictive Value (PPV) evaluates how many true predictions are actually true (in this case, 1132 TP versus 1132 + 2229 total true predictions). Recall, also known as sensitivity, measures how many actual true values were correctly classified (in this case, 1132 TP versus 1132 + 651 total true actuals). Depending on the business problem you are trying to solve, you will have to find the balance between the various metrics, as one metric may be more helpful than others. For example, during the COVID-19 pandemic, a model that determines whether someone is infected with recall equal to one would identify all infected patients. However, it may have accidentally misclassified some of the not-infected ones, which other metrics, such as precision, would have caught.
Important note
Be aware when your model fits your data too well. This is something that we refer to as overfitting, and it may indicate that the model has identified a certain pattern within your training dataset that may not exist in real life. Such models tend to perform poorly when put into production and make inferences on top of unknown data. A common reason for overfitting is a biased training dataset that exposes only a subset of real-world examples. Another reason is target leakage, which means that somehow the value you are trying to predict is passed as an input to the model, perhaps through a feature engineered using the target column. See the Further reading section for guidance on how to handle overfitting and imbalanced data.
As you have seen so far, there are many things to consider while training a machine learning model, and throughout this book, you will get some hands-on experience in training models. In most cases, the first thing you will have to select is the type of computer that is going to run the training process. Currently, you have two options, Central Processing Unit (CPU) or Graphics Processing Unit (GPU) compute targets. Both targets have at least a CPU in them, as this is the core element of any modern computer. The difference is that the GPU compute targets also offer some very powerful graphic cards that can perform massive parallel data processing, making training much faster. To take advantage of the GPU, the model you are training needs to support GPU-based training. GPU is usually used in neural network training with frameworks such as TensorFlow, PyTorch, and Keras.
Once you have trained a machine learning model that satisfies the success criteria defined during the Understanding of the business problem stage of the data science project, it is time to operationalize it and start making inferences with it. That's what you will read about in the next section.
Deploying the model
When it comes to model operationalization, you have two main approaches:
- Real-time inferences: The model is always loaded, waiting to make inferences on top of incoming data. Typical use cases are web and mobile applications that invoke a model to predict based on user input.
- Batch inferences: The model is loaded every time the batch process is invoked, and it generates predictions on top of the incoming batch of records. For example, imagine that you have trained a model to identify your face in pictures and you want to label all the images you have on your hard drive. You will configure a process to use the model against each image, storing the results in a text or CSV file.
The main difference between these two is whether you already have the data to perform the predictions or not. If you already have the data and they do not change, you can make inferences in batch mode. For example, if you are trying to predict the football scores for next week's matches, you can run a batch inference and store the results in a database. When customers ask for specific predictions, you can retrieve the value from the database. During the football match, though, the model predicting the end score needs features such as the current number of players and how many injuries there are, information that will become available in real time. In those situations, you might want to deploy a web service that exposes a REST API, where you send in the required information and the model is making the real-time inference. You will dive deep into both real-time and batch approaches in Chapter 12, Operationalizing Models with Code.
In this section, you reviewed the project life cycle of a data science project and went through all the stages, from understanding what needs to be done all the way to operationalizing a model by deploying a batch or real-time service. Especially for real-time streaming, you may have heard the term structured streaming, a scalable processing engine built on Spark to allow developers to perform real-time inferences the same way they would perform batch inference on top of static data. You will learn more about Spark in the next section.