A real-world ML project is always filled with some unexpected challenges that we get to experience at different stages. The main reason for this is that the data present in the real world, and the ML algorithms, are not perfect. Though these challenges hamper the performance of the overall ML setup, they don’t prevent us from creating a valuable ML application. In a new ML project, it is difficult to know the challenges up front. They are often found during different stages of the project. Some of these challenges are not obvious and require skilled or experienced ML practitioners (or data scientists) to identify them and apply countermeasures to reduce their effect.
In this section, we will understand some of the common challenges encountered during the development of a typical ML solution. The following list shows some common challenges we will discuss in more detail:
- Data collection and security
- Non-representative training set
- Poor quality of data
- Underfitting of the training dataset
- Overfitting of the training dataset
- Infrastructure requirements
Now, let’s learn about each of these common challenges in detail.
Data collection and security
One of the most common challenges that organizations face is data availability. ML algorithms require a large amount of good-quality data in order to provide quality results. Thus, the availability of raw data is critical for a business if it wants to implement ML. Sometimes, even if the raw data is available, gathering data is not the only concern; we often need to transform or process the data in a way that our ML algorithm supports.
Data security is another important challenge that is very frequently faced by ML developers. When we get data from a company, it is essential to differentiate between sensitive and non-sensitive information to implement ML correctly and efficiently. The sensitive part of data needs to be stored in fully secured servers (storage systems) and should always be kept encrypted. Sensitive data should be avoided for security purposes, and only the less-sensitive data access should be given to trusted team members working on the project. If the data contains Personally Identifiable Information (PII), it can still be used by anonymizing it properly.
Non-representative training data
A good ML model is one that performs equally well on unseen data and training data. It is only possible when your training data is a good representative of most possible business scenarios. Sometimes, when the dataset is small, it may not be a true representative of the inherent distribution, and the resulting model may provide inaccurate predictions on unseen datasets despite having high-quality results on the training dataset. This kind of non-representative data is either the result of sampling bias or the unavailability of data. Thus, an ML model trained on such a non-representative dataset may have less value when it is deployed in production.
If it is impossible to get a true representative training dataset for a business problem, then it’s better to limit the scope of the problem to only the scenarios for which we have a sufficient amount of training samples. In this way, we will only get known scenarios in the unseen dataset, and the model should provide quality predictions. Sometimes, the data related to a business problem keeps changing with time, and it may not be possible to develop a single static model that works well; in such cases, continuous retraining of the model on the latest data becomes essential.
Poor quality of data
The performance of ML algorithms is very sensitive to the quality of training samples. A small number of outliers, missing data cases, or some abnormal scenarios can affect the quality of the model significantly. So, it is important to treat such scenarios carefully while analyzing the data before training any ML algorithm. There are multiple methods for identifying and treating outliers; the best method depends upon the nature of the problem and the data itself. Similarly, there are multiple ways of treating the missing values as well. For example, mean, median, mode, and so on are some frequently used methods to fill in missing data. If the training data size is sufficiently large, dropping a small number of rows with missing values is also a good option.
As discussed, the quality of the training dataset is important if we want our ML system to learn accurately and provide quality results on the unseen dataset. It means that the data pre-processing part of the ML life cycle should be taken very seriously.
Underfitting the training dataset
Underfitting an ML model means that the model is too simple to learn the inherent information or structure of the training dataset. It may occur when we try to fit a non-linear distribution using a linear ML algorithm such as linear regression. Underfitting may also occur when we utilize only a minimal set of features (that may not have much information about the target distribution) while training the model. This type of model can be too simple to learn the target distribution. An underfitted model learns too little from the training data and, thus, makes mistakes on unseen or test datasets.
There are multiple ways to tackle the problem of underfitting. Here is a list of some common methods:
- Feature engineering – add more features that represent target distribution
- Non-linear algorithms – switch to a non-linear algorithm if the target distribution is not linear
- Removing noise from the data
- Add more power to the model – increase trainable parameters, increase depth or number of trees in tree-based ensembles
Just like underfitting the model on training data, overfitting is also a big issue. Let’s deep dive into it.
Overfitting the training dataset
The overfitting problem is the opposite of the underfitting problem. Overfitting is the scenario when the ML model learns too much unnecessary information from the training data and fails to generalize on a test or unseen dataset. In this case, the model performs extremely well on the training dataset, but the metric value (such as accuracy) is very low on the test set. Overfitting usually occurs when we implement a very complex algorithm on simple datasets.
Some common methods to address the problem of overfitting are as follows:
- Increase training data size – ML models often overfit on small datasets
- Use simpler models – When problems are simple or linear in nature, choose simple ML algorithms
- Regularization – There are multiple regularization methods that prevent complex models from overfitting on the training dataset
- Reduce model complexity – Use a smaller number of trainable parameters, train for a smaller number of epochs, and reduce the depth of tree-based models
Overfitting and underfitting are common challenges and should be addressed carefully, as discussed earlier. Now, let’s discuss some infrastructure-related challenges.
Infrastructure requirements
ML is expensive. A typical ML project often involves crunching large datasets with millions or billions of samples. Slicing and dicing such datasets requires a lot of memory and high-end multi-core processors. Additionally, once the development of the project is complete, dedicated servers are required to deploy the models and match the scale of consumers. Thus, business organizations willing to practice ML need some dedicated infrastructure to implement and consume ML efficiently. This requirement increases further when working with large, deep learning models such as transformers, large language models (LLMs), and so on. Such models usually require a set of accelerators, graphical processing units (GPUs), or tensor processing units (TPUs) for training, finetuning, and deployment.
As we have discussed, infrastructure is critical for practicing ML. Companies that lack such infrastructure can consult with other firms or adopt cloud-based offerings to start developing ML-based applications.
Now that we understand the common challenges faced during the development of an ML project, we should be able to make more informed decisions about them. Next, let’s learn about some of the limitations of ML.