Choosing the right approach
Before deciding to use ML for a given project, understand the problem first and assess if it can be solved by ML. Invest enough time in working with the right stakeholder to see what the expectations are. Some problems may be better suited to traditional approaches, such as when you have predefined business rules for a given system. It is faster and easier to code rules than is it to train a model, plus you do not need a huge amount of data.
While deciding whether to use ML or not, you can think in terms of whether pattern-based results will work for your problem. If you are building a system that reads data from the frequent-flyer database of an airline to find customers to which you want to send a promotion, a rule-based system may also give you good and acceptable results. An ML-based system may give you better matches for certain scenarios, but will the time spent on building this system be worth it?
The importance of data
The efficiency of your ML model depends on the quality and accuracy of the data, but unfortunately, data collection and processing activities do not get the attention they deserve, which proves costly in later stages of the project in terms of the model not being suitable enough for the given task.
The paper cited here discusses this challenge. An interesting example quoted in the paper is of a team building a model to detect a particular pattern from patient scans, which works brilliantly with test data. However, the model failed in production because the scans being fed onto the model contained tiny dust particles, resulting in the inferior performance of the model. This example is a classic case of a team being focused on model building and not on how it will be used in the real world.
One thing that teams should put focus on is data validation and cleansing. Many times, data is often missing or is not correct—for example, a string field in a number column, different date formats in the same field, or the same identifier (ID) for different records if the records come from different systems. All this data anomaly may result in an inefficient model that will lead to inferior performance.
Once you've been through this process and come to the decision that yes, ML is the way to go… what next?