Organizations are eager to adopt ML to drive their business growth. In many projects, the teams become too focused on technical brilliance while not delivering the business value expected from the ML initiative. This can cause early failures that may result in reduced investment for future projects. These are the two main challenges that businesses are facing in making ML mainstream in all the various parts of the business, as outlined here:
- Keeping the focus on the big picture
- Siloed teams
Focusing on the big picture
The first challenge organizations face is building an ecosystem where ML models create value for the business. The challenging part is that teams often do not focus on all aspects of a project and instead focus only on specific areas, resulting in poor value for the business.
How many organizations that we know of are successful in their ML journey? Beyond the Googles, Metas (formerly Facebook), and Netflixs of the world, there are few success stories. The number one reason is that the teams put focus just on building the model. So, what else is there beyond the algorithm? Google published a paper about the hidden technical debt in ML projects (see the Further reading section at the end of this chapter), and it provides a good summary of things that we need to consider to be successful.
Have a look at the following diagram:
Figure 1.2 – The components of an ML system
Can you see the small block in Figure 1.2? The block in the picture captioned ML is the ML model development part, and you can see that there are a lot more processes involved in ML projects. Let's understand a few of them, as follows:
- Data collection and data verification: To have a reliable and trustworthy model, we need a good set of data. ML is all about finding patterns in the data and predicting unseen data results using those patterns. Therefore, the better the quality of your data, the better your model will perform. The data, however, comes in all shapes and sizes. Some of it may reside in files, some in proprietary databases; a dataset may come from data streams, and some data may need to be harvested from Internet of Things (IoT) devices. On top of that, the data may be owned by different teams with different security and regulatory requirements. Therefore, you need to think about technologies that allow you to collect, transform, and process data from various sources and in a variety of formats.
- Feature extraction and analysis: Often, assumptions about data quality and completeness are incorrect. Data science teams perform an activity called exploratory data analysis (EDA) in which they read and process data from various sources as fast as they can. Teams further improve their understanding of the data before they invest time in processing the data at scale and going to the model-building stage. Think about how your team or organization can facilitate the data exploration to speed up your ML journey.
Data analysis leads to a better understanding of data, but feature extraction is another thing. This is a process of identifying, through experiments, a set of data attributes that influences the accuracy of the model output and identifying which attributes are considered irrelevant or considered noise. For example, in an ML model that classifies if a bank transaction is fraudulent or not, the name of the account holder is considered to be irrelevant, or noise, while the amount of the transaction could be an important feature. The output of this process is a transformed version of the dataset that contains only relevant features and is formatted for consumption in the ML model training process or fitness function. This is sometimes called a feature set. Teams need a tool for performing such analysis and transforming data into a format that is consumable for model training. Data collection, feature extraction, and analysis are also collectively called feature engineering (FE).
- Infrastructure, monitoring, and resource management: You need computers to process and explore data, build and train your models, and deploy ML models for consumption. All these activities need processing power and storage capacity, at the lowest possible cost. Think about how your team will get access to hardware resources on-demand and in a self-service fashion. You need to plan how data scientists and engineers will be able to request the required resources in the fastest manner. At the same time, you still need to be able to follow your organization's policies and procedures. You also need system monitoring to optimize resource utilization and improve the operability of your ML platform.
- Model development: Once you have data available in the form of consumable features, you need to build your models. Model building requires many iterations with different algorithms and different parameters. Think about how to track the outcomes of different experiments and where to store your models. Often, different teams can reuse each other's work to increase the velocity of the teams further. Think about how teams can share their findings. Teams must have a tool that can facilitate model training and experiment runs, record model performance and experiment metadata, store models, and manage the tagging of models and promotion to an acceptable and deployable state.
- Process management: As you see, there are a lot of things to be done to make a useful model. Think about the processes of automating model deployment and monitoring processes. Different personas would be working on different things such as data tasks, model tasks, infrastructure tasks, and more. The team needs to collaborate and share to achieve a particular outcome. The real world keeps on changing: once your model is deployed into production, you may need to retrain your model with new data regularly. All these activities need well-defined processes and automated stages so that the team can continue working on high-value tasks.
In summary, you will need an ecosystem that can provide solution components for all of the following building blocks. This single platform will increase the team's velocity via consistent experience within the team for all the needs of an ML system:
- Fetching, storing, and processing data
- Training, tuning, and tracking models
- Deploying and monitoring models
- Automating repetitive tasks, such as data processing and model deployment
But how can we make different teams collaborate and use a common platform to do their tasks?
Breaking down silos
To complete an ML project, you need to have a team that comprises various roles. However, with diverse roles, there comes a challenge of communication, team dynamics, and conflicting priorities. In enterprises, these roles often belong to different teams in different business units (BUs).
ML projects need a variety of teams and personas to be successful. The following screenshot shows some of the roles and responsibilities that are required to complete a simple ML project:
Figure 1.3 – Silos involved in ML projects
Let's look at these roles in more detail here:
- Data scientist: This role is the most understood one. This persona or team is responsible for exploring the data and running experiment iterations to determine which algorithm is suitable for a given problem.
- Data engineers: The persona or team in this role is responsible for ingesting data from various sources, cleaning the data, and making it useful for the data science teams.
- Developers and operations: Once the model is built, this team is responsible for taking the model and deploying it to be used. The operations team is responsible for making sure that computers and storage are available for the other teams to perform data processing, model life-cycle operations, and model inference.
- A business subject-matter expert (SME): Even though data scientists build the ML model, understanding data and the business domain is critical to building the right model. Imagine a data scientist who is building a model for predicting COVID-19 without understanding the different parameters. An SME, which would be a medical doctor in this case, would be required to assist the data scientists in understanding data before going on to the model-building phase.
Of course, even with the building blocks in place, you're unlikely to succeed at the first attempt.
Fail-fast culture
Building a cross-functional team is not enough. Make sure that the team is empowered to make its own decisions and feels comfortable experimenting with different approaches. The data and ML fields are fast-moving, and the team may choose to adapt a recent technology or process or let go of an existing one based on the given success criteria.
Form a team of people who are passionate about the work, and when you give them autonomy, you will have the best possible outcome. Enable your teams so that they can adapt to change quickly and deliver value for your business. Establish an iterative and fast feedback cycle where teams receive feedback on work that has been delivered so far. A quick feedback loop will put more focus on solving the business problem.
However, this approach brings its own challenges. Adopting modern technologies may be difficult and time-consuming. Think of Amazon Marketplace: if you want to sell some new hot thing, by using Amazon Marketplace, you can bring your product to market faster because the marketplace takes care of a lot of moving parts required to make a sale. The ML platform you will learn about in this book enables you to experiment with modern approaches and modern technologies with ease by supplying basic common services and sandbox environments for your team to experiment fast.
It is critical to the success of projects that teams that belong to distinct groups form a cross-functional and autonomous team. This new team will move with higher velocity without internal friction and avoid tedious processes and delays. It is critical that the cross-functional team is empowered to drive its own decisions and be supported with self-serving platforms so it can work in an independent manner. The ML platform you will see in this book will provide the basis of one such platform where teams can collaborate and share.
Now, let's take a look at what kind of platform will help you address the challenges we have discussed.