Companies interested in creating value with AI/ML have a lot to gain compared to their more hesitant competitors. According to McKinsey Global Institute, “Companies that fully absorb AI in their value-producing workflows by 2025 will dominate the 2030 world economy with +120% cash flow growth.” The undertaking of embracing AI and productionizing it – whether in your product or for internal purposes – is complex, technical debt-heavy, and expensive. Once your models and use cases are chosen, making that happen in production becomes a difficult program to manage and this is a process many companies will struggle with as we see companies in industries other than tech starting to take on the challenge of embracing AI. Operationalizing the process, updating the models, keeping the data fresh and clean, and organizing experiments, as well as validating, testing, and the storage associated with it, are the complicated parts.
In an effort to make this entire process more digestible, we’re going to present this as a step-by-step process because there are varying layers of complexity but the basic components will be the same. Once you have gotten through the easy bit and you’ve settled on the models and algorithms you feel are optimal for your use case, you can begin to refine your process for managing your AI system.
Step 1 – Data availability and centralization
Essentially, you’ll need a central place to store the data that your AI/ML models and algorithms will be learning from. Depending on the databases you invest in or legacy systems you’re using, you might have a need for an ETL pipeline and data engineering to make the layers of data and metadata available for your productionized AI/ML models to ingest and offer insights from. Think of this as creating the pipeline needed to feed your AI/ML system.
AI feeds on data, and if your system of delivering data is clunky or slow, you’ll run into issues in production later. Choosing your preferred way of storing data is tricky in and of itself. You don’t know how your tech stack will evolve as you scale, so choosing a cost-effective and reliable solution is a mission in and of itself. For example, as we started to add more and more customers at a cybersecurity company we were previously working for, we noticed the load time for certain customer-facing dashboards was lagging behind. Part of the issue was the number of customers, and their metadata was too large to support the pipelines we already had in place.
Step 2 – Continuous maintenance
At this point, you have your models and algorithms and you’ve chosen a system for delivering data to them. Now, you’re going to be in the flow of constantly maintaining this system. In DevOps, this is referred to as continuous integration (CI)/continuous delivery (CD). In the later chapters, we will cover the concept of AI Operations (AIOps) but for now, the following is a list of the stages tailored for the continuous maintenance of AI pipelines. The following are the four major components of the continuous maintenance process:
- CI: Testing/validating code and components, along with data, data schemas, and models
- CD: Code changes or updates to your model are passed on continuously so that once you’ve made changes, they are slated to appear in the testing environment before going to production without pauses
- CT: We’ve mentioned the idea of continuous learning being important for ML, and continuous training productionizes this process so that as your data feeds are refreshed, your models are consistently training and learning from that new data
- CM: We can’t have ML/AI models continuously running without also continuously monitoring them to make sure something isn’t going horribly wrong
You can’t responsibly manage an AI program if you aren’t iterating your process constantly. Your models and hyperparameters will become stale. Your data will become stale and when an iterative process like this stagnates, it will stop being effective. Performance is something you’ll constantly be staying up to date on because the lack of performance will be self-evident, whether it is client-facing or not. With that said, things can also go wrong. For example, lags in performance or in the frequency of the model updating can lead to people losing their jobs, not getting a competitive rate on a mortgage, or getting an unfair prison sentence. Major consequences can arise from downstream effects due to improper model maintenance. We recommend exploring the Additional resources section at the end of this chapter for more examples and information on how stagnant AI systems can wreak havoc on environments and people.
B 101 – databases, warehouses, data lakes, and lakehouses
AI/ML products run on data. Where and how you store your data is a big consideration that impacts your AI/ML performance, and in this section, we will be going through some of the most popular storage vehicles for your data. Figuring out the optimal way to store and access and train your data is a specialization in and of itself, but if you’re in the business of AI product management, eventually, you’re going to need to understand the basic building blocks of what makes your AI product work. In a few words, data does.
Because AI requires big data, this is going to be a significant strategic decision for your product and business. If you don’t have a well-oiled machine, pun intended, you’re going to run into snags that will impair the performance of your models and, by extension, your product itself. Having a good grasp of the most cost-effective and performance-driven solution for your particular product, and finding the balance within these various facets, is going to help your success as a product manager. Yes, you will depend on your technical executives for a lot of these decisions, but you’ll be at the table helping make these decisions, so some familiarity is needed here.
Let’s look at some of the different options to store data for AI/ML products.
Database
Depending on your organization’s goals and budget, you’ll be centralizing your data somehow between a data lake, a database, and a data warehouse, and you might even be considering a new option: the data lakehouse. If you’re just getting your feet wet, you’re likely just storing your data in a relational database so that you can access it and query it easily. Databases are a great way to do this if you have a relatively simple setup. With a relational database, there’s a particular schema you’re operating under if you wanted to combine this data with data that’s in another database; you would run into problems aligning these schemas later.
If your primary use of the database is querying to access data and use only a certain subset of your company’s data for general trends, a relational database might be enough. If you’re looking to combine various datasets from disparate areas of your business and you’re looking to accomplish more advanced analytics, dashboards, or AI/ML functions, you’ll need to read on.
Data warehouse
If you’re looking to combine data into a location where you can centralize it somewhere and you’ve got lots of structured data coming in, you’re more likely going to use a data warehouse. This is really the first step toward maturity because it will allow you to leverage insights and trends across your various business units quickly. If you’re looking to leverage AI/ML in various ways, rather than one specific specialized way, this will serve you well.
Let’s say, for example, that you want to add AI features to your existing product as well as within your HR function. You’d be leveraging your customer data to offer trends or predictions to your customers based on the performance of others in their peer group, as well as using AI/ML to make predictions or optimizations for your internal employees. Both these use cases would be well served with a data warehouse.
Data warehouses do, however, require some upfront investment to create a plan and design your data structures. They also require a costly investment as well because they make data available for analysis on demand, so you’re paying a premium for keeping that data readily available. Depending on how advanced your internal users are, you could opt for cheaper options, but this option would be optimal for organizations where most of your business users are looking for easily digestible ways to analyze data. Either way, a data warehouse will allow you to create dashboards for your internal users and stakeholder teams.
Data lake (and lakehouse)
If you’re sitting on lots of raw, unstructured data, and you want to have a more cost-effective place to store it, you’d be looking at a data lake. Here, you can store unstructured, semi-structured, as well as structured data that can be easily accessed by your more tech-savvy internal users. For instance, data scientists and ML engineers would be able to work with this data because they would be creating their own data models to transform and analyze the data on the fly, but this isn’t the case at most companies.
Keeping your data in a data lake would be cheap if you’ve got lots of data your business users don’t need immediately, but you won’t ever really be able to replace a warehouse or a database with one. It’s more of a “nice to have.’’ If you’re sitting on a massive data lake of historical data you want to use in the future for analytics, you’ll need to consider another way to store it to get those insights.
You might also come across the term lakehouse. There are many databases, data warehouses, and data lakes out there. However, the only lakehouse we’re aware of has been popularized by a company called Databricks, which offers something like a data lake but with some of the capabilities you get with data warehouses, namely, the ability to showcase data, make it available and ingestible for non-technical internal users, and create dashboards with it. The biggest advantage here is that you’re storing it and paying for the data to be stored upfront with the ability to access and manipulate it downstream.
Data pipelines
Regardless of the tech you use to maintain and store your data, you’re still going to need to put up pipelines to make sure your data is moving, that your dashboards are refreshing as readily as your business requires, and that data is flowing the way it needs to. There are also multiple ways of processing and passing data. You might be doing it in batches (batch processing) for large amounts of data being moved at various intervals, or in real-time pipelines for getting data in real time as soon as it’s generated. If you’re looking to leverage predictive analytics, enable reporting, or have a system in place to move, process, and store data, a data pipeline will likely be enough. However, depending on what your data is doing and how much transformation is required, you’ll likely be using both data pipelines and perhaps, more specifically, ETL pipelines.
ETL stands for extract, transform, and load, so your data engineers are going to be creating specific pipelines for more advanced systems such as centralizing all your data into one place, adding data or data enrichment, connecting your data with CRM (customer relationship management) tools, or even transforming the data and adding structure to it between systems. The reason for this is that it’s a necessary step when using a data warehouse or database. If you’re exclusively using a data lake, you’ll have all the metadata you need to be able to analyze it and get your insights as you like. In most cases, if you’re working with an AI/ML product, you’re going to be working with a data engineer who will power the data flow needed to make your product a success because you’re likely using a relational database as well as a data warehouse. The analytics required to enable AI/ML features will most likely need to be powered by a data engineer who will focus on the ETL pipeline.
Managing and maintaining this system will also be the work of your data engineer, and we encourage every product manager to have a close relationship with the data engineer(s) that supports their products. One key difference between the two is that ETL pipelines are generally updated in batches and not in real time. If you’re using an ETL pipeline, for instance, to update historical daily information about how your customers are using your product to offer client-facing insights in your platform, it might be optimal to keep this batch updating twice daily. If you need insights to come in real time for a dashboard that’s being used by your internal business users and they rely on that data to make daily decisions, however, you likely will need to resort to a data pipeline that’s updated continuously.
Now, that we understand the different available options to store data and how to choose the right option for the business, let’s discuss how to manage our projects.