Succeeding in AI – how well-managed AI companies do infrastructure right
It’s indicative of the complexity of ML systems that many large technology companies that depend heavily on ML have dedicated teams and platforms that focus on building, training, deploying, and maintaining ML models. The following are a few examples of options you can take when building an ML/AI program:
- Databricks has MLflow: MLflow is an open source platform developed by Databricks to help manage the complete ML life cycle for enterprises. It allows you to run experiences and work with any library, framework, or language. The main benefits are experiment tracking (so you can see how your models are doing between experiments), model management (to manage all versions of your model between teammates), and model deployment (to have a quick view of deployment in view in the tool).
- Google has TensorFlow Extended (TFX): This is Google’s newest product built on TensorFlow and it’s an end-to-end platform for deploying production-level ML pipelines. It allows you to collaborate within and between teams and offers robust capabilities for scalable, high-performance environments.
- Uber has Michelangelo: Uber is a great example of a company creating their own ML management tool in-house for collaboration and deployment. Earlier, they were using disparate languages, models, and algorithms and had teams that were siloed. After they implemented Michelangelo, they were able to bring in varying skill sets and capabilities under one system. They needed one place for a reliable, recreatable, and standardized pipeline to create, manage, predict, and deploy their data at scale.
- Meta has FBLearner Flow: Meta also created its own system for managing its numerous AI projects. Since ML is such a foundational part of their product, Meta needed a platform that would allow the following:
- Every ML algorithm that was implemented once to have the ability to be reusable by someone else at a later date
- Every engineer to have the ability to write a training pipeline that can be reused
- Make model training easy and automated
- Everybody to have the ability to search past projects and experiments easily
Effectively, Facebook created an easy-to-use knowledge base and workflow to centralize all their ML ops.
- Amazon has SageMaker: This is Amazon’s product that allows you to build, train, and deploy your ML models and programs with their own collection of fully managed infrastructure tools and workflows. The idea of this product is to meet their customers where they are and offer low-code or no-code UIs, whether you employ ML engineers or business analysts. The ability to use their infrastructure is also great if you’re already using Amazon services for your cloud infrastructure so that you can take it a step further to automate and standardize your ML/AI program and operations at scale.
- Airbnb has Bighead: Airbnb created its own ML infrastructure in an effort to create standardization and centralization between their AI/ML organizations. They used a collection of tools such as Zipline, Redspot, and DeepThought to orchestrate their ML platform in an effort to do the same as Facebook and Uber: to mitigate errors and discrepancies and minimize repeatable work.
As we can see, there are multiple platforms that can be used to create, train, and deploy ML models. Finally, let’s see what the future of AI looks like.