An overview of the ML platform
In this section, we will talk about the capabilities of the ML platform that you will need to consider. The aim is to make you aware of the basic building blocks that could form an ecosystem for your team to help you in your ML journey. An ML platform can be thought of as a set of components that assists in the faster development and deployment of ML models and data pipelines.
There are three main characteristics of an ML platform, as outlined here:
- A complete ecosystem: The platform should provide an end-to-end (E2E) solution that includes data life-cycle management, ML life-cycle management, application life-cycle management, and observability.
- Built on open standards: The platform should provide a way to extend and build on the existing baseline. Because the field is fast-moving, it is critical that you can further enhance, tailor, and optimize platforms for your specific needs.
- Self-serving: The platform should be able to provide the resources required by teams automatically and on-demand, from hardware requests to deploying software in production. The platform automates the provisioning of resources based on enterprise controls and recovers them once the job is completed. The resources can be central processing units (CPUs), memory, or disk, or can be software such as integrated development environments (IDEs) to write code or a combination of these.
The following diagram shows the various components of an ML platform that serves different personas, allowing them to collaborate on a common platform:
Apart from the characteristics presented in Figure 1.4, the platform must have the following technical capabilities:
- Workflow automation: The platform should have some form of workflow automation capability where both data engineers can create jobs that perform repetitive tasks such as data ingestion and preparation and data scientists can orchestrate model training and automate model deployments.
- Security: The platform must be secured to prevent data leaks and data loss that can have a negative impact on the business.
- Observability: We do not want to run applications without observability, whether it is a traditional application or an ML model. Deploying applications in production without observability is like riding a bike blindfolded. The platform should have a good amount of observability where you can monitor the health and performance of the entire system or sub-system in near real time. This should also include an alerting capability.
- Logging: Logging plays a key role in understanding what happened when systems start behaving in an unexpected way. The platform must have a solid logging mechanism to allow operations teams to better support the ML project.
- Data processing and pipelining: Because ML projects rely on a huge amount of data, the platform must include a reliable fully featured data processing and data pipelining solution that can scale horizontally.
- Model packaging and deployment: Not all data scientists are experienced software engineers. Although some may have an experience in writing applications, it is not safe to assume that all data scientists can write production-grade applications and deploy them to production. Therefore, the platform must be able to automatically package an ML model into an application and serve it.
- ML life cycle: The platform must also be capable of managing ML experiments, tracking performance, storing training and experiment metadata and feature sets, and versioning models. This not only allows data scientists to work efficiently, but also allows them to work collaboratively.
- On-demand resource allocation: One important feature an ML platform should have is the capability that allows data scientists and data engineers to provision their own runtime resources automatically and on-demand. This eliminates the need for manual requisition of resources and eliminates time wasted on waiting and handovers with operations teams. The platform must allow platform users to create their own environment and to allocate the right amount of compute resources they need to do their jobs.
There are already platform products that have most, if not all, of the capabilities you have just learned about. What you will learn in the later chapters of this book is how to build one such platform based on OSS on top of Kubernetes.