The Operator Framework defines a Capability Model (https://operatorframework.io/operator-capabilities/) that categorizes Operators based on their functionality and design. This model helps to break down Operators based on their maturity, and also describes the extent of an Operator's interoperability with OLM and the capabilities users can expect when using the Operator.
The Capability Model is divided into five hierarchical levels. Operators can be published at any one of these levels and, as they grow, may evolve and graduate from one level to the next as features and functionality are added. In addition, the levels are cumulative, with each level generally encompassing all features of the levels below it.
The current level of an Operator is part of the CSV, and this level is displayed on its OperatorHub listing. The level is based on somewhat subjective yet guided criteria and is purely an informational metric.
Each level has specific functionalities that define it. These functionalities are broken down into Basic Install, Seamless Upgrades, Full Lifecycle, Deep Insights, and Auto Pilot. The specific levels of the Capability Model are outlined here:
- Level I—Basic Install: This level represents the most basic of Operator capabilities. At Level I, an Operator is only capable of installing its Operand in the cluster and conveying the status of the workload to cluster administrators. This means that it can set up the basic resources required for an application and report when those resources are ready to be used by the cluster.
At Level I, an Operator also allows for simple configuration of the Operand. This configuration is specified through the Operator's Custom Resource. The Operator is responsible for reconciling the configuration specifications with the running Operand workload. However, it may not be able to react if the Operand reaches a failed state, whether due to malformed configuration or outside influence.
Going back to our example web application from the start of the chapter, a Level I Operator for this application would handle the basic setup of the workloads and nothing else. This is good for a simple application that needs to be quickly set up on many different clusters, or one that should be easily shared with users for them to install themselves.
- Level II—Seamless Upgrades: Operators at Level II offer the features of basic installation, with added functionality around upgrades. This includes upgrades for the Operand but also upgrades for the Operator itself.
Upgrades are a critical part of any application. As bug fixes are implemented and more features are added, being able to smoothly transition between versions helps ensure application uptime. An Operator that handles its own upgrades can either upgrade its Operand when it upgrades itself or manually upgrade its Operand by modifying the Operator's Custom Resource.
For seamless upgrades, an Operator must also be able to upgrade older versions of its Operand (which may exist because they were managed by an older version of the Operator). This kind of backward compatibility is essential for both upgrading to newer versions and handling rollbacks (for example, if a new version introduces a high-visibility bug that can't wait for an eventual fix to be published in a patch version).
Our example web application Operator could offer the same set of features. This means that if a new version of the application were released, the Operator could handle upgrading the deployed instances of the application to the newer version. Or, if changes were made to the Operator itself, then it could manage its own upgrades (and later upgrade the application, regardless of version skew between Operator and Operand).
- Level III—Full Lifecycle: Level III Operators offer at least one out of a list of Operand lifecycle management features. Being able to offer management during the Operand's lifecycle implies that the Operator is more than just passively operating on a workload in a set and forget fashion. At Level III, Operators are actively contributing to the ongoing function of the Operand.
The features relevant to the lifecycle management of an Operand include the following:
- The ability to create and/or restore backups of the Operand.
- Support for more complex configuration options and multistep workflows.
- Failover and failback mechanisms for disaster recovery (DR). When the Operator encounters an error (either in itself or the Operand), it needs to be able to either re-route to a backup process (fail over) or roll the system back to its last known functioning state (fail back).
- The ability to manage clustered Operands, and—specifically—support for adding and removing members to and from Operands. The Operator should be capable of considering quorum for Operands that run multiple replicas.
- Similarly, support for scaling an Operand with worker instances that operate with read-only functionality.
Any Operator that implements one or more of these features can be considered to be at least a Level III Operator. The simple web application Operator could take advantage of a few of these, such as DR and scaling. As the user base grows and resources demands increase, an administrator could instruct the Operator to scale the application with additional replica Pods to handle the increased load.
Should any of the Pods fail during this process, the Operator would be smart enough to know to fail over to a different Pod or cluster zone entirely. Alternatively, if a new version of the web app was released that introduced an unexpected bug, the Operator could be aware of the previous successful version and provide ways to downgrade its Operand workloads if an administrator noticed the error.
- Level IV—Deep Insights: While the previous levels focus primarily on Operator features as they relate to functional interaction with the application workload, Level IV emphasizes monitoring and metrics. This means an Operator is capable of providing measurable insights to the status of both itself and its Operand.
Insights may be seen as less important from a development perspective relative to features and bug fixes, but they are just as critical to an application's success. Quantifiable reports about an application's performance can drive ongoing development and highlight areas that need improvement. Having a measurable system to push these efforts allows a way to scientifically prove or disprove which changes have an effect.
Operators most commonly provide their insights in the form of metrics. These metrics are usually compatible with metrics aggregation servers such as Prometheus. (Interestingly enough, Red Hat publishes an Operator for Prometheus that is a Level IV Operator. That Operator is available on OperatorHub at https://operatorhub.io/operator/prometheus.)
However, Operators can provide insights through other means as well. These include alerts and Kubernetes Events. Events are built-in cluster resource objects that are used by core Kubernetes objects and controllers.
Another key insight that Level IV Operators report is the performance of the Operator and Operand. Together, these insights help inform administrators about the health of their clusters.
Our simple web application Operator could provide insights about the performance of the Operand. Requests to the app would provide information about the current and historic load on the cluster. Additionally, since the Operator can identify failed states at this point, it could trigger an alert when the application is unhealthy. Many alerts would indicate a reliability issue that would gain the attention of an administrator.
- Level V—Auto Pilot: Level V is the most sophisticated level for Operators. It includes Operators that offer the highest capabilities, in addition to the features in all four previous levels. This level is called Auto Pilot because the features that define it focus on being able to run almost entirely autonomously. These capabilities include Auto Scaling, Auto-Healing, Auto-Tuning, and Abnormality Detection.
Auto Scaling is the ability for an Operator to detect the need to scale an application up or down based on demand. By measuring the current load and performance, an Operator can determine whether more or fewer resources are necessary to satisfy the current usage. Advanced Operators can even try to predict the need to scale based on current and past data.
Auto-Healing Operators can react to applications that are reporting unhealthy conditions and work to correct them (or, at least, prevent them from getting any worse). When an Operand is reporting an error, the Operator should take reactive steps to rectify the failure. In addition, Operators can use current metrics to proactively prevent an Operand from transitioning to a failure state.
Auto-Tuning means that an Operator can dynamically modify an Operand for peak performance. This involves tuning the settings of an Operand automatically. It can even include complex operations such as shifting workloads to entirely different nodes that are better suited than their current nodes.
Finally, Abnormality Detection is the capability of an Operator to identify suboptimal or off-pattern behavior in an Operand. By measuring performance, an Operator has a picture of the application's current and historical levels of functioning. That data can be compared to a manually defined minimum expectation or used to dynamically inform the Operator of that expectation.
All of these features are heavily dependent upon the use of metrics to automatically inform the Operator of the need to act upon itself or its Operand. Therefore, a Level V Operator is an inherent progression from Level IV, which is the level at which an Operator exposes advanced metrics.
At Level V, the simple web application Operator would manage most of the aspects of the application for us. It has insights into the current number of requests, so it can scale up copies of the app on demand. If this scaling starts to cause errors (for example, too many concurrent database calls), it can identify the number of failing Pods and prevent further scaling. It would also attempt to modify parameters of the web app (such as request timeouts) to help rectify the situation and allow the auto-scaling to proceed. When the load peak subsided, the Operator would then automatically scale down the application to its baseline service levels.
Levels I and II (Basic Install and Seamless Upgrades) can be used with the three facets of the Operator SDK: Helm, Ansible, and Go. However, Level III and above (Full Lifecycle, Deep Insights, and Auto Pilot) are only possible with Ansible and Go. This is because the functionality at these higher levels requires more intricate logic than what is available through Helm charts alone.
We have now explained the three main pillars of the Operator Framework: Operator SDK, OLM, and OperatorHub. We learned how each contributes different helpful features to the development and usage of Operators. We also learned about the Capability Model, which serves as a reference for the different levels of functionality that Operators can have. In the next section, we'll apply this knowledge to a sample application.