Defining an SLA for a data platform
When operating a data platform, it is essential to define a healthy state for the entire data platform and maintain that state. Think about what kind of state the data platform should be in. It would be good to define an SLA as an indicator of health. This SLA does not always need to be communicated to end users but is used as an internal indicator to measure whether your data platform is healthy or not.
The basic strategy is to maintain a certain data platform state where the SLA is met and then recover to the normal state when it fails. In other words, monitoring is performed to understand when the platform has deviated from a normal state to an abnormal state, and recovery is performed to return the data platform from an abnormal state to a normal state, as illustrated in the following diagram:
Figure 11.1 – The monitoring cycle
Now, I would like to look at an example of how to define the health of a data platform...