Monitoring the SLA of a data platform
Let’s think about the implementation of a mechanism to monitor the health of a data platform. There are two common strategies to identify the state of a data platform:
- Fact-based approach: Inspect the end user activities and retrieve the metrics.
- Simulation-based approach: Simulate the end user activities and measure the metrics.
To monitor performance and cost SLAs, you can inspect the end user activities from the metrics and log messages. For Amazon Athena, you will see a variety of metrics including query planning time and total execution time via Amazon CloudWatch (https://docs.aws.amazon.com/athena/latest/ug/query-metrics-viewing.html) or Amazon Athena’s query history (https://docs.aws.amazon.com/athena/latest/ug/querying.html#queries-viewing-history). For Amazon Redshift, you can rely on system tables: SVL_QUERY_SUMMARY
(https://docs.aws.amazon.com/redshift/latest/dg/using-SVL-Query-Summary.html) and SVL_QUERY_REPORT...