Bringing it all together for RCA
We are at the point now where we can now discuss how we can bring everything together. In our desire to increase our effectiveness in IT operations and look more holistically at application health, we now need to operationalize what we've prepared in the prior sections and configure our anomaly detection jobs accordingly. To that end, let's work through a real-life scenario in which Elastic ML helped us get to the root cause of an operational problem.
Outage background
This scenario is loosely based on a real application outage, although the data has been somewhat simplified and sanitized to obfuscate the original customer. The problem was with a retail application that processed gift card transactions. Occasionally, the app would stop working and transactions could not be processed. This would only be discovered when individual stores called headquarters to complain. The root cause of the issue was unknown and couldn't be ascertained...