Dealing with the plethora of data
IT departments have invested in monitoring tools for decades, and it is not uncommon to have a dozen or more tools actively collecting and archiving data that can be measured in terabytes, or even petabytes, per day. The data can range from rudimentary infrastructure- and network-level data to deep diagnostic data and/or system and application log files.
Business-level key performance indicators (KPIs) could also be tracked, sometimes including data about the end user's experience. The sheer depth and breadth of data available, in some ways, is the most comprehensive than it has ever been. To detect emerging problems or threats hidden in that data, there have traditionally been several main approaches to distilling the data into informational insights:
- Filter/search: Some tools allow the user to define searches to help trim down the data into a more manageable set. While extremely useful, this capability is most often used in an ad hoc fashion once a problem is suspected. Even then, the success of using this approach usually hinges on the ability for the user to know what they are looking for and their level of experience—both with prior knowledge of living through similar past situations and expertise in the search technology itself.
- Visualizations: Dashboards, charts, and widgets are also extremely useful to help us understand what data has been doing and where it is trending. However, visualizations are passive and require being watched for meaningful deviations to be detected. Once the number of metrics being collected and plotted surpasses the number of eyeballs available to watch them (or even the screen real estate to display them), visual-only analysis becomes less and less useful.
- Thresholds/rules: To get around the requirement of having data be physically watched in order for it to be proactive, many tools allow the user to define rules or conditions that get triggered upon known conditions or known dependencies between items. However, it is unlikely that you can realistically define all appropriate operating ranges or model all of the actual dependencies in today's complex and distributed applications. Plus, the amount and velocity of changes in the application or environment could quickly render any static rule set useless. Analysts find themselves chasing down many false positive alerts, setting up a boy who cried wolf paradigm that leads to resentment of the tools generating the alerts and skepticism of the value that alerting could provide.
Ultimately, there needed to be a different approach—one that wasn't necessarily a complete repudiation of past techniques, but could bring a level of automation and empirical augmentation of the evaluation of data in a meaningful way. Let's face it, humans are imperfect—we have hidden biases and limitations of capacity for remembering information and we are easily distracted and fatigued. Algorithms, if used correctly, can easily make up for these shortcomings.