Situations similar to the one just described are actually more common than desired. A system fault that had no visible symptoms before is relatively rare. A subsection of UNIX administration horror stories (http://www-uxsup.csx.cam.ac.uk/misc/horror.txt) only containing stories about faults that weren't noticed in time could probably be compiled easily.
As experience shows, problems tend to happen when we are least equipped to solve them. To work with them on our terms, we turn to a class of software commonly referred to as network monitoring software. Such software usually allows us to constantly monitor things happening in a computer network using one or more methods and notify the persons responsible if a metric passes a defined threshold.
One of the first monitoring solutions most administrators implement is a simple shell script invoked from crontab, which checks some basic parameters, such as disk usage, or some service state, such as an Apache server. As the server and monitored parameter count grows, a neat and clean script system starts to grow into a performance-hogging script hairball that costs more time in upkeep than it saves. While the do-it-yourself crowd claims that nobody needs dedicated software for most tasks (monitoring included), most administrators will disagree as soon as they have to add switches, UPSes, routers, IP cameras, and a myriad of other devices to the swarm of monitored objects.
So, what basic functionality can expect from a monitoring solution? Let's take a look:
- Data gathering: This is where everything starts. Usually, data is gathered using various methods, including Simple Network Management Protocol (SNMP), Zabbix agents, Intelligent Platform Management Interface (IPMI), and Java Management Extensions (JMX).
- Data storage: Once we have gathered the data, it doesn't make sense to throw it away, so we will often want to store it for later analysis.
- Alerting: Gathered data can be compared to thresholds and alerts sent out when required using different channels, such as email or SMS.
- Visualization: Humans are better at distinguishing visualized data than raw numbers, especially when there's a lot of data. As we have data already gathered and stored, it is easy to generate simple graphs from it.
Sounds simple? That's because it is. But then we start to want more features, such as easy and efficient configuration, escalations, and permission delegation. If we sit down and start listing the things we want to keep an eye out for, it may turn out that that area of interest extends beyond the network, for example, a hard drive that has Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T) errors logged, an application that has too many threads, or a UPS that has one phase overloaded. It is much easier to manage the monitoring of all of these different problem categories from a single configuration point.
In the quest for a manageable monitoring system, wondrous adventurers stumbled upon collections of scripts much like the way they themselves implemented obscure and not-so-obscure workstation-level software and heavy, expensive monitoring systems from big vendors.
Many went with a different category—free software. We will look at a free software monitoring solution, Zabbix.