The first steps in monitoring
Situations similar to the one just described are actually more common than desired. A system fault that had no symptoms visible before is relatively rare. A subsection of UNIX Administration Horror Stories (http://www-uxsup.csx.cam.ac.uk/misc/horror.txt) that only contains stories about faults that weren't noticed in time could probably be compiled easily.
As experience shows, problems tend to happen when we are least equipped to solve them. To work with them on our terms, we turn to a class of software commonly referred to as network monitoring software. Such software usually allows us to constantly monitor things happening in a computer network using one or more methods and notify the persons responsible, if a metric passes a defined threshold.
One of the first monitoring solutions most administrators implement is a simple shell script invoked from a crontab
that checks some basic parameters such as disk usage or some service state, such as an Apache server. As the server and monitored-parameter count grows, a neat and clean script system starts to grow into a performance-hogging script hairball that costs more time in upkeep than it saves. While the do-it-yourself crowd claims that nobody needs dedicated software for most tasks (monitoring included), most administrators will disagree as soon as they have to add switches, UPSes, routers, IP cameras, and a myriad of other devices to the swarm of monitored objects.
So, what basic functionality can one expect from a monitoring solution? Let's take a look:
Data gathering: This is where everything starts. Usually, data is gathered using various methods, including Simple Network Management Protocol (SNMP), agents, and Intelligent Platform Management Interface (IPMI).
Alerting: Gathered data can be compared to thresholds and alerts sent out when required using different channels, such as e-mail or SMS.
Data storage: Once we have gathered the data, it doesn't make sense to throw it away, so we will often want to store it for later analysis.
Visualization: Humans are better at distinguishing visualized data than raw numbers, especially when there's a lot of data. As we have data already gathered and stored, it is easy to generate simple graphs from it.
Sounds simple? That's because it is. But then we start to want more features, such as easy and efficient configuration, escalations, and permission delegation. If we sit down and start listing the things we want to keep an eye out for, it may turn out that that area of interest extends beyond the network, for example, a hard drive that has Self-Monitoring, Analysis, and Reporting Technology (SMART) errors logged, an application that has too many threads, or a UPS that has one phase overloaded. It is much easier to manage the monitoring of all these different problem categories from a single configuration point.
In the quest for a manageable monitoring system, wondrous adventurers stumbled upon collections of scripts much like the way they themselves implemented obscure and not-so-obscure workstation-level software and heavy, expensive monitoring systems from big vendors.
Many went with a different category—free software. We will look at a free software monitoring solution, Zabbix.