Chapter 1. Introducing Nagios
Imagine you're an administrator of a large IT infrastructure. You have just started receiving e-mails that a web application has suddenly stopped working. When you try to access the same page, it just does not load. What are the possibilities? Is it the router? Maybe the firewall? Perhaps the machine hosting the page is down? The server process has crashed? Before you even start thinking rationally about what to do, your boss calls about the critical situation and demands explanations. In all this panic, you'll probably start plugging everything in and out of the network, rebooting the machine... and it still doesn't help.
After hours of nervous digging into the issue, you've finally found the root cause: although the web server was working properly, it continuously timed out during communication with the database server. This is because the machine with the database did not get an IP address assigned. Your organization requires all IP addresses to be configured using the DHCP protocol and the local DHCP server ran out of memory and killed several processes, including the dhcpd process responsible for assigning IP addresses. Imagine how much time it would take to determine all this manually! To make things worse, the database server could be located in another branch of the company or in a different time zone, and it could be the middle of the night over there.
But what if you had Nagios up and running across your entire company? You would just go to the web interface and see that there are no problems with the web server and the machine on which it is running. There would also be a list of issues—the machine serving IP addresses to the entire company does not do its job and the database is down. If the setup also monitored the DHCP server, you'd get a warning e-mail that little swap memory is available or too many processes are running. Maybe it would even have an event handler for such cases to just kill or restart non-critical processes. Also, Nagios would try to restart the dhcpd process over the network in case it is down.
In the worst case, Nagios would reduce hours of investigation to ten minutes. Ideally, you would just get an e-mail that there was such a problem and another e-mail that it's already fixed. You would just disable a few services and increase the swap size for the DHCP machine and solve the problem permanently. Hopefully, it would be solved fast enough so that nobody would notice that there was a problem in the first place!