Troubleshooting Zabbix
All of the previous Q&As cover a few of the most common issues new users might encounter. There are a lot of other issues one might run into, and with new versions of Zabbix, new issues will appear. While it's good to have quick solutions to common problems, let's look at some details that could be helpful when debugging Zabbix problems.
The Zabbix log file format
One of the first places we should check when there's an unexplained issue is log files. This is not just a Zabbix-specific thing; log files are great. Sometimes. Other times, they do not help, but we will discuss some other options for when log files do not provide the answer. To be able to find the answer, though, it is helpful to know some basics about the log file format. The Zabbix log format is as follows:
PPPPPP:YYYYMMDD:HHMMSS.mmm
Here, PPPPPP
is process ID, space-padded to 6 characters, YYYYMMDD
is the current date, HHMMSS
is the current time, and mmm
is milliseconds for the timestamp. Colons and the dot are literal symbols. This prefix is followed by a space and then by the actual log message. Here's an example log entry:
10372:20151223:134406.865 database is down: reconnecting in 10 seconds
If there's a line in the log file without this prefix, it is most likely coming from an external source such as a script, or maybe from some library such as Net-SNMP
.
During startup, output similar to the following will be logged:
12331:20151215:163629.968 Starting Zabbix Server. Zabbix 3.0.0 (revision {ZABBIX_REVISION}). 12331:20151215:163630.020 ****** Enabled features ****** 12331:20151215:163630.020 SNMP monitoring: YES 12331:20151215:163630.020 IPMI monitoring: YES 12331:20151215:163630.020 Web monitoring: YES 12331:20151215:163630.020 VMware monitoring: YES 12331:20151215:163630.020 SMTP authentication: YES 12331:20151215:163630.020 Jabber notifications: NO 12331:20151215:163630.020 Ez Texting notifications: YES 12331:20151215:163630.020 ODBC: YES 12331:20151215:163630.020 SSH2 support: YES 12331:20151215:163630.020 IPv6 support: NO 12331:20151215:163630.020 TLS support: NO 12331:20151215:163630.020 ****************************** 12331:20151215:163630.020 using configuration file: /usr/local/etc/zabbix_server.conf 12331:20151215:163630.067 current database version (mandatory/optional): 03000000/03000000 12331:20151215:163630.067 required mandatory version: 03000000
The first line prints out the daemon type and version. Depending on how it was compiled, it might also include the current SVN revision number. A list of the compiled-in features follows—very useful to know whether you should expect SNMP, IPMI, or VMware monitoring to work at all. Then, the path to the currently used configuration file is shown—helpful when we want to figure out whether the file we changed was the correct one. In the server and proxy log files, both the current and the required database versions are present—we discussed those in Chapter 22, Zabbix Maintenance.
After the database versions, the internal process startup messages can be found:
2583:20151231:155712.323 server #0 started [main process] 2592:20151231:155712.334 server #5 started [poller #3] 2591:20151231:155712.336 server #4 started [poller #2] 2590:20151231:155712.337 server #3 started [poller #1] 2593:20151231:155712.339 server #6 started [poller #4]
There will be many more lines like these; the output here is trimmed. This might help verify that the expected number of processes of some type has been started. When looking at log file contents, it is not always obvious which process logged a specific line—and this is where the startup messages can help. If we see a line such as the following, we can find out which process logged it:
21974:20151231:184520.117 Zabbix agent item "vfs.fs.size[/,free]" on host "A test host" failed: another network error, wait for 15 seconds
We can do that by looking for the startup message with the same PID:
# grep 21974 zabbix_server.log | grep started 21974:20151231:184352.921 server #8 started [unreachable poller #1]
Note
If more than one line is returned, apply common sense to find out the startup message.
This demonstrates that hosts are deferred to the unreachable poller after the first network failure.
But what if the log file has been rotated and the original startup messages are lost? Besides more advanced detective work, there's a simple method, provided that the daemon is still running. We will look at that method a bit later in this chapter.
Reloading the configuration cache
We met the configuration cache in Chapter 2, Getting Your First Notification, and we discussed ways to monitor it in Chapter 22, Zabbix Maintenance. While it helps a lot performance-wise, it can be a bit of a problem if we are trying to quickly test something. It is possible to force Zabbix server to reload the configuration cache. Run the following to display Zabbix server options:
# zabbix_server --help
Note
We briefly discussed Zabbix proxy configuration cache reloading in Chapter 19, Using Proxies to Monitor Remote Locations.
In the output, look for the runtime control
options section:
-R --runtime-control runtime-option Perform administrative functions Runtime control options:config_cache_reload Reload configuration cache
Thus, reloading the server configuration cache can be initiated by the following:
# zabbix_server --runtime-control config_cache_reload zabbix_server [2682]: command sent successfully
Examining the server log file will reveal that it has received the signal:
forced reloading of the configuration cache
In the background, the sending of the signal happens like this:
The server binary looks up the default configuration file.
It then looks for the file specified in the
PidFile
option.It sends the signal to the process with that ID.
As discussed in Chapter 19, Using Proxies to Monitor Remote Locations, the great thing with this feature is that it's also supported for active Zabbix proxies. Even better, when an active proxy is instructed to reload its configuration cache, it connects to the Zabbix server, gets all the latest configuration, and then reloads the local configuration cache. If such a signal is sent to a passive proxy, it ignores the signal.
What if you have several proxies running on the same system—how can you tell the binary which exact instance should reload the configuration cache? Looking back at the steps that were taken to deliver the signal to the process, all that is needed is specifying the correct configuration file. If running several proxies on the same system, each must have its own configuration file already, specifying different PID files, log files, listening ports, and so on. Instructing a proxy that used a specific configuration file to reload the configuration cache would be this simple:
# zabbix_proxy -c /path/to/zabbix_proxy.conf --runtime-control config_cache_reload
Note
The full or absolute path must be provided for the configuration file; a relative path is not supported.
Note
The same principle applies for servers and proxies, but it is even less common to run several Zabbix servers on the same system.
Manually reloading the configuration cache is useful if we have a large Zabbix server instance and have significantly increased the CacheUpdateFrequency
parameter.
Controlling running daemons
A configuration cache reload was only one of the things available in the runtime section. Let's look at the remaining options in there:
housekeeper_execute Execute the housekeeper log_level_increase=target Increase log level, affects all processes if target is not specified log_level_decrease=target Decrease log level, affects all processes if target is not specified Log level control targets: pid Process identifier process-type All processes of specified type (for example, poller) process-type,N Process type and number (e.g., poller,3)
As discussed in Chapter 22, Zabbix Maintenance, the internal housekeeper is first run 30 minutes after the server or proxy startup. The housekeeper_execute
runtime option allows us to run it at will:
# zabbix_server --runtime-control housekeeper_execute
Even more interesting is the ability to change the log level for a running process. This feature first appeared in Zabbix 2.4, and it made debugging much, much easier. Zabbix daemons are usually started and just work—until we have to change something. While we cannot tell any of the daemons to re-read their configuration file, there are a few more options that allow us to control some aspects of a running daemon. As briefly mentioned in Chapter 22, Zabbix Maintenance, the DebugLevel
parameter allows us to set the log level when the daemon starts, with the default being 3
. Log level 4
adds all the SQL queries, and log level 5
also adds the received content from web monitoring and VMware monitoring. For the uninitiated, anything above level 3
can be very surprising and intimidating. Even a very small Zabbix server can easily log tens of megabytes in a few minutes at log level 4
. As some problem might not appear immediately, one might have to run it for hours or days at log level 4
or 5
. Imagine dealing with gigabytes of logs you are not familiar with. The ability to set the log level for a running process allows us to increase the log level during a problem situation and lower it later, without requiring a daemon restart.
Even better, when using the runtime log level feature, we can select which exact components should have their log level changed. Individual processes can be identified by either their system PID or by the process number inside Zabbix. Specifying processes by the system PID could be done like this:
# zabbix_server --runtime-control log_level_increase=1313
Specifying an individual Zabbix process is done by choosing the process type and then passing the process number:
# zabbix_server --runtime-control log_level_increase=trapper,3
A fairly useful and common approach is changing the log level for all processes of a certain type—for example, we don't know which trapper will receive the connection that causes the problem, so we could easily increase the log level for all trappers by omitting the process number:
# zabbix_server --runtime-control log_level_increase=trapper
And if no parameter is passed to this runtime option, it will affect all Zabbix processes:
# zabbix_server --runtime-control log_level_increase
When processes are told to change their log level, they log an entry about it and then change the log level:
21975:20151231:190556.881 log level has been increased to 4 (debug)
Note that there is no way to query the current log level or set a specific level. If you are not sure about the current log level of all the processes, there are two ways to sort it out:
Restart the daemon
Decrease or increase the log level five times so that it's guaranteed to be at
0
or5
, then set the desired level
As a simple test of the options we just explored, increase the log level for all pollers:
# zabbix_server --runtime-control log_level_increase=poller
Follow the Zabbix server logfile:
# tail -f /tmp/zabbix_server.log
Notice the amount of data just 5 poller processes on a tiny Zabbix server can generate. Then decrease the log level:
# zabbix_server --runtime-control log_level_decrease=poller
Runtime process status
Zabbix has another small trick to help with debugging. Run top
and see which mode gives you a more stable and longer list of Zabbix processes—one of sorting by processor usage (hitting Shift + P) or memory usage (hitting Shift + M) might.
Note
Alternatively, hit o and type COMMAND=zabbix_server
.
Press C and notice how Zabbix processes have updated their command line to show which exact internal process it is and what is it doing:
zabbix_server: poller #1 [got 0 values in 0.000005 sec, idle 1 sec] zabbix_server: poller #4 [got 1 values in 0.000089 sec, idle 1 sec] zabbix_server: poller #5 [got 0 values in 0.000004 sec, idle 1 sec]
Follow their status and see how the task and the time it takes changes for some of the processes. We could also have output that could be redirected or filtered through other commands:
# top -c -b | grep zabbix_server
The -c
option tells it to show the command line, the same thing we achieved by hitting C before. The -b
option tells top
to run in batch mode without accepting input and just outputting the results. We could also specify -n 1
to run it only once or specify any other number as needed.
It might be more convenient to use ps
:
# ps -f -C zabbix_server
The -f
flag enables full output, which includes the command line. The -C
flag filters by the executable name:
zabbix 21969 21962 0 18:43 ? 00:00:00 zabbix_server: poller #1 [got 0 values in 0.000006 sec, idle 1 sec] zabbix 21970 21962 0 18:43 ? 00:00:00 zabbix_server: poller #2 [got 0 values in 0.000008 sec, idle 1 sec] zabbix 21971 21962 0 18:43 ? 00:00:00 zabbix_server: poller #3 [got 0 values in 0.000004 sec, idle 1 sec]
The full format prints out some extra columns—if all we needed was the PID and the command line, we could limit columns in the output with the -o
flag, like this:
# ps -o pid=,command= -C zabbix_server 21975 zabbix_server: trapper #1 [processed data in 0.000150 sec, waiting for connection] 21976 zabbix_server: trapper #2 [processed data in 0.001312 sec, waiting for connection]
Note
The equals sign after pid
and command tells ps
not to use any header for these columns.
And to see a dynamic list that shows the current status, we can use the watch
command:
# watch -n 1 'ps -o pid=,command= -C zabbix_server'
This list will be updated every second. Note that the interval parameter -n
also accepts decimals, so to update twice every second, we could use -n 0.5
.
This is also the method to find out which PID corresponds to which process type if startup messages are not available in the log file—we can see the process type and PID in the output of top
or ps
.
Further debugging
There are a lot of things that could go wrong, and a lot of tools to help with finding out why. If you are familiar with the toolbox , including tools such as tcpdump
, strace
, ltrace
, and pmap
, you should be able to figure out most Zabbix problems.
Note
Some people claim that everything is a DNS problem. Often, they are right—if nothing else helps, check the DNS. Just in case.
We won't discuss them here, as it would be quite out of scope—that's general Linux or Unix debugging. Of course, there's still a lot of Zabbix-specific things that could go wrong. You might want to check out the Zabbix troubleshooting page on the wiki: http://zabbix.org/wiki/Troubleshooting. If that does not help, make sure to check the community and commercial support options, such as the Zabbix IRC channel we will discuss in Appendix B, Being Part of the Community.