Review – example troubleshooting session
Let’s look at an example troubleshooting session. All we know is that one specific Linux server is running extremely slowly.
To begin with, we want to see what’s happening on the system. You just learned that you can see a live view of processes running on a system by running the interactive top
command. Let’s try that now.
Figure 2.9: Example of output of the top command
By default, the top
command sorts processes by CPU usage, so we can simply look at the first listed process to find the offending one. Indeed, the top process is using 94% of one CPU’s available processing time.
As a result of running top
, we’ve gotten a few useful pieces of information:
- The problem is CPU usage, as opposed to some other kind of resource contention.
- The offending process is PID 1763, and the command being run (listed in the COMMAND column) is
bzip2
, which is a compression program.
We determine that this bzip2
process doesn’t need to be running here, and we decide to stop it. Using the kill
command, we ask the process to terminate:
kill 1763
After waiting a few seconds, we check to see if this (or any other) bzip2
process is running:
pgrep bzip2
Unfortunately, we see that the same PID is still running. It’s time to get serious:
kill –9 1763
This orders the operating system to kill the process without allowing the process to trap (and potentially ignore) the signal. A SIGKILL
(signal #9) simply kills the process where it stands.
Now that you’ve killed the offending process, the server is running smoothly again and you can start tracking down the developer who thought it was a good idea to compress large source directories on this machine.
In this example, we followed the most common systems troubleshooting pattern in existence:
- We looked at resource usage (via
top
in this example). This can be any of the other tools we discussed, depending on which resource is the one being exhausted. - We found a PID to investigate.
- We acted on that process. In this example, no further investigation was necessary and we sent a signal, asking it to shut down (15,
SIGTERM
).