Monitoring your server's operation (Medium)
Monitoring is a part of any operation management practice. As with any other monitoring discipline, you will need to choose which Key Performance Indicators (KPIs) are applicable to your business and application, and which thresholds/ranges are acceptable and unacceptable for you.
Getting ready
Yes, you can do real-time pedestrian monitoring by running command-line tools yourself. Let's face it, a lot of systems administrators will still login via SSH to their servers and run top for a reason; immediate snapshots have immediate value in the monitoring process. But you can also install a monitoring product and map the values over time. A mix of the two approaches will be valuable while managing a Debian system.
Some of the monitoring products out there, such as Munin, Nagios, and Zenoss, will have default values for most of the usual metrics monitored for web servers; however, you need to perform test runs that might span a couple weeks or so, with different types of loads, to understand your acceptable ranges.
How it works…
The following are some examples of KPIs for your web server on Debian:
Ping round-trip time (RTT), which, on a lower scale, helps determine network outages, and on a higher scale, helps understand latency issues to your server.
Network interface throughput, which helps understand capacity and usage.
Disk usage, memory usage, and CPU usage. Also, tools such as vmstat will help you do real-time analysis. The three metrics combined will help you find bottlenecks in your application, which in most cases will be I/O bound. The independent metrics will just give you an idea whether you need to add capacity or not.
TCP response times, which helps measure the consistency of the time the network stack takes to respond to a request on the port where your server's running (both the web and the database servers if you're using TCP connections).
HTTP request/response times, which helps measure the consistency of the time the web server takes to respond to a request. More advanced monitoring of this includes expected responses, specific URIs, and form workloads for testing (which may be useful for determining defacements, for example).
Database query response times, which helps find bottlenecks in your application. In teams with DBAs, this is usually done by the DBA as part of a performance optimization effort, but DevOps team might put that into their monitoring plate.
How to do it…
The easiest way to get a monitoring glimpse of your server's operation is to log in via SSH and interpret the output of certain commands.
Run vmstat. You will get a fairly cryptical snapshot of the following:
procs-r: It indicates processes waiting for runtime. It is a potential indicative of CPU-bound applications. The lower, the better—0 is best (test under load).
procs-b: It indicates processes on uninterruptible sleep which is not good. 0 is best.
memory-swpd: It indicates swapped memory. Here, 0 is normal. Having any amount here is bad because the processes will wait until the disk spins to virtual memory. You can also tweak the swappiness (tendency to swap). Swap in/swap out, the lower the better; it will be 0 if you have no swapped memory.
memory-cache: It indicates cached memory. It is generally good as it will avoid a slower I/O.
memory-free: It indicates free memory. As opposed to Windows, Linux tends to leave free memory untouched. So it is not always an indicator of anything in particular, except if it's 0, then you're running out of memory.
Io-bi/bo: It indicates blocks in/out of disk. Here, lower is better. High numbers here, that increase without control or patterns during loads, are indicative of an I/O-bound application. Either find the I/O bottlenecks and remove them or invest in faster storage… or a different storage architecture. Also see CPU waiting time (wa), which is the time waiting for I/O (here, lower is better).
cpu-us/id: It indicates the CPU usage versus the idle time. Idle means responsiveness but also underconsumption of CPU power. A fair, consistent usage amount is a good indicator of a stable load.
You can use
vmstat n
to get samplings each n seconds. For examplevmstat 5
is good when you're doing a load test to see where your app bottlenecks are. Here is an example of a sample web server when runninghttperf
. The first two samples are before the test, and the next two are during the test:
Can you spot the bottleneck? Yes, it's disk I/O. You can see swap is good, and the CPU and memory usage rates are good as well, thus no need to increase capacity.
This set up runs on a VM (particularly, a VPS), which tends to have slow storage (some people prefer to use network-based storage to avoid using the disk drivers of their hypervisors).
You can use jnettop (called
jnettop –i <interface name>
) to check the bandwidth usage per each TCP connection. You can see aggregates and check whether the HTTP requests or the SQL connections are using all of your available bandwidth. There are several strategies to increase the bandwidth, which we cover in this book under the Using proxies, caches, and clusters to scale your architecture recipe.As an alternative, you can use
tcptrack
to track the TCP state as well; although, for network load, we like to use jnettop that looks like:Previously we used
httperf
to simulate a load scenario. Now let's suppose you want to use it to actually measure the response time of your web server, simulating 10 connections using the commandhttperf --hog --server=www.example.com --num-conns=10
. It should look like the following screenshot:Our application is able to handle 2.4 req/s (given the 10 connection workload and the fact that we ran this locally), or conversely it takes 0.4 seconds to reply to one request. This operation used 0.9 KB/s of bandwidth. For production environments, you would like to set a more complex test environment with remote computers simulating hundreds or thousands of connections from different connections.
Measuring the response time for a SQL query is trivial. If you can write your query on the command line (for example, with
mysql –e
orpsql –c
), you can wrap the entire statement on a time call:time (mysql -u root -p tsa -e 'select count(*) from token').
Take a look at the user and sys values—since this statement requires a password, real is artificially higher. Also notice that this statement will also include the time necessary to run the MySQL binary, connect, and so on, so it might be biased—for single queries, the mysql console already gives you an execution time in seconds. You could also compare the value over time by wrapping everything on a watch
statement. But soon you will find out that the query response time depends on a lot of variables such as server load, I/O load, and so on, and that it is more efficient to focus on the queries that are systemically slow.
If using MySQL, edit
/etc/mysql/my.cnf
and uncomment thelog_slow_queries
directive. Queries taking more thanlong_query_time
to complete will get logged to that file. Then your programmers, DBA, and you can sit and work on that query.If using Postgres, edit
/etc/postgresql/9.1/main/postgresql.conf
and setlog_min_duration_statement
to a value (for example, 250ms).Restart your database with the service
mysql restart
orsudo service postgres restart
and start taking a look at the logs.