Monitoring Elasticsearch
Monitoring distributed systems is difficult because as the number of nodes, the number of users, and the amount of data increase, problems will begin to crop up.
Furthermore, it may not be immediately obvious if there is an error. Often, the cluster will keep running and try to recover from the error automatically. As shown in Figures 1.2, 1.3, and 1.4 earlier, a node failed, but Elasticsearch brought itself back to a green
state without any action on our part. Unless monitored, failures such as these can go unnoticed. This can have a detrimental impact on system performance and reliability. Fewer nodes means less processing power to respond to queries, and, as in the previous example, if another node fails, our cluster won't be able to return to a green
state.
The aspects of an Elasticsearch cluster that we'll want to keep track of include the following:
- Cluster health and data availability
- Node failures
- Elasticsearch JVM memory usage
- Elasticsearch cache size
- System utilization (CPU, Memory, and Disk)
- Query response times
- Query rate
- Data index times
- Data index rate
- Number of indices and shards
- Index and shard size
- System configuration
In this book, we'll go over how to understand each of these variables in context and how understanding them can help diagnose, recover from, and prevent problems in our cluster. It's certainly not possible to preemptively stop all Elasticsearch errors. However, by proactively monitoring our cluster, we'll have a good idea of when things are awry and will be better positioned to take corrective action.
In the following chapters, we'll go over everything from web-based cluster monitoring tools to Unix command line tools and log file monitoring. Some of the specific tools this book covers are as follows:
- Elasticsearch-head
- Bigdesk
- Marvel
- Kopf
- Kibana
- Nagios
- Unix command-line tools
These tools will give us the information we need to effectively diagnose, solve, and prevent problems with Elasticsearch.