Reviewing some case studies
This section discusses some real-world scenarios of Elasticsearch node failure and how to address them.
The ES process quits unexpectedly
A few weeks ago we noticed in Marvel that the Elasticsearch process was down on one of our nodes. We restarted Elasticsearch on this node, and everything seemed to return to normal. However, checking Marvel later on in the week, we notice that the node is down again. We decide to look at the Elasticsearch log files, but don't notice any exceptions. As we don't see anything in the Elasticsearch log, we suspect that the operating system may have killed Elasticsearch. Checking syslog
at /var/log/syslog
, we see the error:
Out of memory: Kill process 5969 (java) score 446 or sacrifice child
This verifies that the operating system killed Elasticsearch because the system was running out of memory. We check the Elasticsearch configuration and don't see any issues. This node is configured in the same way as the other nodes in the cluster...