The simple answer is everything, or, as much as you can. You can never predict what scenario may be forced upon you and your cluster, and having the correct monitoring and alerting in place can mean the difference between handing a situation gracefully or having a full-scale outage. A list of things that should be monitored in decreasing order of importance is as follows.
What should be monitored
Ceph health
The most important thing to capture is the health status of Ceph. The main reporting item is the overall health status of the cluster, either HEALTH_OK, HEALTH_WARN, or HEALTH_ERR. By monitoring this state, you will be alerted any time Ceph itself thinks that something is not right. In addition to this, you may also want...