High availability and fault tolerance
High availability, in simple terms, means achieving very close to 100% system uptime by ensuring that there is no single point of failure. This is typically done by incorporating redundancy mechanisms, such as backup processes taking over instantly from the failed ones and so on.
Mastering high availability
In Mesos, this is achieved using Apache ZooKeeper, a centralized coordination service. Multiple masters are set up (one active leader and other backups), with ZooKeeper coordinating the leader election and handling lead master detection by other Mesos components such as slaves and frameworks.
A minimum of three master nodes are required to maintain a quorum for a high availability setting. The recommendation for production systems is however, at least five. The leader election process is described in detail at http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection.
The state of a failed master can be recreated on whichever master gets elected...