Recovering from a disaster
Failures are a regular occurrence in large clusters. Hard drives fail, servers fail, even full data centers will go dark. Shifting services to cloud platforms such as AWS and Azure have helped, but even they have had entire regions go down. Using containers may make your applications more resistant to failure, but the hosts running those containers are still affected by any number of things. Properly engineered, your cluster should be able to cope with disaster. Here are a few things to keep in mind to keep your cluster safe.
Restarting the full cluster
There may be times when the entire swarm has to be shutdown. Hopefully, there will be time to properly shut down running services and the hosts. When the time comes to shutdown the hosts, start with the workers then shutdown the managers. When the cluster is started up again, start the managers first then the workers. Make sure that the managers have the same IP addresses or your nodes will come up and not be able...