Disaster recovery for Kubernetes essentially involves creating a cluster state backup-and-restore strategy. Let's first take a look at what stateful components are in Kubernetes:
- Etcd cluster (https://etcd.io/) that persists the state for Kubernetes API server resources.
- Persistent volumes used by pods.
And surprisingly (or not), that is all! For the master node components and pods running on worker nodes, you don't have any nonrecoverable state involved; if you provision a new replacement node, Kubernetes can easily move the workload to the new nodes, providing full business continuity. When your etcd cluster is restored, Kubernetes will take care of reconciling the cluster component's state.
Let's take a look at how to back up and restore persistent volumes. It all depends on how your persistent volumes are provisioned...