Preventing failures is vital to achieving operational excellence. Failures are bound to happen, and it's critical to identify them as far in advance as possible. During architecture design, anticipate failure to make sure you design for failure so that nothing will fail. Assume that everything will fail all the time and have a backup plan ready. Perform regular pre-mortem exercises to identify any potential source of failure. Try to remove or mitigate any resource that could cause a failure during system operation.
Create a test scenario based on your SLA that potentially includes a system Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Test your scenario, and make sure you understand their impact. Make your team ready to respond to any incident by simulating in a production-like scenario. Test your response procedure to make sure it is resolving issues effectively and create a confident team that is familiar with response execution...