Disaster recovery is the set of measures you undertake when your deployment has undergone a significant failure that exceeds the capabilities of your fault tolerance. Some example scenarios that might require disaster recovery efforts include the following:
- Storage failures: For example, if your storage environment has two redundant disk controllers and both of them fail before you can return the system to full capacity, or more than one disk fails simultaneously in a RAID-5 disk volume.
- Virtual machine host failures: If your environment comprises virtual machines and the underlying virtual machine hypervisors fail in a way that prohibits the virtual machines supporting your environment from powering up.
- Software updates: This could apply to operating system updates, application platform updates, driver updates, or other application updates that render the system unusable.
- Database failure: Since the majority of SharePoint Server's services rely on storing and retrieving information from databases, a catastrophic database failure could prohibit components in the farm from working correctly.
- Primary data center site compromise: Any event that impacts your primary data center, such as extended power outage, a flood or another natural disaster, network connectivity service interruption, or military action.
Your organization may require you to be prepared to resume activities in the event of any of these scenarios (or others that may apply to your environment). The ability to recover or restore operations is gauged by three measurements:
- Recovery Point Objective (RPO): The RPO can be expressed in several ways, such as "the last available backup from which to initiate a restore" or "the acceptable amount of data loss."
- Recovery Level Objective (RLO): A sub-function of the RPO, the RLO defines the granularities that you need to be able to recover (such as a data center, rack, host, farm, server, application, database, site, document library, folder, or file).
- Recovery Time Objective (RTO): The amount of time it takes to get a system operational with the data parameters of the RPO. This can also be referred to as how long the outage can last or "how long we're down."
A business recovery objective or requirement might be expressed as follows:
Must be able to recover a SharePoint farm at the document library level (RLO) in less than 2 hours (RTO) with no more than 2 hours of potential data loss (RPO).
As you put together a disaster recovery plan for SharePoint Server, it's important to start with the organization's goals (such as the number of hours of downtime or how much potential data loss is acceptable), and then recommend strategies, processes, and products based on that business requirement.
Outage costs
Outages fall into three categories, generally as follows:
- Planned loss of application or service (such as a service upgrade or scheduled maintenance)
- Unplanned loss of application or service
- Loss of data
Loss of an application or service may prohibit your organization from generating revenue or performing required activities for the business to operate, which may have a financial impact, depending on the application or service that is inoperable. An application or service can also incur a partial loss (such as running in a degraded fashion), which may render the system usable for some activities and not for others.
Planned outages are typically communicated to business users or customers and are scheduled to happen during low periods of activity. Unplanned outages, conversely, happen without notice due to some type of system failure.
Loss of data, depending on the type of data affected, could have a significant financial impact on an organization.
Depending on the type of application or data hosted by a SharePoint Server environment and the type of outage incurred, you may need to evaluate one or more disaster recovery options.
Disaster recovery options, costs, and considerations
Disaster recovery options (and their costs) can be quite varied, from a simple backup and restore to full standby data center solutions. Here are some example of disaster recovery options:
Type |
Components |
Notes |
Relative Deployment and Maintenance Cost |
Recovery Time |
Tape or disk-to-disk backup solution |
Tape or disk-to-disk backup hardware, software |
This simplest form of disaster recovery covers only the applications and data. It is typically the cheapest option to deploy and maintain, but it depends on the organization being able to provide infrastructure, should the need to recover data arise. |
Lowest |
Longest |
Cold standby infrastructure |
Dedicated servers ready to be configured in the event of a disaster |
This solution builds on having backups by providing dedicated hardware. This hardware is not configured or maintained but is waiting for a disaster so that it can be configured to meet the exact recovery requirements. Cold standby infrastructure is typically infrastructure that can be available within hours or days. |
Low |
Long |
Warm standby infrastructure |
Dedicated servers that are regularly maintained and available |
A warm standby infrastructure disaster recovery scenario leverages dedicated equipment that is kept up to date on a schedule using regular restores or synchronizations of data. Warm standby infrastructure can typically be used to make a solution available within minutes to hours. |
Medium |
Medium |
Hot standby infrastructure |
Dedicated servers that are regularly maintained and kept up to date, ready for failover |
Hot standby infrastructure, like warm standby infrastructure, is dedicated equipment that is kept up to date. Unlike warm standby infrastructure, however, hot standby infrastructure is ready to take over within seconds to minutes. Hot standby infrastructure plans frequently rely on load balancing and data replication technologies. |
Expensive |
Shortest |
Cold standby data center |
Dedicated data center space with equipment ready to be provisioned |
A cold standby data center strategy relies on having available equipment and backups at a secondary location. This is a somewhat expensive solution to maintain (a data center space and networking and server equipment is required, as well as ensuring backups are available) and has both high-recovery time and point objectives. It will likely take days or weeks to get a cold standby data center operational. |
Somewhat expensive |
Long |
Warm standby data center |
Dedicated data center space with pre-configured equipment, ready to accept failover or restores |
Similar to a warm standby infrastructure solution, a warm standby data center disaster recovery solution means you have equipment mostly up to date at a remote location. The most recent data can be applied to this environment, typically within minutes or hours. |
More expensive |
Medium |
Hot standby data center |
Dedicated servers that are regularly maintained and kept up to date, ready for failover in a separate data center space |
Building on the concepts of hot standby infrastructure, a hot standby data center recovery strategy is the most resilient (and expensive) solution to maintain as it requires both investment (data center space, dedicated equipment, software, networking, and communications) and sound process execution. Hot standby data centers can be ready within seconds to minutes and can have the lowest recovery time and recovery point objectives for overcoming full primary site disaster. |
Most expensive |
Shortest |
As with designing a fault-tolerance strategy, you'll also want to design a disaster recovery strategy that takes failure domains into account. These failure domains might include the following:
- Application, workload, database, or service
- Infrastructure or platform
- Farm
- Data center
Finally, no disaster recovery plan is complete without documentation that allows the technicians or support staff to return services to their full operational status. These operational recovery plans (sometimes referred to as runbooks or playbooks) should include things such as the following:
- Step-by-step printed instructions used to recover services from each failure or disaster mode, such as operating system installation and configuration, configuration, IP address schemes, or database names
- Tested scripts for building, deploying, and testing the configuration
- Operational procedures for restoring data
- Correct versions of software installation media and any applicable licensing information (such as key files, licenses, or other activation/registration information necessary to bring the service online)
- Emergency contact information for building access, infrastructure personnel, and application or business owners
Evaluating the business objectives (recovery time objective and recovery point objective) in conjunction with the budget will help you arrive at an appropriate disaster recovery strategy for your organization.
Next, we'll look at backup and restore as part of the SharePoint Server planning process.