What is High Availability and Disaster Recovery?
High Availability
High availability refers to providing an agreed level of system or application availability by minimizing the downtime caused by infrastructure or hardware failure.
When the hardware fails, there's not much you can do other than switch the application to a different computer so as to make sure that the hardware failure doesn't cause application downtime.
Disaster Recovery
Business continuity and disaster recovery, though used interchangeably, are different concepts.
Disaster recovery refers to re-establishing the application or system connectivity or availability on an alternate site, commonly known as a DR site, after an outage in the primary site. The outage can be caused by a site-wide (data center) wide infrastructure outage or a natural disaster.
Business continuity is a strategy that ensures that a business is up and running with minimal or zero downtime or service outage. For example, as a part of business continuity, an organization may plan to decouple an application into small individual standalone applications and deploy each small application across multiple regions. Let's say that a financial application is deployed on region one and the sales application is deployed on region two. Therefore, if a disaster hits region one, the finance application will go down, and the company will follow the disaster recovery plan to recover the financial application. However, the sales application in region two will be up and running.
High availability and disaster recovery are not only required during hardware failures; you also need them in the following scenarios:
System upgrades: Critical system upgrades such as software, hardware, network, or storage require the system to be rebooted and may even cause application downtime after being upgraded because of configuration changes. If there is an HA setup present, this can be done with zero downtime.
Human errors: As it's rightly said, to err is human. We can't avoid human errors; however, we can have a system in place to recover from human errors. An error in deployment or an application configuration or bad code can cause an application to fail. An example of this is the GitLab outage on January 31, 2017, which was caused by the accidental removal of customer data from the primary database server, resulting in an overall downtime of 18 hours.
Note
You can read more about the GitLab outage post-mortem here: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/.
Security breaches: Cyber-attacks are a lot more common these days and can result in downtime while you find and fix the issue. Moving the application to a secondary database server may help reduce the downtime while you fix the security issue in most cases.
Let's look at an example of how high availability and disaster recovery work to provide business continuity in the case of outages.
Consider the following diagram:
The preceding diagram shows a common HA and DR implementation with the following configuration:
The primary and secondary servers (SQL Server instance) are in Virginia. This is for high availability (having an available backup system).
The primary and secondary servers are in the same data center and are connected over LAN.
A DR server (a third SQL Server instance) is in Ohio, which is far away from Virginia. The third SQL Server instance is used as a DR site.
The DR site is connected over the internet to the primary site. This is mostly a private network for added security.
The primary SQL Server (node 1) is active and is currently serving user transactions.
The secondary and DR servers are inactive or passive and are not serving user transactions.
Let's say there is a motherboard failure on node 1 and it crashes. This causes node 2 to be active automatically and it starts serving user transactions. This is shown in the following diagram:
This is an example of high availability where the system automatically switches to the secondary node within the same data center or a different data center in the same region (Virginia here).
The system can fall back to the primary node once it's fixed and up and running.
Note
A data center is a facility that's typically owned by a third-party organization, allowing customers to rent or lease out infrastructure. A node here refers to a standalone physical computer. A disaster recovery site is a data center in a different geographical region than that of the primary site.
Now, let's say that while the primary server, node 1, was being recovered, there was a region-wide failure that caused the secondary server, node 2, to go down. At this point, the region is down; therefore, the system will fail over to the DR server, node 3, and it'll start serving user transactions, as shown in the following diagram:
This is an example of disaster recovery. Once the primary and secondary servers are up and running, the system can fall back to the primary server.
Note
Organizations periodically perform DR drills (mock DR) to make sure that the DR solution is working fine and to estimate downtime that may happen in the case of an actual DR scenario.