(For more resources related to this topic, see here.)
Now that we have high availability and failover at the namespace level between datacenters, we need to achieve the same for the Mailbox server role. This is accomplished in a similar way to Exchange 2010, by extending a DAG across two or more datacenters.
An organization's SLA covering failure and disaster recovery scenarios is what mostly influences a DAG's design. Every aspect needs to be considered: the number of DAGs to be deployed, the number of members in the DAG(s), the number of database copies, if site resilience is to be used and whether it has to be used for all users or just a subset, if the multiple site solution will be active/active or active/passive, and so on. As to the latter, there are generally three main scenarios when considering a two-datacenter model.
In an active/passive configuration, all users' databases are mounted in an active (primary) datacenter, with a passive (standby) datacenter used only in the event of a disaster affecting the active datacenter; this is shown in the following diagram:
There are several reasons why organizations might choose this model. Usually, it is because the passive datacenter is not as well equipped as the active one and, as such, is not capable of efficiently hosting all the services provided in the active datacenter. Sometimes, it is simply due to the fact that most or all users are closer to the active datacenter.
In this example, a copy of each database in the New York datacenter is replicated to the MBX4 server in the New Jersey datacenter so they can be used in a disaster recovery scenario to provide messaging services to users. By also having database replicas in New York, we provide intrasite resilience. For example, if server MBX1 goes down, database DB1, which is currently mounted on server MBX1, will automatically failover to servers MBX2 or MBX3 without users even realizing or being affected.
In some failure scenarios where a server shutdown is initiated (for example, when an Uninterruptible Power Supply (UPS) issues a shutdown command to the server), Exchange tries to activate another copy of the database(s) that the server is hosting, before the shutdown is complete. In case of a hard failure (for example, hardware failure), it will be the other servers detecting the problem and automatically mounting the affected database(s) on another server.
In this scenario, we could lose up to two Mailbox servers in New York before having to perform a datacenter switchover. As New York is considered the primary datacenter, the witness server is placed in New York. If, for some reason, the primary site is lost, the majority of the quorum voters is lost, so the entire DAG goes offline. At this stage, administrators have to perform a datacenter switchover, just like in Exchange 2010. However, because the recovery of a DAG is decoupled from the recovery of the namespace, it becomes much easier to perform the switchover, assuming a global namespace is being used. All that the administrators need to do is run the following three cmdlets to get the DAG up and running again in New Jersey:
Stop-DatabaseAvailabilityGroup <DAG_Name> -ActiveDirectorySite NewYork
Stop-Clussvc
Restore-DatabaseAvailabilityGroup <DAG_Name> -ActiveDirectorySite NewJersey
It is true that placing the witness server in the passive datacenter (when a DAG has the same number of nodes in both datacenters) would allow Exchange to automatically failover the DAG to the passive datacenter if the active site went down. However, there is a major disadvantage of doing this: if the passive site were to go down, even though it does not host any active databases, the entire DAG would go offline as the members in the active datacenter would not have quorum. This is why, in this scenario, it is recommended to always place the witness server in the active site.
In order to prevent databases in New Jersey from being automatically mounted by Exchange, the Set-MailboxServer cmdlet can be used together with the DatabaseCopyAutoActivationPolicy parameter to specify the type of automatic activation on selected Mailbox servers. This parameter can be configured to any of the following values:
For the preceding example, we would run the following cmdlet in order to prevent database copies from being automatically activated on MBX4:
Set-MailboxServer MBX4 -DatabaseCopyAutoActivationPolicy Blocked
As New York and New Jersey are on different AD sites, setting the DatabaseCopyAutoActivationPolicy parameter to IntrasiteOnly would achieve the same result. In either case, when performing a database switchover, administrators need to first remove the restriction on the target server, as shown in the following code, otherwise they will not be able to mount any databases.
Set-MailboxServer MBX4 -DatabaseCopyAutoActivationPolicy Unrestricted
In this configuration, users' mailboxes are hosted across both datacenters. This is a very common scenario for deployments with a user population close to both locations. If Exchange fails for users in either of the datacenters, its services are activated on the other datacenter. Instead of simply having some active databases on the MBX4 server (refer to the preceding diagram), multiple DAG members are deployed in the New Jersey datacenter in order to provide protection against additional failures and additional capacity so it can support the entire user population in case the New York datacenter fails.
By having more than one member in each datacenter, we are able to provide both intrasite and intersite resilience. Proper planning is crucial, especially capacity planning, so that each server is capable of hosting all workloads, including protocol request handling, processing, and data rendering from other servers without impacting the performance.
In this example, the DAG is extended across both datacenters to provide site resilience for users on both sites. However, this particular scenario has a single point of failure: the network connection (most likely a WAN) between the datacenters. Remember that the majority of the voters must be active and able to talk to each other in order to maintain quorum. In the preceding diagram, the majority of voters are located in the New York datacenter, meaning that a WAN outage would cause a service failure for users whose mailboxes are mounted in New Jersey. This happens because when the WAN connection fails, only the DAG members in the New York datacenter are able to maintain the quorum. As such, servers in New Jersey will automatically bring their active database copies offline.
In order to overcome this single point of failure, multiple DAGs should be implemented, with each DAG having a majority of voters in different datacenters, as shown in the following diagram:
In this example, we would configure DAG1 with its witness server in New York and DAG2 with its witness server in New Jersey. By doing so, if a WAN outage happens, replication will fail, but all users will still have messaging services as both DAGs continue to retain quorum and are able to provide services to their local user population.
The third scenario is the only one to provide automatic DAG failover between datacenters. It involves splitting a DAG across two datacenters, as in the previous scenarios, with the difference that the witness server is placed in a third location. This allows it to be arbitrated by members of the DAG in both datacenters independent of the network state between the two sites. As such, it is a key to place the witness server in a location that is isolated from possible network failures that might affect either location containing the DAG members.
This was fully supported in Exchange 2010, but the downside was that the solution itself would not get automatically failed over as the namespace would still need to be manually switched over. For this reason, it was not recommended. Going back to the advantage of the namespace in Exchange 2013 not needing to move with the DAG, this entire process now becomes automatic as shown in the following diagram:
However, even though this scenario is now recommended, special consideration needs to be taken into account for when the network link between the two datacenters hosting Exchange mailboxes fail. For example, even though the DAG will continue to be fully operational on both datacenters, CASs in New York will not be able to proxy requests to servers in New Jersey. As such, proper planning is necessary in order to minimize user impact in such an event. One way of doing this is to ensure that DNS servers that are local to users only resolve the namespace to the VIP on the same site. This would cancel the advantages of single and global namespace, but as a workaround during an outage, it would reduce cross-site connections, ensuring users are not affected.
Microsoft has been testing the possibility of placing a DAG's witness server in a Windows Azure IaaS (Infrastructure as a Service) environment. However, this infrastructure does not yet support the necessary network components to cater to this scenario. At the time of writing this book, Azure supports two types of networks: single site-to-site VPN (a network connecting two locations) and one or more point-to-site VPNs (a network connecting a single VPN client to a location). The issue is that in order for a server to be placed in Azure and configured as a witness server, two site-to-site VPNs would be required, connecting each datacenter hosting Exchange servers to Azure, which is not possible today. As such, the use of Azure to place a DAG's witness server is not supported at this stage.
DAC is a DAG setting that is disabled by default. It is used to control the startup behavior of databases.
Let us suppose the following scenario: a DAG split across two datacenters, like the one shown in the first diagram of the section Scenario 2 – active/active, and the primary datacenter suffers a complete power outage. All servers and the WAN link are down, so a decision is made to activate the standby datacenter. Usually in such scenarios, WAN connectivity is not instantly restored when the primary datacenter gets its power back. When this happens, members of the DAG in the primary datacenter are powered up but are not able to communicate with other members in the standby datacenter that is currently active. As the primary datacenter contains most of the DAG quorum voters (or so it should), when the power is restored, the DAG members located in the primary datacenter have the majority; so, they have quorum. The issue with this is that with quorum, they can mount their databases (assuming everything required to do so is operational, such as storage), which causes discrepancy with the actual active databases mounted in the standby datacenter. So now we have the exact same databases mounted simultaneously in separate servers. This is commonly known as split brain syndrome.
DAC was specifically created to prevent a split brain scenario. It does so through the Datacenter Activation Coordination Protocol (DACP) protocol. Once such a failure occurs, when the DAG is recovered, it will not automatically mount databases even if it has quorum. DACP is instead used to evaluate the DAG's current state and if databases should be mounted or not in each server by the Active Manager.
Active Manager uses memory to store a bit (a 0 or a 1); so, the DAG knows if it can mount databases that are assigned as active on the local server. When DAC is enabled, every time the Active Manager is started, it sets the bit to 0, meaning it is not allowed to mount any databases. The server is then forced to establish communication with the other DAG members in order to get another server to tell it if it is allowed to mount its local databases or not. The answer from the other members is simply their bit setting in the DAG. If a server replies that it has its bit set to 1, the server is permitted to mount databases and also set its own bit to 1.
On the other hand, when restoring the DAG in the preceding scenario, all the members of the DAG in the primary datacenter will have their DACP bit set to 0. As such, none of the servers powering up in the recovered primary datacenter are allowed to mount any databases because none of them are able to communicate with a server that has a DACP bit set to 1.
Besides dealing for split brain scenarios, enabling DAC mode allows administrators to use the site resilience built-in cmdlets to carry out datacenter switchovers:
Stop-DatabaseAvailabilityGroup Restore-DatabaseAvailabilityGroup Start-DatabaseAvailabilityGroup
When DAC mode is disabled, both Exchange and cluster management tools need to be used when performing datacenter switchovers.
DAC can only be enabled or disabled using the Set-DatabaseAvailabilityGroup cmdlet together with the DatacenterActivationMode parameter. To enable DAC mode, this parameter is set to DagOnly and to disable it, it is set to Off:
Set-DatabaseAvailabilityGroup <DAG_Name> -DatacenterActivationMode DagOnly
When designing and configuring a DAG, it is important to consider the location of the witness server. As we have seen, this is very much dependent on the business requirements and what is available to the organization. As already discussed, Exchange 2013 allows scenarios that were not previously recommended, such as placing a witness server on a third location.
The following table summarizes the general recommendations around the placement of the witness servers according to different deployment scenarios:
Scenario |
Place Witness Server In... |
|
# DAGs |
# Datacenters |
|
1 |
1 |
The datacenter where the DAG members are located. |
1 |
2 |
The primary datacenter (refer to diagrams of section Scenario 1 – A third location that is isolated from possible network failures that might affect either datacenter containing DAG members. |
2+ |
1 |
The datacenter where the DAG members are located. The same witness server can be used for multiple DAGs. A DAG member can be used as a witness server for another DAG. |
2+ |
2 |
The datacenter where the DAG members are located. The same witness server can be used for multiple DAGs. A DAG member can be used as a witness server for another DAG. A third location that is isolated from possible network failures that might affect either datacenter containing DAG members. |
1 or 2+ |
3+ |
The datacenter where administrators want the majority of voters to be. A third location that is isolated from possible network failures that might affect either datacenter containing DAG members. |
Throughout this article, we explored all the great enhancements made to Exchange 2013 in regards to site resilience for the Mailbox server. As the recovery of a DAG is no longer tied together with the recovery of the client access namespace, each one of these components can be easily switched over between different datacenters without affecting the other component. This also allows administrators to place a witness server in a third datacenter in order to provide automatic failover for a DAG on either of the datacenters it is split across, something not recommended in Exchange 2010.
Further resources on this subject: