Why is Disaster Recovery Needed?
A lot of people may ask themselves: "Why would we need a 'guide' for Disaster Recovery? If a Domain Controller (DC) has a critical failure, we just install another one". This might seem to work at first, and even for a longer period in small organizations, but in the long run, there would be problems, and a lot of error messages. Correct recovery is crucial to ensure a stable AD environment. The speed at which problems appear, grows exponentially if there are multiple locations of various sizes across different time zones and countries. For example, let's say a company called Nail Corporation (www.nailcorp.com) has its headquarters in Los Angeles, California, and branch offices with several hundred employees in Munich, and Germany, in addition to branch offices in Brazil and India.
NailCorp has one big AD domain and a data center in Brazil having a 512 kilobit link to the headquarters. Let's suppose that the data center in Brazil is partially destroyed due to an earthquake. Network connectivity is restored fairly quickly, but both DCs are physically broken and have therefore become non-functional. The company has around 10,000 employees and, according to Microsoft's AD Sizer software, the space requirement for each Global Catalog server is about 5GB.
As you have to start the rebuild process from scratch, and you have no other DC at the site, you have to replicate 5GB over a 512 kilobit link. Assuming that you get maximum connectivity speed, and no other traffic is flowing at the same time, which is nearly impossible because your users will inadvertently boot their machines and want to start working, you would need over a day to replicate the database. This will increase your restoration time even further-in this case, by at least a day.
In the event of a disastrous event for a company such as NailCorp, you would want to replicate and rebuild as fast as possible. During that time, since you have machines authenticating against the other domain controllers in your company—assuming your DNS service is globally configured to support failover—your replication will be much slower. In this case, you should have different plans in place than just installing another DC.
Note
To learn more about how DNS and authentication (DC selection) for Windows XP clients work, please read Microsoft's Knowledgebase article 314861 (http://support.microsoft.com/kb/314861).
Another good example is an application that authenticates against a specific DC, or pulls specific information from one. If that DC breaks, the DC will have to be rebuilt with the same name. If you do not do this the right way, you may see strange things happening This is not very far fetched especially in, for example, a software development company.
The need for Disaster Recovery is ever-increasing, and there are several books that touch upon the subject. But none of them are dedicated to different scenarios, and certainly none of them explain the entire process.
Recovering AD from any kind of disaster is trickier then most people think. If you do not understand the processes associated with recovery, you can damage more than you fix.
In order to prevent any kind of major interruptions, and to speed up recovery in the event of an disaster, there are several things that can be done.
For example, AD relies extremely heavily on DNSes. So you need to make sure that if you use AD Integrated (ADI) DNS zones, you should have a standard backup DNS server that has a complete copy of your zones in a non-integrated form. This DNS server should be on an isolated network, and should contain only the records and zones relating to AD, and not all existing dynamic updates.
You should also have a Delayed Replication Site (DRS), also called a lag site . This is a standard part of your AD domain. This should have one or two DCs, maybe a DNS server, and even a standby Exchange server in case one is needed. However, the AD replication is set up with a high link cost in order to prevent replication for a longer time period. Or, you can make it a completely isolated site with a firewall and force a replicate once every one to three months only. This will allow you to have a stable infrastructure. This state may be three months old, but if anything happens you can have a running AD within a few hours, instead of days.
Virtualization can be a boon, especially in this case. Buying a server is fairly cheap nowadays, and as for a DRS, you only need a lot of memory in the machine. VMWare server (http://vmware.com/products/server/) and Microsoft Virtual Server (http://www.microsoft.com/windowsserversystem/virtualserver/) can be downloaded and used for free nowadays. Both of these systems allow the DRS to be run in a virtualized, isolated environment.
Having a DRS can reduce restore time tremendously because, even if there is a global failure, the old DCs can be removed and new ones installed to replicate the DRS.