Designing fault-tolerant hierarchies
In today's world, the systems we design and deploy are under more pressure than ever. We are expected to design systems with less money to implement, which can lead to design mistakes or errors in judgment. One area that usually suffers from this is the design of fault tolerance or disaster recovery. In a legacy world, Configuration Manager has been seen as a tool only used by IT departments to manage machines and has often given little business benefit. With the new wave of mobile devices, tablet devices, and other scenarios, such as bring your own device, Configuration Manager has suddenly become a critical application that is used to manage all these devices from one single pane of glass.
It is no secret that true fault tolerance in the form of clustering is something missing from Configuration Manager but that does not mean we still cannot produce a design that is able to switch to a disaster recovery scenario or provide a fault-tolerant service.
Fault tolerance in site systems
The central administration site, primary site, and the secondary site are all site systems. These themselves cannot be part of a cluster or any type of load balancing. What we can do though is make the database that stores our entire configuration, inventory, and other information highly available.
This can be done in a number of ways, for example, we can provide high availability using a traditional SQL cluster service. Both of these configurations allow us to make the database highly available.
Back to the site systems, if we are deploying our site system servers as virtual machines, then we can take advantage of replica in Hyper-V or similar technologies in VMware. This will make our site system server switch, should the workload need moving in the event of a failure. In this scenario, if you are deploying the site system server on the same server where SQL Server is deployed, then we might not need to worry about making the database highly available.
Fault tolerance in site-system roles
Any other service you deploy in Configuration Manager, such as the management point, distribution point, and fallback status point, to name a few, is known as a site system role. In some instances, we can create multiple instances of these roles to create tolerance but not in the sense of a cluster.
Some site system roles you can only deploy as one instance per hierarchy, for example, this is true of the Endpoint Protection Point where you can only deploy one instance of the role per hierarchy.
The management point is a good exception to this. While we cannot pick and choose which management point a client will communicate with in a primary site, we can deploy multiple management point servers to provide options to the client. If our hierarchy is running in an HTTPS configuration, then management points that are HTTPS enabled will be ordered above any HTTP management points by the client while it is selecting a management point to use.
The same can be said for the distribution point: we can deploy multiple instances of the distribution point to give the clients options when deciding which to use for downloading content. Software Update Points can be added to an NLB cluster, for example, which must be configured using PowerShell. However, they can also, with newer versions of Configuration Manager, have multiple instances in the same hierarchy without the need for an NLB cluster.
Depending on the requirements of the design and how important Configuration Manager is in terms of its role in the recovery of a data center is the driving factor for building fault tolerance at the site system role side of the picture.