The following steps will guide you through the various high availability options you have when designing a SCOM infrastructure.
In previous versions of SCOM, high availability was achieved through the use of clustering owing to the reliance on a component known as the RMS. Starting with the 2012 release and carrying forward to the R2 release, the RMS role was deprecated, and therefore, it no longer requires clustering.
In SCOM now, high availability is achieved by deploying multiple management servers and grouping these into resource pools.
It is recommended, even in a simple deployment, to always deploy a minimum of two management servers, as this will provide failover of monitoring and access while simplifying the maintenance of your SCOM infrastructure.
As you scale out your infrastructure for larger monitoring deployments, you need to consider adding management servers that can be allocated to dedicated resource pools used for specific areas of monitoring.
The typical resource pools seen in implementations are All Management Servers, Unix/Linux Servers and Network monitoring.
All management servers will by default be added to the All Management Servers resource pool. It is recommended that once you have correctly sized your environment, if you need to dedicate management servers to monitoring, say, network devices, then you need to ensure you add these servers to the dedicated resource pool and remove them from the All Management Servers resource pool.
You must be aware that at least 50 percent of the servers in the All Management Servers resource pool need to be running in order for SCOM to fully function so you should always have, at a minimum, two servers remaining in this pool.
Basic implementations will see consoles connecting to a specifically named management server. If this server is offline, another server must be specified during the connection.
To provide high availability for console connections and to employ a more seamless connection method, Network Load Balancing can be implemented across the management servers used to provide console (or even SDK access for systems like the System Center 2012 R2 Service Manager connectors) and the DNS name allocated to the virtual IP address used, instead of a specific management server.
Operational and Data Warehouse SQL databases
Since SCOM relies on the SQL databases in order to function, these should at a minimum be provided with high availability.
Normal SQL high availability scenarios apply here with the use of either standard failover clustering or the newer SQL Server 2012 AlwaysOn availability groups.
SCOM uses SQL Server Reporting Services (SSRS) as the reporting mechanism and this cannot be made highly available. The underlying database for SSRS can be made highly available by utilizing either a traditional SQL Cluster or SQL 2012 Always-On.
Nevertheless, it is possible to quickly restore as long as the main SQL and SCOM components are still intact.
Audit Collection Services
In a default deployment, ACS will usually be installed with a single ACS Collector and ACS Database pair. You can then implement multiple ACS Forwarders that point to this collector, but if the collector goes offline, the Security Event Log on the forwarder will effectively become a queue for the backlog until it can reconnect to a collector.
Using this configuration has the benefit of simplicity, and if the original ACS Collector can be brought back online within the Event Retention Period or the ACSConfig.xml
file restored to a new ACS Collector, then potentially there would be no loss or duplication of data.
Tip
ACS Collectors use a configuration file named ACSConfig.xml
, which is stored in %systemroot%\System32\Security\AdtServer
.
This configuration file, which is updated every 5 minutes, keeps track of each forwarder communicating with the collector and a sequence number corresponding to the EventRecordID. This allows the collector to be aware of which events have been collected.
Using this simple configuration, however, does leave open the possibility for loss of data (Security Events not captured) if the ACS Collector is offline for longer than the retention period (the default is 72 hours) or duplication of data in the database if the original ACSConfig.xml
file is not restored.
Another option would be to implement multiple ACS Collector/ACS Database pairs. This would allow you to specify a failover ACS Collector when deploying an ACS Forwarder and would provide you with automatic failover in case of an outage.
However, while this does provide automatic failover, it is important to note that each ACS Collector/ACS Database pair is independent, and after a failover, it would mean that Security Event data is spread across databases making it harder to query when reporting. It would also mean duplication of data after a failover as the ACSConfig.xml
file on the failover ACS Collector would not be aware of the EventRecordID sequence the other ACS Collector was at.
This will provide automatic failover with no data loss and minimal data duplication while maintaining a single database for ease of reporting.
The SCOM web console can be made highly available and scalable by utilizing the normal Network Load Balancing technique used with most IIS websites.
This could either be through the use of dedicated hardware-based network load balancers or the built-in Windows Network Load Balancing role.
Agents don't specifically have or need high availability at the client level as that would defeat the objective of monitoring to see whether the server went offline even if the agent was able to continue working. But you can implement multihoming, which allows the client to communicate with up to four management groups per agent. This is ideal for obtaining data from live systems in both test and production SCOM environments.
Multiple gateway servers can be deployed and agents pointed at both a primary server and multiple failover gateway servers to provide high availability.
To do this, use PowerShell to designate the primary and failover(s) servers and then set the agent configuration using the Set-SCOMParentManagementServer
command with the –Agent
switch.
For example, to set Failover Management Server (PONSCOMGW02) on Agent (PONDC01), use the following command:
The gateway servers themselves can also be pointed at multiple management servers as both primary and failovers. This technique uses the same PowerShell command, but with the –Gateway
switch.
At the base layer, all the relevant databases for SCOM should reside on a highly available SQL instance. This may either be a traditional cluster or utilizing the new Always-On features of SQL 2012.
It would make sense, even for a very small deployment, to have at least two management servers, as this allows for easier maintenance with reduced downtime and you can then scale from there.