Root cause analysis
Once the problem has been correctly understood, and a fix applied and tested, the next step would be to determine the root cause and apply corrective actions if needed.
The Root Cause Analysis and Corrective Actions (RCCA) is the final step in troubleshooting a problem and involves determining the root cause of the issue and outlining any suggestions and recommendations for actions that can be implemented to prevent the reoccurrence of the underlying issue.
Most of the problems encountered in the Citrix world can be grouped into three categories:
- Performance issues, for example, applications are slow to start, the network is unreliable, and so on
- Incorrect configuration, for example, XenApp is not properly configured during the initial installation or a subsequent change
- Broken code leading to unexpected behavior from XenApp or underlying components—these are trickiest to debug and probably the least encountered
Most root cause analysis reveal either a performance issue or an incorrect configuration.
Where a root cause is deemed to be performance related, tackling them usually requires improvements in the infrastructure—bigger bandwidth, more servers, faster disks, and so on. The real challenge is determining how much to scale the infrastructure so that performance falls back within acceptable parameters without spending a large amount of money.
Preventive steps for these types of problems could be:
- Ensuring a capacity management process is in place
- Monitoring Citrix infrastructure for active usage
- Creating an easily scalable Citrix architecture
Incorrect configurations are usually self-evident; for example, if an administrator performs a change that negatively affects the Citrix infrastructure, again usually almost immediately. The root cause analysis, therefore, focuses on the following questions:
- Has the change management process been followed?
- Have the risks been properly established and highlighted?
- Have actions been considered to minimize the risks?
- Is there a backup plan in place in case a rollback is needed?
- What is the impact of a failed change and how will it affect users or production environments?
Changes where the risks have been appropriately highlighted ("Changing X setting has the risk of bringing down the Citrix site for 15 minutes"), where the change is performed out of hours (minimizing risks) and has a proper rollback plan in place are perfectly acceptable.
Most changes have the potential of causing downtime, but if the proper change management process is followed, the risks are minimized and the potential outage reduced.
Preventive steps for this type of problems could be:
- Ensuring the risks have been correctly identified and presented to the business
- Ensuring steps to minimize the risks have been identified
- Ensuring there is a clear backup plan in place
Finally, during troubleshooting, a number of changes might need to be done before the final fix is found. It is, therefore, a good idea to keep a track of these changes while the troubleshooting process is actively ongoing.
Once the correct fix has been identified, a retroactive change request should be logged in the IT system. Although, in this instance, the change hasn't followed the standard change management approval process, it is still useful to have changes logged in the system in case they need to be looked up in the future as part of troubleshooting previous changes.