vRealize Operations Manager helps in managing the health, efficiency and compliance of virtualized environments. In this tutorial, we have covered the latest troubleshooting techniques of vRealize Operations. Let's see what we can do to monitor and troubleshoot the key vRealize Operations components and services.
This is an excerpt from Mastering vRealize Operations Manager - Second Edition written by Spas Kaloferov, Scott Norris, and Christopher Slater.
Let's first take a look into some of the most important vRealize Operations services.
vRealize Operations internally uses the Apache2 web server to host Admin UI, Product UI, and Suite API. Here are some useful service log locations:
Log file nameLocationPurposeaccess_log & error_logstorage/log/var/log/apache2
Stores log information for the Apache
service
Watchdog is a vRealize Operations service that maintains the necessary daemons/services and attempts to restart them as necessary should there be any failure. The vcops-watchdog is a Python script that runs every 5 minutes by means of the vcops-watchdog-daemon with the purpose of monitoring the various vRealize Operations services, including CaSA.
The Watchdog service performs the following checks:
Here are some useful service log locations:
Log file nameLocationPurposevcops-watchdog.log/usr/lib/vmware-vcops/user/log/vcops-watchdog/Stores log information for the WatchDog service
The Collector service sends a heartbeat to the controller every 30 seconds. By default, the Collector will wait for 30 minutes for adapters to synchronize.
The collector properties, including enabling or disabling Self Protection, can be configured from the collector.properties properties file located in /usr/lib/vmware-vcops/user/conf/collector.
Here are some useful service log locations:
Log file nameLocationPurposecollector.log/storage/log/vcops/log/Stores log information for the Collector service
The Controller service is part of the analytics engine. The controller does the decision making on where new objects should reside. The controller manages the storage and retrieval of the inventory of the objects within the system.
The Controller service has the following uses:
Here are some useful service file locations:
Log file nameLocationPurposecontroller.properties/usr/lib/vmware-vcops/user/conf/controller/Stores properties information for the Controller service
As we learned, vRealize Operations contains quite a few databases, all of which are of great importance for the function of the product. Let’s take a deeper look into those databases.
Currently, Cassandra DB stores the following information:
Cassandra stores all the information which we see in the content folder; basically, any settings which are applied globally.
You are able to log into the Cassandra database from any Analytic Node. The information is the same across nodes.
There are two ways to connect to the Cassandra database:
To connect to the DB, run the following from the command line (SSH):
$VMWARE_PYTHON_BIN $ALIVE_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $ALIVE_BASE/user/conf/cassandra/cqlshrc
Once you are connected to the DB, we need to navigate to the globalpersistence key space using the following command:
vcops_user@cqlsh> use globalpersistence ;
Once you are logged on to the Cassandra DB, we can run the following commands to see information:
Command syntaxPurposeDescribe tablesTo list all the relation (tables) in the current database instanceDescribe <table_name>To list the content of that particular tableExitTo exit the Cassandra command lineselect commandTo select any Column data from a tabledelete commandTo delete any Column data from a table
Some of the important tables in Cassandra are:
Table namePurposeactivity_2_tblStores all the activitiestbl_2a8b303a3ed03a4ebae2700cbfae90bfStores the Shard mapping information of an object (table name may be differ in each environment)supermetricStores the defined super metricspolicyStores all the defined policiesAuthStores all the user details in the clusterglobal_settingsAll the configured global settings are stored herenamespace_to_classtypeInforms what type of data is stored in what table under CassandrasymptomproblemdefinitionAll the defined symptomscertificatesStores all the adapter and data source certificates
The Cassandra database has the following configuration files:
File typeLocationCassandra.yaml/usr/lib/vmware-vcops/user/conf/cassandra/vcops_cassandra.properties/usr/lib/vmware-vcops/user/conf/cassandra/Cassandra conf scripts/usr/lib/vmware_vcopssuite/utilities/vmware/vcops/cassandra/
The Cassandra.yaml file stores certain information such as the default location to save data (/storage/db/vcops/cassandra/data). The file contains information about all the nodes. When a new node joins the cluster, it refers to this file to make sure it contacts the right node (master node). It also has all the SSL certificate information.
Cassandra is started and stopped via the CaSA service, but just because CaSA is running does not mean that the Cassandra service is necessarily running.
The service command to check the status of the Cassandra DB service from the command line (SSH) is:
service vmware-vcops status cassandra
The Cassandra cassandraservice.Sh is located in:
$VCOPS_BASE/cassandra/bin/
Here are some useful Cassandra DB log locations:
Log File NameLocationPurposeSystem.log/usr/lib/vmware-vcops/user/log/cassandraStores all the Cassandra-related activitieswatchdog_monitor.log/usr/lib/vmware-vcops/user/log/cassandraStores Cassandra Watchdog logswrapper.log/usr/lib/vmware-vcops/user/log/cassandraStores Cassandra start and stop-related logsconfigure.log /usr/lib/vmware-vcops/user/log/cassandraStores logs related to Python scripts of vRopsinitial_cluster_setup.log/usr/lib/vmware-vcops/user/log/cassandraStores logs related to Python scripts of vRopsvalidate_cluster.log/usr/lib/vmware-vcops/user/log/cassandraStores logs related to Python scripts of vRops
To enable debug logging for the Cassandra database, edit the logback.xml XML file located in:
/usr/lib/vmware-vcops/user/conf/cassandra/
Change <root level="INFO"> to <root level="DEBUG">.
The System.log will not show the debug logs for Cassandra.
If you experience any of the following issues, this may be an indicator that the database has performance issues, and so you may need to take the appropriate steps to resolve them:
Cassandra can tolerate 5 ms of latency at any given point of time.
You can validate the cluster by running the following command from the command line (SSH). First, navigate to the following folder:
/usr/lib/vmware-vcopssuite/utilities/
Run the following command to validate the cluster:
$VMWARE_PYTHON_BIN -m vmware.vcops.cassandra.validate_cluster <IP_ADDRESS_1 IP_ADDRESS_2>
You can also use the nodetool to perform a health check, and possibly resolve database load issues.
The command to check the load status of the activity tables in vRealize Operations version 6.3 and older is as follows:
$VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 status
For the 6.5 release and newer, VMware added the requirement of using a 'maintenanceAdmin' user along with a password file. The new command is as follows:
$VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status
Regardless of which method you choose to perform the health check, if any of the nodes have over 600 MB of load, you should consult with VMware Global Support Services on the next steps to take, and how to elevate the load issues.
The Postgres database was introduced in 6.1. It has two instances in version 6.6. The Central Postgres DB, also called repl, and the Alerts/HIS Postgres DB, also called data, are two separate database instances under the database called vcopsdb.
The central DB exists only on the master and the master replica nodes when HA is enabled. It is accessible via port 5433 and it is located in /storage/db/vcops/vpostgres/repl.
Currently, the database stores the Resources inventory.
You can connect to the central DB from the command line (SSH). Log in on the analytic node you wish to connect to and run:
su - postgres
The command should not prompt for a password if ran as root.
Once logged in, connect to the database instance by running:
/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5433
The service command to start the central DB from the command line (SSH) is:
service vpostgres-repl start
Here are some useful log locations:
Log file nameLocationPurposepostgresql-<xx>.log/storage/db/vcops/vpostgres/data/pg_logProvides information on Postgres database cleanup and other disk-related information
The Alerts DB is called data on all the data nodes including the master and master replica node.
It was again introduced in 6.1. Starting from 6.2, the Historical Inventory Service xDB was merged with the Alerts DB. It is accessible via port 5432 and it is located in /storage/db/vcops/vpostgres/data.
Currently, the database stores:
You can connect to the Alerts DB from the command line (SSH). Log in on the analytic node you wish to connect to and run:
su - postgres
The command should not prompt for a password if ran as root.
Once logged in, connect to the database instance by running:
/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5432
The service command to start the Alerts DB from the command line (SSH) is:
service vpostgres start
The File System Database (FSDB) contains all raw time series metrics and super metrics data for the discovered resources.
What is FSDB in vRealize Operations Manager?:
If you are troubleshooting FSDB performance issues, you should start from the Self Performance Details dashboard, more precisely, the FSDB Data Processing widget. We covered both of these earlier in this chapter.
You can also take a look at the metrics provided by the FSDB Metric Picker:
You can access it by navigating to the Environment tab, vRealize Operations Clusters, selecting a node, and selecting vRealize Operations Manager FSDB. Then, select the All Metrics tab.
You can check the synchronization state of the FSDB to determine the overall health of the cluster by running the following command from the command line (SSH):
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py getShardStateMappingInfo
By restarting the FSDB, you can trigger synchronization of all the data by getting missing data from other FSDBs. Synchronization takes place only when you have vRealize Operations HA configured.
Here are some useful log locations for FSDB:
Log file nameLocaitonPurposeAnalytics-<UUID>.log/usr/lib/vmware-vcops/user/log
Used to check the Sharding Module functionality
Can trace which objects have synced
ShardingManager_<UUID>.log/usr/lib/vmware-vcops/user/logCan be used to get the total time the sync tookfsdb-accessor-<UUID>.log/usr/lib/vmware-vcops/user/logProvides information on FSDB database cleanup and other disk-related information
Platform-cli is a tool by which we can get information from various databases, including the GemFire cache, Cassandra, and the Alerts/HIS persistence databases.
In order to run this Python script, you need to run the following command:
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py
The following example of using this command will list all the resources in ascending order and also show you which shard it is stored on:
$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py getResourceToShardMapping
We discussed how to troubleshoot some of the most important components like services and databases along with troubleshooting failures in the upgrade process. To know more about self-monitoring dashboards and infrastructure compliance, check out this book Mastering vRealize Operations Manager - Second Edition.
What to expect from vSphere 6.7
Introduction to SDN – Transformation from legacy to SDN
Learning Basic PowerCLI Concepts