Troubleshooting techniques in vRealize Operations components

vRealize Operations Manager helps in managing the health, efficiency and compliance of virtualized environments. In this tutorial, we have covered the latest troubleshooting techniques of vRealize Operations. Let's see what we can do to monitor and troubleshoot the key vRealize Operations components and services.

This is an excerpt from Mastering vRealize Operations Manager - Second Edition written by Spas Kaloferov, Scott Norris, and Christopher Slater.

Troubleshooting Services

Let's first take a look into some of the most important vRealize Operations services.

The Apache2 service

vRealize Operations internally uses the Apache2 web server to host Admin UI, Product UI, and Suite API. Here are some useful service log locations:

Log file nameLocationPurposeaccess_log & error_logstorage/log/var/log/apache2

Stores log information for the Apache

service

The Watchdog service

Watchdog is a vRealize Operations service that maintains the necessary daemons/services and attempts to restart them as necessary should there be any failure. The vcops-watchdog is a Python script that runs every 5 minutes by means of the vcops-watchdog-daemon with the purpose of monitoring the various vRealize Operations services, including CaSA.

The Watchdog service performs the following checks:

PID file of the service

Service status

Here are some useful service log locations:

Log file nameLocationPurposevcops-watchdog.log/usr/lib/vmware-vcops/user/log/vcops-watchdog/Stores log information for the WatchDog service

The Collector service

The Collector service sends a heartbeat to the controller every 30 seconds. By default, the Collector will wait for 30 minutes for adapters to synchronize.

The collector properties, including enabling or disabling Self Protection, can be configured from the collector.properties properties file located in /usr/lib/vmware-vcops/user/conf/collector.

Here are some useful service log locations:

Log file nameLocationPurposecollector.log/storage/log/vcops/log/Stores log information for the Collector service

The Controller service

The Controller service is part of the analytics engine. The controller does the decision making on where new objects should reside. The controller manages the storage and retrieval of the inventory of the objects within the system.

The Controller service has the following uses:

It will monitor the collector status every minute

How long a deleted resource is available in the inventory

How long a non-existing resource is stored in the database

Here are some useful service file locations:

Log file nameLocationPurposecontroller.properties/usr/lib/vmware-vcops/user/conf/controller/Stores properties information for the Controller service

Databases

As we learned, vRealize Operations contains quite a few databases, all of which are of great importance for the function of the product. Let’s take a deeper look into those databases.

Cassandra DB

Currently, Cassandra DB stores the following information:

User Preferences and Config

Alerts definition

Customizations

Dashboards, Policies, and View

Reports and Licensing

Shard Maps

Activities

Cassandra stores all the information which we see in the content folder; basically, any settings which are applied globally.

You are able to log into the Cassandra database from any Analytic Node. The information is the same across nodes.

There are two ways to connect to the Cassandra database:

cqlshrc is a command-line tool used to get the data within Cassandra, in a SQL-like fashion (inbuilt).

To connect to the DB, run the following from the command line (SSH):

$VMWARE_PYTHON_BIN $ALIVE_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $ALIVE_BASE/user/conf/cassandra/cqlshrc

Once you are connected to the DB, we need to navigate to the globalpersistence key space using the following command:

vcops_user@cqlsh&gt; use globalpersistence ;

The nodetool command-line tool

Once you are logged on to the Cassandra DB, we can run the following commands to see information:

Command syntaxPurposeDescribe tablesTo list all the relation (tables) in the current database instanceDescribe <table_name>To list the content of that particular tableExitTo exit the Cassandra command lineselect commandTo select any Column data from a tabledelete commandTo delete any Column data from a table

Some of the important tables in Cassandra are:

Table namePurposeactivity_2_tblStores all the activitiestbl_2a8b303a3ed03a4ebae2700cbfae90bfStores the Shard mapping information of an object (table name may be differ in each environment)supermetricStores the defined super metricspolicyStores all the defined policiesAuthStores all the user details in the clusterglobal_settingsAll the configured global settings are stored herenamespace_to_classtypeInforms what type of data is stored in what table under CassandrasymptomproblemdefinitionAll the defined symptomscertificatesStores all the adapter and data source certificates

The Cassandra database has the following configuration files:

File typeLocationCassandra.yaml/usr/lib/vmware-vcops/user/conf/cassandra/vcops_cassandra.properties/usr/lib/vmware-vcops/user/conf/cassandra/Cassandra conf scripts/usr/lib/vmware_vcopssuite/utilities/vmware/vcops/cassandra/

The Cassandra.yaml file stores certain information such as the default location to save data (/storage/db/vcops/cassandra/data). The file contains information about all the nodes. When a new node joins the cluster, it refers to this file to make sure it contacts the right node (master node). It also has all the SSL certificate information.

Cassandra is started and stopped via the CaSA service, but just because CaSA is running does not mean that the Cassandra service is necessarily running.

The service command to check the status of the Cassandra DB service from the command line (SSH) is:

service vmware-vcops status cassandra

The Cassandra cassandraservice.Sh is located in:

$VCOPS_BASE/cassandra/bin/

Here are some useful Cassandra DB log locations:

Log File NameLocationPurposeSystem.log/usr/lib/vmware-vcops/user/log/cassandraStores all the Cassandra-related activitieswatchdog_monitor.log/usr/lib/vmware-vcops/user/log/cassandraStores Cassandra Watchdog logswrapper.log/usr/lib/vmware-vcops/user/log/cassandraStores Cassandra start and stop-related logsconfigure.log /usr/lib/vmware-vcops/user/log/cassandraStores logs related to Python scripts of vRopsinitial_cluster_setup.log/usr/lib/vmware-vcops/user/log/cassandraStores logs related to Python scripts of vRopsvalidate_cluster.log/usr/lib/vmware-vcops/user/log/cassandraStores logs related to Python scripts of vRops

To enable debug logging for the Cassandra database, edit the logback.xml XML file located in:

 /usr/lib/vmware-vcops/user/conf/cassandra/

Change <root level="INFO"> to <root level="DEBUG">.

The System.log will not show the debug logs for Cassandra.

If you experience any of the following issues, this may be an indicator that the database has performance issues, and so you may need to take the appropriate steps to resolve them:

Relationship changes are not be reflected immediately

Deleted objects still show up in the tree

View and Reports takes slow in processing and take a long time to open.

Logging on to the Product UI takes a long time for any user

Alert was supposed to trigger but it never happened

Cassandra can tolerate 5 ms of latency at any given point of time.

You can validate the cluster by running the following command from the command line (SSH). First, navigate to the following folder:

/usr/lib/vmware-vcopssuite/utilities/

Run the following command to validate the cluster:

$VMWARE_PYTHON_BIN -m vmware.vcops.cassandra.validate_cluster &lt;IP_ADDRESS_1  IP_ADDRESS_2&gt;

You can also use the nodetool to perform a health check, and possibly resolve database load issues.

The command to check the load status of the activity tables in vRealize Operations version 6.3 and older is as follows:

$VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool --port 9008 status

For the 6.5 release and newer, VMware added the requirement of using a 'maintenanceAdmin' user along with a password file. The new command is as follows:

$VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status

Regardless of which method you choose to perform the health check, if any of the nodes have over 600 MB of load, you should consult with VMware Global Support Services on the next steps to take, and how to elevate the load issues.

Central (Repl DB)

The Postgres database was introduced in 6.1. It has two instances in version 6.6. The Central Postgres DB, also called repl, and the Alerts/HIS Postgres DB, also called data, are two separate database instances under the database called vcopsdb.

The central DB exists only on the master and the master replica nodes when HA is enabled. It is accessible via port 5433 and it is located in /storage/db/vcops/vpostgres/repl.

Currently, the database stores the Resources inventory.

You can connect to the central DB from the command line (SSH). Log in on the analytic node you wish to connect to and run:

su - postgres

The command should not prompt for a password if ran as root.

Once logged in, connect to the database instance by running:

/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5433

The service command to start the central DB from the command line (SSH) is:

service vpostgres-repl start

Here are some useful log locations:

Log file nameLocationPurposepostgresql-<xx>.log/storage/db/vcops/vpostgres/data/pg_logProvides information on Postgres database cleanup and other disk-related information

Alerts/HIS (Data) DB

The Alerts DB is called data on all the data nodes including the master and master replica node.

It was again introduced in 6.1. Starting from 6.2, the Historical Inventory Service xDB was merged with the Alerts DB. It is accessible via port 5432 and it is located in /storage/db/vcops/vpostgres/data.

Currently, the database stores:

Alerts and alarm history

History of resource property data

History of resource relationship

You can connect to the Alerts DB from the command line (SSH). Log in on the analytic node you wish to connect to and run:

su - postgres

The command should not prompt for a password if ran as root.

Once logged in, connect to the database instance by running:

/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5432

The service command to start the Alerts DB from the command line (SSH) is:

service vpostgres start

FSDB

The File System Database (FSDB) contains all raw time series metrics and super metrics data for the discovered resources.

What is FSDB in vRealize Operations Manager?:

FSDB is a GemFire server and runs inside analytics JVM.

FSDB in vRealize Operations uses the Sharding Manager to distribute data between nodes (new objects). (We will discuss what vRealize Operations cluster nodes are later in this chapter.)

The File System Database is available in all the nodes of a vRops Cluster deployment.

It has its own properties file.

FSDB stores data (time series data ) collected by adapters and data which is generated/calculated (system, super, badge, CIQ metrics, and so on) based on analysis of that data.

If you are troubleshooting FSDB performance issues, you should start from the Self Performance Details dashboard, more precisely, the FSDB Data Processing widget. We covered both of these earlier in this chapter.

You can also take a look at the metrics provided by the FSDB Metric Picker:

troubleshooting-in-vrealize-operations-components-img-0

You can access it by navigating to the Environment tab, vRealize Operations Clusters, selecting a node, and selecting vRealize Operations Manager FSDB. Then, select the All Metrics tab.

You can check the synchronization state of the FSDB to determine the overall health of the cluster by running the following command from the command line (SSH):

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py getShardStateMappingInfo

By restarting the FSDB, you can trigger synchronization of all the data by getting missing data from other FSDBs. Synchronization takes place only when you have vRealize Operations HA configured.

Here are some useful log locations for FSDB:

Log file nameLocaitonPurposeAnalytics-<UUID>.log/usr/lib/vmware-vcops/user/log

Used to check the Sharding Module functionality

Can trace which objects have synced

ShardingManager_<UUID>.log/usr/lib/vmware-vcops/user/logCan be used to get the total time the sync tookfsdb-accessor-<UUID>.log/usr/lib/vmware-vcops/user/logProvides information on FSDB database cleanup and other disk-related information

Platform-cli

Platform-cli is a tool by which we can get information from various databases, including the GemFire cache, Cassandra, and the Alerts/HIS persistence databases.

In order to run this Python script, you need to run the following command:

$VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py

The following example of using this command will list all the resources in ascending order and also show you which shard it is stored on:

 $VMWARE_PYTHON_BIN /usr/lib/vmware-vcops/tools/vrops-platform-cli/vrops-platform-cli.py getResourceToShardMapping

We discussed how to troubleshoot some of the most important components like services and databases along with troubleshooting failures in the upgrade process. To know more about self-monitoring dashboards and infrastructure compliance, check out this book Mastering vRealize Operations Manager - Second Edition.

What to expect from vSphere 6.7

Introduction to SDN – Transformation from legacy to SDN

Learning Basic PowerCLI Concepts