In this article by Tony Campbell, author of the book Troubleshooting OpenStack, we will cover some of the chronic issues that might be early signs of trouble. This article is more about prevention and aims to help you avoid emergency troubleshooting as much as possible.

(For more resources related to this topic, see here.)

Database

Many OpenStack services make heavy use of a database. Production deployments will typically use MySQL or Postgres as the backend database server. As you may have learned, a failing or misconfigured database will quickly lead to trouble in your OpenStack cluster. Database problems can also present more subtle concerns that may grow into huge problems if neglected.

Availability

This database server can become a single point of failure if your database server is not deployed in a highly available configuration. OpenStack does not require a high availability installation of your database, and as a result, many installations may skip this step. However, production deployments of OpenStack should take care to ensure that their database can survive the failure of a single database server.

MySQL with Galera Cluster

For installations using the MySQL database engine, there are several options for clustering your installation. One popular method is to leverage Galera Cluster (http://http://galeracluster.com/). Galera Cluster for MySQL leverages synchronous replication and provides a multi-master cluster, which offers high availability for your OpenStack databases.

Postgres

Installations that use the Postgres database engine have several options such as high availability, load balancing, and replication. These options include block device replication with DRBD, log shipping, Master-Standby replication based on triggers, statement-based replication, and asynchronous multi-master replication. For details, refer to the Postgres High Availability Guide (http://www.postgresql.org/docs/current/static/high-availability.html).

Performance

Database performance is one of those metrics that can degrade over time. For an administrator who does not pay attention to small problems in this area, these can eventually become large problems. A wise administrator will regularly monitor the performance of their database and will constantly be on the lookout for slow queries, high database loads, and other indications of trouble.

MySQL

There are several options for monitoring your MySQL server, some of which are commercial and many others that are open source. Administrators should evaluate the options available and select a solution that fits their current set of tools and operating environment. There are several performance metrics you will want to monitor.

Show Status

The MySQL SHOW STATUS statement can be executed from the mysql command prompt. The output of this statement is server status information with over 300 variables reported. To narrow down the information, you can leverage a LIKE clause on the variable_name to display the sections you are interested in. Here is an abbreviated list of the output instances returned by SHOW STATUS:

mysql> SHOW STATUS;

+------------------------------------------+-------------+

| Variable_name                            | Value       |

+------------------------------------------+-------------+

| Aborted_clients                          | 29          |

| Aborted_connects                         | 27          |

| Binlog_cache_disk_use                    | 0           |

| Binlog_cache_use                         | 0           |

| Binlog_stmt_cache_disk_use               | 0           |

| Binlog_stmt_cache_use                    | 0           |

| Bytes_received                           | 614         |

| Bytes_sent                               | 33178       |

Mytop

Mytop is a command-line utility inspired by the Linux top command. Mytop retrieves data from the MySql SHOW PROCESSLIST command and the SHOW STATUS command. Data from these commands is refreshed, processed, and displayed in the output of the Mytop command. The Mytop output includes a header, which contains summary data followed by a thread section.

Mytop header section

Here is an example of the header output from the Mytop command:

MySQL on localhost (5.5.46)                                                                                                                    load 1.01 0.85 0.79 4/538 23573 up 5+02:19:24 [14:35:24]

 Queries: 3.9M     qps:    9 Slow:     0.0         Se/In/Up/De(%):    49/00/08/00

 Sorts:     0 qps now:   10 Slow qps: 0.0  Threads:   30 (   1/   4) 40/00/12/00

 Cache Hits: 822.0 Hits/s:  0.0 Hits now:   0.0  Ratio:  0.0%

 Ratio now:  0.0%

 Key Efficiency: 97.3%  Bps in/out:  1.7k/ 3.1k   Now in/out:  1.0k/ 3.9k

As demonstrated in the preceding output, the header section for the Mytop command includes the following information:

The hostname and MySQL version
The server load
The MySQL server uptime
The total number of queries
The average number of queries
Slow queries
The percentage of Select, Insert, Update, and Delete queries
Queries per second
Threads
Cache hits
Key efficiency

Mytop thread section

They Mytop thread section will list as many threads as it can display. The threads are ordered by the Time column, which displays the threads' idle time:

       Id      User         Host/IP         DB       Time    Cmd    State Query                                                                                                                        

       --      ----         -------         --       ----    ---    ----- ----------                                                                                                                   

     3461     neutron    174.143.201.98    neutron   5680    Sleep                                                                                                                                      

     3477    glance      174.143.201.98     glance   1480    Sleep                                                                                                                                      

     3491      nova      174.143.201.98     nova      880    Sleep                                                                                                                                       

     3512      nova       174.143.201.98    nova      281    Sleep                                                                                                                                       

     3487  keystone  174.143.201.98   keystone        280  Sleep                                                                                                                                       

     3489    glance  174.143.201.98     glance        280  Sleep                                                                                                                                      

     3511  keystone  174.143.201.98   keystone        280  Sleep                                                                                                                                      

     3513   neutron  174.143.201.98    neutron        280  Sleep                                                                                                                                       

     3505  keystone  174.143.201.98   keystone        279  Sleep                                                                                                                                       

     3514  keystone  174.143.201.98   keystone        141  Sleep                                                                                                                                      

     ...

The Mytop thread section displays the ID of each thread followed by the user and host. Finally, this section will display the database, idle time, and state or command query. Mytop will allow you to keep an eye on the performance of your MySql database server.

Percona toolkit

The Percona Toolkit is a very useful set of command-line tools for performing MySQL operations and system tasks. The toolkit can be downloaded from Percona at https://www.percona.com/downloads/percona-toolkit/. The output from these tools can be fed into your monitoring system allowing you to effectively monitor your MyQL installation.

Postgres

Like MySQL, the Postgres database also has a series of tools, which can be leveraged to monitor database performance. In addition to standard Linux troubleshooting tools, such as top and ps, Postgres also offers its own collection of statistics.

The PostgreSQL statistics collector

The statistics collector in Postgres allows you to collect data about server activity. The statistics collected in this tool are varied and may be helpful for troubleshooting or system monitoring. In order to leverage the statistics collector, you must turn on the functionality in the postgresql.conf file. The settings are commented out by default in the RUNTIME STATISTICS section of the configuration file. Uncomment the lines in the Query/Index Statistics Collector subsection.

#------------------------------------------------------------------------------

# RUNTIME STATISTICS

#------------------------------------------------------------------------------

 

# - Query/Index Statistics Collector -

 

track_activities = on

track_counts = on

track_io_timing = off

track_functions = none                 # none, pl, all

track_activity_query_size = 1024       # (change requires restart)

update_process_title = on

stats_temp_directory = 'pg_stat_tmp'

Once the statistics collector is configured, restart the database server or execute a pg_ctl reload for the configuration to take effect. Once the collector has been configured, there will be a series of views created and named with the prefix “pg_stat”. These views can be queried for relevant statistics in the Posgres database server.

Database bckups

A diligent operator will ensure that a backup of the database for each OpenStack project is created. Since most OpenStack services make heavy use of the database for persisting things such as states and metadata, a corruption or loss of data can render your OpenStack cloud unusable. The current database backups can help rescue you from this fate. MySQL users can use the mysqldump utility to back up all of the OpenStack datbases.

mysqldump --opt --all-databases > all_openstack_dbs.sql

Similarly, Postgres users can back up all OpenStack databases with a command similar to the following:

pg_dumpall > all_openstack_dbs.sql

Your cadence for backups will depend on your environment and tolerance for data corruption of loss. You should store these backups in a safe place and occasional deploy test restores from the data to ensure that they work as expected.

Monitoring

Monitoring is often your early warning system that something is going wrong in your cluster. Your monitoring system can also be a rich source of information when the time comes to troubleshoot issues with the cluster. There are multiple options available for monitoring OpenStack. Many of your current application monitoring platforms will handle OpenStack just as well as any other Linux system. Regardless of the tool you select to for monitoring, there are several parts of OpenStack that you should focus on.

Resource monitoring

OpenStack is typically deployed on a series of Linux servers. Monitoring the resources on those servers is essential. A set it and forget it attitude is a recipe for disaster. The things you may want to monitor on your host servers include the following:

CPU
Disk
Memory
Log file size
Network I/O
Database
Message broker

OpenStack qotas

OpenStack operators have the option to set usage quotas for each tenant/project. As an administrator, it is helpful to monitor a project’s usage as it pertains to these quotas. Once users reach a quota, they may not be able to deploy additional resources. Users may misinterpret this as an error in the system and report it to you . By keeping an eye on the quotas, your can proactively warn users as they reach their thresholds or you can decide to increase the quotas as appropriate. Some of the services have client commands that can be used to retrieve quota statistics. As an example, we demonstrate the nova absolute-limits command here:

nova absolute-limits

+--------------------+------+-------+

| Name               | Used | Max   |

+--------------------+------+-------+

| Cores              | 1    | 20    |

| FloatingIps        | 0    | 10    |

| ImageMeta          | -    | 128   |

| Instances          | 1    | 10    |

| Keypairs           | -    | 100   |

| Personality        | -    | 5     |

| Personality Size   | -    | 10240 |

| RAM                | 512  | 51200 |

| SecurityGroupRules | -    | 20    |

| SecurityGroups     | 1    | 10    |

| Server Meta        | -    | 128   |

| ServerGroupMembers | -    | 10    |

| ServerGroups       | 0    | 10    |

+--------------------+------+-------+

The absolute-limits command in Nova is nice because it displays the project’s current usage alongside the quota maximum, making it easy to notice that a project/tenant is coming close to the limit.

RabbitMQ

RabbitMQ is the default message broker used in OpenStack installations. However, if it is installed as is out the box, it can become a single point of failure. Administrators should consider clustering RabbitMQ and activating mirrored queues.

Summary

OpenStack is the leading open source software for running private clouds. Its popularity has grown exponentially since it was founded by Rackspace and NASA. The output of this engaged community is staggering, resulting in plenty of new features finding their way into OpenStack with each release. The project is at a size now where no one can truly know the details of each service. When working with such a complex project, it is inevitable that you will run into problems, bugs, errors, issues, and plain old trouble.