In this article by Tony Campbell, author of the book Troubleshooting OpenStack, we will cover some of the chronic issues that might be early signs of trouble. This article is more about prevention and aims to help you avoid emergency troubleshooting as much as possible.
(For more resources related to this topic, see here.)
Many OpenStack services make heavy use of a database. Production deployments will typically use MySQL or Postgres as the backend database server. As you may have learned, a failing or misconfigured database will quickly lead to trouble in your OpenStack cluster. Database problems can also present more subtle concerns that may grow into huge problems if neglected.
This database server can become a single point of failure if your database server is not deployed in a highly available configuration. OpenStack does not require a high availability installation of your database, and as a result, many installations may skip this step. However, production deployments of OpenStack should take care to ensure that their database can survive the failure of a single database server.
For installations using the MySQL database engine, there are several options for clustering your installation. One popular method is to leverage Galera Cluster (http://http://galeracluster.com/). Galera Cluster for MySQL leverages synchronous replication and provides a multi-master cluster, which offers high availability for your OpenStack databases.
Installations that use the Postgres database engine have several options such as high availability, load balancing, and replication. These options include block device replication with DRBD, log shipping, Master-Standby replication based on triggers, statement-based replication, and asynchronous multi-master replication. For details, refer to the Postgres High Availability Guide (http://www.postgresql.org/docs/current/static/high-availability.html).
Database performance is one of those metrics that can degrade over time. For an administrator who does not pay attention to small problems in this area, these can eventually become large problems. A wise administrator will regularly monitor the performance of their database and will constantly be on the lookout for slow queries, high database loads, and other indications of trouble.
There are several options for monitoring your MySQL server, some of which are commercial and many others that are open source. Administrators should evaluate the options available and select a solution that fits their current set of tools and operating environment. There are several performance metrics you will want to monitor.
The MySQL SHOW STATUS statement can be executed from the mysql command prompt. The output of this statement is server status information with over 300 variables reported. To narrow down the information, you can leverage a LIKE clause on the variable_name to display the sections you are interested in. Here is an abbreviated list of the output instances returned by SHOW STATUS:
mysql> SHOW STATUS;
+------------------------------------------+-------------+
| Variable_name | Value |
+------------------------------------------+-------------+
| Aborted_clients | 29 |
| Aborted_connects | 27 |
| Binlog_cache_disk_use | 0 |
| Binlog_cache_use | 0 |
| Binlog_stmt_cache_disk_use | 0 |
| Binlog_stmt_cache_use | 0 |
| Bytes_received | 614 |
| Bytes_sent | 33178 |
Mytop is a command-line utility inspired by the Linux top command. Mytop retrieves data from the MySql SHOW PROCESSLIST command and the SHOW STATUS command. Data from these commands is refreshed, processed, and displayed in the output of the Mytop command. The Mytop output includes a header, which contains summary data followed by a thread section.
Here is an example of the header output from the Mytop command:
MySQL on localhost (5.5.46) load 1.01 0.85 0.79 4/538 23573 up 5+02:19:24 [14:35:24]
Queries: 3.9M qps: 9 Slow: 0.0 Se/In/Up/De(%): 49/00/08/00
Sorts: 0 qps now: 10 Slow qps: 0.0 Threads: 30 ( 1/ 4) 40/00/12/00
Cache Hits: 822.0 Hits/s: 0.0 Hits now: 0.0 Ratio: 0.0%
Ratio now: 0.0%
Key Efficiency: 97.3% Bps in/out: 1.7k/ 3.1k Now in/out: 1.0k/ 3.9k
As demonstrated in the preceding output, the header section for the Mytop command includes the following information:
They Mytop thread section will list as many threads as it can display. The threads are ordered by the Time column, which displays the threads' idle time:
Id User Host/IP DB Time Cmd State Query
-- ---- ------- -- ---- --- ----- ----------
3461 neutron 174.143.201.98 neutron 5680 Sleep
3477 glance 174.143.201.98 glance 1480 Sleep
3491 nova 174.143.201.98 nova 880 Sleep
3512 nova 174.143.201.98 nova 281 Sleep
3487 keystone 174.143.201.98 keystone 280 Sleep
3489 glance 174.143.201.98 glance 280 Sleep
3511 keystone 174.143.201.98 keystone 280 Sleep
3513 neutron 174.143.201.98 neutron 280 Sleep
3505 keystone 174.143.201.98 keystone 279 Sleep
3514 keystone 174.143.201.98 keystone 141 Sleep
...
The Mytop thread section displays the ID of each thread followed by the user and host. Finally, this section will display the database, idle time, and state or command query. Mytop will allow you to keep an eye on the performance of your MySql database server.
The Percona Toolkit is a very useful set of command-line tools for performing MySQL operations and system tasks. The toolkit can be downloaded from Percona at https://www.percona.com/downloads/percona-toolkit/. The output from these tools can be fed into your monitoring system allowing you to effectively monitor your MyQL installation.
Like MySQL, the Postgres database also has a series of tools, which can be leveraged to monitor database performance. In addition to standard Linux troubleshooting tools, such as top and ps, Postgres also offers its own collection of statistics.
The statistics collector in Postgres allows you to collect data about server activity. The statistics collected in this tool are varied and may be helpful for troubleshooting or system monitoring. In order to leverage the statistics collector, you must turn on the functionality in the postgresql.conf file. The settings are commented out by default in the RUNTIME STATISTICS section of the configuration file. Uncomment the lines in the Query/Index Statistics Collector subsection.
#------------------------------------------------------------------------------
# RUNTIME STATISTICS
#------------------------------------------------------------------------------
# - Query/Index Statistics Collector -
track_activities = on
track_counts = on
track_io_timing = off
track_functions = none # none, pl, all
track_activity_query_size = 1024 # (change requires restart)
update_process_title = on
stats_temp_directory = 'pg_stat_tmp'
Once the statistics collector is configured, restart the database server or execute a pg_ctl reload for the configuration to take effect. Once the collector has been configured, there will be a series of views created and named with the prefix “pg_stat”. These views can be queried for relevant statistics in the Posgres database server.
A diligent operator will ensure that a backup of the database for each OpenStack project is created. Since most OpenStack services make heavy use of the database for persisting things such as states and metadata, a corruption or loss of data can render your OpenStack cloud unusable. The current database backups can help rescue you from this fate. MySQL users can use the mysqldump utility to back up all of the OpenStack datbases.
mysqldump --opt --all-databases > all_openstack_dbs.sql
Similarly, Postgres users can back up all OpenStack databases with a command similar to the following:
pg_dumpall > all_openstack_dbs.sql
Your cadence for backups will depend on your environment and tolerance for data corruption of loss. You should store these backups in a safe place and occasional deploy test restores from the data to ensure that they work as expected.
Monitoring is often your early warning system that something is going wrong in your cluster. Your monitoring system can also be a rich source of information when the time comes to troubleshoot issues with the cluster. There are multiple options available for monitoring OpenStack. Many of your current application monitoring platforms will handle OpenStack just as well as any other Linux system. Regardless of the tool you select to for monitoring, there are several parts of OpenStack that you should focus on.
OpenStack is typically deployed on a series of Linux servers. Monitoring the resources on those servers is essential. A set it and forget it attitude is a recipe for disaster. The things you may want to monitor on your host servers include the following:
OpenStack operators have the option to set usage quotas for each tenant/project. As an administrator, it is helpful to monitor a project’s usage as it pertains to these quotas. Once users reach a quota, they may not be able to deploy additional resources. Users may misinterpret this as an error in the system and report it to you . By keeping an eye on the quotas, your can proactively warn users as they reach their thresholds or you can decide to increase the quotas as appropriate. Some of the services have client commands that can be used to retrieve quota statistics. As an example, we demonstrate the nova absolute-limits command here:
nova absolute-limits
+--------------------+------+-------+
| Name | Used | Max |
+--------------------+------+-------+
| Cores | 1 | 20 |
| FloatingIps | 0 | 10 |
| ImageMeta | - | 128 |
| Instances | 1 | 10 |
| Keypairs | - | 100 |
| Personality | - | 5 |
| Personality Size | - | 10240 |
| RAM | 512 | 51200 |
| SecurityGroupRules | - | 20 |
| SecurityGroups | 1 | 10 |
| Server Meta | - | 128 |
| ServerGroupMembers | - | 10 |
| ServerGroups | 0 | 10 |
+--------------------+------+-------+
The absolute-limits command in Nova is nice because it displays the project’s current usage alongside the quota maximum, making it easy to notice that a project/tenant is coming close to the limit.
RabbitMQ is the default message broker used in OpenStack installations. However, if it is installed as is out the box, it can become a single point of failure. Administrators should consider clustering RabbitMQ and activating mirrored queues.
OpenStack is the leading open source software for running private clouds. Its popularity has grown exponentially since it was founded by Rackspace and NASA. The output of this engaged community is staggering, resulting in plenty of new features finding their way into OpenStack with each release. The project is at a size now where no one can truly know the details of each service. When working with such a complex project, it is inevitable that you will run into problems, bugs, errors, issues, and plain old trouble.
Further resources on this subject: