(For more resources related to this topic, see here.)
OpenStack is a complex suite of software that can make tracking down issues and faults quite daunting to beginners and experienced system administrators alike. While there is no single approach to troubleshooting systems, understanding where OpenStack logs vital information or what tools are available to help track down bugs will help resolve issues we may encounter. However, OpenStack like all software will have bugs that we are not able to solve ourselves. In that case, we will show you how gathering the required information so that the OpenStack community can identify bugs and suggest fixes is important in ensuring those bugs or issues are dealt with quickly and efficiently.
Understanding logging
Logging is important in all computer systems, but the more complex the system, the more you rely on logging to be able to spot problems and cut down on troubleshooting time. Understanding logging in OpenStack is important to ensure your environment is healthy and you are able to submit relevant log entries back to the community to help fix bugs.
Getting ready
Log in as the root user onto the appropriate servers where the OpenStack services are installed. This makes troubleshooting easier as root privileges are required to view all the logs.
How to do it...
OpenStack produces a large number of logs that help troubleshoot our OpenStack installations. The following details outline where these services write their logs:
OpenStack Compute services logs
Logs for the OpenStack Compute services are written to /var/log/nova/, which is owned by the nova user, by default. To read these, log in as the root user (or use sudo privileges when accessing the files). The following is a list of services and their corresponding logs. Note that not all logs exist on all servers. For example, nova-compute.log exists on your compute hosts only:
nova-compute: /var/log/nova/nova-compute.log
Log entries regarding the spinning up and running of the instances
nova-network: /var/log/nova/nova-network.log
Log entries regarding network state, assignment, routing, and security groups
nova-manage: /var/log/nova/nova-manage.log
Log entries produced when running the nova-manage command
nova-conductor: /var/log/nova/nova-conductor.log
Log entries regarding services making requests for database information
nova-scheduler: /var/log/nova/nova-scheduler.log
Log entries pertaining to the scheduler, its assignment of tasks to nodes, and messages from the queue
nova-api: /var/log/nova/nova-api.log
Log entries regarding user interaction with OpenStack as well as messages regarding interaction with other components of OpenStack
nova-cert: /var/log/nova/nova-cert.log
Entries regarding the nova-cert process
nova-console: /var/log/nova/nova-console.log
Details about the nova-console VNC service
nova-consoleauth: /var/log/nova/nova-consoleauth.log
Authentication details related to the nova-console service
nova-dhcpbridge: /var/log/nova/nova-dhcpbridge.log
Network information regarding the dhcpbridge service
OpenStack Dashboard logs
OpenStack Dashboard (Horizon) is a web application that runs through Apache by default, so any errors and access details will be in the Apache logs. These can be found in /var/log/apache2/*.log, which will help you understand who is accessing the service as well as the report on any errors seen with the service.
OpenStack Storage logs
OpenStack Object Storage (Swift) writes logs to syslog by default. On an Ubuntu system, these can be viewed in /var/log/syslog. On other systems, these might be available at /var/log/messages.
The OpenStack Block Storage service, Cinder, will produce logs in /var/log/cinder by default. The following list is a breakdown of the log files:
cinder-api: /var/log/cinder/cinder-api.log
Details about the cinder-api service
cinder-scheduler: /var/log/cinder-scheduler.log
Details related to the operation of the Cinder scheduling service
cinder-volume: /var/log/cinder/cinder-volume.log
Log entries related to the Cinder volume service
OpenStack Identity logs
The OpenStack Identity service, Keystone, writes its logging information to /var/log/keystone/keystone.log. Depending on how you have Keystone configured, the information in this log file can be very sparse to extremely verbose including complete plaintext requests.
OpenStack Image Service logs
The OpenStack Image Service Glance stores its logs in /var/log/glance/*.log with a separate log file for each service. The following is a list of the default log files:
api: /var/log/glance/api.log
Entries related to the glance API
registry: /var/log/glance/registry.log
Log entries related to the Glance registry service. Things like metadata updates and access will be stored here depending on your logging configuration.
OpenStack Network Service logs
OpenStack Networking Service, formerly Quantum, now Neutron, stores its log files in /var/log/quantum/*.log with a separate log file for each service. The following is a list of the corresponding logs:
dhcp-agent: /var/log/quantum/dhcp-agent.log
Log entries pertaining to the dhcp-agent
l3-agent: /var/log/quantum/l3-agent.log
Log entries related to the l3 agent and its functionality
metadata-agent: /var/log/quantum/metadata-agent.log
This file contains log entries related to requests Quantum has proxied to the Nova metadata service.
openvswitch-agent: /var/log/quantum/openvswitch-agent.log
Entries related the the operation of Open vSwitch. When implementing OpenStack Networking, if you use a different plugin, its log file will be named accordingly.
server: /var/log/quantum/server.log
Details and entries related to the quantum API service
OpenVSwitch Server: /var/log/openvswitch/ovs-vswitchd.log
Details and entries related to the OpenVSwitch Switch Daemon
Changing log levels
By default each OpenStack service has a sane level of logging, which is determined by the level set as Warning. That is, it will log enough information to provide you the status of the running system as well as some basic troubleshooting information. However, there will be times that you need to adjust the logging verbosity either up or down to help diagnose an issue or reduce logging noise.
As each service can be configured similarly, we will show you how to make these changes on the OpenStack Compute service.
Log-level settings in OpenStack Compute services
To do this, log into the box where the OpenStack Compute service is running and execute the following commands:
sudo vim /etc/nova/logging.conf
Change the following log levels to either DEBUG, INFO or WARNING in any of the services listed:
Log-level settings in other OpenStack services
Other services such as Glance and Keystone currently have their log-level settings within their main configuration files such as /etc/glance/glance-api.conf. Adjust the log levels by altering the following lines to achieve INFO or DEBUG levels:
Restart the relevant service to pick up the log-level change.
How it works...
Logging is an important activity in any software, and OpenStack is no different. It allows an administrator to track down problematic activity that can be used in conjunction with the community to help provide a solution. Understanding where the services log and managing those logs to allow someone to identify problems quickly and easily are important.
Checking OpenStack services
OpenStack provides tools to check on its services. In this section, we'll show you how to check the operational status of these services. We will also use common system commands to check whether our environment is running as expected.
Getting ready
To check our OpenStack Compute host, we must log into that server, so do this now before following the given steps.
How to do it...
To check that OpenStack Compute is running the required services, we invoke the nova-manage tool and ask it various questions about the environment, as follows:
Checking OpenStack Compute Services
To check our OpenStack Compute services, issue the following command:
sudo nova-manage service list
You will see an output similar to the following. The :-) indicates that everything is fine.
nova-manage service list
The fields are defined as follows:
Binary: This is the name of the service that we're checking the status of.
Host: This is name of the server or host where this service is running.
Zone: This refers to the OpenStack Zone that is running that service. A zone can run different services. The default zone is called nova.
Status: This states whether or not an administrator has enabled or disabled that service.
State: This refers to whether that running service is working or not.
Updated_At: This indicates when that service was last checked.
If OpenStack Compute has a problem, you will see XXX in place of :-). The following command shows the same:
nova-compute compute.book nova enabled XXX 2013-06-18 16:47:35
If you do see XXX, the answer to the problem will be in the logs at /var/log/nova/.
If you get intermittent XXX and :-) for a service, first check whether the clocks are in sync.
OpenStack Image Service (Glance)
The OpenStack Image Service, Glance, while critical to the ability of OpenStack to provision new instances, does not contain its own tool to check the status of the service. Instead, we rely on some built-in Linux tools. OpenStack Image Service (Glance) doesn't have a tool to check its running services, so we can use some system commands instead, as follows:
ps -ef | grep glance
netstat -ant | grep 9292.*LISTEN
These should return process information for Glance to show it's running, and 9292 is the default port that should be open in the LISTEN mode on your server, which is ready for use. The output of these commands will be similar to the following:
ps -ef | grep glance
This produces output like the following:
To check if the correct port is in use, issue the following command:
netstat -ant | grep 9292
tcp 0 0 0.0.0.0:9292 0.0.0.0:* LISTEN
Other services that you should check
Should Glance be having issues while the above services are in working order, you will want to check the following services as well:
rabbitmq: For rabbitmq, run the following command:
sudo rabbitmqctl status
For example, output from rabbitmqctl (when everything is running OK) should look similar to the following screenshot:
If rabbitmq isn't working as expected, you will see output similar to the following indicating that the rabbitmq service or node is down:
ntp: For ntp (Network Time Protocol, for keeping nodes in time-sync), run the following command:
ntpq -p
ntp is required for multi-host OpenStack environments but it may not be installed by default. Install the ntp package with sudo apt-get install -y ntp)
This should return output regarding contacting NTP servers, for example:
MySQL Database Server: For MySQL Database Server, run the following commands:
PASSWORD=openstack
mysqladmin -uroot –p$PASSWORD status
This will return some statistics about MySQL, if it is running, as shown in the following screenshot:
Checking OpenStack Dashboard (Horizon)
Like the Glance Service, the OpenStack Dashboard service, Horizon, does not come with a built-in tool to check its health.
Horizon, despite not having a built-in utility to check service health, does rely on the Apache web server to serve pages. To check the status of the service then, we check the health of the web service. To check the health of the Apache web service, log into the server running Horizon and execute the following command:
ps -ef | grep apache
This command produces output like the following screenshot:
To check that Apache is running on the expected port, TCP Port 80, issue the following command:
netstat -ano | grep :80
This command should show the following output:
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN off (0.00/0/0)
To test access to the web server from the command line issue the following command:
telnet localhost 80
This command should show the following output:
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Checking OpenStack Identity (Keystone)
Keystone comes with a client side implementation called the python-keystone client. We use this tool to check the status of our Keystone services.
To check that Keystone is running the required services, we invoke the keystone command:
# keystone user-list
This produces output like the following screenshot:
Additionally, you can use the following commands to check the status of Keystone. The following command checks the status of the service:
# ps -ef | grep keystone
This should show output similar to the following:
keystone 5441 1 0 Jun20 ? 00:00:04 /usr/bin/python /usr/bin/keystone-all
Next you can check that the service is listening on the network. The following command can be used:
netstat -anlp | grep 5000
This command should show output like the following:
tcp 0 0 0.0.0.0:5000 0.0.0.0: LISTEN 54421/python
Checking OpenStack Networking (Neutron)
When running the OpenStack Networking service, Neutron, there are a number of services that should be running on various nodes. These are depicted in the following diagram:
On the Controller node, check the Quantum Server API service is running on TCP Port 9696 as follows:
sudo netstat -anlp | grep 9696
The command brings back output like the following:
tcp 0 0 0.0.0.0:9696 0.0.0.0:* LISTEN 22350/python
On the Compute nodes, check the following services are running using the ps command:
ovsdb-server
ovs-switchd
quantum-openvswitch-agent
For example, run the following command:
ps -ef | grep ovsdb-server
On the Network node, check the following services are running:
ovsdb-server
ovs-switchd
quantum-openvswitch-agent
quantum-dhcp-agent
quantum-l3-agent
quantum-metadata-agent
To check our Neutron agents are running correctly, issue the following command from the Controller host when you have the correct OpenStack credentials sourced into your environment:
quantum agent-list
This will bring back output like the following screenshot when everything is running correctly:
Checking OpenStack Block Storage (Cinder)
To check the status of the OpenStack Block Storage service, Cinder, you can use the following commands:
Use the following command to check if Cinder is running:
ps -ef | grep cinder
This command produces output like the following screenshot:
Use the following command to check if iSCSI target is listening:
netstat -anp | grep 3260
This command produces output like the following:
tcp 0 0 0.0.0.0:3260 0.0.0.0:* LISTEN 10236/tgtd
Use the following command to check that the Cinder API is listening on the network:
netstat -an | grep 8776
This command produces output like the following:
tcp 0 0.0.0.0:8776 0.0.0.0:* LISTEN
To validate the operation of the Cinder service, if all of the above is functional, you can try to list the volumes Cinder knows about using the following:
cinder list
This produces output like the following:
Checking OpenStack Object Storage (Swift)
The OpenStack Object Storage service, Swift, has a few built-in utilities that allow us to check its health. To do so, log into your Swift node and run the following commands:
Use the following command for checking the Swift Service
Using Swift Stat:
swift stat
This produces output like the following:
Using PS:
There will be a service for each configured container, account, object-store.
ps -ef | grep swift
This should produce output like the following screenshot:
Use the following command for checking the Swift API:
ps -ef | grep swift-proxy
This should produce the following screenshot:
Use the following command for checking if Swift is listening on the network:
netstat -anlp | grep 8080
This should produce output like the following:
tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 9818/python
How it works...
We have used some basic commands that communicate with OpenStack services to show they're running. This elementary level of checking helps with troubleshooting our OpenStack environment.
Troubleshooting OpenStack Compute services
OpenStack Compute services are complex, and being able to diagnose faults is an essential part of ensuring the smooth running of the services. Fortunately, OpenStack Compute provides some tools to help with this process, along with tools provided by Ubuntu to help identify issues.
How to do it...
Troubleshooting OpenStack Compute services can be a complex issue, but working through problems methodically and logically will help you reach a satisfactory outcome. Carry out the following suggested steps when encountering the different problems presented.
Steps for when you cannot ping or SSH to an instance
When launching instances, we specify a security group. If none is specified, a security group named default is used. These mandatory security groups ensure security is enabled by default in our cloud environment, and as such, we must explicitly state that we require the ability to ping our instances and SSH to them. For such a basic activity, it is common to add these abilities to the default security group.
Network issues may prevent us from accessing our cloud instances. First, check that the compute instances are able to forward packets from the public interface to the bridged interface. Use the following command for the same:
sysctl -A | grep ip_forward
net.ipv4.ip_forward should be set to 1. If it isn't, check that /etc/sysctl.conf has the following option uncommented. Use the following command for it:
net.ipv4.ip_forward=1
Then, run to following command to pick up the change:
sudo sysctl -p
Other network issues could be routing issues. Check that we can communicate with the OpenStack Compute nodes from our client and that any routing to get to these instances has the correct entries.
We may have a conflict with IPv6, if IPv6 isn't required. If this is the case, try adding --use_ipv6=false to your /etc/nova/nova.conf file, and restart the nova-compute and nova-network services. We may also need to disable IPv6 in the operating system, which can be achieved using something like the following line in /etc/modprobe.d/ipv6.conf:
install ipv6 /bin/true
If using OpenStack Neutron, check the status of the neutron services on the host and the correct IP namespace is being used (see Troubleshooting OpenStack Networking).
Reboot your host.
Methods for viewing the Instance Console log
When using the command line, issue the following commands:
nova list
nova console-log INSTANCE_ID
For example:
nova console-log ee0cb5ca-281f-43e9-bb40-42ffddcb09cd
When using Horizon, carry out the following steps:
Navigate to the list of instance and select an instance.
You will be taken to an Overview screen. Along the top of the Overview screen is a Log tab. This is the console log for the instance.
When viewing the logs directly on a nova-compute host, look for the following file:
The console logs are owned by root, so only an administrator can do this. They are placed at: var/lib/nova/instances/<instance_id>/console.log.
Instance fails to download meta information
If an instance fails to communicate to download the extra information that can be supplied to the instance meta-data, we can end up in a situation where the instance is up but you're unable to log in, as the SSH key information is injected using this method.
Viewing the console log will show output like in the following screenshot:
If you are not using Neutron, ensure the following:
nova-api is running on the Controller host (in a multi_host environment, ensure there's a nova-api-metadata and a nova-network package installed and running on the Compute host).
Perform the following iptables check on the Compute node:
sudo iptables -L -n -t nat
We should see a line in the output like in the following screenshot:
If not, restart your nova-network services and check again.
Sometimes there are multiple copies of dnsmasq running, which can cause this issue. Ensure that there is only one instance of dnsmasq running:
ps -ef | grep dnsmasq
This will bring back two process entries, the parent dnsmasq process and a spawned child (verify by the PIDs). If there are any other instances of dnsmasq running, kill the dnsmasq processes. When killed, restart nova-network, which will spawn dnsmasq again without any conflicting processes.
If you are using Neutron:
The first place to look is in the /var/log/quantum/metadata_agent.log on the Network host. Here you may see Python stack traces that could indicate a service isn't running correctly. A connection refused message may appear here suggesting the metadata agent running on the Network host is unable to talk to the Metadata service on the Controller host via the Metadata Proxy service (also running on the Network host).
The metadata service runs on port 8775 on our Controller host, so checking that is running involves checking the port is open and it's running the metadata service. To do this on the Controller host, run the following:
sudo netstat -antp | grep 8775
This will bring back the following output if everything is OK:
tcp 0 0 0.0.0.0:8775 0.0.0.0:* LISTEN
If nothing is returned, check that the nova-api service is running and if not, start it.
Instance launches; stuck at Building or Pending
Sometimes, a little patience is needed before assuming the instance has not booted, because the image is copied across the network to a node that has not seen the image before. At other times though, if the instance has been stuck in booting or a similar state for longer than normal, it indicates a problem. The first place to look will be for errors in the logs. A quick way of doing this is from the controller server and by issuing the following command:
sudo nova-manage logs errors
A common error that is usually present is usually related to AMQP being unreachable. Generally, these errors can be ignored unless, that is, you check the time stamp and these errors are currently appearing. You tend to see a number of these messages related to when the services first started up so look at the timestamp before reaching conclusions.
This command brings back any log line with the ERROR as log level, but you will need to view the logs in more detail to get a clearer picture.
A key log file, when troubleshooting instances that are not booting properly, will be available on the controller host at /var/log/nova/nova-scheduler.log. This file tends to produce the reason why an instance is stuck in Building state. Another file to view further information will be on the compute host at /var/log/nova/nova-compute.log. Look here at the time you launch the instance. In a busy environment, you will want to tail the log file and parse for the instance ID.
Check /var/log/nova/nova-network.log (for Nova Network) and /var/log/quantum/*.log (for Neutron) for any reason why instances aren't being assigned IP addresses. It could be issues around DHCP preventing address allocation or quotas being reached.
Error codes such as 401, 403, 500
The majority of the OpenStack services are web services, meaning the responses from the services are well defined.
40X: This refers to a service that is up but responding to an event that is produced by some user error. For example, a 401 is an authentication failure, so check the credentials used when accessing the service.
500: These errors mean a connecting service is unavailable or has caused an error that has caused the service to interpret a response to cause a failure. Common problems here are services that have not started properly, so check for running services.
If all avenues have been exhausted when troubleshooting your environment, reach out to the community, using the mailing list or IRC, where there is a raft of people willing to offer their time and assistance. See the Getting help from the community recipe at the end of this article for more information.
Listing all instances across all hosts
From the OpenStack controller node, you can execute the following command to get a list of the running instances in the environment:
sudo nova-manage vm list
To view all instances across all tenants, as a user with an admin role execute the following command:
nova list --all-tenants
These commands are useful in identifying any failed instances and the host on which it is running. You can then investigate further.
How it works...
Troubleshooting OpenStack Compute problems can be quite complex, but looking in the right places can help solve some of the more common problems. Unfortunately, like troubleshooting any computer system, there isn't a single command that can help identify all the problems that you may encounter, but OpenStack provides some tools to help you identify some problems. Having an understanding of managing servers and networks will help troubleshoot a distributed cloud environment such as OpenStack.
There's more than one place where you can go to identify the issues, as they can stem from the environment to the instances themselves. Methodically working your way through the problems though will help lead you to a resolution.
Read more