The CPU is the main component that directly affects performance because of its wake up time during sleep states. CPU power management should be disabled as much as possible. However, depending on the hardware manufacturer, multiple options may change.
If you're wondering what the effects are of that power management, it's simple. If the activity of the processor is reduced too much, a standard action that takes 1 second could take more than 2 seconds in that period. During this period of 1 or more seconds, CPU power management techniques are changing the CPU state (lowering frequency, lowering voltage, and deactivating some subsystems, and so on).
When the workload increases, subsystems are reactivated, but C-State reactivation implies some latency (milliseconds to seconds) before being at maximum performance.
Power management optimization
Most of the following power management commands may not work under virtual machines. So, you should consider having a physical machine to test and run them.
cpufreq
allows the OS to control P-States. This means that the OS-based idleness can lower the frequency of the CPU to reduce power consumption.
To get cpufreq
information on a specified core, look at the cpufreq
folder by running the following command:
When enabled in the BIOS, cpufreq
drivers are loaded and the cpufreq
directory in sysfs
is available. You can look at the used governor (power management mechanism using the following command):
If the BIOS settings are disabled, no drivers are loaded and no scaling frequency is allowed. That means your server works at maximum performance. However, you can also tune the scaling governor for performance purposes.
To always get the maximum performance without disabling the options in the BIOS, we'll install a package that will configure all the cores on your machine:
Then, edit the configuration file (/etc/default/cpufrequtils
) and set the new configuration:
You can get all the available governors with the following command:
Then, you can restart the cpufrequtils
service and your governors' cores will be updated to performance
.
cpuidle
allows the OS to control the CPU C-states (control how the CPU goes into idle/sleep state). Depending on the CPU constructor and model, several C-States are available. Standard ones are C0 and C1. C0 is a running state while C1 is an idle state.
Even if C-States have been disabled in the BIOS settings, the cpuidle
driver can be loaded and managed. To look at the loaded driver, run the following command:
The intel_idle
driver handles more C-States and aggressively puts the CPU into a lower idle mode. Since we have significant latency to wake up from lower C-States, this can affect performance.
When the intel_idle
driver is loaded, specific cpuidle
configurations are available for each CPU:
Each of the C-States are described here with the latency to wake up. To know the time it takes a C-State to wake up and check the latency
file, run the following command:
In the preceding command, the time for C-State 3 to wake up is 150 ms! To avoid having all the C-States enabled, change the grub
boot configuration and add the following option (in the /etc/default/grub
location):
To make it work, upgrade the grub
configuration and reboot:
Disk and filesystem optimization
For disks and filesystems, there are multiple factors that can slow down your MariaDB instance:
- Magnetic drives' rotation per minute
- Magnetic drives with data at the beginning of the disk
- Partitions not aligned to the disk
- Small partitions at the end of the disk
- Disk bus speed
- Magnetic drives' seek time
- Active SWAP partitions
Some of these factors can only be resolved by changing the hardware, but others can be changed by tuning the operating system.
Kernel disks' I/O schedulers
The kernel I/O scheduler permits us to change the way we read and write data on the disk. There are three kinds of schedulers. You can select a disk and look at the currently used scheduler using the following command:
The I/O scheduler used here is Completely Fair Queuing (CFQ).
The noop
scheduler queues requests as they are sent to the I/O.
The deadline
scheduler prevents excessive seek movement by serving I/O requests that are near to the new location on the disk. This is the best solution for SSDs.
The CFQ
scheduler is the default scheduler on most Linux distributions. The goal of this scheduler is to minimize seek head movements. This is the best solution for magnetic disks if there is no other mechanism above it (such as RAID, Fusion-io, and so on). In the case of SSDs, you have to use the deadline
scheduler.
To change the disk I/O scheduler with deadline
, use the following command:
You have to replace the device with the device name (such as sda
).
Another solution to avoid changing the disk I/O scheduler manually is to install sysfsutils
by using the following command:
Then, you have to configure it in /etc/sysfs.conf
:
Easy to use and understand, sysfsutils
is a daemon that permits us to make changes in /sys
automatically (as there is no sysctl
for /sys
).
Now, it could be a problem if you have a lot of disks on your machine and want to set the same I/O scheduler on all devices. Simply change the grub
boot settings (/etc/default/grub
) with the elevetor
option:
To make the preceding setting work, upgrade the grub
configuration and reboot:
If you want to go ahead, there are several options for each I/O scheduler, and there is no optimal configuration. For example, on MyISAM, you need to increase nr_requests
to multiply the throughput. You have to test them and look at the better solution corresponding to your needs.
In the latest version of CFQ
, it automatically detects if it's a magnetic disk and adapts itself to avoid changing the elevator value. You can find all the required information on the Linux kernel website (http://www.kernel.org).
The goal of partition alignment is to match logical block partitions with physical blocks to limit the number of disk operations. You must make the first partition begin from the disk sector 2048. However, it can be done automatically if you're using the parted
command. First of all, install the package:
Here is an example:
In the preceding example, we set /dev/sdb
as the disk device, then created a gpt
table partition, and finally created a single partition that takes the full disk size. 0%
means the beginning of the disk (which in fact starts at 2M) and goes to the end (100%
). The optimal
option means we want the best partition alignment to get the best performance.
From the 2.6.33 version of kernel, you can enable TRIM support. Btrfs, Ext4, JFS, and XFS are optimized for TRIM when you activate this option. The TRIM feature blocks data that is no longer considered in use and that can be wiped internally. It allows the SSDs to handle garbage collection overhead that otherwise slows down future operations on the blocks.
Ext4 is one of the best solutions for high performance. To enable TRIM on it, modify your fstab
(/etc/fstab
) configuration to add the discard
option:
Now, remount your partition to enable TRIM support for Ext4:
On LVM, you can also enable TRIM for all the logical volumes by changing the issue_discards
option in your LVM configuration file (/etc/lvm/lvm.conf
):
Finally, we want to limit needless utilization of SSD, and this can be done by setting temporary folders in the RAM using the tmpfs
filesystem. To achieve this, edit the fstab
file at /etc/fstab
and add the following three lines:
Mount the preceding partitions to make them active.
Tip
On Debian, you do not need to change /etc/fstab
, and you can make add tmpfs is /etc/default/tmpfs
instead.
Several kinds of filesystems exist and their performances generally depend on their usage. For MariaDB, I've performed several tests against XFS. My conclusion is the same as what we can find on most of the specialized websites on the Internet: XFS is a good solution but Ext4 is slightly faster.
On Ext4, you can add several interesting options to limit write access on the disk. You can, for example, disable the access time on all files and folders. This will avoid writing the last access time information to any acceded files on partition. As MariaDB often needs to access the same files, they are updated on each MariaDB modification (insert/update/delete), which is disk I/O consuming.
This could be a problem in some cases (for example, if you absolutely need these updates), but most of the time, it can be disabled by adding the following options in the fstab
configuration (/etc/fstab
):
On a high disk I/O system, you will reduce the disk's access significantly.
You've also noticed that we used data=writeback
. This option means that only metadata writes are journalized. It works well with InnoDB and is safe. Why? Because InnoDB has its own transaction logs, there is no need to duplicate the same action. This is the fastest solution, but if you prefer a safer one, you can use data=ordered
instead to get data written before metadata.
Another interesting filesystem performance solution is to separate the Ext4 journal from the data disk (as in journaling, the filesystem writes data twice). Place the journal on a separate fast drive such as SSD. By default, the journal occupies between 2.5 percent and 5 percent of the filesystem size. It's suggested to keep the size at minimum for performance (it could be reduced on a very large data size).
First of all, check your current filesystem block size (here /dev/mapper/vg-home
):
Here, we've got a 4096
block size and the journal needs to have the same block size as well.
To dedicate a journal to a current partition, we need to unmount it. To be sure that there is no access, remove the current journal from the partition, create the journal partition on the dedicated device (partition size * 5 / 100), attach it to the wished partition, and then remount it:
Now, you check on your partition to see whether the journal is located on another partition:
To locate it, use the following command:
As you now have a dedicated journal for your partition, add two other options to /etc/fstab
(journal_async_commit
). The advantage is that the commit block can be written to disk without waiting for the descriptor blocks. This option will boost performance. The code is as follows:
Another option exists for Ext4: barrier=0
. It will boost performance as well. Do not use it if you have a standalone server, because it will delay journal data writes and you may not be able to recover your data if your system crashes. You only have to use barrier=0
if you're using a RAID car with a BBU.
Tip
The Linux kernel evolves very quickly. XFS has new options, new filesystems appear, and Ext4 may not be the best solution in all cases. You should stay in touch with all the kernel-related news and test your usage cases yourself.
As SWAP is used on a physical disk (magnetic or SSD), it's slower than RAM. Linux, by default, likes swapping for several reasons. To avoid your MariaDB data being SWAP instead of RAM, you have to play with a kernel parameter called swappiness
.
A swappiness
value is used to change the balance between swapping out runtime memory and dropping pages from the system page cache. The higher the value, the more the system will swap. The lower the value, the less the system will swap. The maximum value is 100
, the minimum is 0
, and 60
is the default. To change this parameter in the persistence mode, add this line to your sysctl.conf
file in /etc/sysctl.conf
:
To avoid a system reboot to get this value set on the running system, you can launch the following command:
And now check the value to be sure it has been applied:
Dedicating hardware with cgroups
Linux kernel brings features that permit the isolation of a process from others, called cgroups
(since version 2.6.24). If we want to dedicate CPU, RAM, or disk I/O, we can use cgroups
to do it (it also provides other interesting features if you want to go ahead). With this solution, you can be sure to dedicate hardware to your MariaDB instance.
To start using cgroups
, we must start preparing the environment. In fact, cgroups
needs a specific folder hierarchy to work, but you'll see the advantages when we use it. So, edit the fstab
file in /etc/fstab
to mount cgroups
at each machine startup, and add the following line:
Mount cgroup
now to make cgroups
available:
To get all the CPU and memory features enabled, you need to change the grub
configuration by adding two new features in /etc/default/grub
(cgroup_enable
and swapaccount
):
Then, upgrade your grub
settings and reboot:
After the machine has rebooted, you can check whether your cgroup
hierarchy exists:
Let's create our first cgroup
, the MariaDB one! Create a folder with a name of your choice in the cgroup
folder:
If we now look at the mariadb_cgroup
content, you can see all the limitations that the cgroup
features are able to offer:
You can see that there's a lot of stuff! Ok, now let's look at your processor information to see how many cores you've got:
I can see that I've got four cores available on this machine. For example, let's say I want to dedicate two cores to my MariaDB instance. The first thing to do is to assign two cores to the mariadb_cgroup
cgroup:
You can set multiple cores separated by commas or with the minus character if you want a CPU range (0-3 to set from C0 to C3).
In case of multiple cores, I've just asked the cgroup to be bound to the last two cores. That means this cgroup
is only able to use those two cores and that doesn't mean it is the only one able to use them. Those cores are still sharable with other processes. To make them dedicated to this cgroup
, simply use the following command:
You can check the configuration of your cgroup
simply with cat
:
We also need to specify the memory nodes that the tasks will be allowed to access. First, let's get a look at the available memory nodes:
Then, set to the wished memory node (here 0
):
Now, the cgroup is ready to dedicate cores to a process ID:
That is it! If you want to be sure that you've correctly configured your cgroups, you can add another PID in that cgroup that will burst the two cores and check with the top
or htop
command, for example.
You can check your configuration using a PID in the following way:
Automatic solution using the cgconfig daemon
It's preferable to be able to manage the manual solution before the automatic solution to check whether your configuration works as expected.
Now, if you want to have it enabled on boot and automatically configured correctly, you will need to use the cgconfig
daemon. It will load a configuration and then watch all the launched processes. If one matches its set configuration, it will automatically apply the defined rules.
To get cgconfig
, you'll need to install the following package:
The cgroup-bin
package in Debian wheezy is a little bit young, so we need to manually set up the init
file and the configuration from the package documentation.
Unfortunately, you need to do a little hack with the init
skeleton file to be able to use the update-rc.d
command for the cgconfig
services because the original init
files are not 100 percent Debian-compliant yet:
In the meantime, we've updated a Red Hat path to a Debian one (sysconfig | defaults), modified the folder to store the lock file of the daemon, and changed the default cgred
init to correct some bugs.
Regarding the configuration files, let's start with /etc/cgconfig.conf
:
Here, we've got the cgroup
name mariadb_cgroup
. When a user, mysql
, launches an operation, the cpuset
configuration will be applied. In the same way as in the manual method, we've limited the mysql
user process to the second and third cores.
The last thing to configure is the cgrules.conf
file in /etc/cgrules.conf
, which will indicate which process belongs to which cgroup
. You need to add the user mysql
to modify the cpu
information and the cgroup
folder name where it should be placed:
Of course, you can check your configuration in /sys/fs/cgroup
when you want.
When you've finished configuring your cgroup
and want the new configuration to be active, restart the services in the following order:
/etc/init.d/cgred stop
/etc/init.d/cgconfig stop
umount /sys/fs/cgroup 2>/dev/null
rmdir /sys/fs/cgroup/* /sys/fs/cgroup 2>/dev/null
mount /sys/fs/cgroup
/etc/init.d/cgconfig start
/etc/init.d/cgred start
Dedicating hardware optimization with NUMA
With large InnoDB databases (~ >32G), it becomes important to take a look at this kind of optimization.
In old/classic Uniform Memory Architecture (UMA), all the memory was shared among all the processors with equal access. There wasn't any affinity and performances were equal among all cores to the memory bank. With the Non-Uniform Memory Access (NUMA) architecture (starting with Intel Nehalem and AMD Opteron), this is totally different:
Each core has a local memory bank that gives closer access and thus reduces the latency. Of course, the whole system is visible as one unit, but optimization can be done to restrict a processor to its local memory bank. If there is no NUMA optimization, a core can ask for memory outside its local memory, which will increase the latency and lower the global performances.
By default, Linux automatically knows when it runs on a NUMA architecture and performs the following kind of operations natively:
- Collects hardware information to understand the running architecture
- Binds the correct memory module to the local core it belongs to
- Splits physical processors to nodes
- Collects cost information regarding inter-node communication
To look at the NUMA hardware on a running system, you can use the numactl
command (install it first if not present):
We can see two different nodes here that indicate two different physical CPUs and the physical allocated RAM.
The node distances represent the cost of interconnect access. The weight for node 0
to access its local bank is 10
, and for node 1
, it's 20
. This is the same constraint for node 1
to access node 0
.
You can see the NUMA policy and information using the following command:
Now, if you want to bind a process to a CPU, use the following command:
To allocate the local memory of a NUMA node, use the following command:
Now, if you want to get stats and see how your NUMA system works when the interleave has been hit and so on, use the numastat
command: