BIOS power management optimization
By default, new servers are configured for low consumption power (green energy). This is a really good point for the environment! However, reducing electric power implies reducing computation power.
The CPU is the main component that directly affects performance because of its wake up time during sleep states. CPU power management should be disabled as much as possible. However, depending on the hardware manufacturer, multiple options may change.
If you're wondering what the effects are of that power management, it's simple. If the activity of the processor is reduced too much, a standard action that takes 1 second could take more than 2 seconds in that period. During this period of 1 or more seconds, CPU power management techniques are changing the CPU state (lowering frequency, lowering voltage, and deactivating some subsystems, and so on).
When the workload increases, subsystems are reactivated, but C-State reactivation implies some latency (milliseconds to seconds) before being at maximum performance.
C-States
C-States are idle CPU states. C0 is the working processor while C1 is the first idle level of the processor. The problem is no instructions are executed during the C1 state, as the CPU manufacturer introduced new C-States to reduce power consumption during an idle period.
P-States
P-States are different levels of the operation state of the processor. Each level differs from the others by different working voltages/frequencies. For example, at P0, a processor can run at 3 GHz, but at P1 it can run at only 1.5 GHz. Voltage is also scaled to reduce power consumption.
In a P-State, the CPU is working, performing operations and executing instructions. This is not an idle state.
Constructor name options
Depending on the CPU constructor, power management technologies do not have the same name. Here is a list of functionalities that should be disabled. For Intel, they are called:
- Speedstep
- Turbo boost
- C1E
- QPI Power Management
For AMD, the functionalities are as follows:
- PowerNow!
- Cooln'Quiet
- Turbo Core
Some constructors have enabled extra power management features that are keyboard tricks (for example, with HP hardware, a Ctrl + A in the BIOS shows an additional Services Options menu).
Note
In most cases, it's recommended to look at the BIOS constructor documentation to see what should be turned on or off.
Power management optimization
Most of the following power management commands may not work under virtual machines. So, you should consider having a physical machine to test and run them.
cpufreq
cpufreq
allows the OS to control P-States. This means that the OS-based idleness can lower the frequency of the CPU to reduce power consumption.
To get cpufreq
information on a specified core, look at the cpufreq
folder by running the following command:
> ls -1 /sys/devices/system/cpu/cpu0/cpufreq affected_cpus bios_limit cpuinfo_cur_freq cpuinfo_max_freq cpuinfo_min_freq cpuinfo_transition_latency freqdomain_cpus related_cpus scaling_available_frequencies scaling_available_governors scaling_cur_freq scaling_driver scaling_governor scaling_max_freq scaling_min_freq scaling_setspeed
When enabled in the BIOS, cpufreq
drivers are loaded and the cpufreq
directory in sysfs
is available. You can look at the used governor (power management mechanism using the following command):
> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ondemand
If the BIOS settings are disabled, no drivers are loaded and no scaling frequency is allowed. That means your server works at maximum performance. However, you can also tune the scaling governor for performance purposes.
To always get the maximum performance without disabling the options in the BIOS, we'll install a package that will configure all the cores on your machine:
aptitude install cpufrequtils cp /usr/share/doc/cpufrequtils/examples/cpufrequtils.loadcpufreq.sample /etc/default/cpufrequtils
Then, edit the configuration file (/etc/default/cpufrequtils
) and set the new configuration:
# /etc/default/loadcpufreq sample file # # Use this file to override the CPUFreq kernel module to # be loaded or disable loading at all ENABLE=true FREQDRIVER=performance
You can get all the available governors with the following command:
> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors ondemand performance
Then, you can restart the cpufrequtils
service and your governors' cores will be updated to performance
.
cpuidle
cpuidle
allows the OS to control the CPU C-states (control how the CPU goes into idle/sleep state). Depending on the CPU constructor and model, several C-States are available. Standard ones are C0 and C1. C0 is a running state while C1 is an idle state.
Even if C-States have been disabled in the BIOS settings, the cpuidle
driver can be loaded and managed. To look at the loaded driver, run the following command:
> cat /sys/devices/system/cpu/cpuidle/current_driver intel_idle
The intel_idle
driver handles more C-States and aggressively puts the CPU into a lower idle mode. Since we have significant latency to wake up from lower C-States, this can affect performance.
When the intel_idle
driver is loaded, specific cpuidle
configurations are available for each CPU:
> ls -1R /sys/devices/system/cpu/cpu0/cpuidle/ /sys/devices/system/cpu/cpu0/cpuidle/: state0 state1 state2 state3 /sys/devices/system/cpu/cpu0/cpuidle/state0: desc latency name power time usage [...] /sys/devices/system/cpu/cpu0/cpuidle/state3: desc latency name power time usage
Each of the C-States are described here with the latency to wake up. To know the time it takes a C-State to wake up and check the latency
file, run the following command:
> cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency 150
In the preceding command, the time for C-State 3 to wake up is 150 ms! To avoid having all the C-States enabled, change the grub
boot configuration and add the following option (in the /etc/default/grub
location):
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_idle.max_cstate=0"
To make it work, upgrade the grub
configuration and reboot:
update-grub
Disk and filesystem optimization
For disks and filesystems, there are multiple factors that can slow down your MariaDB instance:
- Magnetic drives' rotation per minute
- Magnetic drives with data at the beginning of the disk
- Partitions not aligned to the disk
- Small partitions at the end of the disk
- Disk bus speed
- Magnetic drives' seek time
- Active SWAP partitions
Some of these factors can only be resolved by changing the hardware, but others can be changed by tuning the operating system.
Kernel disks' I/O schedulers
The kernel I/O scheduler permits us to change the way we read and write data on the disk. There are three kinds of schedulers. You can select a disk and look at the currently used scheduler using the following command:
# cat /sys/block/sda/queue/scheduler noop deadline [cfq]
The I/O scheduler used here is Completely Fair Queuing (CFQ).
The noop
scheduler queues requests as they are sent to the I/O.
The deadline
scheduler prevents excessive seek movement by serving I/O requests that are near to the new location on the disk. This is the best solution for SSDs.
The CFQ
scheduler is the default scheduler on most Linux distributions. The goal of this scheduler is to minimize seek head movements. This is the best solution for magnetic disks if there is no other mechanism above it (such as RAID, Fusion-io, and so on). In the case of SSDs, you have to use the deadline
scheduler.
To change the disk I/O scheduler with deadline
, use the following command:
echo deadline > /sys/block/device/queue/scheduler
You have to replace the device with the device name (such as sda
).
Another solution to avoid changing the disk I/O scheduler manually is to install sysfsutils
by using the following command:
aptitude install sysfsutils
Then, you have to configure it in /etc/sysfs.conf
:
block/sdb/queue/scheduler = deadline
Easy to use and understand, sysfsutils
is a daemon that permits us to make changes in /sys
automatically (as there is no sysctl
for /sys
).
Now, it could be a problem if you have a lot of disks on your machine and want to set the same I/O scheduler on all devices. Simply change the grub
boot settings (/etc/default/grub
) with the elevetor
option:
GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=deadline"
To make the preceding setting work, upgrade the grub
configuration and reboot:
update-grub
If you want to go ahead, there are several options for each I/O scheduler, and there is no optimal configuration. For example, on MyISAM, you need to increase nr_requests
to multiply the throughput. You have to test them and look at the better solution corresponding to your needs.
In the latest version of CFQ
, it automatically detects if it's a magnetic disk and adapts itself to avoid changing the elevator value. You can find all the required information on the Linux kernel website (http://www.kernel.org).
Partition alignment
The goal of partition alignment is to match logical block partitions with physical blocks to limit the number of disk operations. You must make the first partition begin from the disk sector 2048. However, it can be done automatically if you're using the parted
command. First of all, install the package:
aptitude install parted
Here is an example:
device=/dev/sdb parted -s -a optimal $device mklabel gpt parted -s -a optimal $device mkpart primary ext4 0% 100%
In the preceding example, we set /dev/sdb
as the disk device, then created a gpt
table partition, and finally created a single partition that takes the full disk size. 0%
means the beginning of the disk (which in fact starts at 2M) and goes to the end (100%
). The optimal
option means we want the best partition alignment to get the best performance.
SSD optimization
From the 2.6.33 version of kernel, you can enable TRIM support. Btrfs, Ext4, JFS, and XFS are optimized for TRIM when you activate this option. The TRIM feature blocks data that is no longer considered in use and that can be wiped internally. It allows the SSDs to handle garbage collection overhead that otherwise slows down future operations on the blocks.
Ext4 is one of the best solutions for high performance. To enable TRIM on it, modify your fstab
(/etc/fstab
) configuration to add the discard
option:
/dev/sda2 / ext4 rw,discard,errors=remount-ro 0 1
Now, remount your partition to enable TRIM support for Ext4:
mount -o remount /
On LVM, you can also enable TRIM for all the logical volumes by changing the issue_discards
option in your LVM configuration file (/etc/lvm/lvm.conf
):
issue_discards = 1
Finally, we want to limit needless utilization of SSD, and this can be done by setting temporary folders in the RAM using the tmpfs
filesystem. To achieve this, edit the fstab
file at /etc/fstab
and add the following three lines:
tmpfs /tmp tmpfs defaults,noatime,mode=1777 0 0 tmpfs /var/lock tmpfs defaults,noatime,mode=1777 0 0 tmpfs /var/run tmpfs defaults,noatime,mode=1777 0 0
Mount the preceding partitions to make them active.
Tip
On Debian, you do not need to change /etc/fstab
, and you can make add tmpfs is /etc/default/tmpfs
instead.
Filesystem options
Several kinds of filesystems exist and their performances generally depend on their usage. For MariaDB, I've performed several tests against XFS. My conclusion is the same as what we can find on most of the specialized websites on the Internet: XFS is a good solution but Ext4 is slightly faster.
On Ext4, you can add several interesting options to limit write access on the disk. You can, for example, disable the access time on all files and folders. This will avoid writing the last access time information to any acceded files on partition. As MariaDB often needs to access the same files, they are updated on each MariaDB modification (insert/update/delete), which is disk I/O consuming.
This could be a problem in some cases (for example, if you absolutely need these updates), but most of the time, it can be disabled by adding the following options in the fstab
configuration (/etc/fstab
):
/dev/sda2 / ext4 rw,noatime,nodiratime,data=writeback,discard 0 1
On a high disk I/O system, you will reduce the disk's access significantly.
You've also noticed that we used data=writeback
. This option means that only metadata writes are journalized. It works well with InnoDB and is safe. Why? Because InnoDB has its own transaction logs, there is no need to duplicate the same action. This is the fastest solution, but if you prefer a safer one, you can use data=ordered
instead to get data written before metadata.
Another interesting filesystem performance solution is to separate the Ext4 journal from the data disk (as in journaling, the filesystem writes data twice). Place the journal on a separate fast drive such as SSD. By default, the journal occupies between 2.5 percent and 5 percent of the filesystem size. It's suggested to keep the size at minimum for performance (it could be reduced on a very large data size).
First of all, check your current filesystem block size (here /dev/mapper/vg-home
):
> dumpe2fs /dev/mapper/vg-home | grep "^Block" dumpe2fs 1.42.5 (29-Jul-2012) Block count: 1327104 Block size: 4096 Blocks per group: 32768
Here, we've got a 4096
block size and the journal needs to have the same block size as well.
To dedicate a journal to a current partition, we need to unmount it. To be sure that there is no access, remove the current journal from the partition, create the journal partition on the dedicated device (partition size * 5 / 100), attach it to the wished partition, and then remount it:
> umount /home > dumpe2fs /dev/mapper/vg-home | grep "Journal" Journal inode: 8 Journal backup: inode blocks Journal features: (none) Journal size: 128M Journal length: 32768 Journal sequence: 0x0000002f Journal start: 0 > tune2fs -f -O ^has_journal /dev/mapper/vg-home > mke2fs -O journal_dev -b 4096 /dev/sdb1 > tune2fs -j -J device=/dev/sdb1 /dev/mapper/vg-home > mount /home
Now, you check on your partition to see whether the journal is located on another partition:
> dumpe2fs /dev/mapper/vg-home | grep "Journal" Journal UUID: 8a3c6cec-2d45-4aa9-ac2f-4a181093a92e Journal device: 0x0811 Journal backup: inode blocks
To locate it, use the following command:
> blkid | grep 8a3c6cec-2d45-4aa9-ac2f-4a181093a92e /dev/mapper/vg-home: UUID="6b8f2604-e1ac-4bea-a5c9-e7acf08cec8c" TYPE="ext4" EXT_JOURNAL="8a3c6cec-2d45-4aa9-ac2f-4a181093a92e" /dev/sdb1: UUID="8a3c6cec-2d45-4aa9-ac2f-4a181093a92e" TYPE="jbd"
As you now have a dedicated journal for your partition, add two other options to /etc/fstab
(journal_async_commit
). The advantage is that the commit block can be written to disk without waiting for the descriptor blocks. This option will boost performance. The code is as follows:
/dev/mapper/vg-home /home ext4 rw,noatime,nodiratime,data=writeback,discard,journal_async_commit 0 2
Another option exists for Ext4: barrier=0
. It will boost performance as well. Do not use it if you have a standalone server, because it will delay journal data writes and you may not be able to recover your data if your system crashes. You only have to use barrier=0
if you're using a RAID car with a BBU.
Tip
The Linux kernel evolves very quickly. XFS has new options, new filesystems appear, and Ext4 may not be the best solution in all cases. You should stay in touch with all the kernel-related news and test your usage cases yourself.
SWAP
As SWAP is used on a physical disk (magnetic or SSD), it's slower than RAM. Linux, by default, likes swapping for several reasons. To avoid your MariaDB data being SWAP instead of RAM, you have to play with a kernel parameter called swappiness
.
A swappiness
value is used to change the balance between swapping out runtime memory and dropping pages from the system page cache. The higher the value, the more the system will swap. The lower the value, the less the system will swap. The maximum value is 100
, the minimum is 0
, and 60
is the default. To change this parameter in the persistence mode, add this line to your sysctl.conf
file in /etc/sysctl.conf
:
vm.swappiness = 0
To avoid a system reboot to get this value set on the running system, you can launch the following command:
sysctl -w vm.swappiness=0
And now check the value to be sure it has been applied:
> sysctl vm.swappiness vm.swap piness = 0
Dedicating hardware with cgroups
Linux kernel brings features that permit the isolation of a process from others, called cgroups
(since version 2.6.24). If we want to dedicate CPU, RAM, or disk I/O, we can use cgroups
to do it (it also provides other interesting features if you want to go ahead). With this solution, you can be sure to dedicate hardware to your MariaDB instance.
To start using cgroups
, we must start preparing the environment. In fact, cgroups
needs a specific folder hierarchy to work, but you'll see the advantages when we use it. So, edit the fstab
file in /etc/fstab
to mount cgroups
at each machine startup, and add the following line:
cgroup /sys/fs/cgroup cgroup defaults 0 0
Mount cgroup
now to make cgroups
available:
mount /sys/fs/cgroup
To get all the CPU and memory features enabled, you need to change the grub
configuration by adding two new features in /etc/default/grub
(cgroup_enable
and swapaccount
):
GRUB_CMDLINE_LINUX_DEFAULT="quiet cgroup_enable=memory swapaccount=1"
Then, upgrade your grub
settings and reboot:
update-grub
After the machine has rebooted, you can check whether your cgroup
hierarchy exists:
> mount | grep ^cgroup cgroup on /sys/fs/cgroup type cgroup (rw,relatime,perf_event,blkio,net_cls,freezer,devices,memory,cpuacct ,cpu,cpuset)
Manual solution
Let's create our first cgroup
, the MariaDB one! Create a folder with a name of your choice in the cgroup
folder:
mkdir /sys/fs/cgroup/mariadb_cgroup
If we now look at the mariadb_cgroup
content, you can see all the limitations that the cgroup
features are able to offer:
> ls -1 /sys/fs/cgroup/mariadb_cgroup/ [...] cpuset.cpu_exclusive cpuset.cpus cpuset.mem_exclusive cpuset.mem_hardwall cpuset.memory_migrate cpuset.memory_pressure cpuset.memory_spread_page cpuset.memory_spread_slab cpuset.mems [...] tasks
You can see that there's a lot of stuff! Ok, now let's look at your processor information to see how many cores you've got:
> cat /proc/cpuinfo | grep ^processor processor : 0 processor : 1 processor : 2 processor : 3
I can see that I've got four cores available on this machine. For example, let's say I want to dedicate two cores to my MariaDB instance. The first thing to do is to assign two cores to the mariadb_cgroup
cgroup:
echo 2,3 > /sys/fs/cgroup/mariadb_cgroup/cpuset.cpus
You can set multiple cores separated by commas or with the minus character if you want a CPU range (0-3 to set from C0 to C3).
In case of multiple cores, I've just asked the cgroup to be bound to the last two cores. That means this cgroup
is only able to use those two cores and that doesn't mean it is the only one able to use them. Those cores are still sharable with other processes. To make them dedicated to this cgroup
, simply use the following command:
echo 1 > /sys/fs/cgroup/mariadb_cgroup/cpuset.cpu_exclusive
You can check the configuration of your cgroup
simply with cat
:
> cat /sys/fs/cgroup/mariadb_cgroup/cpuset.cpu* 1 2-3
We also need to specify the memory nodes that the tasks will be allowed to access. First, let's get a look at the available memory nodes:
> numactl --hardware | grep ^available available: 1 nodes (0)
Then, set to the wished memory node (here 0
):
echo 0 > /sys/fs/cgroup/mariadb_cgroup/cpuset.mems
Now, the cgroup is ready to dedicate cores to a process ID:
echo $(pidof mysqld) > /sys/fs/cgroup/mariadb_cgroup/tasks
That is it! If you want to be sure that you've correctly configured your cgroups, you can add another PID in that cgroup that will burst the two cores and check with the top
or htop
command, for example.
You can check your configuration using a PID in the following way:
> cat /proc/$(pidof mysqld)/status | grep _allowed Cpus_allowed: c Cpus_allowed_list: 2-3 Mems_allowed: 00000000,00000001 Me ms_allowed_list: 0
Automatic solution using the cgconfig daemon
It's preferable to be able to manage the manual solution before the automatic solution to check whether your configuration works as expected.
Now, if you want to have it enabled on boot and automatically configured correctly, you will need to use the cgconfig
daemon. It will load a configuration and then watch all the launched processes. If one matches its set configuration, it will automatically apply the defined rules.
To get cgconfig
, you'll need to install the following package:
aptitude install cgroup-bin daemon
The cgroup-bin
package in Debian wheezy is a little bit young, so we need to manually set up the init
file and the configuration from the package documentation.
Unfortunately, you need to do a little hack with the init
skeleton file to be able to use the update-rc.d
command for the cgconfig
services because the original init
files are not 100 percent Debian-compliant yet:
cd /etc/init.d cp skeleton cgconfig cp skeleton cgred chmod 755 cgconfig cgred sed -i 's/skeleton/cgconfig/' cgconfig sed -i 's/skeleton/cgred/' cgred update-rc.d cgconfig defaults update-rc.d cgred defaults cd /usr/share/doc/cgroup-bin/examples/ cp cgred.conf /etc/default/ cp cgconfig.conf cgrules.conf /etc/ gzip -d cgconfig.gz cp cgconfig cgred /etc/init.d/ cd /etc/init.d/ sed -i 's/sysconfig/defaults/' cgred cgconfig sed -i 's/\/etc\/rc.d\/init.d\/functions/\/lib\/init\/vars.sh/' cgred sed -i 's/--check/--name/' cgred sed -i 's/killproc.*/kill $(cat $pidfile)/' cgred sed -i 's/touch "$lockfile"/test -d \/var\/lock\/subsys || mkdir \/var\/lock\/subsys\n\t&/' cgconfig chmod 755 cgconfig cgred
In the meantime, we've updated a Red Hat path to a Debian one (sysconfig | defaults), modified the folder to store the lock file of the daemon, and changed the default cgred
init to correct some bugs.
Regarding the configuration files, let's start with /etc/cgconfig.conf
:
# # Copyright IBM Corporation. 2007 # # Authors: Balbir Singh <balbir@linux.vnet.ibm.com> # This program is free software; you can redistribute it and/or modify it # under the terms of version 2.1 of the GNU Lesser General Public License # as published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # group mariadb_cgroup { perm { admin { uid = mysql; } task { uid = mysql; } } cpuset { cpuset.mems = 0; cpuset.cpus = "2,3"; cpuset.cpu_exclusive = 1; } }
Here, we've got the cgroup
name mariadb_cgroup
. When a user, mysql
, launches an operation, the cpuset
configuration will be applied. In the same way as in the manual method, we've limited the mysql
user process to the second and third cores.
The last thing to configure is the cgrules.conf
file in /etc/cgrules.conf
, which will indicate which process belongs to which cgroup
. You need to add the user mysql
to modify the cpu
information and the cgroup
folder name where it should be placed:
mysql cpu mariadb_cgroup/
Of course, you can check your configuration in /sys/fs/cgroup
when you want.
When you've finished configuring your cgroup
and want the new configuration to be active, restart the services in the following order:
/etc/init.d/cgred stop
/etc/init.d/cgconfig stop
umount /sys/fs/cgroup 2>/dev/null
rmdir /sys/fs/cgroup/* /sys/fs/cgroup 2>/dev/null
mount /sys/fs/cgroup
/etc/init.d/cgconfig start
/etc/init.d/cgred start
Dedicating hardware optimization with NUMA
With large InnoDB databases (~ >32G), it becomes important to take a look at this kind of optimization.
In old/classic Uniform Memory Architecture (UMA), all the memory was shared among all the processors with equal access. There wasn't any affinity and performances were equal among all cores to the memory bank. With the Non-Uniform Memory Access (NUMA) architecture (starting with Intel Nehalem and AMD Opteron), this is totally different:
Each core has a local memory bank that gives closer access and thus reduces the latency. Of course, the whole system is visible as one unit, but optimization can be done to restrict a processor to its local memory bank. If there is no NUMA optimization, a core can ask for memory outside its local memory, which will increase the latency and lower the global performances.
By default, Linux automatically knows when it runs on a NUMA architecture and performs the following kind of operations natively:
- Collects hardware information to understand the running architecture
- Binds the correct memory module to the local core it belongs to
- Splits physical processors to nodes
- Collects cost information regarding inter-node communication
To look at the NUMA hardware on a running system, you can use the numactl
command (install it first if not present):
> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 65490 MB
node 0 free: 56085 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 65536 MB
node 1 free: 35150 MB
node distances:
node 0 1
0: 10 20
1: 20 10
We can see two different nodes here that indicate two different physical CPUs and the physical allocated RAM.
The node distances represent the cost of interconnect access. The weight for node 0
to access its local bank is 10
, and for node 1
, it's 20
. This is the same constraint for node 1
to access node 0
.
You can see the NUMA policy and information using the following command:
> numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
cpubind: 0 1
nodebind: 0 1
membind: 0 1
Now, if you want to bind a process to a CPU, use the following command:
numactl –physcpubind=0,1 <PID>
To allocate the local memory of a NUMA node, use the following command:
numactl --physcpubind=0 --localalloc <PID>
Now, if you want to get stats and see how your NUMA system works when the interleave has been hit and so on, use the numastat
command:
> numastat
node0 node1
numa_hit 407467513 656867541
numa_miss 0 0
numa_foreign 0 0
interleave_hit 32442 32470
local_node 407037248 656824235
other_node 430265 43306