Tools and techniques for performance tuning in Linux

An Example

Linux provides quick allocation and deallocation of frequently used objects in caches called "slabs." To provide better performance, Christopher Lameter introduced a new slabs manager called Slub.

However, we found that the scheduler performance benchmark known as hackbench reveals a big difference in run time with kernel 2.6.24/2.6.25-rc, between a system with 16 CPU cores and a system with eight CPU cores. Hackbench is expected to be faster on the 16-core system than on the 8-core system, but the testing result shows the first machine requires three times more run time than the second machine, which indicates a possible performance issue.

The vmstat utility provides the output shown in Listing 6.

Listing 6

Starting with vmstat

01 procs   -----------memory---------- --swap---  ---io---- --system--- -----cpu-----
02  r  b    swpd free      buff  cache    si   so    bi    bo   in   cs    us sy id wa st
03 360  0      0 15730644  17980 120336    0    0     0     0  320 140047 0 100  0  0  0
04 327  0      0 15739216  17980 120336    0    0     0     0  322 256259 1 99  0  0  0
05 412  0      0 15743084  17988 120336    0    0     0    16  282 74537  0 100  0  0  0
06 421  0      0 15741076  17988 120336    0    0     0     0  311 51750  0 100  0  0  0
07 334  0      0 15745048  17988 120332    0    0     0     0  295 95434  0 100  0  0  0
08 468  0      0 15747460  17988 120336    0    0     0     0  251 94440  0 100  0  0  0
09 373  0      0 15750844  17988 120336    0    0     0     0  268 104569 0 100  0  0  0
01 procs   -----------memory---------- --swap---  ---io---- --system--- -----cpu-----
02  r  b    swpd free      buff  cache    si   so    bi    bo   in   cs    us sy id wa st
03 360  0      0 15730644  17980 120336    0    0     0     0  320 140047 0 100  0  0  0
04 327  0      0 15739216  17980 120336    0    0     0     0  322 256259 1 99  0  0  0
05 412  0      0 15743084  17988 120336    0    0     0    16  282 74537  0 100  0  0  0
06 421  0      0 15741076  17988 120336    0    0     0     0  311 51750  0 100  0  0  0
07 334  0      0 15745048  17988 120332    0    0     0     0  295 95434  0 100  0  0  0
08 468  0      0 15747460  17988 120336    0    0     0     0  251 94440  0 100  0  0  0
09 373  0      0 15750844  17988 120336    0    0     0     0  268 104569 0 100  0  0  0

Notice the high context switch (cs) count and large number of running processes. In this case, hackbench simulates many chat rooms with a large number of users passing messages back and forth in each room. The lack of idle time in the system indicates that the CPU is very busy.

The next step is to use oprofile to find out where the CPU is spending its time. The oprofile data in Listing 7 shows that about 88% of the CPU time is spent in allocating slabs, adding to partially filled slabs, and freeing slabs. It shows that the benchmark generates lots of messages that are allocated and passed between processes and memory management, and that is where the program is spending the most time.

Listing 7

Studying CPU Usage with oprofile

01 CPU: Core 2, speed 1602 MHz (estimated)
02 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
03 samples  %       image name       app name         symbol name
04 46746994 43.3801 linux-2.6.25-rc4 linux-2.6.25-rc4 __slab_alloc
05 45986635 42.6745 linux-2.6.25-rc4 linux-2.6.25-rc4 add_partial
06 2577578   2.3919 linux-2.6.25-rc4 linux-2.6.25-rc4 __slab_free
07 1301644   1.2079 linux-2.6.25-rc4 linux-2.6.25-rc4 sock_alloc_send_skb
08 1185888   1.1005 linux-2.6.25-rc4 linux-2.6.25-rc4 copy_user_generic_string
09 969847    0.9000 linux-2.6.25-rc4 linux-2.6.25-rc4 unix_stream_recvmsg
10 806665    0.7486 linux-2.6.25-rc4 linux-2.6.25-rc4 kmem_cache_alloc
11 731059    0.6784 linux-2.6.25-rc4 linux-2.6.25-rc4 unix_stream_sendmsg

This result indicates the need to take a closer look at what is going on with the slabs. A utility called slabinfo provides a report on slab activity. (The source code for the slabinfo utility is with the kernel source under Documents/vm/slabinfo.c.) To obtain information about the most actively used objects, invoke the slabinfo utility (see Listing 8).

Listing 8

slabinfo

01 #slabinfo -AD
02 Name                   Objects Alloc    Free      %Fast
03 :0000192                  3428 80093958 80090708  92   8
04 :0000512                   374 80016030 80015715  68   7
05 vm_area_struct            2875   224524   221868  94  20
06 :0000064                 12408   134273   122227  98  47
07 :0004096                    24   127397   127395  99  98
08 :0000128                  4596    57837    53432  97  48
09 dentry                   15659    51402    35824  95  64
10 :0000016                  4584    29327    27161  99  76
11 :0000080                 12784    33674    21206  99  97
12 :0000096                  2998    26264    23757  99  93

The block objects, size 192 and 512, are actively used by hackbench messages: One is for the socket buffer header and one is for the message body.

Basically, the SLUB implementation keeps a per-cpu cache for each slab type. When the kernel allocates an object, it checks the per-cpu cache first without locking. Such allocation is very fast and is called a fast path. If the per-cpu cache hasn't freed objects, the kernel allocates from shared pages with a lock, which is slow. A slow path means more lock contentions. The free procedure also has a fast path and a slow path. Because free uses a distributed lock (page lock) and the allocation process uses more exclusive locks, allocation by fast path is more important.

For these two objects, we noted that the free operation is quite slow; however, allocation is not fast, either. For example, for objects of size 512, only 68% of allocation is by fast path, and 7% of free is by fast path.

To reduce the slow path allocation, we could ask for a bigger sized slab to increase the per-cpu object cache. To increase the default max_order of 1 and min_objects of 32, we add slub_max_order=3 slub_min_objects=32 to the kernel boot command line. This increases the number of objects that must fit into one slab for an allocation to be successful, which will reduce the chance that the kernel allocates objects by slow path.

This step improved the throughput significantly, requiring just one tenth the time needed in the previous test. By extensive testing with different slub_min_objects settings, we found the correlation between slub_min_objects and the CPU number.

Mostly, we get the best result with slum_min_objects=cpu_number*2. If slum_min_objects is equal to a bigger value, the result doesn't provide much improvement.

At this point, we went back to the 8-core machine and did extensive testing to confirm our findings. After we discussed the problem with the SLUB maintainers, a patch that scales slub_min_objects, as a function of the number of CPU cores, was merged into the Linux kernel.

Conclusions

In this article, we provided a quick tour of some useful tools for diagnosing common performance issues. Of course, this brief introduction is not intended as a comprehensive description of the performance tuning craft, but it should provide you with a good starting point for discovering and fixing performance bottlenecks on your Linux systems.

Power Performance

Power consumption is another aspect of system performance. Most recent processors are equipped with processor performance states (P-states) and sleep states (C-states). If the system is not fully loaded, it is better to switch to a P-state that operates the processor at a lower frequency and voltage. If the processor is idle, the system should switch to a sleep state.

To take advantage of these features, make sure the BIOS Speed Step and C-state features are enabled. To take advantage of the P-state feature in the CPU, you need to make sure that a suitable CPU frequency governor is enabled for the system. To see what governors are available, use:

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
ondemand userspace performance

With the following command, you can determine the current governor:

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

The ondemand governor has the best power-saving characteristics and is typically recommended, whereas the performance governor will put the CPU at the maximum frequency and voltage. To switch to the ondemand governor, issue the following command:

# echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

To take advantage of the CPU C-states, you need to enable the tickless idle feature in the kernel. The Linux kernel has a periodic timer tick that wakes up the CPU. This tick prevents the CPU from going into the sleep state. With the recent addition of the tickless idle, the Linux kernel removed this timer tick, which allows the CPU to sleep for a longer time in power-saving mode. If you compile your own kernel, you should enable the option CONFIG_NO_HZ=y.

The PowerTOP utility [3] is a useful tool for checking P-state and C-state status in the system. PowerTOP will show the current P-state and C-state, report on which applications wake up the CPU, and provide additional power-saving hints tailored to your system.

Additional power-saving tips can be found at the Less Watts website [4].

The Author

Tim Chen is a staff engineer of the Open Source Technology Center at Intel Corporation. His current focus is mainly on Linux performance. Before working at Intel, he worked at Trillium Digital Systems on telecommunications systems and at Hughes Space and Communications on mobile satellite systems. He graduated from UCLA in 1995 with a Ph.D. degree in Electrical Engineering.

Alex Shi joined Intel's Open Source Technology as a software engineer in 2005. He works on Linux performance and power tuning.

Yanmin Zhang, from Open Source Technology Center of Intel Corporation, has worked on Linux projects for five years, including processor and chipset enabling, which cover Intel i386, x86-64, and Itanium architectures and PCI-Express. He is currently working on the Linux Kernel Performance project. Before joining Intel, Yanmin worked for Bell Labs Lucent Technology on network management system development.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Local Vulnerabilities in Current Kernels

    Recent kernel versions back to the older kernel 2.6.17 may contain a vulnerability that can be exploited by local attackers.

  • Kernel 2.6.25: 64 Bit Systems At Risk

    The changelog for kernel 2.6.25.11 includes just a single entry, however, it seems to be so important that the Kernel Stable Team urgently advises users to upgrade the kernel on 64 bit multiple user systems.

  • Linux 2.6.25 without Closed Source USB Drivers

    A controversial patch for the imminent kernel 2.6.25 is causing much debate in the developer community: in a similar move to one he made two years ago, the well-known kernel developer Greg Kroah-Hartman has submitted a patch that prevents closed source USB drivers from using the kernel's USB driver API.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

comments powered by Disqus

Direct Download

Read full article as PDF:

030-036_tuning.pdf  (2.10 MB)

News