A Proposed Buddy System Allocator Patch

Donald E. Porter
porterde@cs.utexas.edu
10/29/2007
Update

Intuition

The high level idea is to acquire and release the zone lock in every iteration of a bulk allocate/free rather than the entire time. This appears to improve performance by varying degrees in most cases, and only harms performance negligibly in a very few cases.

Judging by the comments in the code and my own experience, the conventional wisdom for holding the lock for the entire bulk operation is to maximize throughput. Releasing and re-acquiring the lock may increase the number of cache misses: in the non-contended case, the lock itself may become evicted, and in the contended case, the entire data structure may be ping-ponged between CPUs. Thus, from the perspective of a given critical section, throughput is maximized by completing the largest possible amount of work per lock acquire.

It turns out, however, that this may not be the best approach in optimizing the execution time of parallel applications. For the set of benchmarks below, there is generally a marginal performance improvement when the zone lock is released after each iteration. My intuition is that this is because smaller allocations and frees spend less time waiting on larger ones, allowing more of these threads to progress faster.

Essentially, then, optimizing for latency at a slight cost to throughput withing the zone allocator decreases execution time for these workloads.

The patch

diff -uprN linux-2.6.23.1/mm/page_alloc.c linux-2.6.23.1-opt/mm/page_alloc.c
--- linux-2.6.23.1/mm/page_alloc.c      2007-10-12 11:43:44.000000000 -0500
+++ linux-2.6.23.1-opt/mm/page_alloc.c  2007-10-29 18:29:05.000000000 -0500
@@ -477,19 +477,19 @@ static inline int free_pages_check(struc
 static void free_pages_bulk(struct zone *zone, int count,
                                        struct list_head *list, int order)
 {
-       spin_lock(&zone->lock);
        zone->all_unreclaimable = 0;
        zone->pages_scanned = 0;
        while (count--) {
                struct page *page;
+               spin_lock(&zone->lock);
 
                VM_BUG_ON(list_empty(list));
                page = list_entry(list->prev, struct page, lru);
                /* have to delete it as __free_one_page list manipulates */
                list_del(&page->lru);
                __free_one_page(page, zone, order);
+               spin_unlock(&zone->lock);
        }
-       spin_unlock(&zone->lock);
 }
 
 static void free_one_page(struct zone *zone, struct page *page, int order)
@@ -665,14 +665,17 @@ static int rmqueue_bulk(struct zone *zon
 {
        int i;
        
-       spin_lock(&zone->lock);
        for (i = 0; i < count; ++i) {
-               struct page *page = __rmqueue(zone, order);
+               struct page *page;
+               spin_lock(&zone->lock);
+
+               page = __rmqueue(zone, order);
                if (unlikely(page == NULL))
                        break;
                list_add_tail(&page->lru, list);
+               spin_unlock(&zone->lock);
        }
-       spin_unlock(&zone->lock);
+
        return i;
 }
 

Performance

Simulated Hardware

I originally discovered that this trick would make a difference while trying to tune the performance of TxLinux, a variant of the Linux kernel that uses and supports hardware transactional memory. TxLinux runs on top of MetaTM, a hardware transactional memory simulator that is built as a module for Virtutech Simics.

It turns out that the optimization helps the performance both with locks and transactions, hence the motivation for submitting it back to the kernel developers.

Simulation parameters: I simulated an 8, 16, and 32 CPU SMP (Pentium 4) with a clock speed of 1GHz and an IPC of 1. Each CPU has a private L1 and L2 cache. Each L1 cache has 256 64-byte lines with an associativity of 4 and a 0 cycle access time. Each L2 cache has 64k 64-byte lines with an associativity of 8 and an access time of 16 cycles. Main memory is 1 GB and has a 200 cycle access latency. The hard drive is modeled with a fixed 5.5 ms latency. I realize that the fixed IPC and disk latency are particularly unsatisfying limitations of the simulator.

These comparisons were made using kernel version 2.6.16.1. To exercise the kernel, I ran the following parallel workloads:

Name Description
find 32 instances of the find command, each in a different directory, searching files from the Linux 2.6.16 kernel for a text string that is not found. Each directory is 4.6--5.0MB and contains 333--751 files and 144--254 directories.
pmake Runs make -j 2 * number_of_procs to compile 27 source files totaling 6,031 lines of code from the libFLAC 1.1.2 source tree in parallel.
dpunish A locally developed micro-benchmark to stress synchronization in VFS directory entry cache. Parallel lookups and renames across multiple, memory-based file systems.
bonnie++ Simulates file system bottleneck activity on Squid and INN servers stressing create/stat/unlink. 32 instances of: bonnie++ -d /var/local -n 1 Runs with disk delay disabled.
MAB Runs one instance per processor of the Modified Andrew Benchmark, without the compile phase.
Config Run several parallel instances of the configure script for a large software package, one for each processor.

The mean execution time (in seconds), standard deviation, speedup, and 95% confidence intervals (for 4 runs using pseudo-random L2 cache miss perturbation as described by Alameldeen and Wood) are presented below. Data from asterisked experiments (*) are available but too noisy to be helpful.

8 CPU
find* pmake dpunish bonnie++* MAB config
unmod opt unmod opt unmod opt unmod opt unmod opt unmod opt
Mean 3.2913.351 5.7735.793 5.9345.916 12.89812.833
Standard Deviation .05.096 .044.082 .009.014 .096.466
95% Confidence Interval .049.094 .044.080 .009.013 .096.466
Speedup .982 .997 1.003 1.005

16 CPU
find pmake dpunish bonnie++* MAB config
unmod opt unmod opt unmod opt unmod opt unmod opt unmod opt
Mean 1.6471.704 2.6062.518 4.4364.450 3.4993.546 9.5269.361
Standard Deviation .213.310 .110.058 .031.017 .044.130 .339.154
95% Confidence Interval .209.304 .108.057 .030.017 .043.127 .332.151
Speedup .967 1.035 .997 .987 1.018

32 CPU
find pmake dpunish bonnie++* MAB config
unmod opt unmod opt unmod opt unmod opt unmod opt unmod opt
Mean 1.0591.108 2.6402.524 3.6303.609 3.4003.379 8.9858.937
Standard Deviation .086.102 .174.077 .046.017 .047.020 .043.089
95% Confidence Interval .085.100 .171.075 .045.017 .046.020 .042.088
Speedup .956 1.046 1.006 1.006 1.005

Real Hardware

To gain some confidence that this would also work on real hardware, I ran some experiments on the largest multiprocessor machine I had root access to: a Dell Poweredge 2900, which is a 2 chip x 4 core CMP Xeon X5355 (8 cores total), each core running at 2.66 GHz. The machine has 8 GB of RAM.

I used the following benchmarks to exercise the kernel on the real hardware.

Name Description
MAB Same as above
Config Same as above
Objdump For 2.6.23.1 kernel image, execute: objdump -d -l vmlinux | grep mov > /dev/null
Kernel Compile Execute 'make -j 16' on the 2.6.23.1 kernel source that I used for these experiments.
Kernel Compile (Fast) Execute 'make -j 16' on the 2.6.16.1 kernel source that I used for the simulation experiments. It has many fewer Kconfig options selected.
hdparm Execute 'hdparm -t /dev/sda1' to check for adverse effects on the IO processing path. Unlike the others, which report seconds, this reports MB/s.
lmbench Ran the default lmbench suite three times for each kernel, listed here (the first three are unmod).

The performance data are presented below, for 15 samples on each kernel. All measurements are seconds, except for hdparm, which is MB per second.
MAB config objdump Kernel compile Kernel compile fast hdparm
unmod opt unmod opt unmod opt unmod opt unmod opt unmod opt
Mean 2.9162.872 2.3502.337 23.23423.057 255.870258.027 37.76436.030 87.54387.377
Standard Deviation .103.090 .159.178 .255.393 5.8614.903 3.3972.985 .3071.066
95% Confidence Interval .052.046 .080.090 .180.199 2.9662.481 1.7191.511 .155.539
Speedup 1.015* 1.006 1.008* 0.992* 1.048* 1.002

The starred speedups are statistically significant by the Wilcoxon signed-rank test

Conclusions

I propose a very modest change to the synchronization of the Buddy System allocator that helps performance in some cases and does no significant harm in the worst case. Most of these numbers are not clearly "out of the noise," yet the ones that are significant are positive. The trend is generally positive, and the performance tends to improve at higher CPU counts, implying potentially better scalability.

Speaking broadly, the spinlocks in the Linux kernel are already highly tuned, leaving developers on the wrong side of the law of diminishing returns. It should therefore be unsurprising that tuning a given lock yields a small delta in performance.

Subsequent Data

Feedback from the kernel mailing list led me to collect more data points to further understand this patch's performance.

Average size and order of allocation/deallocation

Within our simulation environment, I instrumented the bulk allocation and deallocation routines in the kernel to determine the average number of pages requested and the order of the allocations requested. The means and standard deviations are listed below:
config MAB dpunish pmake
mean stdev mean stdev mean stdev mean stdev
free_pages_bulk() - count 7.010.74 6.821.01 7.000.14 7.000.00
order 0.000.05 0.030.17 0.000.06 0.000.00
rmqueue_bulk() - count 7.031.70 7.212.42 7.071.29 6.281.77
order 0.210.65 0.310.87 0.130.14 0.571.07

Integration with Ticket Spinlocks

I measured the performance effects of this patch when integrated with the ticket spinlocks patch like so:
        if (lock_is_contended(lock)) {
                spin_unlock(lock);
                spin_lock(lock);                /* To the back of the queue */
        }

I used kernel version 2.6.24-rc7 as a baseline (unmod). I compared this to ticket spinlocks alone (ts), ticket spins + my optimization (ts+opt), and ticket spins + my optimization + putting the zone lock in its own cache line (ts+opt+cl).
Per Andrew Morton's request, I added the dd regression test that executes this command
        time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)
The performance data are presented below, for 15 samples on each kernel (except kernel compiles, which had 5). All measurements are seconds, except for hdparm, which is MB per second.
MAB config objdump Kernel compile Kernel compile fast hdparm dd
unmod ts ts+opt ts+opt+cl unmod ts ts+opt ts+opt+cl unmod ts ts+opt ts+opt+cl unmod ts ts+opt ts+opt+cl unmod ts ts+opt ts+opt+cl unmod ts ts+opt ts+opt+cl unmod ts ts+opt ts+opt+cl
Mean 3.1393.239 3.1903.291 2.5202.562 2.4982.528 22.74422.921 22.47623.363 239.501239.754 241.821240.022 37.20438.795 39.71538.872 87.49687.510 87.49387.513 0.0790.078 0.0780.079
Standard Deviation .174.175 .188.261 .202.256 .205.230 .447.583 .500.512 6.1802.459 4.4894.489 2.8614.416 3.0815.167 .127.062 .104.057 .001.002 .002.003
95% Confidence Interval .088.088 .095.132 .102.130 .104.116 .226.295 .253.259 5.4172.155 3.9343.963 2.5083.871 2.7014.529 .064.032 .053.029 .000.001 .001.001
Speedup 0.969* 1.015* 0.969* 0.983* 1.026* 0.988 0.992* 1.020* 0.962* 0.999 0.991 1.007 0.959 0.977 1.022 1.000 1.000 1.000 1.002 1.012* 0.984*

The starred speedups are statistically significant by the Wilcoxon signed-rank test

The speedups are the speedup over the previous column. In other words, the speedup of ts is over unmod, the speedup for ts+opt is the speedup over ts, etc.

These data indicate that there is a small performance penalty (1-2%) incurred when adding ticket spinlocks alone, probably because they are not used enough in the kernel to reap the performance benefits of the implementation cost. By allowing rescheduling in just these two places, the 1-2% overhead of ticket spin performance is reclaimed in most benchmarks, making overall performance comparable to the baseline kernel.

The only data inconsistent with these results are the kernel compilation benchmarks. These data indicate that the best performance is from the baseline kernel. It is not clear to me what property of kernel compilation causes it to have this performance profile.

Placing the zone spin lock in its own cache line hurts performance.

The dd regression test actually shows a similar trend to the other benchmarks; the ts+opt kernel performs best.