Zack's Kernel News

Zack's Kernel News

Article from Issue 275/2023
Author(s):

In kernel news: Heap Hardening Against Hostile Spraying; and Core Contention Improvements … or Not.

Heap Hardening Against Hostile Spraying

Ruiqi Gong wanted to address the generic security threat of "heap spraying" in the Linux kernel. Heap spraying is when an attacker knows the memory address of a critical part of the system and makes a large number of malloc() calls trying to allocate memory at that specific address. In the event of certain types of kernel bugs, one of those malloc() calls might succeed, giving the attacker the ability to overwrite the critical part of the system with their own malicious code. Heap spraying is not a security vulnerability in itself, because user software could legitimately want to issue many malloc() calls. Rather, heap spraying is a way to exploit other bugs that may exist in the kernel.

Because heap spraying itself is not a bug or vulnerability, it poses a fascinating problem in terms of how best to reduce the ability of attackers to make use of it. For example, Ruiqi pointed out that slab caches can be shared among different subsystems and modules. A slab cache is a region of memory allocated all at once on the grounds that the calling routine knows it will need that much memory eventually.

Slab caches are more efficient than doing a bunch of piecemeal allocations as-needed, but they also present a visible target for heap spraying attacks. At the same time, as Ruiqi said, it wouldn't be realistic to disable slab caching in the kernel, because that feature is in wide use by user software, all of which would need to be rewritten and would suffer a significant performance penalty.

Ruiqi posted a patch, proposing the following mitigating approach. He said, "to efficiently prevent heap spraying, we propose the following approach: to create multiple copies of generic slab caches that will never be merged, and random one of them will be used at allocation. The random selection is based on the address of code that calls 'kmalloc()', which means it is static at runtime (rather than dynamically determined at each time of allocation, which could be bypassed by repeatedly spraying in brute force). In this way, the vulnerable object and memory allocated in other subsystems and modules will (most probably) be on different slab caches, which prevents the object from being sprayed."

He posted some benchmarks comparing the kernel with and without his patch, which showed a small performance hit. And this is the crux of the debate over hardening the kernel against such attacks versus only fixing known bugs and vulnerabilities.

As Hyeonggon Yoo put it, "I don't think adding a hardening feature by sacrificing one digit percent performance (and additional complexity) is worth [it]. Heap spraying can only occur when the kernel contains security vulnerabilities, and if there is no known ways of performing such an attack, then we would simply be paying a consistent cost."

Pedro Falcato replied, "And does the kernel not contain security vulnerabilities? :v This feature is opt-in and locked behind a CONFIG_ and the kernel most certainly has security vulnerabilities. So… I don't see why adding the hardening feature would be a bad idea."

Ruiqi amplified Pedro's sentiment, saying, "unfortunately there are always security vulnerabilities in the kernel, which is a fact that we have to admit. Having a useful mitigation mechanism at the expense of a little performance loss would be, in my opinion, quite a good deal in many circumstances. And people can still choose not to have it by setting the config to n."

Vlastimil Babka also replied, "as a slab maintainer I don't mind adding such things if they don't complicate the code excessively, and have no overhead when configured out. This one would seem to be acceptable at first glance, although maybe the CONFIG space is too wide, and the amount of #defines in slab_common.c is also large (maybe there's a way to make it more concise, maybe not)."

However, he went on to say, "But I don't have enough insight into hardening to decide if it's a useful mitigation that people would enable, so I'd hope for hardening folks to advise on that."

Ruiqi replied, "For the effectiveness of this mechanism, I would like to provide some results of the experiments I did. I conducted actual defense tests [...] by reverting fixing patch to recreate exploitable environments, and running the exploits/PoCs on the vulnerable kernel with and without our randomized kmalloc caches patch. With our patch, the originally exploitable environments were not pwned by running the PoCs."

Kees Cook came into the discussion at this point, saying that he heartily agreed with the need for better approaches to heap spraying attacks and other potential exploits, in particular Use After Free (UAF). UAF is a type of vulnerability where memory that has been freed still contains private data that can be accessed by any hostile code that looks at it.

Kees said of Ruiqi's slab cache patch, "This is a nice balance between the best option we have now ('slub_nomerge') and most invasive changes (type-based allocation segregation, which requires at least extensive compiler support), forcing some caches to be 'out of reach'."

Kees found Ruiqi's benchmarks to show a relatively tiny impact on the kernel, which pleased him greatly. And he gave some comments relating Ruiqi's work to other similar work:

"Back when we looked at cache quarantines, Jann pointed out that it was still possible to perform heap spraying – it just needed more allocations. In this case, I think that's addressed (probabilistically) by making it less likely that a cache where a UAF is reachable is merged with something with strong exploitation primitives (e.g. msgsnd).

"In light of all the UAF attack/defense breakdowns in Jann's blog post (https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html), I'm curious where this defense lands. It seems like it would keep the primitives described there (i.e. 'upgrading' the heap spray into a page table 'type confusion') would be addressed probabilistically just like any other style of attack."

And Kees said in summary, "So, yes, I think this is worth it, but I'd like to see what design holes Jann can poke in it first. :)"

Hyeonggon felt now that his performance objections had actually been answered, and the minor performance hit seemed like an appropriate trade-off.

At this point, the developers dove into an implementation discussion, which eventually petered out.

In spite of this particular discussion seeming to fall in favor of Ruiqi's proposed hardening feature, there does remain a heated debate among developers – not just Linux kernel, but in the operating system space generally – as to where to draw the line. I know Linus Torvalds has at one time or another expressed reluctance to include features aimed at attacks that may never succeed because the bugs they rely on don't exist, in favor of fixing known security holes as they appear. That particular debate can get quite heated, and I'd be interested to learn about the final outcome of this particular patch, which seems to have such a low cost to overall efficiency.

Core Contention Improvements … or Not

Ying Huang from Intel posted a set of patches to address the problem of CPUs contending with each other for access to system resources, especially RAM. There are already mechanisms in place in the kernel to handle memory allocations, so Ying's patches generated quite a bit of discussion.

Generally, memory is divided into "zones," with ZONE_NORMAL representing ordinary RAM, and other zones, such as ZONE_DMA, ZONE_MOVABLE, ZONE_DEVICE, etc., representing regions of memory with special characteristics. As Ying put it, "all cores in one physical CPU will contend for the page allocation on one zone in most cases. This causes heavy zone lock contention in some workloads. And the situation will become worse and worse in the future."

As with all operating systems that run multiple simultaneous processes, Linux implements locks so that only one process can access a given resource – in this case a zone of memory – at a given time. In general, a lock is held for a microscopic amount of time and then released for the next process that needs the resource. This is what gives the operating system the illusion of running everything all at once, having multiple pieces of software reading and writing to memory, and so on. In reality, all processes take turns.

With the growing number of CPU cores, Ying said, cores were starting to form long lines waiting for locks to be freed so they could access the zones of memory they needed. While they waited, those cores had to just sit idle. This wouldn't bring the system to a standstill, he said. But as he put it, "For example, on an 2-socket Intel server machine with 224 logical CPUs, if the kernel is built with `make -j224`, the zone lock contention cycles% can reach up to about 12.7%." With his patch series, he went on to say, "the zone lock contention cycles% reduces to less than 1.6% in the above kbuild test case when 4 zone instances are created for ZONE_NORMAL."

That is a significant improvement. Ying achieved this by splitting memory zones into multiple instances of the same zone type. As he put it, "we will create one zone instance for each about 256 GB memory of a zone type generally. That is, one large zone type will be split into multiple zone instances. Then, different logical CPUs will prefer different zone instances based on the logical CPU No. So the total number of logical CPUs contend on one zone will be reduced. Thus the scalability is improved."

Ying added, "another choice is to create zone instances based on the total number of logical CPUs. We choose to use memory size because it is easier to be implemented. In most cases, the more the cores, the larger the memory size is. And, on system with larger memory size, the performance requirement of the page allocator is usually higher."

Dave Hansen, also from Intel, replied:

"A few anecdotes for why I think _some_ people will like this:

"Some Intel hardware has a 'RAM' caching mechanism. It either caches DRAM in High-Bandwidth Memory or Persistent Memory in DRAM. This cache is direct-mapped and can have lots of collisions. One way to prevent collisions is to chop up the physical memory into cache-sized zones and let users choose to allocate from one zone. That fixes the conflicts.

"Some other Intel hardware a ways to chop a NUMA node representing a single socket into slices. Usually one slice gets a memory controller and its closest cores. Intel calls these approaches Cluster on Die or Sub-NUMA Clustering and users can select it from the BIOS.

"In both of these cases, users have reported scalability improvements. We've gone as far as to suggest the socket-splitting options to folks today who are hitting zone scalability issues on that hardware.

"That said, those _same_ users sometimes come back and say something along the lines of: 'So… we've got this app that allocates a big hunk of memory. It's going slower than before.' They're filling up one of the chopped-up zones, hitting _some_ kind of undesirable reclaim behavior [...].

"Anyway, _if_ you do this, you might also consider being able to dynamically adjust a CPU's zonelists somehow. That would relieve pressure on one zone for those uneven allocations."

Ying replied, "Yes. For the requirements you mentioned above, we need a mechanism to adjust a CPU's zonelists dynamically. I will not implement that in this series. But I think that it's doable based on the multiple zone instances per zone type implementation in this series."

Elsewhere, Ying's whole approach was called into question.

Michal Hocko said, "It is not really clear to me why you need a new zone for all this rather than partition free lists internally within the zone?" He added, "I am also missing some information why pcp caches tunning is not sufficient."

Per-CPU Pageset (PCP) caching is another way, already in the kernel, to reduce zone lock contention. Each CPU core allocates a cache of memory ahead of time, just for its own use. When processes on that core request memory access, it's taken out of that cache, thus avoiding the need to request a lock on that memory zone. Because the memory has already been allocated, there's no risk of any other process trying to use it. Meanwhile, PCP caches are replenished by reclaiming memory that's no longer needed by its process – this too avoids locking the zone in order to replenish the cache.

Ying replied to Michal's email, saying, "PCP does improve the page allocation scalability greatly! But it doesn't help much for workloads that allocating pages on one CPU and free them in different CPUs. PCP tuning can improve the page allocation scalability for a workload greatly. But it's not trivial to find the best tuning parameters for various workloads and workload run time statuses (workloads may have different loads and memory requirements at different time). And we may run different workloads on different logical CPUs of the system. This also makes it hard to find the best PCP tuning globally. It would be better to find a solution to improve the page allocation scalability out of box or automatically."

Michal replied, "this makes sense. Does that mean that the global pcp tuning is not keeping up and we need to be able to do more auto-tuning on local bases rather than global?"

Ying said, "I think that PCP helps the good situations performance greatly, and splitting zone can help the bad situations scalability. They are working at the different levels." He added, "As for PCP auto-tuning, I think that it's hard to implement it to resolve all problems (that is, makes PCP never be drained). And auto-tuning doesn't sound easy."

David Hildenbrand replied, "I agree with Michal that looking into auto-tuning PCP would be preferred." And he added, "If we could avoid instantiating more zones and rather improve existing mechanisms (PCP), that would be much more preferred IMHO. I'm sure it's not easy, but that shouldn't stop us from trying ;)."

Ying absolutely agreed that "improving PCP or adding another level of cache will help performance and scalability." And he also said that "it has value too to improve the performance of zone itself. Because there will be always some cases that the zone lock itself is contended."

He added pointedly, "That is, PCP and zone works at different level, and both deserve to be improved." And continued, "I do agree that it's valuable to make PCP etc. cover more use cases. I just think that this should not prevent us from optimizing zone itself to cover remaining use cases."

However, David Hildenbrand and Michal did not agree.

David Hildenbrand explained his overall position:

"Well, the zone is kind-of your "global" memory provider, and PCPs cache a fraction of that to avoid exactly having to mess with that global datastructure and lock contention. [...] As soon as you manage the memory in multiple zones of the same kind, you lose that "global" view of your memory that is of the same kind, but managed in different bucks. You might end up with a lot of memory pressure in a single such zone, but still have plenty in another zone. [...] As one example, hot(un)plug of memory is easy: there is only a single zone. No need to make smart decisions or deal with having memory we're hotunplugging be stranded in multiple zones."

David Hildenbrand concluded, "I really don't like the concept of replicating zones of the same kind for the same NUMA node. But that's just my personal opinion maintaining some memory hot(un)plug code :)."

Michal said, "Increasing the zone number sounds like a hack to me TBH. It seems like an easier way but it allows more subtle problems later on. E.g. hard to predict per-zone memory consumption and memory reclaim disbalances."

Ying concluded the debate, saying, "At least, we all think that improving PCP is something deserved to be done." He said he would look into it himself at some point, and the discussion ended there.

This discussion is fascinating to me, because it represents two important values: the desire to speed things up versus the desire to keep the code maintainable. Ying's patches resulted in quite a significant boost in overall efficiency of multicore CPUs. Yet Dave Hansen and Michal felt that they represented a change that would complicate future development decisions that might have to be made. Although perhaps a more difficult problem in the short term, they felt that improving PCP caching would avoid those complexities while potentially achieving an efficiency improvement similar to Ying's zone-splitting patchset. Still, it's hard to overlook Ying's performance improvements. It's possible that if no equivalent PCP improvements are found soon, his patches might make a comeback.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the little links that bring us closer within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on want vs. need, and hiding system resources ... from the system.

  • Kernel Protection

    Security vulnerabilities in the kernel often remain undetected. The kernel hacker initiative, Kernel Self-Protection, promotes safe programming techniques to keep attackers off the network, and, if they do slip through the net, mitigate the consequences.

  • Performance Tuning Toolbox

    Tune up your systems and search out bottlenecks with these handy performance tools.

  • OOM Killer

    When a Linux system runs out of memory, a special agent, the out-of-memory killer, rushes to its aid. Facebook has now introduced its own OOM killer. What makes it different from its kernel-based counterpart? And what is an OOM killer really?

comments powered by Disqus