Zack's Kernel News
Zack's Kernel News
Zack Brown reports on improving spinlock efficiency and adding up a few extra bytes.
Improving Spinlock Efficiency
Alex Kogen, from Oracle, wanted to eke out a little more efficiency from spinlocks. Locks are part of any multitasking operating system – they make sure only one process can access any given system resource at a time. Spinlocks are locks that just loop around and around waiting for one process to let go of a resource so the next process can claim it. There are lots of different kinds of locks, depending on a ton of different special contexts. For example, an MCS lock is a particular type of spinlock named after the people who first described it: Mellor-Crummey and Scott.
In this case, Alex said, "In CNA [compact NUMA-aware], spinning threads are organized in two queues, a primary queue for threads running on the same node as the current lock holder, and a secondary queue for threads running on other nodes. After acquiring the MCS lock and before acquiring the spinlock, the MCS lock holder checks whether the next waiter in the primary queue (if it exists) is running on the same NUMA [non-uniform memory access] node. If it is not, that waiter is detached from the main queue and moved into the tail of the secondary queue. This way, we gradually filter the primary queue, leaving only waiters running on the same preferred NUMA node."
Alex was talking about having spinlocks be aware of the special context of the threads waiting for the resource they guard. Specifically, on systems with more than one CPU (which is pretty much all of them these days), threads on any CPU might be waiting for resources such as RAM on one specific CPU. But whenever a thread uses a resource on a different CPU from its own, access will be slower than a thread accessing that resource on its same CPU. Also, different CPUs may have RAM that's inherently faster or slower than RAM on another CPU. Those are called NUMA systems.
As Alex described it, his patch would check the threads waiting for access to a piece of RAM. If the threads were on different CPUs, the patch would move those threads to the back of the line so the more efficient threads (on the same CPU as the RAM they wanted to access) could have priority.
He pointed out a known issue, saying, "Note that this variant of CNA may introduce starvation by continuously passing the lock between waiters in the main queue. This issue will be addressed later in the series." In other words, the lower priority threads might never get access to the RAM they want, if more and more higher priority threads keep making requests.
At first, there was no significant criticism of Alex's project. Davidlohr Bueso had some technical questions, and Waiman Long offered some background explanations. Barry Song liked the patches in general and suggested that they might have a broader value beyond only NUMA systems. Though he did remark, "anyway, it looks like a good start to be aware of numa only for this moment."
The only developer to seriously question the necessity of Alex's patch was Peter Zijlstra.
This started because Alex's patch, in its initial form, only works on x86 systems. Guo Ren remarked that he was actually working on a RISC-V architecture variant of Alex's feature. It was at this point that Peter questioned the whole value of Alex's (and Guo's) CNA approach. He asked, "What has been your reason for working on CNA? What lock has been so contended you need this?"
Guo pointed to his post in another thread where he explained that on certain specific systems that had just been released he was hopeful the CNA approach would show a significant improvement. He said, "The two NUMA nodes platform has come out, so we want to measure the benefit of CNA qspinlock."
But Peter replied, "CNA should only show a benefit when there is strong inter-node contention, and in that case it is typically best to fix the kernel side locking. Hence the question as to what lock prompted you to look at this."
Guo replied, "I met the long lock queue situation when the hardware gave an overly aggressive store queue merge buffer delay mechanism." He gave a link to another post of his, in which he identified a hardware bug in certain processors resulting in this behavior.
Peter replied, "*groan*, so you're using it to work around 'broken' hardware :-(. Wouldn't that hardware have horrifically bad lock throughput anyway? Everybody would end up waiting on that store buffer delay."
Guo agreed that the hardware should be improved and that Peter was right about the awful throughput. But he said, "This problem is only found in the lock torture case, and only one entry left in the store buffer would cause the problem. We are now widely stress testing userspace parallel applications to find a second case."
The dispute was very technical and ended inconclusively. Ultimately, Peter and Guo also observed different behaviors on the relevant hardware. So there may need to be further tests to resolve the question of CNA's ultimate usefulness. Guo's main point appears to be that CNA would be a good and inexpensive workaround for the hardware bugs he and Alex were targeting. Peter's main point seems to be that CNA represents a significant and high-maintenance new feature, and that the hardware bugs Guo is concerned about could be addressed more easily and inexpensively by taking a more subtle or sensitive approach to when and how locks are taken in the kernel code. There isn't necessarily a need, he seems to say, to analyze the nature or contexts of the threads requesting the locks.
This is a situation that often arises in Linux kernel development. There may be general agreement that a problem exists, even though the manifestation of that problem and the right way to fix it may be obvious in different ways to different people involved in the discussion. Sometimes the resolution turns out to be a matter of principle, such as, we will never break the application binary interface. Other times it's a matter of practicality, such as, this code introduces a lot of complexity in order to solve a rare corner case that probably nobody will ever encounter. And sometimes the issues are overlapping and contradictory, making it very difficult to settle on a real resolution – especially when developers have put a lot of effort into their favorite solution already.
It's unclear whether Alex's and Guo's patches will make it into the kernel. Maybe Peter is right, and the solution is for the kernel to use locking more wisely to avoid the problems in the first place. Maybe Alex and Guo are right that their code will speed up the problem case on the bad hardware. And maybe the real solution will involve the best points from both sides. We never know in advance.
Adding Up a Few Extra Bytes
Sharing resources between systems on a network, between CPUs on a motherboard, between processes on a CPU, and in many other situations, is the way of the world these days. Virtualization and cloud services run the world, and all of them run on Linux. Depending on the application, a few bytes saved in the kernel can turn into terabytes on a large system, and terabytes can turn into yottabytes on the cloud – at least maybe some day.
For example, database applications value speed and efficiency above almost all other considerations. Sharing memory across multiple database processes bypasses the overhead of pipes and sockets and really revs up things. Databases have done that pretty much forever. However, there is apparently some room for improvement.
Khalid Aziz of Oracle recently pointed out that for each memory page shared between processes on Linux, a separate identical page table entry (PTE) had to be maintained within each of those processes. You already see where he's going with this.
First of all, Linux uses a virtual memory system to map all the various RAM chips and other bits and pieces of available memory on a given system into one address space for ease of reference. Otherwise the nightmares would get too intense. Processes rely on PTEs to map those virtual addresses to the physical hardware. So processes use PTEs not just for shared memory, but for all of their memory accesses. And for shared memory, those PTEs are simply identical in whichever process uses that memory.
Khalid wanted to share PTEs alongside the shared memory they referenced. As he put it, "When very few memory pages are shared between processes, the number of page table entries (PTEs) to maintain is mostly constrained by the number of pages of memory on the system. As the number of shared pages and the number of times pages are shared goes up, amount of memory consumed by page tables starts to become significant."
He also pointed out that this problem did not exist with a process's internal threads of execution, because threads already share PTEs, along with the memory they share within their process.
Threads and processes are fundamentally the same – they are both sequences of running code on the system; they go through the kernel to access RAM, disk, and other system resources; and they run in parallel with each other under the careful tending of the Linux scheduler, to give us that wonderful multitasking feeling as we browse the web. However, processes are fundamentally independent. Your web browser and database application are different tools invoked independently by you. They don't necessarily know or care much about each other, and they don't have rights to each other's data. They have their own private address spaces and are essentially protected from each other by the operating system, whereas threads are invoked from within a single application, as part of the coordinated logic and operation of that application. There are way fewer security considerations for threads, because they are all created by the benevolent parent process.
Khalid said that at his company, "field deployments commonly see memory pages shared across 1000s of processes. On x86_64, each page requires a PTE that is only 8 bytes long which is very small compared to the 4K page size. When 2000 processes map the same page in their address space, each one of them requires 8 bytes for its PTE and together that adds up to 8K of memory just to hold the PTEs for one 4K page."
For example, he said, "On a database server with 300GB SGA [System Global Area], a system crash was seen with out-of-memory condition when 1500+ clients tried to share this SGA even though the system had 512GB of memory." (SGA is an Oracle-specific shared memory pool.)
Khalid concluded, "mapping every page from SGA would have required 878GB+ for just the PTEs. If these PTEs could be shared, amount of memory saved is very significant."
He posted a patch to modify the mmap()
system call so that processes could let the kernel know that they were cool with sharing their PTE for a given shared memory page with other processes using that memory. He said that thereafter, "any other process that mmaps the same file with MAP_SHARED_PT flag can then share the same page table entries."
Khalid described his initial to-be-improved handshaking action, whereby various threads could gain access to the PTEs. However, he emphasized that this was a work in progress, and he himself had implementation questions and even concerns. For example, he said, "When page tables for a process are freed upon process exit, pmd_free_tlb() gets called at one point to free all PMDs allocated by the process. For a shared page table, shared PMDs can not be released when a guest process exits" because that would leave all the other processes relying on that PTE chewing on dirt. Unsure of how to deal with this, Khalid proposed some possibilities.
Mike Kravetz, also from Oracle, had some technical issues with Khalid's code, although he wasn't opposed to the basic idea. He pointed out that one of Linux's security features applying to processes as opposed to threads was address space layout randomization (ASLR), which ensured that each process's set of virtual memory addresses would be randomly scrambled relative to the physical memory on the system, as a way of stymieing attackers. As Khalid's patch stood currently, the virtual addresses used by multiple processes would need to be the same. Mike had some technical suggestions for how to get around that, maintaining security while still sharing the PTE.
Khalid replied that the requirement for identical virtual memory addresses could potentially be relaxed, as Mike suggested. He also made reference to an earlier attempt to do essentially the same thing as his current patch – the older mshare
implementation, which would come up again later in the conversation; however, at this point the conversation became a bit more disjointed and less of an easily described narrative.
Meanwhile, Peter Xu of Red Hat pointed out that the hugetlb patches also supported something along the lines of Khalid's feature, and he asked for a comparison between the two approaches. Hugetlb is a kernel feature that lets the kernel define very large memory pages, because modern CPUs support that. Peter was hopeful that if Khalid's feature made it into the kernel, it might pave the way for hugetlb's similar feature to go in too.
A little time passed with no movement on the mailing list, until Rongwei Wang of Alibaba asked about the status of Khalid's patch; he also asked Matthew Wilcox about the status of mshare
. Rongwei said, "I remember you [Matthew] said you will redesign the mshare to per-vma not per-mapping (apologize if remember wrongly) in last time MM alignment session. And I also refer to you to re-code this part in our internal version (based on this RFC). It seems that per VMA will can simplify the structure of pgtable sharing, even doesn't care the different permission of file mapping."
But Matthew replied, "It was David [Hildenbrand] who wants to make page table sharing be per-VMA. I think he is advocating for the wrong approach. In any case, I don't have time to work on mshare and Khalid is on leave until September, so I don't think anybody is actively working on mshare."
David Hildenbrand of Red Hat also replied to Rongwei, saying, "I also don't have any time to look into this, but my comment essentially was that we should try decoupling page table sharing (reduce memory consumption, shorter rmap walk) from the mprotect(PROT_READ) use case." But he concluded, "Again, I don't have time to look into that (just like everybody else as it appears) and might miss something important. Just sharing my thoughts that I raised in the call."
However, it did seem as though Rongwei had time to work on this, and he said regarding some of Matthew's and David's technical comments, "Your words are very helpful to me. I try to design our internal version about this feature in a right way."
David and Matthew had some more technical comments, and at one point Matthew pointed out that David's use case was too narrow to handle all situations. He said, "This is why I proposed mshare(). Anyone can use it for anything. We have such a diverse set of users who want to do stuff with shared page tables that we should not be tying it to memfd or any other filesystem. Not to mention that it's more flexible; you can map individual 4kB files into it and still get page table sharing." He added later, "arbitrary filesystems was one of the origin use cases; where the database is stored on a persistent memory filesystem, and neither the kernel nor userspace has a cache. The Postgres & Commercial Database use-cases collapse into the same case, and we want to mmap the files directly and share the page tables."
Finally, as the thread came to an end, there was some confusion about what was being discussed. Rongwei asked, "I'm a little confused about this mshare(). Which one is the mshare() you refer to here, previous mshare() based on filesystem or this RFC v2 posted by Khalid? IMHO, they have much difference between previously mshare() and MAP_SHARED_PT now."
Matthew replied, "I haven't read this version of the patchset. I'm describing the original idea, not what it may have turned into. As far as I'm concerned, we're still trying to decide what functionality we actually want," and the thread came to an end.
This can sometimes happen in Linux kernel development. There's a problem to be solved, such as a lot of extra memory usage that might be reduced, and it turns out there are actually several people trying to implement roughly similar ideas in various parts of the kernel, each with their own ideas on how it should go. And many of them are being pulled into other tasks by their corporate teams, or even just trying to go on vacation. A feature like that – which can make a big difference to a small subset of users, but that isn't as clear-cut as a simple bug fix – may not make much headway at first. It could take years for such features to settle into any sort of final form that works in all the various microscopic contexts that might need them.
Buy this article as PDF
(incl. VAT)