Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.
Making Device-Based RAM Available To Arbitrary User Processes
Anshuman Khandual wanted to come up with a way for all attached devices to make their on-board RAM available to a running Linux system as just a normal region of RAM that any process could use without having to do anything special.
This is a tough nut to crack, because it's not just a simple matter of mapping the memory and making it available for allocation by user processes. As Anshuman put it, "To achieve seamless integration between system RAM and coherent device memory it must be able to utilize core memory kernel features like anon mapping, file mapping, page cache, driver managed pages, HW poisoning, migrations, reclaim, compaction, etc."
Each of the above requirements had its own set of difficulties, which Anshuman enumerated. For one thing, unlike regular RAM, device RAM couldn't be made available until after a given device had been initialized. That constraint would need to be handled properly. Likewise, each issue had its own constraints and caveats that would need to be papered over so that regular user code would perceive the RAM as simply being available for use.
Anshuman went over a wide-ranging array of steps that he felt would need to be taken to accommodate device RAM.
One requirement he identified was the need for the device's memory page data structures to be used cleanly within the kernel's LRU (least recently used) lists of available memory. However, Jerome Glisse felt that this might not be as important as it seemed. The problem, as he saw it, was that the kernel's page cache code would sometimes need to add any given memory page to the LRU lists. If this wasn't available, the page cache code would be flummoxed.
However, Jerome thought there was a reasonable workaround available for this. In his own work on HMM (heterogeneous memory management), he explained, "I am using ZONE_DEVICE and because memory is not accessible from CPU (not everyone is blessed with a decent system bus like CAPI, CCIX, Gen-Z, …) so in my case a file back page must always be spawned first from a regular page and once read from disk then I can migrate to GPU page. So if you accept this intermediary step you can easily use ZONE_DEVICE for device memory. This way no lru, no complex dance to make the memory out of reach from regular memory allocator."
Jerome suggested that he and Anshuman pool their efforts, as it seemed there was a good opportunity to simplify and enhance both projects.
While Jerome was suggesting that some of Anshuman's requirements might not be strictly necessary, Dave Hansen felt that Anshuman had left out some crucial requirements, specifically support for autonuma (automatic non-uniform memory architecture) and hugetlbfs. Both of these relate to gathering up available RAM and making it available to the system, so presumably they'd be relevant to Anshuman's work.
David Nellans, working with Anshuman on this project, replied that this had been a conscious choice, to make the device RAM pages migratable between CPUs via an explicit decision, rather than by autonuma's automated method. As for ignoring hugetlbfs, David said that Anshuman's code would rely more on THP (transparent huge tables), which performed a similar function as hugetlbfs.
Dave wasn't satisfied with this explanation and said that in particular it would force some complexity into the hugetlbfs code to exclude this type of device RAM explicitly.
Various folks descended into a technical debate of the issues surrounding this and Jerome's suggestion. The question of what to support and what to work around was crucial to identifying whether the feature could be implemented at all: Support too much, and the code becomes too complicated/slow/unmaintainable. Work around too much, and the code becomes too inconsistent/unusable/breakable. The goal is ultimately to create features that everyone can use without too many headaches and that can be maintained without too much insanity.
For example, at one point Anshuman objected to Jerome's earlier suggestion, saying that "there are problems now migrating user mapped pages back and forth between LRU system RAM memory and non LRU device memory which is yet to be solved. Because you are proposing a non LRU based design with ZONE_DEVICE, how we are solving/working around these problems for bi-directional migration?" And he added, "Before non LRU migration support patch series from Minchan, it was not possible to migrate non LRU pages which are generally driver managed through migrate_pages interface. This was affecting the ability to do compaction on platforms which has a large share of non LRU pages. That series actually solved the migration problem and allowed compaction. But it still did not solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages are mapped into a process's page table and being accessed from user space, it can not be moved using migrate_pages interface."
To which Jerome replied at some point, saying, "Minchan is trying to allow migration for device driver kernel allocated memory ie not memory that end inside a regular vma (non special vma) but only inside a device driver file vma if at all. So we are targeting different problem. Me I only care about "regular" process memory is private anonymous, or share memory (either back by regular file or pure share memory). I do not want to mess with any of the device driver vma or any special vma that are under control of an unknown device driver. Trying to migrate any such special memory is just not going to work. Moreover I believe it is not something we care [about] in the first place."
So, there are a range of issues surrounding the various aspects of support for device-based RAM in the general memory pool. Different ongoing kernel projects have a vested interest in seeing a compatible implementation, and there are never any easy ways to discern which would be the easiest approach to thread a given needle.
Ultimately, projects of this type will inevitably attract the attention of hard-core kernel hackers who can definitively rule out certain approaches. At that point I'd expect a certain amount of goal realignment, especially if the folks involved in the initial patches have gotten too far off into a world of feature compromise and requiring user code to be too aware of kernel internals. This is exactly the sort of situation where we might see Linus Torvalds come along and propose a dramatically simpler approach that satisfies none of the original proponents but that allows device-based RAM to be used by certain processes in simple ways.
Blunting Hostile Code
Juerg Haefliger posted some code to implement eXclusive Page Frame Ownership (XPFO), which he felt would help guard against ret2dir attacks. A ret2dir attack occurs when hostile code is written into kernel memory, such that a function call will return to the location of that hostile code, thus executing it. It's a sibling to a ret2usr attack, in which the hostile code is written into user memory instead of kernel memory.
XPFO addresses ret2dir by creating exclusive ownership of memory pages, so that user code would not be able to allocate memory from the kernel, insert hostile code, and then cause the kernel to attempt to run that code.
Laura Abbott had some technical suggestions, mostly about portability and how to improve locking efficiency in Juerg's code, and she offered suggestions for how to contact the relevant maintainers.
Juerg liked Laura's code review and pointers and started posting updated patches. The technical discussion went on for a bit, but there were no serious objections, and it looked as though Juerg was making good progress toward putting together an acceptable patch.
Some argue that little security fixes like this don't make much difference in the grand scheme of things, especially when they guard against attacks that can only occur when user code does something dumb; however, others think that any opportunity to remove an attack vector should be taken. Ultimately, there's a give-and-take between usefulness and bloat, but it does look as though Juerg's code is likely to make it into the kernel eventually.
New Kernel Messaging System
David Herrmann introduced the idea of a kernel messaging bus called bus1.ko, inspired by the existing kdbus project but going its own way in terms of design. The bus1.ko project would implement interprocess communication (IPC) that would be completely divorced from user space.
Communication would take the form of nodes containing the message data and handles containing references to nodes. Peer processes would send messages to one another via handles that would give the target process access to the referenced node. Once a process was in possession of a handle, the only way to cut off that process's access to the referenced node would be for the original process to destroy the node in question. There would be no way to modify or revoke the handle once it was sent.
Linus Torvalds took a look at this and gave his assessment:
The thing that tends to worry me about these is resource management.
If I understood the documentation correctly, this has per-user resource management, which guarantees that at least the system won't run out of memory. Good. The act of sending a message transfers the resource to the receiver end. Fine.
However, the usual problem ends up being that a bad user can basically DoS a system agent, especially since for obvious performance reasons the send/receive has to be asynchronous.
So the usual DoS model is that some user just sends a lot of messages to a system agent, filling up the system agent resource quota, and basically killing the system. No, it didn't run out of memory, but the system agent may not be able to do anything more, since it is now out of resources.
Keeping the resource management with the sender doesn't solve the problem, it just reverses it: Now the attack will be to send a lot of queries to the system agent, but then just refuse to listen to the replies – again causing the system agent to run out of resources.
Usually the way this is resolved is by forcing a "request-and-reply" resource management model, where the person who sends out a request is not only the one who is accounted for the request, but also accounted for the reply buffer. That way the system agent never runs out of resources, because it's always the requesting party that has its resources accounted, never the system agent.
David replied that for this version of the design, bus1.ko did all resource accounting via user ID in order to match POSIX specs as well as possible. He said, "More advanced accounting is left as a future extension."
He went on to explain that the tricky part of resource accounting came when users needed to transfer resources. He said, "Before SEND, a resource is always accounted on the sender. After RECV, a resource is accounted on the receiver. That is, resource ownership is transferred. In most cases this is obvious: Memory is copied from one address-space into another, or file-descriptors are added into the file-table of another process, etc."
David added, "at the time of SEND resource ownership is transferred (unlike sender-accounting, which would transfer it at time of RECV). The reasons are manifold, but mainly we want RECV to not fail due to accounting, resource exhaustion, etc. We wanted SEND to do the heavy-lifting, and RECV to just dequeue. By avoiding sender-based accounting, we avoid attacks where a receiver does not dequeue messages and thus exhausts the sender's limits."
Finally, David said, "The issue left is senders DoS'ing a target user. To mitigate this, we implemented a quota system. Whenever a sender wants to transfer resources to a receiver, it only gets access to a subset of the receiver's resource limits. The inflight resources are accounted on a uid <--> uid basis, and the current algorithm allows a receiver access to at most half the limit of the destination not currently used by anyone else."
Richard Weinberger asked whether messages could be passed between cgroup containers as well as between processes within a single system; and Tom Gundersen replied, "There is no restriction with respect to containers. The metadata is translated between namespaces, obviously, but you can send messages to anyone you have a handle to."
The discussion petered out around there. It's unclear whether Linus was satisfied with David's answer regarding DoS possibilities, so it's difficult to tell what would become of David's design. It's also unclear who would be the big users of this message-passing system. So, overall, it's too soon to predict the fate of any patches coming out of this design.
Buy this article as PDF
(incl. VAT)