Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on want vs. need, and hiding system resources ... from the system.
Want vs. Need
Ryan Houdek wanted to enhance compatibility layers in Linux. A compatibility layer is used when you have a piece of software that was compiled to run on a different system, and you want it to run on yours. Maybe the software expects a certain system file to exist, or certain opcodes at the CPU level, or certain system calls. A compatibility layer will provide those things so the software can run. A lot of cloud service companies like Google and Amazon use Linux's compatibility layers to make one piece of hardware look like a whole bunch of other pieces of hardware.
So compatibility layers are not new in Linux, but Ryan wanted to run old software compiled for 32-bit CPUs on 64-bit systems and offered up a general justification for compatibility layers. One of his main points was that "Not all software is open source or easy to convert to 64-bit," and that a lot of gaming software fell into this category.
Ryan pointed to various attempts in the Linux world to work around these problems, such as Qemu, a generic CPU emulator. But the problem with such attempts, he said, was not emulating the CPU, it was emulating various system resources such as memory handling and input/output controls.
He posted a patch to address the whole issue in what he felt was a more comprehensive way, by exposing compatibility system calls to user space. System calls are the library of functions the kernel provides so that user space can use the hardware and other resources on a given system. Ryan wanted to create a new set of system calls that would behave the way older software expected.
This was always going to be an iffy proposition.
Steven Price, for example, remarked, "Running 32-bit processes under a compatibility layer is a fine goal, but it's not clear why the entire 32-bit compat syscall layer is needed for this." And he added, "QEMU's user mode emulation already achieves this in many cases without any changes to the kernel."
Steven went on to say, "Clearly there are limitations, but in general I can happily 'chroot' into a distro filesystem using an otherwise incompatible architecture using a qemu-xxx-static binary."
So exposing an entirely new set of system calls didn't appeal to Steven, though he agreed that memory and input/output handling were serious issues when it came to any sort of compatibility layer.
In particular, Steven agreed that "ioctls are a mess."
Input/output controls (ioctls) are a nightmarish fantasy of one of the outer gods, possibly Nyarlathotep. Ioctls exist in the nether region between what you need the hardware to do, and what system calls are able to provide.
The ioctl()
system call is intended to be extended by device driver writers, so that the inputs to ioctl()
can be relevant to any particular device driver. In this way, the kernel doesn't need to be loaded up with new system calls every time a new device comes out on the market. Instead, the single ioctl()
system call can take all of that malignant energy unto itself, growing darkly beneath the surface for all time. If you asked a kernel developer about documenting all the behaviors of ioctl()
, they would begin to laugh, cry, and explode simultaneously. Try it and see. Or don't. They have suffered enough.
Steven remarked, "ioctls are going to be a problem whatever you do, and I don't think there is much option other than having a list of known ioctls and translating them in user space."
Steven also agreed that memory handling was difficult to manage, in terms of converting between 32- and 64-bit systems. But he also mentioned, "I've seen examples of MAP_32BIT being (ab)used to do this, but it actually restricts to 31 bits and it's not even available on arm64. Here I think you'd be better off focusing on coming up with a new (generic) way of restricting the addresses that the kernel will pick."
In any event, Steven said, even exposing a full system call compatibility layer would not save anyone from having to do "a load of fixups in user space due to the differing alignment/semantics of the architectures." And he pointed out, "You are already going to have to have an allow-list of ioctls that are handled because any unknown ioctl is likely to blow up in strange ways due to the likes of structure alignment differences."
Meanwhile, Mark Brown pointed out that a certain amount of compatibility layering did already exist in Linux. He said, "this has been deployed on Debian for a long time – you can install any combination of Debian architectures on a single system and it will use qemu to run binaries that can't be supported natively by the hardware."
In response to that, Catalin Marinas remarked, "The only downside I think is that for some syscalls it's not that efficient. Those using struct iovec come to mind, qemu probably duplicates the user structures, having to copy them in both directions."
Regardless, Catalin opposed Ryan's patch, saying "Those binary translation tools need to explore the user-only options first and come up with some perf numbers to justify the proposal." The perf
tool is for performance analysis of the Linux kernel. It's common to see developers asking for the perf
numbers for a given patch, to make sure it doesn't slow things down too much.
Arnd Bergmann also spoke in favor of Qemu, saying, "qemu does a nice job at translating the interfaces for many combinations of host and target architectures at a decent speed, and is improving at both the compatibility and the performance over time."
But he also acknowledged:
"The ioctl emulation in qemu is limited in multiple ways:
* It needs to duplicate the kernel's compat emulation for every single command it wants to handle, and will always lag behind what gets merged into the kernel and what drivers a particular distro ships.
* Some ioctl commands cannot be emulated in user space because the compat code relies on tracking device state in the kernel.
* In some cases, emulation can be expensive, both for runtime overhead and for code complexity."
Arnd opposed Ryan's patch as well, or at least was not convinced it was needed. He thought it might be better to try to address the ioctl insanity on its own, rather than the entire system call layer. And once ioctls had been brought under control, Arnd felt, the rest of the system call layer would not pose many more problems.
David Laight also had issues with Ryan's patch. As Catalin pointed out already, performance would be a significant question. But David said, "I don't think the problem is only the performance. The difficulty is knowing when structures need changing. A typical example is driver ioctl requests. Any user space adaption layer would have to know which actual driver has been opened and what internal structures it has. [...] It is much easier to get it right in the code that knows about the actual structures."
Mark Rutland also made the point that exposing a full set of compatibility system calls might be more than Ryan actually needed. For example, all Ryan really needed, Mark R. said, was "being able to limit the range of mmap() and friends." He went on, "I think that for this series x86 emulation is a red herring. ABIs differ across 32-bit arm and 32-bit x86 (and they have distinct arch-specific syscalls), and so those need distinct compatibility layers. If you're wanting to accelerate x86 emulation, I don't think this is the right approach."
Mark R. went on to say that, in fact, a more targeted approach could actually benefit more projects. He said, "For example, having variants with an explicit address mask would also benefit JITs which want to turn VA bits into additional tag bits."
Finally, Mark R. voiced his opposition to Ryan's patch, adding, "However, I do think that we can make emulation easier by extending some syscalls (e.g. mmap()), and that this would be worthwhile regardless of emulation." He concluded, "I appreciate that people have 32-bit applications, and want to run those, but I'm not convinced that this requires these specific changes. Especially considering that the cross-architecture cases you mention are not addressed by this, and need syscall emulation in userspace; that implies that in general userspace needs to handle conversion of semantics, and what the kernel needs to do is provide the primitives onto which userspace can map things in order to get the desired semantics (which is not the same as blindly exposing compat syscalls)."
Amid all the voices expressing concern, Amanieu d'Antras said he liked Ryan's patch and thought it should be accepted.
First of all, Amanieu felt that speed and efficiency were not as important as others had suggested, and that the main point was correct emulation. And the bottom line, he said, was that user space simply "does not have the information or the capabilities needed to ensure that the 32-bit syscall ABI is correctly emulated."
Amanieu also pointed out that while exporting a compatibility system call layer would not solve all the problems, it would allow emulators to take up the various slack of "memory management, signal handling, /proc emulation, ptrace emulation, etc." With that done, the emulator could pass the resulting system call through to the user code, where it would properly do its thing.
He listed a bunch of technical requirements that he felt couldn't be done in user space, saying, "these issues are all solved by exposing compat syscalls to 64-bit processes and ensuring is_compat_task/in_compat_syscall is true for the duration of that syscall."
Amanieu said something similar was already done in Linux, specifically "on x86, syscalls made with int 0x80 are treated as 32-bit syscalls even if they come from a 64-bit process."
Mark R. sympathized with the point that "there are cases where the emulator cannot do the right thing due to lack of a mechanism." But rather than try to compensate for this by exposing more mechanisms from the kernel, Mark R. said, "where the emulator does not have knowledge, I don't think that it can safely invoke the syscall."
He went on to say that there were numerous cases where the kernel would be unable to determine what the correct behavior might be. And that, "the kernel cannot possibly do something that is correct in this case." He concluded, "Maybe your emulator avoids these, but that's no justification for the kernel to expose such broken behaviour."
Amanieu completely disagreed with Mark R.'s take on the situation, saying, "I disagree that any broken behavior is exposed here."
At that point, the conversation came to an end.
It's doubtful that anything short of absolute necessity would lead Linus Torvalds to accept a full-on system call compatibility layer into the kernel. It would lock Linux into emulating ancient hardware essentially forever.
But Ryan and his supporters have some valid points as well. Aside from everything else, it's fundamentally good to be able to run compiled binaries where the source code has been lost, and where the only other way to run that code would require a chronometric displacement aperture of human proportions, in which case we would have other problems.
Ultimately, there seems to be plenty of room for compromise – I would guess Linus would want the kernel to export anything that was truly needed by user space emulators, and the emulator people will be satisfied with that since it will let them do their main thing.
Hiding System Resources … from the System
Mike Rapoport recently submitted a patch to implement secret memory areas in Linux. As he explained, "The pages in that mapping will be marked as not present in the direct map and will be present only in the page table of the owning mm." And he went on, "such secret mappings are useful for environments where a hostile tenant is trying to trick the kernel into giving them access to other tenants' mappings."
Part of Mike's idea was to hide the secret memory in a file – or at least to make it accessible via a file descriptor, as if it were an ordinary file. He explained:
"Hiding secret memory mappings behind an anonymous file allows usage of the page cache for tracking pages allocated for the 'secret' mappings as well as using address_space_operations for e.g. page migration callbacks.
"The anonymous file may be also used implicitly, like hugetlb files, to implement mmap(MAP_SECRET) and use the secret memory areas with 'native' mm ABIs in the future."
In Mike's vision, the feature would be disabled by default and would require a boot-time command-line argument to activate.
David Hildenbrand had some questions. Among other things, he said, "secret" memory allocations would be invisible to the various memory management features in the kernel. This would mean that blocks of memory would be sitting immobile in RAM, while other blocks could be moved around and reorganized as needed. So a lot of Linux features, such as process migration and other cool things, might be tripped up by "secret" memory.
Mike clarified, saying that actually, "secret" memory (which he called secretmem
) would only be allocated from a region of RAM that was never moved around anyway. So it wasn't that secretmem
was especially unmovable; it was that the kernel already had unmovable memory, and secretmem
used that.
This didn't satisfy David, and the two – along with Michal Hocko – embarked on a technical discussion of different types of memory and the conditions under which they were used. As Michal put it:
"A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE. As I've said it is quite easy to land at the similar situation even with tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is really uncommon. It would be even worse that those would be allowed to consume both CMA/ZONE_MOVABLE. One has to be very careful when relying on CMA or movable zones."
Throughout the discussion, everyone also made reference to various aspects of memory handling that needed to be documented somewhere in the kernel. So, the lack of that documentation may have made the conversation more difficult.
Another element that may have made the conversation difficult was the fact that other areas of the kernel have to deal with similar issues – hot plugging, for example, as Michal pointed out.
There was also the question of how users could analyze their own systems. As David said at one point, "With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be a big lie of how your system behaves."
David also pointed out that at least in its current form, secretmem
might allow processes to hog memory that should stay available to other processes. As he put it, "Secretmem gives user space the option to allocate a lot of GFP_{HIGH}USER memory. If I am not wrong, 'ulimit -a' tells me that each application on F33 can allocate 16 GiB (!) of secretmem. Which other ways do you know where random user space can do something similar? I'd be curious what other scenarios there are where user space can easily allocate a lot of unmovable memory."
And at one point, David gave some of his own context for the discussion:
"I am constantly trying to fight for making more stuff MOVABLE instead of going into the other direction (e.g., because it's easier to implement, which feels like the wrong direction).
"Maybe I am the only person that really cares about ZONE_MOVABLE these days :) I can't stop such new stuff from popping up, so at least I want it to be documented."
Though Michal confirmed that "MOVABLE zone is certainly an important thing to keep working. And there is still quite a lot of work on the way. But as I've said this is more of a outlier than a norm. On the other hand movable zone is kinda hard requirement for a lot of application[s] and it is to be expected that many features will be less than 100% compatible. Some usecases even impossible."
The technical discussion continued for quite awhile, with James Bottomley joining in at one point, with some virtual machine developer perspective.
Ultimately the discussion ended inconclusively. It's clear that there are many perspectives to consider and many stakeholders within the kernel, as well as a general lack of documentation and a lack of clarity on the behavior of various parts of the system.
secretmem
is also a security feature, but not an essential one. It doesn't seem to plug any particular hole, but is intended more as camouflage to guard against the possibility of an attack. As such, it's not the sort of feature that Linus Torvalds tends to leap towards. Linus seems to be more focused on closing known security holes, rather than on reducing the surface area of potential attacks. So for now, it's unclear if secretmem
could make it into the kernel, either in this form or another.
Buy this article as PDF
(incl. VAT)