Zack's Kernel News

Zack's Kernel News

Article from Issue 212/2018

Chronicler Zack Brown reports on the NOVA filesystem, making system calls userspace only, and extending module support to plain executables. 

The NOVA Filesystem

As flash drives and other forms of nonvolatile RAM continue to supplement or replace traditional optical and magnetic technologies, new filesystems emerge to support them. Recently Andiry Xu, Lu Zhang, and Steven Swanson posted an update to their nascent NOn-Volatile memory Accelerated (NOVA) filesystem.

NOVA uses a log-based approach, which means that it writes new data sequentially onto the drive, leaving old data to sit until it's reclaimed by the system. One benefit of this is that older versions of files can be kept as "snapshots." Traditional filesystems written for drives based on rotating magnetic disks tend to update files by seeking to the file's location and updating its data in place. That reduces file fragmentation on those drives, which reduces the amount of time spent on seeking to different parts of the file. Nonvolatile RAM doesn't mind fragmenting all its writes into sequential chunks, because seek times aren't an issue for that kind of hardware.

The NOVA filesystem has reached the stage where it can do some good things, but its developers aren't yet ready to add it into the kernel for general consumption. This time around, they asked for feedback from the kernel folks. There was not a huge amount of discussion, but enough bug reports and general encouragement to indicate that NOVA is likely to go into the kernel as soon as it's ready.

Making System Calls Userspace Only

Lately, there's been a lot of interest in isolating user space from kernel space, partly because of some severe flaws in Intel hardware that have been discovered recently, affecting the security of a large number of computers on planet Earth. The Linux developers want to work around these flaws, while sacrificing as little as possible, especially in the area of speed.

One area affected by this has been the kernel's use of system calls. Traditionally, a system call is the way user software requests services from the kernel. You use system calls to open and close files, write data to disk, and so on. But the kernel itself has also traditionally used system calls to do all those things. To keep system calls entirely a userspace tool, and not a mechanism for hostile actors to gain access to kernel space, the massive task of removing all syscall invocations from the kernel has become necessary.

There's also some urgency, because the best current workarounds to Intel's hardware problems involve taking a sizable speed hit. The kernel developers want to gradually migrate away from those workarounds and replace them with sleeker, hipper solutions that know how to dance.

Like other efforts to massively change the way the kernel does things, this one can't be done all at once. Many system call invocations can be removed from the kernel relatively easily, but some have subtleties that will take time to figure out and resolve. This is reminiscent of the effort to get rid of the Big Kernel Lock (BKL), in which many instances could be easily replaced by simpler locks, but some had specific requirements that needed more complicated solutions.

Like the BKL effort, this will be an ongoing project. Some parts of the kernel will have to masquerade as user space in order to retain permission to use system calls, at least for now. And of course, the whole effort is being done in conjunction with many related projects, all trying to find better ways of dealing with the Intel hardware flaws.

Extending Module Support to Plain Executables

It may soon be possible to load and run plain ELF executables as kernel modules. Alexei Starovoitov has been working on this, and it seems to offer some security benefits. But it also may introduce new risks that will require careful implementation.

Alexei posted some patches recently for this and listed off some of the benefits. Unlike a regular kernel module, the ELF executable would run as a user-mode process, which means that if it crashed, it wouldn't necessarily bring the entire system down with it. It would also be subject to all the normal controls placed on user processes, including being subject to the out of memory (OOM) killer code, which steps in when a system is almost out of memory and tries to identify and kill whichever process is most likely to be the cause. Exactly how well the OOM killer is able to do that is a tough question that is the subject of much ongoing work; however, with Alexei's patch, it would have a new pool of processes to consider. Debugging, testing, and profiling ELF executables is also something that can be done with regular user tools, which are more plentiful and possibly also more familiar.

These are all good reasons to include the feature in the kernel, and initially Linus Torvalds was very enthusiastic about it. His only initial suggestion was to increase the amount of logging that Alexei's code performed, so it would never be possible for a module to be loaded without the user seeing it.

There were a couple of criticisms at first that turned out not to be much of a problem. For example, Andy Lutomirski suggested that instead of coding this feature into the kernel itself, it could simply be added to the modprobe program and exist entirely in user space. Traditionally, anything that can go into user space should go into user space. Why not this?

But Linus felt that the module loading logic had become a real mess when it had been left out of the kernel and was susceptible to poor decisions by its maintainers, with which the kernel would then have to live. Moving the module logic inside the kernel, he said, had greatly improved the situation. The only thing modprobe should do, he said, was to track dependencies between modules and load them in without performing any additional checks or changes. Let the kernel handle that stuff. He remarked, "I do *not* want the kmod project that is then taken over by systemd, and breaks it the same way they broke firmware loading. [...] Right now kmod is a nice simple project. Lots of testsuite stuff, and a very clear goal. Let's keep kmod doing one thing, and not even have to care about internal kernel decisions like 'oh, this module might not be a module, but an executable'. If anything, I think we want to keep our options open, in the case we need or want to ever consider short-circuiting things and allowing direct loading of the simple cases and bypassing modprobe entirely."

So that objection turned out to be a nonissue. Likewise, Kees Cook had an objection that didn't go very far – he was concerned that Alexei's patch might make security exploits easier, in the event of certain types of bugs appearing in the kernel. Specifically, if it ever inadvertently became possible for a hostile user to break a module out of a virtual system running on top of the kernel, Alexei's feature would allow the ELF module to execute arbitrary code deep within the kernel.

But Linus didn't find that argument compelling either. Specifically, as he and others pointed out, in the circumstance Kees mentioned, a regular kernel module would be much more powerful and dangerous to let loose than an ELF binary. So Alexei's code wouldn't make the risk any worse than it already was. Additionally, they said, Kees's objection depended on there already being an exploitable security hole in the container code that might allow the hostile user to break a module out of that confinement in the first place. Any such security hole would be treated as a bug and fixed. So to Linus, it was also a nonissue.

But over the course of defending Alexei's patch from these criticisms, additional problems came up that were not so easily dismissed.

Linus raised one of these himself, pointing out that when a module was loaded, first its signature was checked, and then if it was OK, the module file would be loaded and run. But there was a moment in between checking the signature and loading the module, where "the execve() will end up not using the actual buffer we checked the signature on, but instead just re-reading the file." In which case, he went on, "somebody could maybe try to time it and modify the file after-the-fact of the signature check, and then we execute something else."

There were a couple of things standing in the way of doing this. And at first, it had seemed to Linus that anyone with sufficient privileges on the running system to modify files in the modules directory would already be able to simply run anything as root to begin with. If they could do that, there would be no reason for them to mess with modules, so there was no reason to try to guard against that possibility.

But on reflection, he'd realized that the hostile user wouldn't need to modify files in the modules directory; they would only need to copy files – something they could do with less permissions, yet still use the exploit to run arbitrary code in the kernel.

So, Linus said, this had to be addressed before Alexei's patch could go into the tree.

Additionally, Andy noticed something that was not a security hole in Alexei's patch, but was something almost as bad – a break in the kernel's application binary interface (ABI) backward compatibility. That refers to the ability of an arbitrary piece of compiled code to rely on the kernel behaving the way it did in the past. If the kernel breaks ABI compatibility, then certain pieces of existing compiled user code would break and need to be recompiled from source. And since not all source code is available for all the binaries running on Linux systems today, that represents an unacceptable change. Linus wants everything that can currently run on a Linux system to continue to be able to run.

It's not inconceivable that Linus might allow something to break part of the ABI, but it would be an exceptional circumstance. For example, if it were discovered that a certain part of the ABI contained a security hole that could only be fixed at the expense of ABI compatibility, Linus would make the change without hesitation. Other circumstances are much more iffy, though at times various kernel developers have argued in favor of lumping a bunch of hotly desired ABI changes into one big patch, like taking a swig of medicine and getting it over with.

So Andy's assertion that Alexei's code broke ABI compatibility was also a potential showstopper. Specifically, he explained to Alexei, "Without your patch, init_module doesn't keep using the file, so it's common practice to load a module and then delete or unmount it. With your patch, the unmount case breaks. This is likely to break existing user space."

Both Linus's and Andy's issues spawned a lot of discussion over how to fix them in Alexei's patch. And while no absolutely clear solutions seemed to emerge from the discussion, certainly something will be found to address those issues, and the code will eventually go into the kernel. It just seems like a feature a lot of people want, including Linus. Even Kees, who didn't like the security risks, felt that the feature itself was good. So there's plenty of motivation to iron out the details.

An interesting aspect of the whole issue is what does and does not constitute a security problem. As we saw, Kees had a security issue at the start of the discussion that Linus did not consider important. But Kees is the module security person, with authority (granted by Linus) to veto patches that introduce security holes in the kernel's module support. He's clearly very knowledgeable. And yet Linus did not agree with him about the significance of the issue he raised.

This is not just a question of Linus being right and Kees being wrong. A number of high powered security people in the world would still agree with Kees that reducing the available attack surface of a given vulnerability is an important thing to do. And so Kees's initial objection – that if certain bugs appeared in the kernel then Alexei's feature could offer a large and tempting target to hostile attackers – is one with which many of those security people would agree. They would probably argue that by reducing the attack surface, it becomes easier to test for and guard against any such attack. By exposing a larger attack surface, an attacker might find an exploit that would be more difficult for the kernel developers to identify and fix.

Linus does not share that view. He would probably argue that a security hole is a security hole – if it exists, it should be closed; if it doesn't exist, then what's the problem? He has also said in the past that once an attacker is able to gain a certain level of privileges on a system, trying to mitigate what they can do with those privileges is sort of a waste of time, given that the attacker could use those privileges to work around whatever mitigations are put in place. In other words, once a hostile user gains root on a system, that's the ball game, and adding more patches won't change that.

It's a different approach to security. And it's one that leads some security-minded users to claim that Linux is less secure than, for example, some of the free software BSD systems.

My personal view is these two views don't represent such vastly different positions as their advocates may think. If a security hole were to be discovered on either a BSD system or a Linux system, it would immediately be closed by the developers, and patches released to users. It's never the case that either the BSD or Linux projects just blithely release new versions with known security exploits in them. In BSD and Linux development, security issues trump all other considerations. The developers of any of these projects would remove whole subsystems without a second thought, rather than include a known security exploit.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Lockdown Mode

    Lockdown mode makes your Linux system more secure and even prevents root users from modifying the kernel.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Zack Brown reports on developer trust.

comments powered by Disqus