Zack's Kernel News
Zack's Kernel News
In kernel news: Vulnerabilities using a 32-Bit Kernel on a 64-Bit CPU; Working Around Hardware Security Vulnerabilities; and When It's OK to Panic.
Vulnerabilities using a 32-Bit Kernel on a 64-Bit CPU
Naresh Kamboju of Linaro reported that the kernel was giving some serious Spectre and Retbleed security warnings on the Linaro test farm. He said they were using the i386 kernel on the 64-bit Intel Skylake CPU. Or at least, he posted some debugging output that identified the specific CPU.
Greg Kroah-Hartman remarked that this particular combination – a 32-bit kernel running on a 64-bit architecture – was sort of pathological. People should really use a faster and more efficient 64-bit kernel if their architecture supports it. Greg said this particular environment might not be something the kernel developers needed to care about. He suggested that Pawan Gupta from Intel might be the person to ask.
Meanwhile, Peter Zijlstra remarked, "Yeah, so far nobody cared to fix 32bit. If someone *realllllly* cares and wants to put the effort in I suppose I'll review the patches, but seriously, you shouldn't be running 32bit kernels on Skylake / Zen based systems, that's just silly."
Rawan did reply to Greg, saying, "Intel is not aware of production environments that use 32-bit mode on Skylake-gen CPUs. So this should not be a concern."
However, Adam Borowski gave the following evaluation of the plight of potentially many users:
"Alas, some people still run [32-bit kernels on 64-bit CPUs] because of not knowing any better. Until not so long ago, they were proposed with two install media, '32-bit' and '64-bit', but no explanation. Upgrades keep working, crossgrades are still only for the brave of the heart, and reinstalling might not appear to have a reason compelling enough. And for quite some tasks, halved word size (thus ~2/3 memory usage) can overcome register starvation and win benchmarks.
"Thus I wonder: perhaps such combinations we consider to be invalid should refuse to boot unless given a cmdline parameter?"
Jan Engelhardt wryly remarked, "So how many benchmarks does a 32-bit userspace with a 32-bit kernel win over 32-bit userspace with a 64-bit kernel?" To which Adam replied, "Likely none or almost none."
However, Adam went on to say:
"What we want is for people to run 64-bit kernel, there are no real issues with userland.
"Valid uses to run 32-bit kernel:
- ancient hardware (so much more prevalent than m68k we support!; non-hobbyists should upgrade to reduce power costs)
- hardware to run that 100$k-1M ISA industrial control/medical imaging card (which, having ISA, is necessarily ancient too)
- us devs testing the above
"Only the last case will have a modern CPU, thus requiring an explicit override won't hurt less educated users – while telling the latter to grab a 64-bit kernel if their hardware isn't ancient would have other benefits for them beside just vulnerabilities."
It's fascinating for me to see this consideration of which kernels to care about on which architectures. It's certainly true that some combinations don't make practical sense because an alternative would run much faster and safer. And it's also true that a lot of people are not necessarily aware of the ins and outs of all those choices and distinctions and aren't aware that their decision might affect the security of their system. To some extent user ignorance may not be the problem of kernel developers, and it may be that distribution maintainers would be more interested in helping guard against certain risky user errors. But this distinction and division of labor is a fascinating one for me, and I always enjoy watching how the developers come down on one side or the other of such questions.
Working Around Hardware Security Vulnerabilities
Recently on the Linux kernel mailing list, Thomas Gleixner remarked, "Back in the good old spectre v2 days (2018) we decided to not use IBRS. In hindsight this might have been the wrong decision because it did not force people to come up with alternative approaches."
Indirect Branch Restricted Speculation (IBRS) is a hardware feature built into some Intel CPUs in response to some of the security vulnerabilities like Spectre that have been identified in recent years. As a hardware feature, IBRS transparently blocks the security exploits of some of those vulnerabilities, without needing to be "enabled" by the kernel.
But like all such attempts to bypass these tough security vulnerabilities, there are trade-offs: For example, one security solution might be to avoid a large swath of CPU features, which in turn means that those features must be implemented in the kernel itself, less efficiently. So it's never an obvious choice to simply accept the hardware security features Intel provides. The kernel developers must evaluate the pros and cons in each case.
For example, as Thomas expressed this time, "It was already discussed back then to try software based call depth accounting [...] to avoid the insane overhead of IBRS."
Thomas proceeded to explore the specific aspects of IBRS overhead, including increased memory usage and accounting errors, which he said themselves could be exploited by an attacker.
He went on to say, "As IBRS is a performance horror show, Peter Zijstra and me revisited the call depth tracking approach and implemented it in a way which is hopefully more palatable and avoids the downsides of the original attempt. We both unsurprisingly hate the result with a passion."
Thomas described his and Peter's implementation approach, which was a nightmare-tentacled god of death involving somehow identifying code that was making certain calls, patching those calls to be different (all while the system was in full operation), and then doing various tasks before sending those calls back to where they thought they were going in the first place.
He added, "This does not need a new compiler and avoids almost all overhead for non-affected machines." But Thomas went on to say that in their solution, "The memory consumption is impressive. On a affected server with a Debian config this results in about 1.8MB call thunk memory and 2MB btree memory to keep track of the thunks. The call thunk memory is overcommitted due to the way how objtool collects symbols. This probably could be cut in half, but we need to allocate a 2MB region anyway."
Thomas went into some detail on the scholarly basis for how IBRS (and his and Peter's code) would defeat the various security exploits. Although he also added, "There is obviously no scientific proof that this will withstand future research progress, but all we can do right now is to speculate about that."
He posted a batch of performance benchmarks, showing that their software-based solution was at least never slower than Intel's hardware-based IBRS solution. Of course, one benefit of a software-based solution is that it can be revised and improved by developers, while hardware is forever. Thomas and Peter's work turned out to be a case in point, as the best was most definitely yet to come.
Thomas and David Laight discussed some of the technical details, including ways of improving efficiency and reducing overhead. In fact, part of the potential improvements also involved patches to the GNU C Compiler (GCC). Thomas summarized these changes as "Let the compiler add a 16 byte padding in front of each function entry point and put the call depth accounting there. That avoids calling out into the module area and reduces ITLB pressure."
At this point Linus Torvalds joined the discussion, saying:
"Ooh.
"I actually like this a lot better.
"Could we just say 'use this instead if you have SKL and care about the issue?'
"I don't hate your module thunk trick, but this does seem *so* much simpler, and if it performs better anyway, it really does seem like the better approach.
"And people and distros who care would have an easy time adding that simple compiler patch instead."
To which Thomas replied, "Yes, Peter and I came from avoiding a new compiler and the overhead for everyone when putting the padding into the code. We realized only when staring at the perf data that this padding in front of the function might be an acceptable solution. I did some more tests today on different machines with mitigations=off with kernels compiled with and without that padding. I couldn't find a single test case where the result was outside of the usual noise. But then my tests are definitely incomplete."
Peter joined the discussion at this point, as well, along with Joao Moreira and others, and the bunch of them tried to figure out the best form for the GCC patch, along with other implementation details for the rest of the fix. The discussion delved pretty deep into technical details and ranged all over the place, with pretty much everyone being more or less confused at least some of the time.
At one point, Tim Chen of Intel posted some performance benchmarks using the new "padding" approach from Thomas and Peter, and he said, "Padding improves performance significantly."
Linus looked at those numbers and replied, "That certainly looks oh-so-much better than those disgusting ibrs numbers."
The discussion ended roughly around there, with people seeming very enthusiastic about the prospect of a much more efficient approach to Retbleed security problems.
It's an ongoing process – the latest improvement is merely the new worst-case that everyone will love to improve upon. And it's amazing to know that in the open source world, we can watch the developers struggle in real time to protect everyone from attackers that themselves work night and day to crack into all of our systems.
When It's OK to Panic
David Hildenbrand submitted some security code updates that ran into resistance from Linus Torvalds. Among David's various changes was the use of the VM_BUG_ON()
function. This and its various sibling functions check the state of the running system and bring it to a halt if that state seems to have become pathological. This seems sensible: If your system is horked, you may not want to blithely continue whatever it is you're doing.
However, in response to that bit of code, Linus replied, "STOP DOING THIS."
He went on to explain:
"Using BUG_ON() for debugging is simply not ok.
"And saying 'but it's just a VM_BUG_ON()' does not change *anything*. At least Fedora enables that unconditionally for normal people, it is not some kind of 'only VM people do this'.
"Really. BUG_ON() IS NOT FOR DEBUGGING.
"Stop it. Now.
"If you have a condition that must not happen, you either write that condition into the code, or – if you are convinced it cannot happen – you make it a WARN_ON_ONCE() so that people can report it to you.
"The BUG_ON() will just make the machine die.
"And for the facebooks and googles of the world, the WARN_ON() will be sufficient."
David, unflustered, replied, "I totally agree with BUG_ON … but if I get talked to in all-caps on a Thursday evening and feel like I just touched the forbidden fruit, I have to ask for details."
David continued:
"VM_BUG_ON is only active with CONFIG_DEBUG_VM. … which indicated some kind of debugging at least to me. I *know* that Fedora enables it and I *know* that this will make Fedora crash.
"I know why Fedora enables this debug option, but it somewhat destroys the whole purpose of VM_BUG_ON kind of nowadays?
"For this case, this condition will never trigger and I consider it much more a hint to the reader that we can rest assured that this condition holds. And on production systems, it will get optimized out.
"Should we forbid any new usage of VM_BUG_ON just like we mostly do with BUG_ON?"
Linus explained:
"VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no different, the only difference is 'we can make the code smaller because these are less important'.
"The only possible case where BUG_ON can validly be used is 'I have some fundamental data corruption and cannot possibly return an error'.
"This kind of 'I don't think this can happen' is _never_ an excuse for it.
"Honestly, 99% of our existing BUG_ON() ones are completely bogus, and left-over debug code that wasn't removed because they never triggered. I've several times considered just using a coccinelle script to remove every single BUG_ON() (and VM_BUG_ON()) as simply bogus. Because they are pure noise.
"I just tried to find a valid BUG_ON() that would make me go 'yeah, that's actually worth it', and couldn't really find one. Yeah, there are several ones in the scheduler that make me go 'ok, if that triggers, the machine is dead anyway', so in that sense there are certainly BUG_ON()s that don't _hurt_.
"But as a very good approximation, the rule is 'absolutely no new BUG_ON() calls _ever_'. Because I really cannot see a single case where 'proper error handling and WARN_ON_ONCE()' isn't the right thing.
"Now, that said, there is one very valid sub-form of BUG_ON(): BUILD_BUG_ON() is absolutely 100% fine."
Jason Gunthorpe remarked that he had heard people make the argument, "Since BUG_ON crashes the machine and Linus says that crashing the machine is bad, WARN_ON will also crash the machine if you set the panic_on_warn parameter, so it is also bad, thus we shouldn't use anything." Jason added that "I've generally maintained that people who set the panic_on_warn *want* these crashes, because that is the entire point of it. So we should use WARN_ON with an error recovery for 'can't happen' assertions like these."
To which Linus offered this fascinating explanation:
"If you set 'panic_on_warn' you get to keep both pieces when something breaks.
"The thing is, there are people who *do* want to stop immediately when something goes wrong in the kernel.
"Anybody doing large-scale virtualization presumably has all the infrastructure to get debug info out of the virtual environment.
"And people who run controlled loads in big server machine setups and have a MIS department to manage said machines typically also prefer for a machine to just crash over continuing.
"So in those situations, a dead machine is still a dead machine, but you get the information out, and panic_on_warn is fine, because panic and reboot is fine.
"And yes, that's actually a fairly common case. Things like syzkaller etc *wants* to abort on the first warning, because that's kind of the point.
"But while that kind of virtualized automation machinery is very very common, and is a big deal, it's by no means the only deal, and the most important thing to the point where nothing else matters.
"And if you are *not* in a farm, and if you are *not* using virtualization, a dead machine is literally a useless brick. Nobody has serial lines on individual machines any more. In most cases, the hardware literally doesn't even exist any more.
"So in that situation, you really cannot afford to take the approach of 'just kill the machine'. If you are on a laptop and are doing power management code, you generally cannot do that in a virtual environment, and you already have enough problems with suspend and resume being hard to debug, without people also going 'oh, let's just BUG_ON() and kill the machine'.
"Because the other side of that 'we have a lot of machine farms doing automated testing' is that those machine farms do not generally find a lot of the exciting cases.
"Almost every single merge window, I end up having to bisect and report an oops or a WARN_ON(), because I actually run on real hardware. And said problem was never seen in linux-next.
"So we have two very different cases: the 'virtual machine with good logging where a dead machine is fine' – use 'panic_on_warn'. And the actual real hardware with real drivers, running real loads by users.
"Both are valid. But the second case means that BUG_ON() is basically _never_ valid."
So there you have it.
Buy this article as PDF
(incl. VAT)