Zack's Kernel News

Zack's Kernel News

Article from Issue 210/2018
Author(s):

Zack Brown reports on improving a hashing function, constant values adjustable at boot time, and dealing with an Intel design flaw. 

Improving a Hashing Function

Amir Goldstein posted a small fix to a kernel function that would produce a unique hash value for an input string. He noticed that for 64-bit values, some bits were lost before calculating the hash – resulting in a slightly less good hash. For 32-bit values, he said, no bits were lost, and the code was left unchanged, but since the relevant stringhash.h file had no official maintainer, and Linus Torvalds had done some work on it in the past, Amir asked Linus what he thought of the fix and for advice on how to test it. As Amir put it, "I wouldn't even know where to begin testing its affects, or how to prove if there really is a problem."

Linus dug back into his memory, saying he believed the lost bits were intentional, though he acknowledged, "it's a long time ago, and we've changed some of the hashing since."

But Linus also said, "you're wrong that it's a no-op on 32-bit. It's a very expensive and pointless multiplication there too, even if the shift ends up being a no-op. The name hashing is pretty performance-sensitive."

Amir corrected Linus, saying that he hadn't said the patch would result in a no-op (code that literally did nothing, not even use time); he'd said that the patch left the code for the 32-bit case unchanged, except for the bit shift Linus had mentioned.

Linus pointed out that while the shift might cost nothing, the __hash_32() used nearby was very expensive. So, he concluded, "the patch as-is doesn't seem to buy anything, and only adds cost." He also remarked, "we did have numbers at one point. That's what really matters: how good the actual hashing ends up being. So for me to take the patch, I would need to see that it actually improves the hash bucket spreading enough to be worth the cost."

Amir replied that it was not his patch that added the __hash_32() call at all – it was already there.

To which Linus replied, "Oh, duh. My bad. It was indeed there already. Let me go back and look at the history of this thing."

A few days later, he wrote back more fully:

After having looked more at it, I take back all my complaints about the patch; you were right, and I was misreading things or just being stupid.

I also don't worry too much about the possible performance impact of this on 64-bit, since most architectures that actually care about performance end up not using this very much (the dcache code is the most performance-critical, but the word-at-a-time case uses its own hashing anyway).

So this ends up being mostly used for filesystems that do their own degraded hashing (usually because they want a case-insensitive comparison function).

A _tiny_ worry remains, in that not everybody uses DCACHE_WORD_ACCESS, and then this potentially makes things more expensive on 64-bit architectures with slow or lacking multipliers even for the normal case.

That said, realistically the only such architecture I can think of is PA-RISC. Nobody really cares about performance on that; it's more of a "look ma, I've got warts^W an odd machine" platform.

So I think your patch is fine, and all my initial worries were just misplaced from not looking at this properly.

Sorry.

Linus also remarked, "I *would* love to get some kind of estimate on how this changes the hash distribution. Do you have anything like that? The way this function is designed to be used, the upper bits should be used for the hash bucket information, so doing some bucket size distribution on (say) the upper ~12 bits or something with some real string input would be really good."

Amir replied, "I've added bucket counters and tracing number of entries/number of non-empty buckets and largest bucket to dcache (patch attached). Increased d_hash_shift to take only 12 bits for hash table offset. The results vary a bit per run, but below are some outputs from sample runs of find/of 55K entries right after boot. The bottom line is that, at least w.r.t this simplistic bucket analysis and this specific workload, the patch does not seem to move the needle. The bucket distribution of before and after patch are similar and also similar to the word-at-a-time results."

He included some numbers, and the thread ended.

Constant Values Adjustable at Boot Time

Some parts of the kernel run so hot that even the overhead of assigning a value to a variable can result in a significant speed difference. Kirill A. Shutemov recently posted a patch to introduce "patchable constants." These are constant values that can be set at boot time, either as part of the kernel configuration itself or by the user at the bootloader command line. Once set, they thereafter behave as true constant values in the running kernel, rather than variables.

As Kirill explained it, "Patchable constants implemented by replacing a constant with call to inline function that returns the constant value using inline assembler. In inline assembler, we also write down into separate section location of the instruction that loads the constant. This way we can find the location later and adjust the value."

The reason Kirill wanted to implement this general feature was to solve the specific case of the __PHYSICAL_MASK variable. To do memory encryption, this had to be dynamically settable. But because memory encryption has a speed impact on absolutely everything, Kirill wanted to find a way to shave as many Assembly instructions off the code as possible. Reimplementing the variable as a patchable constant solved the problem.

However, Kirill acknowledged that it came with an unacceptable cost. He said, "This conversion makes GCC generate worse code. Conversion __PHYSICAL_MASK to a patchable constant adds about 5k in .text on defconfig and makes it slightly slower at runtime (~0.2% on my box)."

So, the very effort to speed things up in one way, ended up slowing down the code in another. Kirill pointed out that he was submitting this patch, not to be included in the kernel, but in the hopes that someone might have a better idea for how to get the benefits of patchable constants, without this unwanted cost.

Linus Torvalds replied, saying:

I actually wanted something very close to this, but I think your approach is much too simplistic. You force all constants into a register, which means that the resulting code is always going to be very far from non-optimal. You also force a big movabsq instruction, which really is huge, and almost never needed. Together with the "use a register," it just makes for big code.

He added, "What I wanted was something that can take things like a shift by a variable that is set once and turn it into a shift by a boot-time constant. Which means that you literally end up patching the 8-bit immediate in the shift instruction itself."

He posted some assembly code, showing how several lines could logically be boiled down to just one, which he said would result in "much smaller code, and register %rcx isn't used at all. And no D$ miss on loading that constant (that is a constant depending on boot-time setup only)." He went on to say, "It's rather more complex, but it actually gives a much bigger win. The code itself will be much better, and smaller. The *infrastructure* for the code gets pretty hairy, though."

He pointed to a discussion from 18 months earlier, in which H. Peter Anvin had done some work on a similar idea. Peter had called it a "disgusting pseudo-self-modifying code idea," and it hadn't been completed because of all the extra complexity it introduced.

Peter replied, "I am currently working on it much more comprehensive set of patches for extremely this. I am already much further ahead and support most operations." He added, "The patchset I have is about 85% complete. It mostly needs cleanup, testing, and breaking into reasonable chunks." He also remarked, "The main reason I haven't submitted it yet is that I got a bit overly ambitious and wanted to implement a whole bunch of more complex subcases, such as 64-bit shifts on a 32-bit kernel. The win in that case is actually quite huge, but it requires data-dependent code patching and not just immediate patching, which requires augmentation of the alternative framework."

The discussion ended at that point.

Because open source is so decentralized, overlapping projects like this do sometimes happen, often resulting in a lot of wasted work or ongoing controversy over which approach is better. But in this case, given that Kirill had posted his patch specifically because he had not found a way to solve the underlying problems, it doesn't look like there will be much controversy, at least over which project to accept into the kernel. There still may be some debate over the value of adding so much complex infrastructure to handle self-modifying constants. It may turn out to be so bug-prone and unmaintainable that it defeats the value of the speed increase.

Dealing with Intel Design Flaw

Recently, a serious design flaw with Intel microprocessors has forced various operating systems, including Linux, to come up with some workarounds. Recently, KarimAllah Ahmed posted a patch to control Indirect Branch Speculation (IBS), essentially trying to manage the hardware deficiencies of Intel chips, to activate them when needed and safe, and to lock them down when their flaws might result in security issues.

Linus replied, "All of this is pure garbage. Is Intel really planning on making this shit architectural? Has anybody talked to them and told them they are f*cking insane? Please, any Intel engineers here – talk to your managers."

David Woodhouse said, "If the alternative was a two-decade product recall and giving everyone free CPUs, I'm not sure it was entirely insane." He agreed that KarimAllah's patch was a "nasty hack," but added, "hey – the world was on fire, and in the end we didn't have to just turn the datacentres off and go back to goat farming, so it's not all bad."

Linus replied, "You seem to have bought into the cool-aid. Please add a healthy dose of critical thinking. Because this isn't the kind of cool-aid that makes for a fun trip with pretty pictures. This is the kind that melts your brain."

He went on to say that the true situation was much worse than a "nasty hack."

First, Linus acknowledged that, "The speculation control cpuid stuff shows that Intel actually seems to plan on doing the right thing for Meltdown (the main question being _when_). Which is not a huge surprise, since it should be easy to fix, and it's a really honking big hole to drive through. Not doing the right thing for Meltdown would be completely unacceptable."

But he went on to say, "the IBRS garbage implies that Intel is _not_ planning on doing the right thing for the indirect branch speculation. Honestly, that's completely unacceptable too."

He continued:

The whole IBRS_ALL feature to me very clearly says "Intel is not serious about this; we'll have a ugly hack that will be so expensive that we don't want to enable it by default, because that would look bad in benchmarks."

So instead they try to push the garbage down to us. And they are doing it entirely wrong, even from a technical standpoint.

I'm sure there is some lawyer there who says "we'll have to go through motions to protect against a lawsuit." But legal reasons do not make for good technology, or good patches that I should apply.

Earlier, David had remarked, "We do need the IBPB feature to complete the protection that retpoline gives us – it's that or rebuild all of userspace with retpoline." To which Linus said:

BULLSHIT.

Have you _looked_ at the patches you are talking about? You should have – several of them bear your name.

The patches do things like add the garbage MSR writes to the kernel entry/exit points. That's insane. That says "we're trying to protect the kernel." We already have retpoline there, with less overhead.

So somebody isn't telling the truth here. Somebody is pushing complete garbage for unclear reasons. Sorry for having to point that out.

If this was about flushing the BTB at actual context switches between different users, I'd believe you. But that's not at all what the patches do.

As it is, the patches are COMPLETE AND UTTER GARBAGE.

They do literally insane things. They do things that do not make sense. That makes all your arguments questionable and suspicious. The patches do things that are not sane.

WHAT THE F*CK IS GOING ON?

And that's actually ignoring the much _worse_ issue, namely that the whole hardware interface is literally mis-designed by morons.

It's mis-designed for two major reasons:

  • the "the interface implies Intel will never fix it" reason.

See the difference between IBRS_ALL and RDCL_NO. One implies Intel will fix something. The other does not. Do you really think that is acceptable?

  • the "there is no performance indicator"

The whole point of having cpuid and flags from the microarchitecture is that we can use those to make decisions. But since we already know that the IBRS overhead is huge on existing hardware, all those hardware capability bits are just complete and utter garbage. Nobody sane will use them, since the cost is too damn high. So you end up having to look at "which CPU stepping is this" anyway.

I think we need something better than this garbage.

David was actually in full agreement. Instead of Intel making a fix "opt-in," he said, they should just fix it, since failing to opt in would always be unacceptable. But he felt that for the kernel, a short-term fix that was horrendously ugly would still be OK, given that it was just a short-term fix.

He also pointed out that Linus had been looking slightly in the wrong place when complaining about David's patches – David pointed out that Linus had been looking at IBRS rather than IBPB. To which Linus replied, "Ehh. Odd Intel naming detail." And he went on, "If you look at this series, it very much does that kernel entry/exit stuff. It was patch [number 10 of 10 in the series]. In fact, the patch I was replying to was explicitly setting that garbage up. And I really don't want to see these garbage patches just mindlessly sent around."

David replied with a very interesting and useful summary of the whole situation. He said:

I think we've covered the technical part of this now, not that you like it – not that any of us *like* it. But since the peanut gallery is paying lots of attention, it's probably worth explaining it a little more for their benefit.

This is all about Spectre variant 2, where the CPU can be tricked into mis-predicting the target of an indirect branch. And I'm specifically looking at what we can do on *current* hardware, where we're limited to the hacks they can manage to add in the microcode.

The new microcode from Intel and AMD adds three new features.

One new feature (IBPB) is a complete barrier for branch prediction. After frobbing this, no branch targets learned earlier are going to be used. It's kind of expensive (order of magnitude ~4,000 cycles).

The second (STIBP) protects a hyperthread sibling from following branch predictions which were learned on another sibling. You *might* want this when running unrelated processes in userspace, for example. Or different VM guests running on HT siblings.

The third feature (IBRS) is more complicated. It's designed to be set when you enter a more privileged execution mode (i.e., the kernel). It prevents branch targets learned in a less-privileged execution mode, BEFORE IT WAS MOST RECENTLY SET, from taking effect. But it's not just a "set-and-forget" feature, it also has barrier-like semantics and needs to be set on *each* entry into the kernel (from userspace or a VM guest). It's *also* expensive. And a vile hack, but for a while it was the only option we had.

Even with IBRS, the CPU cannot tell the difference between different userspace processes, and between different VM guests. So in addition to IBRS to protect the kernel, we need the full IBPB barrier on context switch and vmexit. And maybe STIBP while they're running.

Then along came Paul with the cunning plan of "oh, indirect branches can be exploited? Screw it, let's not have any of *those* then," which is retpoline. And it's a *lot* faster than frobbing IBRS on every entry into the kernel. It's a massive performance win.

So now we *mostly* don't need IBRS. We build with retpoline, use IBPB on context switches/vmexit (which is in the first part of this patch series before IBRS is added), and we're safe. We even refactored the patch series to put retpoline first.

But wait, why did I say "mostly"? Well, not everyone has a retpoline compiler yet … but OK, screw them; they need to update.

Then there's Skylake, and that generation of CPU cores. For complicated reasons they actually end up being vulnerable not just on indirect branches, but also on a "ret" in some circumstances (such as 16+ CALLs in a deep chain).

The IBRS solution, ugly though it is, did address that. Retpoline doesn't. There are patches being floated to detect and prevent deep stacks and deal with some of the other special cases that bite on SKL, but those are icky too. And in fact IBRS performance isn't anywhere near as bad on this generation of CPUs as it is on earlier CPUs *anyway*, which makes it not quite so insane to *contemplate* using it as Intel proposed.

That's why my initial idea, as implemented in this RFC patchset, was to stick with IBRS on Skylake and use retpoline everywhere else. I'll give you "garbage patches," but they weren't being "just mindlessly sent around." If we're going to drop IBRS support and accept the caveats, then let's do it as a conscious decision having seen what it would look like, not just drop it quietly because poor Davey is too scared that Linus might shout at him again. :)

I have seen *hand-wavy* analyses of the Skylake thing that mean I'm not actually lying awake at night fretting about it, but nothing concrete that really says it's OK.

If you view retpoline as a performance optimization, which is how it first arrived, then it's rather unconventional to say "well, it only opens a *little* bit of a security hole, but it does go nice and fast, so let's do it."

But fine, I'm content with ditching the use of IBRS to protect the kernel, and I'm not even surprised. There's a *reason* we put it last in the series, as both the most contentious and most dispensable part. I'd be *happier* with a coherent analysis showing Skylake is still OK, but hey-ho, screw Skylake.

The early part of the series adds the new feature bits and detects when it can turn KPTI off on non-Meltdown-vulnerable Intel CPUs and also supports the IBPB barrier that we need to make retpoline complete. That much I think we definitely *do* want. There have been a bunch of us working on this behind the scenes; one of us will probably post that bit in the next day or so.

I think we also want to expose IBRS to VM guests, even if we don't use it ourselves. Because Windows guests (and RHEL guests; yay!) do use it.

If we can be done with the shouty part, I'd actually quite like to have a sensible discussion about when, if ever, we do IBPB on context switch (ptraceability and dumpable have both been suggested) and when, if ever, we set STIPB in userspace.

Luke Kenneth Casson Leighton thanked David for this explanation, saying, "there is actually a significant benefit to what you're doing, not just peanut-gallery-ing." He said, "this is a cluster-f  * where every single Intel (and AMD) engineer is prevented and prohibited from talking directly to you as they develop the microcode. … They've been ignored and demoralized. It's a lesson that I'm not sure their management are capable of learning, despite the head of the Intel open source innovation centre has been trying to get across to them for many years: OPEN UP THE FUCKING FIRMWARE AND MICROCODE."

He went on, "unfortunately, the burden is on you, the members of the linux kernel team, to read between the lines, express things clearly here on LKML so that the intel engineers who are NOT PERMITTED to talk directly to you can at least get some clear feedback … to indicate *to them* that you fully grasp the technical situation … whilst at the same time not being permitted access to the fucking microcode."

Meanwhile, David was joined by Ingo Moln·r and many others working on the technical side of generating new patches to solve the various issues.

There was clearly a lot of bitterness among the kernel developers – for example, as expressed by Pavel Machek at one point in the discussion, where he said, "someone at Intel put world on fire. And then was selling faulty CPUs for half a year while world was on fire; they knew they are faulty yet they sold them anyway. Then Intel talks about how great they are and how security is important for them … . Intentionally confusing between Meltdown and Spectre so they can mask how badly they screwed. And without apologies."

He went on to say, "and now Intel wants to cheat at benchmarks, to put companies that do right thing at disadvantage and thinks that that's okay because world was on fire? At this point, I believe that yes, product recall would be appropriate. If Intel is not willing to do it on their own, well, perhaps courts can force them. Ouch and I would not mind some jail time for whoever is responsible for selling known-faulty CPUs to the public."

The discussion continued for quite awhile, mostly on a technical level.

Aside from unhappiness toward Intel, the issue is fascinating and helps illuminate some of the fundamental issues confronting a universe in which open source has already essentially successfully taken over the world. How can the relationships and conflicting motivations between open source projects and for-profit corporations be properly managed? How can private companies reveal crucial information about proprietary microcode and such things, while protecting their competitive secrets from other corporations? And if, as Linus guessed in an early post, Intel intends to maintain a significant flaw over the long term, how can open source developers and users communicate with, and perhaps put pressure on, such companies to change that kind of decision?

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News