Zack's Kernel News
Zack's Kernel News
Improving the Android low memory killer; randomizing the Kernel stack; and best practices.
Improving the Android Low Memory Killer
Sultan Alsawaf submitted patches to implement a low memory killer for Android (i.e., something that detects which process to kill if the system starts to run out of available RAM). Low memory killers are very useful in emergencies, since the alternative is to crash and burn.
In this case, however, Greg Kroah-Hartman pointed out that an existing in-kernel low memory killer had recently been removed from the kernel source tree in favor of a userspace low memory killer that he felt worked just fine.
But Sultan replied that in his opinion, the userspace version sucked lemons. Among other things, he said, it was slow, killed too many processes, and was overly complex. As he put it, "The original reasoning behind why the old kernel low memory killer was removed is also a bit vague to me. It just seemed to be abandonware, and all of a sudden a userspace daemon was touted as the solution."
He added, "This driver is like an Android-flavored version of the kernel's OOM [Out of Memory] killer and has proven quite effective for me on my Pixel. Processes are killed exactly when a page allocation fails, so memory use is maximized. There is no complexity to try and estimate how full memory is either."
Michal Hocko was also opposed to the whole idea, saying, "we are not going to maintain two different OOM implementations in the kernel. From a quick look the implementation is quite a hack, which is not really suitable for anything but a very specific use case. E.g., reusing a freed page for a waiting allocation sounds like an interesting idea, but it doesn't really work for many reasons. E.g., any NUMA affinity is broken; zone protection doesn't work either. Not to mention how the code hooks into the allocator hot paths. This is simply [a] no no."
Michal pointed out that the userspace version was the right place to implement what Sultan wanted.
Sultan, having had no illusions about the future of his patch, replied, "I had no doubt that this would be vehemently rejected on the mailing list, but I wanted feedback/opinions on it and thus sent it as an RFC. At best, I thought perhaps the mechanisms I've employed might serve as inspiration for LMKD [low memory killer daemon] improvements in Android."
Suren Baghdasaryan replied that maybe some of Sultan's ideas could be incorporated into the userspace code and invited Sultan to discuss it further. But he confirmed that there was no way in hell that an in-kernel implementation would get into the codebase.
But Joel Fernandes felt that Sultan's basic assumptions were not good enough to actually work in any sort of general-purpose environment. Sultan's code essentially tried to very quickly identify which process was overusing memory and kill it immediately. This was the speed improvement he wanted to bring to the userspace version. However, as Joel put it, "the point is that a transient temporary memory spike should not be a signal to kill any process. The reaction to kill shouldn't be so spontaneous that unwanted tasks are killed, because the system went into panic mode. It should be averaged out, which I believe is what PSI does." In other words, the existing system wasn't simply slow, it was slow because being faster might be dangerous.
Still, Sultan argued that his patch was just about as fast as the existing in-kernel OOM killer code, which meant that if his patch had the problem Joel pointed out, then the kernel's OOM killer would too. From this he deduced that his code didn't run the risk Joel ascribed to it. Sultan added, "The decision to kill a process occurs after the page allocator has tried very hard to satisfy a page allocation via alternative means, such as utilizing compaction, flushing file-backed pages to disk via kswapd
, and direct reclaim. Once all of those means have failed, it is quite reasonable to kill a process to free memory. Trying to wait out the memory starvation at this point would be futile."
Meanwhile, Suren said that the slower speed of the userspace code was a known issue, which the developers were continuing to address. He remarked, "For example the recent LMKD change to assign processes being killed to a cpuset
cgroup containing big cores cuts the kill time considerably." However, he acknowledged that this was not actually an ideal solution, so he and others were thinking about further alternatives.
To this, Michal remarked, "The only way to control the OOM behavior proactively is to throttle allocation speed. We have memcg
high limit for that purpose. Along with PSI, I can imagine a reasonably working userspace early OOM notifications and reasonable acting upon that." Suren said this made good sense, and the developers were aiming in exactly that direction.
But ultimately, the difference between all these alternatives and the in-kernel OOM killer is that the OOM killer is an out-of-memory killer, while the others are simply low-memory killers. That is, both Suren's and Sultan's code are trying to get a little ahead of the game, with the idea of avoiding having to fall back on the default in-kernel OOM killer.
As Michal put it at one point, "If you really want to reach the real OOM situation, then you can very well rely on the in-kernel OOM killer. The only reason you want a customized OOM killer is the tasks classification. And that is a different story. Userspace hints on the victim selection has been a topic for quite a while. It never gets to any conclusion as interested parties have always lost an interest because it got hairy quickly."
The discussion eventually was overrun by the userspace folks discussing ways to improve the userspace code. It's likely this was exactly what Sultan was hoping to inspire with his original patch. There did seem to be a lot of progress made during that phase of the discussion, although Sultan did not weigh in again to give a thumbs up or down.
It's a little unusual these days to have someone write and post a patch with the express purpose of challenging an alternative's defenders to do better. But it does seem as though an in-kernel low memory killer is not likely to get into Linux again in the near future. It's an odd approach to project management; however, if it works, it works.
Randomizing the Kernel Stack
Elena Reshetova recently posted some code to utterly randomize the kernel stack offset in response to every system call. The idea is of course to make it harder to crack into the running system via stack-based attacks. Though as she said in her post, "as Linux kernel stack protections have been constantly improving (vmap-based stack allocation with guard pages, removal of thread_info
, STACKLEAK
), attackers have to find new ways for their exploits to work."
She also added that her patch was not aimed at any particular attack, but was intended more as a general coat of armor for any such attacks that might wish to be rebuffed.
Everyone welcomed the patch – although in general, Linus Torvalds seems to prefer security patches that fix security holes, rather than speculatively addressing a region of potential-yet-not-actual attack vectors. But in this case, no one had any serious objections – just implementation suggestions.
Josh Poimboeuf, however, did remark, "Now that thread_info
is off the stack, and vmap stack guard pages exist, it's not clear to me what the benefit is." But no one else took up the charge, and even Josh was ready with implementation suggestions.
Elsewhere, Andy Lutomirski had an issue with what seemed to be an alignment error. Elena's code introduced 8 bits of randomness – which was fine – but also added 4 bits of zero value, to properly align the stack. Andy said this seemed unnecessary because "x86-64 Linux only maintains 8-byte stack alignment."
But David Laight replied, "The GCC developers arbitrarily changed the alignment a few years ago. If the stack is only 8-byte aligned and you allocate a variable that requires 16-byte alignment, you need GCC to generate the extra stack frame to align the stack. I don't remember seeing the relevant GCC options on the linux GCC command lines."
Andy said, "On older GCC, you can't set the relevant command-line options, because GCC was daft. So we just crossed [our] fingers and [hoped] for the best. On newer GCC, we set the options. Fortunately, 32-byte stack variable alignment works regardless. AFAIK x86-64 Linux has never aligned the stack to 16 bytes."
Elena said she'd take another look, and see if she'd been wrong to add the four zero bits.
Another issue raised by Andy was the source of randomness. He didn't like the speed of existing techniques, and he felt that some sources of randomness might be too easy to defeat. He said, "Perhaps we need a little percpu
buffer that collects 64 bits of randomness at a time, shifts out the needed bits, and refills the buffer when we run out."
But Kees Cook remarked, "I'd like to avoid saving the exact details of where the next offset will be, but if nothing else works, this should be okay. We can use 8 bits at a time and call prandom_u32()
every fourth call. Something like prandom_bytes()
, but where it doesn't throw away the unused bytes."
Elena also said that Andy's suggestion might not provide enough random bits. She said, "We might have to refill pretty often on a syscall
-hungry workloads. If we need 8 bits for each sys call, then we will refill every eight syscall
s, which is of course better than each one, but is it an acceptable penalty?"
Regarding Kees's suggested solution, Elena replied, "I think this would make the end result even worse security-wise than simply using rdtsc()
on every syscall
. Saving the randomness in percpu
buffer, which is probably easily accessible and can be probed if needed, would supply [an] attacker with much more knowledge about the next three-four random offsets than what he would get if we use 'weak' rdtsc
. Given that for a successful exploit, an attacker would need to have stack aligned once only, having a knowledge of three-four next offsets sounds like a present to an exploit writer…."
Kees replied, "That certainly solidifies my concern against saving randomness."
There was a bit more discussion, but the proof was in the pudding, and Elena posted new versions of her patch that seemed to lead the way forward better than the theoretical discussions; especially when the speed of an actual implementation was the best way to determine on which implementation to rely.
It still seems like an odd patch to go into the kernel, since it doesn't address any known exploits. It would not be the first time that developers put a lot of work into a security fix, only to be told that it wasn't a real fix, because there were no exploits to guard against.
So I'll be curious to see whether Elena's code makes it into the tree. For something that will happen every time anything invokes a system call, I would imagine the justification would have to be very clear, because there's bound to be some kind of performance hit, no matter how tiny, with any version of Elena's patch.
Best Practices
There are some things that you know will be controversial as soon as someone mentions them. Who's your favorite president in US history? Should recreational marijuana be legal, or should we just keep breaking the law? When boiling an egg, do you put it in straight from the fridge, or at room temperature?
Or, in the Linux kernel world, how many bytes can you put in a line of text?
Recently, Alastair D'Silva in all innocence suggested, "With modern high resolution screens, we can display more data, which makes life a bit easier when debugging. Allow 64 bytes to be dumped per line."
He posted a patch to increase the options for row length from 16 and 32 bytes to 16, 32, and 64 bytes.
Petr Mladek decided to nip this sacrilegious rebellion in the bud, saying, "I have quite some doubts about this feature. We are talking about more than 256 characters per line. I wonder if such a long line is really easier to read for a human." He went on, "I am not expert, but there is a reason why the standard is 80 characters per line. I guess that anything above 100 characters is questionable. https://en.wikipedia.org/wiki/Line_length somehow confirms that."
Delving deeper, Petr continued, "If we take 8 pixels per character. Then we need 2048 to show the 256 characters. It is more than HD. IMHO, there is still [a] huge number of people that even do not have [an] HD display, especially on a notebook."
But Alastair said the patch worked perfectly well for him, and that "it's basically two separate panes of information side by side, the hex dump and the ASCII version." And he remarked, "The intent is to make debugging easier when dealing with large chunks of binary data. I don't expect end users to see this output."
Petr replied, "I am sure that it works for you. But I do not believe that it would be useful in general." To which Alastair said, equally pugnaciously, "I do, and I believe the choice of the output length should be in the hands of the caller."
Alastair then doubled down (literally) on the whole issue, saying that in fact, "it would make more sense to remove the hard-coded list of sizes and just enforce a power of two."
But this was too much for David Laight, whose head had been spinning with steam shrieking out of his ears for too long during this debate already. David asked, "Why powers of two? You may want the length to match sizeof
(struct foo
). You might even want the address increment to be larger that the number of lines dumped."
To which Alastair replied, "Good point, the base requirement is that it should be a multiple of groupsize
."
At this point, the conversation went offline, or at least they took it out into the street somewhere, and that was the end of it.
My favorite part of that particular debate is that if it continues for long enough, Linus Torvalds will have to make a decision on it one way or the other. Because this issue has never, ever, ever come up before…
…Well, maybe it did in 2012, when someone tried to increase the maximum line length for patch submissions from 80 characters to 100. At that time, Linus said:
"Quite frankly, I think we should still keep it at 80 columns.
"The problem is not the 80 columns; it's that damn patch-check script that warns about people occasionally going over 80 columns.
"But usually it's better to have the occasional 80+ column line, than try to split it up. So we do have lines that are longer than 80 columns, but that's not because 100 columns is ok – it's because 80+ columns is better than the alternative.
"So it's a trade-off. Thinking that there is a hard limit is the problem. And extending that hard limit (and thinking that it's 'ok' to be over 80 columns) is also a problem.
"So no, 100-char columns are not ok."
Buy this article as PDF
(incl. VAT)