Zack's Kernel News
Zack's Kernel News
Zack Brown reports on container-aware cgroups, a different type of RAM chip on a single system, new SARA security framework, and improving GPIO interrupt handling.
Container-Aware Cgroups
Roman Gushchin didn't like the way the out-of-memory (OOM) killer targeted individual processes for termination. On a system with many virtual systems on top, he said, the current OOM killer would not behave ideally. It would not recognize individual processes as belonging to particular containers, so it might unexpectedly kill some random process within the container. Or a very large container might not be recognized as a proper target for the OOM killer if it simply contained a large number of very small processes. The OOM killer might target a much smaller container instead, only because that container had a couple of large processes.
Roman wanted to address these problems by creating an OOM killer that would treat a single container as having the size of all processes running within it. Then the OOM killer might properly target that container and kill all the processes associated with it. In cases where no such containers existed, the OOM killer would fall back to its traditional per-process targeting system.
He posted a patch to implement this, but Michal Hocko objected. The real problem with the OOM killer is similar to the problem with context switching, in which the kernel switches rapidly between processes to give the illusion that they are all running simultaneously. The problem with context-switching algorithms is that different user behaviors call for different switching algorithms; the same is true for the OOM killer. There's no obviously correct way to choose which process to kill during OOM conditions.
Michal pointed this out and reminded Roman that among the kernel developers there was still no consensus about which processes the OOM killer should target in general. And he said that therefore trying to extend the OOM killer to handle cgroups might be jumping the gun.
Johannes Weiner, on the other hand, felt that Roman's patches were not dangerously related to OOM killer policy. He felt that in Roman's patches, the OOM killer was still expected to do the standard thing – identify which process to kill, according to its existing set of policies – but under Roman's patches, the OOM killer could simply consider a whole container as a single process and make its assessment the same as it would for any other process.
As Johannes put it, "All we want is the OOM policy, whatever it is, applied to cgroups."
But Balbir Singh agreed with Michal. He started to pose algorithmic policy questions related to how best to assess a given container as being a good or bad target for the OOM killer.
But Johannes explained, "The problem is when OOM happens, we really want the biggest *job* to get killed. Before cgroups, we assumed jobs were processes. But with cgroups, the user is able to define a group of processes as a job, and then an individual process is no longer a first-class memory consumer." He went on, "Without a patch like this, the OOM killer will compare the sizes of the random subparticles that the jobs in the system are composed of and kill the single biggest particle, leaving behind the incoherent remains of one of the jobs. That doesn't make a whole lot of sense."
Without talking past each other, it became clear that the two sides of the discussion were interested in different things. The pro-patch folks wanted containers to be treated as discrete killable jobs, because killing a random part inside one of them might leave the whole container in an unworkable state, whereas the anti-patch folks – Balbir in particular – were more concerned with finding ways to avoid hitting an OOM condition in the first place. They wanted to find ways to handle memory that would be less likely to overcommit RAM. As Balbir put it, "OOM is a big hammer and having allocations fail is far more acceptable than killing processes. I believe that several applications may have much larger VM than actual memory usage, but I believe with a good overcommit/virtual memory limiter the problem can be better tackled."
At a certain point, Vladimir Davydov entered the discussion, with an objection to Roman's patch that was more along the lines of the kind of feedback Roman had probably hoped for initially. Vladimir said:
I agree that the current OOM victim selection algorithm is totally unfair in a system using containers and it has been crying for rework for the last few years now, so it's great to see this finally coming.
However, I don't reckon that killing a whole leaf cgroup is always the best practice. It does make sense when cgroups are used for containerizing services or applications, because a service is unlikely to remain operational after one of its processes is gone, but one can also use cgroups to containerize processes started by a user. Kicking a user out for one of her process[es] has gone mad doesn't sound right to me.
Another example when the policy you're suggesting fails in my opinion is in case a service (cgroup) consists of sub-services (sub-cgroups) that run processes. The main service may stop working normally if one of its sub-services is killed. So it might make sense to kill not just an individual process or a leaf cgroup, but the whole main service with all its sub-services.
And both kinds of workloads (services/applications and individual processes run by users) can co-exist on the same host – consider the default systemd setup, for instance.
Because of these two equally valid possibilities, Vladimir suggested allowing the user to choose which policy they preferred. He said, "we could introduce a per-cgroup flag that would tell the kernel whether the cgroup can tolerate killing a descendant or not. If it can, the kernel will pick the fattest sub-cgroup or process and check it. If it cannot, it will kill the whole cgroup and all its processes and sub-cgroups."
Roman saw a lot of value in Vladimir's scenarios. But he was hesitant to create a per-cgroup flag that would have to be supported in all future kernels. Presumably eventually such a thing would no longer be needed, so it would be good to avoid having the unnecessary flag available in the kernel for all time. But he did agree that there should be an option to disable the cgroup-aware OOM killer on a system-wide basis, if only for backward compatibility purposes.
The debate continued, especially between the "don't change policy" and the "we're not changing policy" positions. Ultimately it doesn't seem to be as much like bickering as it might appear. The folks protesting against policy changes are very sensitive to policy changes because that's the part of the code that concerns them, and they see the subtle ways that Roman's patch does actually change the way the OOM killer targets processes. The folks claiming Roman's patch doesn't implement any policy changes are sensitive to the plight of the virtual system, and they just don't want the OOM killer doing something that seems nonsensical, like destroying a core component of a container, leaving the rest of the container unusable, but without actually freeing up the memory used by the now-useless container.
A Different Type of RAM Chip on a Single System
Ross Zwisler of Intel pointed out that modern devices would typically have multiple different kinds of RAM associated with a given CPU, and that the kernel needed to handle that properly. He said, "These disparate memory ranges will have some characteristics in common, such as CPU cache coherence, but they can have wide ranges of performance both in terms of latency and bandwidth." He went on, "consider a system that contains persistent memory, standard DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. There could potentially be an order of magnitude or more difference in performance between the slowest and fastest memory attached to that CPU."
He said the problem was that "NUMA nodes are CPU-centric, so all the memory attached to a given CPU will be lumped into the same NUMA node. This makes it very difficult for userspace applications to understand the performance of different memory ranges on a given CPU."
He suggested adding sysfs files to indicate performance characteristics of given memory ranges. Then user applications could choose the memory they really wanted to use for a given allocation.
To do this, Intel saw two possible options. Either they could directly export data into sysfs from the newly implemented Heterogeneous Memory Attribute Table (HMAT) code, which contains information about memory, or they could use a library and a running daemon to provide users with an API to access the same data via function calls. Or maybe, Ross said, there was a third way the Linux developers might prefer over either of those options.
There were no serious objections to Ross's proposal, nor did anyone seem to have a strong opinion on the best way to expose the data to user code. Mostly, folks just seemed interested in the whole issue, the types of devices, the use cases, and whatnot.
Bob Liu raised the only dissenting voice, suggesting that most users probably wouldn't care which piece of RAM they allocated, so he wouldn't want user code to be forced to pay attention to these various new APIs and exported sysfs files, to which Ross replied that there would be a decent set of default policies, so that only the users who did care about which memory they wanted would need to pay attention to the new interfaces.
This seems like one of those moments where Intel – or some other hardware maker – is doing something in hardware that obviously needs to be supported because the alternative is just not to support it; they're trying to let the kernel folks know that this is going to be a thing. So this whole thread seemed essentially like a little "heads-up" to the kernel developers that something's coming and folks should be ready when the patches arrive. There's also an aspect of Intel covering its butt, making sure that the direction they're going internally with the patches is less likely to violate some kind of unexpected kernel requirement.
New SARA Security Framework
Salvatore Mesoraca posted a patch to implement SARA (short for "SARA is Another Recursive Acronym"), a new general-purpose security framework intended to let users build security sub-modules to match particular needs. One sub-module Salvatore included was a USB filtering system to better control the kind of data that could pass along a USB connection. Another was a WX protection system to ensure that a piece of memory could be either writable or executable, but not both. Other sub-modules could be designed to meet other specific needs.
MickaÎl Sala¸n suggested merging the WX protection sub-module with his own security project, TPE (Trusted Path Execution)/shebang LSM. He explained that TPE could prevent a user's binaries and even scripts from being executed, but he said, "there is always a way for a process to mmap/mprotect arbitrary data and make it executable, be it intentional or not." He suggested that Salvatore's SARA framework could "make exceptions by marking a file with dedicated xattr values. This kind of exception fit well with TPE to get a more hardened executable security policy (e.g. forbid an user to execute his own binaries or to mmap arbitrary executable code)."
Salvatore agreed that the two projects complemented each other, although he saw some difficulties with merging the two projects, in particular their very different configuration systems. SARA implemented its own, whereas TPE relied on xattr
s. Salvatore said he wouldn't mind if TPE implemented the SARA configuration system, but he was loath to abandon his own system in favor of xattr
s. He didn't really see a need to merge them, since they could both be used side by side without any cost.
Close by, Matt Brown announced that he'd merged his own TPE work with MickaÎl's and added some additional shebang features.
The discussion veered off into technical implementation details. Clearly there's some motivation for these various security systems to merge together, although it seems to be too soon to point to any final front end as the most likely interface.
Buy this article as PDF
(incl. VAT)