Zach's Kernel News

Zach's Kernel News

Article from Issue 169/2014

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Extending Containers

Marian Marinov noticed that if he ran multiple containers, they all shared the same process counters. In other words, if two containers used the same user ID numbers and group ID numbers, then the processes owned by those IDs, but on completely separate containers, would appear to be owned by the same user. This would mess with his ability to do process resource limiting on a per-container basis.

This situation caused problems for Marian, because his containers were all instantiated by an identical template (hence identical UID and GID numbers) that contained a large number of files for a particular project. Changing the ownerships of those files within the running container would be a very time consuming task, and abandoning his template would require a lot of redesign. He proposed modifying some kernel data structures to isolate each container's user namespace, so the process counters would see them as separate from each other.

Eric W. Biederman replied that the current behavior was actually intentional, and he felt it would be bad to have per-user namespace data structures in the kernel, although he did say that he'd been considering allowing resource limits that would detect the different containers and apply the limits appropriately.

In the meantime, Eric offered a workaround. He suggested that Marian untar his large group of files within the container, which would quickly assign a different owner to the relevant files. However, Marian said this would be a step backward, because he already used a fancy snapshotting tool for the files. He said this was much more efficient than using tar, but it left him with the namespace problem.

Serge Hallyn sympathized with Marian and suggested a solution of "a very thin stackable filesystem which does uid shifting, or, better yet, a non-stackable way of shifting uids at mount."

Pavel Emelyanov liked Serge's idea, especially the non-stackable bit, mostly because, as he said, "even simple stacking is quite a challenge."

They started going over the technical details of how this might be approached, although Serge pointed out that they'd need to catch UID and GID values at every point where kernelspace touched userspace. There were many such points, and catching them all would be difficult.

At one point, Eric suggested, "There is a simple solution. We pick the filesystems we choose to support. We add privileged mounting in a user namespace. We create the user and mount namespace. Global root goes into the target mount namespace with setns and performs the mounts."

He added, "As long as we don't plan to support XFS (as XFS likes to expose it's implementation details to userspace) it should be quite straightforward."

Marian replied, "This may solve one of the problems, but it does not solve the issue with UID/GID maps that overlap in different user namespaces. In our cases, this means breaking container migration mechanisms."

James Bottomley added, "Any implementation which doesn't support XFS is unviable from a distro point of view. The whole reason we're fighting to get USER_NS enabled in distros goes back to lack of XFS support (they basically refused to turn it on until it wasn't a choice between XFS and USER_NS). If we put them in a position where they choose a namespace feature or XFS, they'll choose XFS."

Eric argued that the situations weren't quite the same. In the previous case, the problem was that XFS wouldn't build at all, and so the distros had to make a choice between it and USER_NS. In the current case, the problem was that XFS would work fine, unless it was inside a user namespace. It was a completely new scenario, and no existing use cases would break, he said.

Eric felt that there were appropriate reasons to exclude XFS from Marian's UID/GID isolation idea. First, he said, XFS's journal replay would already not support migrating a filesystem from a system with one endian/wordsize combination to another. So Marian's use case was already excluded – he wanted to migrate filesystems between containers, and XFS already didn't fully support that. Therefore, nothing was lost by excluding it from this feature as well.

Eric added that because of XFS's implementation of certain features, it would be a much larger coding effort to make the changes needed. Better to support "normal" filesystems first, he said, and worry about XFS when it had overcome some of those deeper implementation issues.

The discussion ended around there, with no real conclusion. It's not clear how the kernel will ultimately address Marian's issue. The existence of a tar-based workaround suggests that there's no urgency, which in turn suggests that the deeper workaround of excluding XFS from a partial solution is also not urgent. On the other hand, there doesn't seem to be any inherent opposition to isolating user IDs and group IDs in containers; there's only disagreement on the right approach to doing it.

New Random Number Generators in the Kernel

Stephan Mueller submitted some patches to implement a deterministic random number generator, as defined by SP800-90A [1], a document put out by the US Department of Commerce's National Institute of Standards and Technology (NIST).

As stated in the document, "NIST is responsible for developing standards and guidelines, including minimum requirements, for providing adequate information security for all agency operations and assets, but such standards and guidelines shall not apply to national security systems. This recommendation has been prepared for use by Federal agencies. It may be used by nongovernmental organizations on a voluntary basis and is not subject to copyright."

Stephan implemented all three random number generators – HMAC, hash, and CTR, with HMAC being the simplest, and CTR being the most complex. As stated in the document, "The methods provided are based on either hash functions, block cipher algorithms or number theoretic problems."

There was no discussion, and Herbert Xu accepted the patches unchanged.

The printk() Untouchable Behemoth

Petr Mladek submitted a patch to support calling printk() from within a non-maskable interrupt (NMI). The problem had been that NMIs could not themselves be interrupted, whereas printk() required taking an internal lock. This meant that calling printk() from within an NMI could potentially cause a deadlock. If printk() required a lock that was held by another process, it would wait for that process to release the lock, then printk() would take the lock itself. However, the other process would never release the lock, because the NMI would never be interrupted to give CPU cycles to that process.

Petr's solution was for printk() merely to try to take the lock. If it failed, it would simply buffer whatever it was trying to print and complete its print operation at the first opportunity after the NMI finished executing.

Petr's code had to overcome several problems. For one thing, the buffer could grow without bound, if printk() kept being called. Petr's code had to make sure that the buffered messages would be printed in the correct order. In the worst-case scenario, if the buffer itself was too small to hold all the messages, Petr's code had to make sure the most recent messages, rather than the earliest, were the ones that ultimately got displayed. In addition to that, his code also had to make sure the messages were printed to the console as soon as possible. Any delay risked not printing the messages at all, if the NMI were part of a system crash.

Jiri Kosina felt that Petr's code was important and addressed real problems. He said, "we've actually seen the lockups triggered by RCU stall detector trying to dump stacks on all CPUs, and hard-locking machine up while doing so. So this really needs to be solved."

Frédéric Weisbecker pointed out that the code was unpleasant to review, "due to a not very appealing changestat on an old codebase that is already unpopular."

Having said that, Frédéric did look at the code, and said, "Your patches look clean and pretty nice actually. They must be seriously considered if we want to keep the current locked ring buffer design and extend it to multiple per context buffers."

He also added that the printk() code used an ancient design and that Petr's code made a 1,000-line change, just to overcome printk()'s fundamental flaws. Frédéric suggested, "shouldn't we rather redesign it to use a lockless ring buffer like ftrace or perf ones?"

Jiri agreed that "printk() has grown over years to a stinking pile of you-know-what, no argument to that." And, he agreed that a redesign was the right move, in the long run. However, he added that the patch for that would be even bigger and scarier than what Petr had already produced. Jiri also pointed out that Petr's code was primarily a bug-fix, addressing real-world deadlocks that users had experienced. Even if a redesign was the right approach overall, Petr's bug-fix was still necessary in the short-term.

Jan Kara also agreed that Petr's fix was good for the moment, especially "Given how difficult/time consuming is it to push any complex changes to printk." Jan also wanted to get a broader consensus on the appropriate new printk() design, to avoid putting a lot of effort into the wrong direction. Also, as Frédéric put it, "There is also a big risk that if we push back this bugfix, nobody will actually do that desired rewrite."

Frédéric CC'd Linus Torvalds to get his take on the situation, and Linus replied:

Printing from NMI context isn't really supposed to work, and we all *know* it's not supposed to work.

I'd much rather disallow it, and if there is one or two places that really want to print a warning and know that they are in NMI context, have a special workaround just for them, with something that does *not* try to make printk in general work any better.

Dammit, NMI context is special. I absolutely refuse to buy into the broken concept that we should make more stuff work in NMI context. Hell no, we should *not* try to make more crap work in NMI. NMI people should be careful.

Make a trivial 'printk_nmi()' wrapper that tries to do a trylock on logbuf_lock, and *maybe* the existing sequence of

if (console_trylock_for_printk()) console_unlock();

then works for actually triggering the printout. But the wrapper should be 15 lines of code for 'if possible, try to print things,' and *not* a thousand lines of changes.

Jiri replied that printing from NMI context was actually "rather useful in a few scenarios – particularly [if] it's the only way to dump stacktraces from remote CPUs in order to obtain traces that actually make sense (in situations like RCU stall); using workqueue-based dumping is useless there."

Petr also replied to Linus's suggested 15-line solution, saying, "I am afraid that basically this is done in my patch set. It does trylock and uses the main buffer when possible. I am just not able to squeeze it into 15 lines."

Linus's comment essentially put a stop to the discussion for a while, though, because no one knew exactly how to address the issue. There was still the problem of deadlocks in the real world, and Linus didn't seem to like Petr's code that implemented his suggestion, just taking 1,000 lines instead of 15.

Jiri got the conversation started again, pointing out that the printk() code had recently become much more complex, forcing Linus's 15-line fix to turn into a "crazy mess due to handling of all the indexes, sequence numbers, etc."

Jiri added, "I find it rather outrageous that fixing *real bugs* (leading to hangs) becomes impossible due to printk() being too complex. It's very unfortunate that the same level of pushback didn't happen when new features (that actually *made* it so complicated) have been pushed; that would be much more valuable an[d] appropriate."

Paul McKenney joined the discussion at this point, with an alternative. He said he could possibly change the read-copy-update (RCU) code "to allow people to tell it not to use NMIs to dump the stack." But Jiri replied that this would make the RCU stall detector less useful for identifying problems. The alternative "workqueue-based stack dumping is very unlikely to point its finger to the real offender, as it'd be coming way too late," said Jiri.

Paul replied, "I would not use workqueues, but rather have the CPU detecting the stall grovel through the other CPUs' stacks, which is what I do now for architectures that don't support NMI-based stack dumps."

Jiri agreed this would handle the printk() problem, although it would leave a huge number of NMI-specific tests sprinkled in kernel code, which would all need to be taken out by hand.

Meanwhile, Linus also replied to Paul, saying that his suggestion didn't go far enough. Linus said: "I don't think it should be an 'option'."

He continued, "We should stop using nmi as if it was something 'normal'. It isn't. Code running in nmi context should be special, and should be very very aware that it is special. That goes way beyond 'don't use printk'. We seem to have gone way way too far in using nmi context. So we should get *rid* of code in nmi context rather than then complain about printk being buggy."

Paul posted a small patch to accomplish this and included the following changelog entry: "Although NMI-based stack dumps are in principle more accurate, they are also more likely to trigger deadlocks. This commit therefore replaces all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so that the CPU detecting an RCU CPU stall does the stack dumping."

There was not universal love for Paul's patch. Steven Rostedt and Jiri thought it gave up too much, making it more difficult to debug problems. Jiri pointed out, "This is prone to producing not really consistent stacktraces though, right? As the target task is still running at the time the stack is being walked, it might produce stacktraces that are potentially nonsensical."

Paul agreed that "Yes, there is some potential for confusion. My (admittedly limited) rcutorture testing produced sensible stack traces, but things might be a bit uglier in other situations."

They discussed some possible alternatives but agreed that Paul's solution was a decent workaround for the moment. Still, Jiri said, "I feel bad about the fact that we are now hostages of our printk() implementation, which doesn't allow for any fixes/improvements. Having the possibility to printk() from NMI would be nice and more robust … otherwise, we'll be getting people trying to do it in the future over and over again, even if we now get rid of it at once."

This discussion reminds me a bit of the way folks used to talk about the big kernel lock (BKL). It had gotten so ensconced in so many parts of the system, each with their own unique locking requirements, that no suitable reimplementation could be found. It was an untouchable mess that held the whole kernel hostage. Ultimately, the solution was to decouple all the BKL users from the single centralized implementation, copy the implementation out to the periphery so that each use case had its own identical BKL implementation, and then rewrite each one individually, so it met the locking needs of its particular situation. Who knows, maybe something similar would work for printk().


  1. "Recommendation for Random Number Generation Using Deterministic Random Bit Generators," NIST SP800-90A:

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Zack Brown discusses implementing digital rights management in-kernel, improving lighting controls, and updating printk().

  • Kernel News

    Zack Brown reports on fixing printk() bit by bit, kernel internationalization (or not), and kernel encryption and secure boot. 

  • Kernel News

    Chronicler Zack Brown reports on printk() wrangling, persistent memory as a generalized resource, making Kernel headers available on running systems, and Kernel licensing Hell. 

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Zack Brown discusses preventing the kernel from tainting, encrypting printk() output, and a new kernel bug reporting bot. 

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95