Zack's Kernel News
Zack's Kernel News
Zack Brown discusses implementing digital rights management in-kernel, improving lighting controls, and updating printk().
Implementing Digital Rights Management In-Kernel
Content providers are always interested in ways to stream audio and video in such a way that the data cannot be copied by the recipient. Sean Paul recently posted a patch that the Chrome OS team has been using for three years to control content on Exynos, MediaTek, and Rockchip hardware. The patch can be used to turn content protection entirely off, it can request that content protection be enabled by the hardware driver, and it can actually stream protected data.
The patch was received with suspicion by kernel developers.
Pavel Machek specifically said that he couldn't see any case where a user would set the feature to anything other than "off." He also asked, "If kernel implements this, will it mean hardware vendors will have to prevent user[s] from updating the kernel on machines they own?" And wondered, "If this is merged, does it open kernel developers to DMCA threats if they try to change it?"
Daniel Vetter made the case that this particular patch would only encrypt data over a cable, using High-bandwidth Digital Content Protection (HDCP). It wouldn't implement any other aspects of content protection and was thus a generic data security feature. He added, "If you want to actually lock down a machine to implement content protection, then you need secure boot without an unlockable bootloader and a pile more bits in userspace. If you do all that, only then do you have full content protection. And yes, then you don't really own the machine fully."
Pavel replied, "This patch makes it more likely to see machines with locked down kernels, preventing developers from working with systems on their own, running hardware. That is evil and [a] direct threat to the Free software movement."
He added, "Users compiling their own kernels get no benefit from it. Actually it looks like this only benefits Intel and Disney. We don't want that." And he concluded, "it does not belong in kernel."
Alex Deucher suggested that the patch might be useful for "sensitive video streams in government offices where you want to avoid a spy potentially tapping the cable to see the video stream." He added that it was not just Intel and Disney who would benefit, but also "just about every SoC manufacturer and Google and Amazon and a ton of other companies and organizations."
Alex suggested that if the kernel folks didn't want a patch like this, then they should also remove support for encrypted filesystems and encrypted virtual machines. To which Pavel replied, "Encrypted filesystems benefit users. Encrypted video is designed to work against users. In particular, users don't have encryption keys for video they generate. I'd have nothing against [a] feature that would let users encrypt video with keys they control."
Meanwhile, Sean pointed out that his patch only enabled features that were already present in the machine's hardware. He wasn't implementing anything new, just giving the kernel the ability to control hardware features that were already present. He said, "those registers exist and _can_ be used for HDCP; it's just that now you know about it. Having all of the code in the open allows users to see what is happening with their hardware; how is this a bad thing?"
And along the same lines, Daniel also said to Pavel, "you can't claim to speak for the entire kernel and FLOSS community of users and developers. The feature is optional: It does not enforce additional constraints on users but exposes additional functionality already present in hardware, for those who wish to opt in to it. Those who wish to avoid it can do so, by simply not making active use of it."
At this point, Alan Cox came in, saying that he was speaking for himself at this time, and not for his employer, Intel. He said:
"The upstream policy has always been that we don't merge things which don't have an open usable user space. Is the HDCP encryption feature useful on its own? What do users get from it?
If this is just an enabler for a lump of binary stuff in Chrome OS, then I don't think it belongs; if it is useful standalone, then it seems it does belong?"
He also added in response to Alex's scenario regarding secure government communications, "Last time I checked HDCP did not meet government security requirements – which is hardly surprising since you can buy $10 boxes from China to de-HDCP video streams."
Daniel pointed out that everything going into Chrome OS was open source and said that Chrome OS had very strict requirements about what could go into the userspace side of things as well.
That was the end of the discussion. It looks as if some very heavy hitters are opposed to this going into the kernel. Although in a case like this, it's impossible to predict what Linus Torvalds will ultimately decide. Like Alan says, if there's a legitimate use for the code, Linus would be more likely to include it; while if the only use is to lock people out of their own systems, Linus would refuse. It's also possible that someone might notice a way that this hardware could be used to benefit regular users, while completely failing to satisfy the large content providers. In that case, Linus would be likely to simply repurpose the hardware to do the user-beneficial thing, in complete disregard of the original intent of the hardware.
I would guess that, as it stands, the code is DOA, and the Chrome OS people will have to keep maintaining it on their own codebase for the foreseeable future.
Improving Lighting Controls
Enric Balletbo i Serra posted a patch to adjust the way Linux controls back-lit screens. As he pointed out, the human eye perceives changes to light levels differently in low-light situations. He felt that the CIE 1931 lightness formula represented the proper way to calculate Linux screen behavior during user configuration. He said, "This patch adds support to compute the brightness levels based on a static table filled with the numbers provided by the CIE 1931 algorithm."
There were a few technical comments. Daniel Thompson felt the data table could be made smaller, and Pavel Machek felt the table could probably just be generated on the fly. But Enric replied, "This was discussed a bit in previous RFC which had the code to generate the table on the fly [...]. The use of a fixed table or an on-the-fly table is something that I'll let the maintainers decide. I've no strong opinion on the use of the fly table."
Other technical comments revolved around the relationship between the actual amount of light and the way a human would perceive it, and how best to adjust those numbers.
But there were no outcries against the patch. Everyone seemed in agreement that human perceptions should be given preference in Linux over those of other animals, even in spite of New Zealand's 2015 legal recognition of animals as sentient beings. Hopefully a dolphin-centric version of Enric's code will be coming soon.
Updating printk()
Sergey Senozhatsky posted a patch to give the printk()
system call its own thread of execution on the running system. This has been an ongoing effort, with many versions of the patch coming down the pike. The basic issue is printk()
is not safe to call everywhere in the kernel. The reason is that the call to output its log messages, console_unlock()
, may loop forever, which means that if printk()
is called in an atomic (uninterruptible) context, it could lock the system. This isn't really a danger, since kernel code knows to avoid that situation. But it does mean that printk()
may not be called to log messages that really should be logged.
Sergey's patch offloads the entire question to another thread, whose primary purpose is to recognize and break out of loops, returning control to the system. There are also certain emergency circumstances, like during a system panic, where normal interrupts no longer take place, and printk()
must continue to log messages even without its dedicated thread.
As with everything having to do with bootup and shutdown, the code is insane. Hence, the many versions submitted for review.
Petr Mladek preferred a separate approach from Steven Rostedt, in spite of that code being complex and prone to false bug reports from people who couldn't figure out what it was doing. On top of that, it didn't fully solve the problem of locking the system. However, Petr preferred it because, as messy as it was, it represented a more modular approach that was still less complicated and insane than Sergey's code.
Tejun Heo said he didn't care which approach got in; he just wanted something that would work (i.e., not lock the system). But he felt both patches were way too complicated.
Meanwhile, although Steven's version theoretically could still lock the system, no one has been able to reliably reproduce such a case. Steven came in at this point to remark, "I still don't believe there is one. And it's all hand waving until there's an actual report that we can lock up the system with my approach."
Sergey objected to Steven's approach, because he said it required calling printk()
from a CPU that was not in atomic mode. But he said, "what happens if there is NO non-atomic CPU or that non-atomic simply misses the console_owner != false
point?"
The bottom line, Sergey said, was that the code had to do the exact right thing at the exact right time on the exact right CPU, and the exact right time was a vanishingly tiny window.
But this, in fact, was the case that Steven felt had not yet been proved. He asked if anyone could confirm that there was indeed a way to lock the system like that. Before going with Sergey's more complex approach, Steven wanted real evidence that his own not-quite-as-complex approach was truly insufficient.
Sergey replied that at his company the engineers encountered this problem fairly frequently; he posted some info about how it came about.
But Petr was still skeptical. He said, the "console_lock()
owner is able to sleep. Therefore there is no risk of a soft lockup. Sure, many messages will get stacked in the meantime, and the console owner my get then passed to another owner in atomic context. But do you really see this in the real life?"
As he put it, "My current view is that Steven's patch could not make things worse. I was afraid of possible deadlock, but it seems that I was wrong. Other than that, the patch should make things just better because it allows you to pass the work from time to time a safe way."
But Sergey replied, "we are not looking for a solution that does not make things worse. We are looking for a solution that improves the thing."
And Petr replied that they should just push the code into the kernel and see what shook loose. If there were bug reports, then the kernel developers could act on them. But he didn't want to preemptively try to patch a bug that no one would ever encounter, at the cost of increasing code complexity. To which Steven agreed wholeheartedly.
Sergey began in earnest to try to design a sequence of events to trigger the lockup that he believed was there. But he also remarked, "I don't even understand what our plan is. I don't see how are we going to verify the patch. Over the last 3 years, how many emails do you have from Facebook or Samsung or Linaro, or any other company reporting the printk
-related lockups? I don't have that many in my Gmail inbox, to be honest. and this is not because there were no printk
lockups observed. This is because people usually don't report those issues to the upstream community. Especially vendors that use outdated kernels, that are approx 1-2 year(s) behind the mainline. And I don't see why it should be different this time. It will take years before vendors pick the next LTS kernel, which will have that patch in it. But the really big problem here is that we already know that the patch has some problems. Are we going to conclude that 'no emails === no problems'? With all my respect, it does seem like, in the grand scheme of things, we are going to do the same thing, yet expect a different result."
He said in conclusion that Steven's patch did not meet the real world needs he was encountering at his own company, and that Steven's code resulted in worse behavior than the current version of printk()
, given that it would sleep longer, cause userspace applications to time out, and still allowed the system lockup that Sergey wanted to fix.
Finally, Sergey did post some process traces showing a lockup. But Steven didn't find these convincing. He said, "The traces I've seen from you were from non-realistic scenarios. But I have hit issues with printk()
s happening that cause one CPU to do all the work, where my patch would fix that. Those are the scenarios I'm talking about."
Steven and Sergey went back and forth for a bit, each growing more and more frustrated with the other.
Other people started jumping in at this point, and eventually Tejun returned, preferring Sergey's approach because he, too, was seeing the lockups and needed them to be addressed. He said to Steven, "I tried your v4 patch and ran the test module and could easily reproduce RCU stall and other issues stemming from a CPU getting pegged down by printk
flushing." He continued, "this isn't a theoretical problem. We see these stalls a lot. Preemption isn't enabled to begin with. Memory pressure is high, and OOM triggers and printk
starts printing out OOM warnings; then, a network packet comes in, which triggers allocations in the network layer, which fails due to memory pressure, which then generates memory allocation failure messages, which then generates netconsole packets, which then tries to allocate more memory and so on. It's just that there's no one else to give that flushing duty too, so the ping-ponging that your patch implements can't really help anything."
Tejun concluded, "You argue that it isn't made worse by your patch, which may be true, but your patch doesn't solve actual problems and is most likely an unnecessary complication, which gets in the way for the actual solution."
Steven looked at Tejun's scenario, but said, "WTF! You are printing 10,000 printk
messages from an interrupt context??? And to top it off, I ran this on my box, switching printk()
to trace_printk()
(which is extremely low overhead). And it is triggered on the same CPU that did the printk()
itself on. Yeah, there is no hand off, because you are doing a shitload of printk
s on one CPU and nothing on any of the other CPUs. This isn't the problem that my patch was set out to solve, nor is it a very realistic problem."
Well, the discussion is ongoing. We have multiple people accusing each other of not reading what they're writing, curses flying back and forth, and people generally talking past each other. There's no end in sight.
And yet, even in the heat of frustration and disagreement, both sides are still taking each other seriously and trying to address each other's concerns. Steven has begun to think that Tejun's scenario is a bug in another part of the kernel code and has begun trying to diagnose that. At which point, possibly, both Sergey and Tejun would agree that Steven's code addresses the only real problems that remain. Meanwhile, both Sergey and Tejun have been trying to post real-world scenarios, process traces, and code that reveal the bug that Sergey's code attempts to fix, but that Steven's does not.
And of course, this debate has been ongoing for quite a while already, with multiple patches, and multiple debates, occurring over several years. There's no way to know where this will lead or how it will pan out. The code is already insanely complex. The patches to fix it are more complex still. And workloads that reproduce the problems may or may not be related to what the developers are trying to fix.
Buy this article as PDF
(incl. VAT)