Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.
Revamping the Firmware API
Luis R. Rodriguez didn't like that the kernel's firmware API had been gradually extended into a nightmare mash of hopeless deformity. Any time someone updated a firmware routine, or added a new one, all existing firmware users had to be sifted and updated to account for the change. The API had also been extended to the point that user code had begun to use it for things other than firmware. As Luis put it, "an example here is the p54 driver enables users to provide a custom EEPROM through this interface. Another example is optional CPU microcode updates. This list is actually quite endless." He added that there was a queue of other subsystems wanting to use the firmware API for their own non-firmware purposes.
Luis proposed, first of all, continuing to support the existing firmware API. Beyond that, it would provide a superset of features, some of which could be used for firmware, but that would have more general "data helper" value as well. The new API would handle calls synchronously or asynchronously as needed, and in general the API would be extensible in natural ways that would not require going back to change all existing usage every time a new extension arrived.
Greg Kroah-Hartman had no major objections and mostly offered documentation suggestions.
One thing Luis's code did not do was address the firmware fallback options. If a firmware update doesn't work, it's nice to be able to use a previous version while you replace the new version. But Luis found the fallback code to be "hairy" and wanted to tackle it properly. So his initial plan was to leave all fallback fixes out and just deal with them as soon as possible. As he put it, "I really am trying hard to make that fallback mechanism *work* fine; right now its just hairballs."
On another note, Luis added, "There are also some more longer term things to consider which I'd like to address as well, for instance, the firmware code is the only code using the general UMH [User Mode Helper] lock, although it was originally added to help warn for firmware API uses on suspend/resume, the firmware cache mechanism now helps resolve the issues – the only case which cannot take advantage of this is the custom fallback mechanism. One also needs to consider if the UMH lock or a replacement should be considered for other kernel UMH."
So Luis would not be fixing absolutely everything all at once. But he also pointed out that there were currently race conditions with respect to the kernel's init code, that couldn't really be addressed properly with the old firmware code still in place. Once the new code had been adopted, he said, the races would be much more fixable.
Bjorn Andersson asked for some reassurance that when the new fallback design did get coded into the kernel, it would be compatible with existing user space, "or will we forever have two duplicate systems for loading 'firmware' in the kernel?"
Luis replied that the old API would be sticking around indefinitely. But that it would become static, and all new features would be added to the new API. Beyond that, he said, he wasn't certain exactly how compatible the old and new APIs would be, and so he couldn't say for sure that the old API would definitely go away.
I love reading about these crazy rescue missions. It's always something everyone relies on, but that hasn't been maintained in years and has just been accruing patches like crustaceans until the whole thing is ready to tip over and fall back into the sea. Except someone decides to write a whole new thing that's not quite as susceptible to crustacean infestation.
Crypto Optimization
Binoy Jayan wanted to optimize some of the kernel's cryptographic routines. Specifically, he pointed out that the initialization vectors (IVs) were generated in the dm-crypt.c file. If they could be implemented as template ciphers in the kernel's cryptographic layer, the IVs could run directly in hardware, rather than being implemented in software. This would produce a significant speedup, he said.
With the IV generator code migrated, the dm-layer code could be optimized to send a whole "bio" at a time, and Binoy explained that "each bio contains the in memory representation of physically contiguous disk blocks. The dm layer sets up a chained scatterlist of these blocks split into physically contiguous segments in memory so that DMA can be performed."
Milan Broz cautioned against Binoy's approach. Some of the IV generator code, he said, were "hacks," and some existed only for compatibility reasons and should not be regarded as safe. He also said that if the IV generator code was moved into the kernel's crypto layer, it would make it difficult to change the cryptographic key structure later on, if they should ever want to.
Herbert Xu felt that the code could remain in the dm-crypt.c file under the control of the dm-crypt team but still be registered in the kernel's crypto layer to take advantage of the greater hardware efficiency.
Herbert and Binoy went back and forth for a bit on implementation issues and came to some kind of understanding about how to divvy up different pieces of the code.
Meanwhile Gilad Ben-Yossef was not even able to test Binoy's patches because they wouldn't apply cleanly against his tree. Binoy explained that "there were some key structure changes in dm-crypt after I sent out v2. I have resolved them while working on v3. Please wait for the next version of the patchset. I'll send it probably by next week. I wanted to incorporate a few changes suggested by Herbert before sending them."
At the same time, Ondrej Mosnacek came in with his own objections:
"I like what you are trying to achieve; however, I don't think the solution you are heading towards (passing sector number to a special crypto template) would be the best approach here. Milan is currently trying to add authenticated encryption support to dm-crypt and as part of this change, a new random IV mode would be introduced. This mode generates a random IV for each sector write, includes it in the authenticated data and stores it in the sector's metadata (in a separate part of the disk). In this case, dm-crypt will need to have control over the IV generation (or at least be able to somehow retrieve it after the crypto operation).
That said, I believe a different approach would be preferable here. I would suggest, instead of moving the IV generation to the crypto layer, to add a new type of request to skcipher API (let's call it skcipher_bulk_request), which could be used to submit several messages at once (together in a single sg list), each with their own IV, to a skcipher. This would allow drivers to optimize handling of such requests (e.g., the SIMD ciphers could call kernel_fpu_begin/end just once for the whole request). It could be done in such a way that implementing this type of requests would be optional and a fallback implementation, which would just split the request into regular skcipher_requests, would be automatically set for the ciphers that do not set it themselves. That way, this would require no changes to crypto drivers in the beginning and optimizations could be added incrementally.
The advantage of this approach to handling such 'bulk' requests is that crypto drivers could just optimize regular algorithms (xts(aes), cbc(aes), etc.) and wouldn't need to mess with dm-crypt-specific IV generation. This also means that other users that could potentially benefit from bulking requests (perhaps network stack?) could use the same functionality."
Mosnacek added that he was currently working on a set of patches to implement a proof-of-concept for this idea and planned to post something soon. But at this point, the conversation came to an end.
This discussion yielded no clarity about how things would eventually work out. Which is great – it's just folks talking about how to make something cool. What I like about these kinds of discussions is how people from disparate areas of the kernel bring what they know, often in ways that surprise each other, and end up coalescing around a solution none of them could have envisaged at the start.
Fixing Memory Access for PCI Devices
Nikita Yushchenko pointed out that a given PCI device could conceivably support up to 64-bit direct memory access (DMA) addressing. But PCI host bridge had limitations on its addressing abilities that prevented it from accessing that much RAM. Nikita felt that instead of allowing PCI devices to claim the ability to access 64-bits when in practice they wouldn't be able to, it would be better to prevent them from claiming that ability in the first place.
He posted a patch to implement that. Will Deacon objected that the patch treated the problem as if it were architecture-specific, which it wasn't. He suggested that "another hack you could try would be to register a PCI bus notifier in the host bridge looking for BUS_NOTIFY_BIND_DRIVER, [and] then you could proxy the DMA ops for each child device before the driver has probed, but adding a dma_set_mask callback to limit the mask to what you need."
Arnd Bergmann posted an experimental patch of his own, saying, "This is what I prototyped a long time ago when this first came up. I still think this needs to be solved properly for all of ARM64, not with a PCI-specific hack, and in particular not using notifiers."
Nikita objected that Arnd's patch was far from complete and added, "In current device trees no dma-ranges is defined for nodes that are parents to PCI host bridges. This will make of_dma_configure() to fall back to 32-bit size for all devices on all current platforms. Thus applying this patch will immediately break 64-bit DMA masks on all hardware that supports it."
Arnd disagreed with Nikita's diagnosis, saying that the patch wouldn't break 64-bit DMA masks. But he also felt that it wasn't clear where the true underlying kernel issue was. He asked, "Is it actually the PCI host bridge that limits the ranges here, or the bus that it is connected to?"
Nikita acknowledged that he wasn't sure where the true source of the limitation was. He and Arnd dove into a technical exploration of the problem. At one point, Arnd remarked, "we have to guarantee that the fallback to 32-bit DMA always succeeds. There are also a lot of drivers that try to set a 64-bit mask but don't implement bounce buffers for streaming mappings if that fails, and swiotlb is what we use to make those drivers work."
(He also added, "And yes, the API is a horrible mess." And Nikita replied, "The entire infrastructure to allocate and use DMA memory is messy.")
At one point, Arnd gave an historical analysis of the situation. "What I think happened here in chronological order is:
- In the old days, 64-bit architectures tended to use an IOMMU all the time to work around 32-bit limitations on DMA masters.
- Some architectures had no IOMMU that fully solved this and the dma-mapping API required drivers to set the right mask and check the return code. If this failed, the driver needed to use its own bounce buffers as network and scsi do. See also the grossly misnamed PCI_DMA_BUS_IS_PHYS macro.
- As we never had support for bounce buffers in all drivers, and early 64-bit Intel machines had no IOMMU, the swiotlb code was introduced as a workaround, so we can use the IOMMU case without driver-specific bounce buffers everywhere.
- As most of the important 64-bit architectures (x86, arm64, powerpc) now always have either IOMMU or swiotlb enabled, drivers like NVMe started relying on it, and no longer handle a dma_set_mask failure properly."
At some point, Christoph Hellwig joined in the fun, and he and Arnd continued the technical analysis. Eventually, the thread petered out inconclusively.
Ultimately, no clear diagnosis of the problem came out of the discussion, but it's clear that the problem exists and the current code is broken. So that's something. It's also clear that several folks are delving into the code to try and untangle what went wrong, so there will probably be some kind of fix in the near future.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.
-
CachyOS Adds Support for System76's COSMIC Desktop
The August 2024 release of CachyOS includes support for the COSMIC desktop as well as some important bits for video.
-
Linux Foundation Adopts OMI to Foster Ethical LLMs
The Open Model Initiative hopes to create community LLMs that rival proprietary models but avoid restrictive licensing that limits usage.
-
Ubuntu 24.10 to Include the Latest Linux Kernel
Ubuntu users have grown accustomed to their favorite distribution shipping with a kernel that's not quite as up-to-date as other distros but that changes with 24.10.
-
Plasma Desktop 6.1.4 Release Includes Improvements and Bug Fixes
The latest release from the KDE team improves the KWin window and composite managers and plenty of fixes.
-
Manjaro Team Tests Immutable Version of its Arch-Based Distribution
If you're a fan of immutable operating systems, you'll be thrilled to know that the Manjaro team is working on an immutable spin that is now available for testing.