Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.
Supporting Peer-to-Peer Memory Devices
Logan Gunthorpe recently posted a patch to support peer-to-peer PCI memory devices. He explained that this is "a PCI card with a BAR space that points to regular memory. This may be an independent PCI card or part of another completely unrelated device (like an IB card or a NVMe card)." A base access register (BAR) is used for addressing memory on the device. Using PCI memory devices as regular system memory may increase the latency between memory transfers as compared with the primary RAM on the system motherboard, but it could come in very handy in cases where the primary RAM starts to be used up.
Logan, however, acknowledged there were caveats to this approach, particularly when dealing with older PCI devices. As a result, he explained, "The code is designed to only utilize the p2pmem device if all the devices involved in a transfer are behind the same PCI switch. Other cases may still work or be desirable for some end users but it was decided this would be the best course of action to prevent users enabling it and wondering why their performance dropped."
Sinan Kaya had some concerns about the portability of Logan's patch. He understood the desire to avoid unexplained performance hits, but he felt that requiring all devices to be behind the same switch was essentially arbitrary and should be left as a policy decision to be made by the user. He said, "I'd rather see the feature enabled by default without any assumptions. Using it with a switch is just a use case that you happened to test."
Logan said that in principle this would be fine, except for the fact that there was a huge pile of devices to consider, some of which would work as peer-to-peer PCI memory devices and some wouldn't. Trying to identify a complete blacklist would be a big effort. On the other hand, blacklisting everything by default, and white listing only PCI switches would be easy and simple, since PCI switches were known to work. So, it wasn't so much a policy decision as a shortcut that made sure the patch worked at all. Logan remarked that the load sharing facility (LSF) folks "were primarily concerned with not having users enable the feature and see breakage or terrible performance."
Logan said that originally he was in favor of letting users configure which specific PCI devices would be used by his patch, and then they'd be responsible for testing and benchmarking it. But not enough people liked that idea, so he went with the behind-the-same-switch approach.
Sinan saw Logan's point, but said he was "trying to generalize what you are doing to a little bigger context so that I can use it on another architecture like arm64 where I may or may not have a switch." So he still wanted to stick with a blacklist instead of a white list; however, instead of painstakingly sifting through the wreckage of human progress, Sinan suggested simply blacklisting any device created before the year 2016. Presto!
Logan asked, "How do you get a manufacturing date for a given device within the kernel? Is this actually something generically available?" Sinan replied that such information was used all over the place in the kernel, "for introducing new functionality while maintaining backwards compatibility." He suggested checking out the drivers/pci
and drivers/acpi
directories in the source tree for examples of SMBIOS calls.
Logan investigated but was only able to confirm that these calls provided the date the firmware was written for a piece of hardware – not the date of manufacture of the hardware itself. He said, "Saying that the system's firmware has to be written after 2016 seems like an arbitrary restriction that isn't likely to correlate to any working systems."
Logan stuck to his original idea: "Allow all switches and then add a white list of root ports that are known to work well. If we care about preventing broken systems in a comprehensive way then that's the only thing that is going to work."
Sinan agreed that relying only on the firmware date would not be enough. But he explained, "I recommended a combination of blacklist + P2P capability + BIOS date. Not just BIOS date." He went on to propose:
"PCI defines capability registers for discovering features. Unfortunately, there is no direct P2P capability register.
However, Access Control Services (ACS) capability register has flags indicating P2P functionality. P2P feature needs to be discovered from ACS: https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf. This is just one of the many P2P capability flags.
'ACS P2P Request Redirect: must be implemented by Root Ports that support peer-to-peer traffic with other Root Ports5; must be implemented by Switch Downstream Ports.'
If the root port or a switch does not have ACS capability, P2P is not allowed. If these P2P flags are not set, don't allow P2P feature.
The normal expectation from any system (root port/switch) is not to set these bits unless P2P feature is present/working. However, there could be systems in the field with ACS capability but broken HW or broken FW. This is when the BIOS date helps so that you don't break existing systems.
The right thing in my opinion is: 1. Blacklist by pci vendor/device id like any other pci quirk in quirks.c. 2. Require this feature for recent HW/BIOS by checking the BIOS date. 3. Check the P2P capability from ACS."
Sinan added that Logan's idea of only supporting devices behind the same PCI switch couldn't be relied on either. He pointed out that in general, when it came to portability it was important for device drivers to avoid making architecture-specific decisions within their own code. He said, "There are hundreds of device drivers in the kernel. None of them are guaranteed to work in any architecture, but they don't prohibit use either. System integrators like me test these drivers against their own systems, find bugs to remove arch specific assumptions, and post patches."
In the case of Logan's code, Sinan said, the peer-to-peer memory feature was just one of the many users of peer-to-peer support within Linux. It had to make use of that support in some kind of generic way that would work for other architectures as well.
Logan replied, "I'd agree that the final code for determining P2P capability should belong in the pci code. Or more likely an even more generic interface with struct device that is bus agnostic. Though, I'd hope that a lot of this could happen later when there are more kernel users actually wanting to use this code. It's hard to design a generic interface when you only have one user at present."
Logan also pointed out that his patch was not enabled by default, specifically because of the risk of lowering memory performance. The user had to explicitly make the choice, or else we'd see cases where users upgraded their systems, only to discover big unexplained slowdowns.
Sinan replied that this actually made the whole problem much simpler. He hadn't realized the code was disabled by default. In that case, he said, "Push the decision all the way to the user. Let them decide whether they want this feature to work on a root port connected port or under the switch. [...] If you are just worried about performance, the switch recommendation belongs to your particular product tuning guide or a how-to document not into the actual code itself. I think you should get rid of all pci searching business in your code." Logan liked this idea a lot and said it was what he'd originally preferred but had gotten too much opposition from the LSF folks.
At this point the discussion ended, although presumably the LSF folks will have their own response. Personally, I find this type of debate fascinating, because you can never tell when something that seems like a trivial detail will turn out to be the linchpin of some other element of the kernel, with its own set of requirements. Often these days, security seems to be the sudden wrench in the works of a hot new feature, but in the peer-to-peer memory case it seems to be portability. When the LSF folks have their say, another element may turn out to be crucially important.
Adding a Kernel Thread for printk()
Sergey Senozhatsky posted a patch to move printk() to its own separate kernel thread. This would allow the kernel to print messages to the console, while avoiding certain race conditions that might lead to system lockup. He'd been using this patch in production for nearly a year, with only a few issues. The biggest of these was that having a separate printk() thread changed the behavior of printk() in a couple particular corner cases. In one case, if one CPU on the system is much faster than the CPU running the printk() thread, it could result in some printk() messages never reaching the console. In the other corner case, if one CPU called a ton of printk()s, it could interfere with the other CPUs being able to call printk() successfully.
Petr Mladek liked Sergey's patch in general and offered some minor naming suggestions. Peter Zijlstra also offered his ideas, although he said he was not yet convinced that a whole new thread for printk() was truly justified. After all, each new thread on the system is something that the scheduler has to cycle through, which slows things down, even if it's just by a small amount.
Sergey agreed in general, but he said he saw no other way to cleanup the lockups and other problems currently associated with printk(). Sergey and Petr continued to discuss message passing issues, but it wasn't clear that Peter's objections were satisfied. Eventually the discussion petered out (ouch) without resolution.
Ultimately, there is likely to be resistance from many kernel folks, who don't want to add new kernel threads without very compelling justification. In the case of printk(), that justification may not exist. Yes, it's currently necessary for printk() users to be aware of when and how they call printk(), or else risk triggering a lockup or other problem. But, it is actually possible for those users to be aware of those things and avoid the problem. So, Sergey's patch may represent just a simple convenience and not a real improvement for the kernel. And, if simple convenience were all that was needed to justify implementing new kernel threads, many other kernel developers with their own pet projects would be submitting them as well. So, unless Sergey's code turns out to make things a lot better than they are now, he's likely to encounter trouble getting it into the main source tree.
AVR32 Architecture End Of Life
Hans-Christian Noren Egtvedt pointed out that the AVR32 architecture has not been well-maintained lately in the kernel. And because it shared so many drivers with Atmel ARM system on a chip (SoC) – which was actively maintained – it was starting to be harder to keep those drivers up to date for that architecture. Hans proposed ditching the AVR32 architecture entirely and just taking it out of the kernel.
Both of these architectures come out of the Atmel corporation and are SoC solutions. But as Hans put it, "all AVR32 AP7 SoC processors are end of lifed from Atmel" and would no longer be supported. Also, he said, the GCC tool chain was stuck at an older version, as it had not received any patches from Atmel in recent times. He suspected there were very few AVR32 users left in the world, if any at all. Better to put it out of its misery than keep it on life support.
Andy Shevchenko also pointed out that the Buildroot distribution had already removed support for AVR32 back in 2015. He approved Hans's patch wholeheartedly. Boris Brezillon also approved the patch and offered up his grateful thanks. Nicolas Ferre volunteered to help clean up any code fallout that resulted from removing the AVR32 architecture. And, Håvard Skinnemoen, who had worked alongside Hans on all of this, offered some additional suggestions of remaining pieces to expunge.
Ultimately, no one objected to getting rid of AVR32, so it does seem destined for the chopping block in the immediate future.
Buy this article as PDF
(incl. VAT)