Zack's Kernel News
Zack's Kernel News
Zack Brown discusses how to be a maintainer, power-up/power-down control, and smoothing out disk caching.
How To Be a Maintainer
Tobin C. Harding posted a patch to create a new documentation file in the kernel source tree, describing how to be a kernel driver or subsystem maintainer. As with Linus Torvalds' requirements for a usable revision control system long ago, it's a little surprising that the information hadn't already been pulled together. Regardless, Tobin's done it now.
Most of the text was taken from a mailing list discussion between Greg Kroah-Hartman and Linus. Essentially, it describes the Git features and use cases that are most relevant to being a maintainer. For example, all patches need to be digitally signed to confirm that they're really coming from the person they claim to come from. So, to be a maintainer, you need to be able to set up a public key using GPG2 and configure Git to use it by default.
Tobin's doc goes on to say that you'll also need to be able to create pull requests – i.e., to let other maintainers know that you've got some code to share and how they can get it. This involves creating a new named branch that has all the changes you want to share in it. The name can be digitally signed, or not – maintainers differ on their willingness to pull unsigned branches from other maintainers. However, Linus will only pull signed branches into his tree.
Of crucial importance is the message that accompanies your pull request. This message goes into the Git project history and may need to be relied on in the future. As Linus says, "it should not just make sense to me, but make sense as a historical record too." Partly for those reasons, Linus and other maintainers may reserve the right to edit the text of any pull messages they receive, before sending them along to the actual tree.
Once you've got the branch the way you like it, you'll need to generate a pull request against the maintainer's tree you're sending it to. Presumably this is the same tree as yours, although perhaps in a slightly different historical state. The pull request reconciles those differences into something the recipient will be able to make sense of.
Finally, you submit the pull request in an email, just as if it were an ordinary patch.
That's as far as Tobin's doc went. Greg said it was an excellent beginning and proposed adding a section on how to set up Git on a given system.
Dan Williams also liked what Tobin had done so far and suggested two additional sections: one on how to "age" commits in the -next
tree so that they could be culled or sent along to Linus at a certain point and another describing various techniques to avoid rebasing. In the Git world, rebasing is a way of cleaning up a tree in a certain way, but it also reshuffles patches, essentially rewriting history. When you're in the midst of feeding your patches up the food chain to a higher maintainer, this can mess with that maintainer's ability to see what actually happened in your project's history. The whole issue is a bit of a religious debate, but the Linux kernel has its own set of preferences, handed down from Linus.
Tobin replied that he'd like to read that section himself!
Meanwhile, folks like Mauro Carvalho Chehab offered their own nuanced suggestions for things like Git branch naming conventions (specifically, he felt the names didn't matter much).
It seems as though there is plenty of room for the doc to grow, and it'll probably do so in some fairly fascinating ways. It'll be a mix between specific Git use cases and oral history converted to firm policy. It's nice too, because anyone will be able to extrapolate the doc into their own open source projects.
Power-Up/Power-Down Control
Jon Hunter proposed replacing or extending the power management (PM) domain framework to be more flexible. The current system, GenPD, would let users associate a particular PM domain with a particular device. This domain would define a hierarchical structure of devices that would be powered on and off together, in a given sequence. The problem, Jon said, came when various pieces of hardware didn't necessarily need to be powered on and off together, although they might in certain circumstances. GenPD provided no way to create an alternative way of powering these devices up and down.
As an example, he said that the Tegra124/210 XUSB subsystem involved several pieces of hardware that were able to be used independently of each other. There was no specific need to power them up and down together, except that GenPD provided no alternative.
His proposal, he said, "extends the generic PM domain framework to allow a device to define more than one PM domain in the device-tree 'power-domains' property. If there is more than one, then the assumption is that these PM domains will be controlled explicitly by the consumer, and the device will not be automatically bound to any PM domain." More specifically, he said the new code would "add new APIs for GenPD to allow consumers to get, power on, power off, and put PM domains so that they can be explicitly controlled by the consumer. These new APIs for powering on and off the PM domains call into the existing internal functions, genpd_sync_power_on
/off()
, to power them on and off. To ensure that PM domains that are controlled [both] explicitly (via these new APIs) and implicitly (via runtime-pm
callbacks) do not conflict, the PM domain device_count
and suspended_count
counters are used to ensure the PM domain is in the correct state."
Rajendra Nayak was glad to see this work being done and offered some technical suggestions. In particular, he wanted to be able to track the devices directly to check their power status, rather than relying on the state of an associated variable. He also suggested isolating the implementation from the interface – providing users with a "handle" to access a given device, rather than exposing the contents of particular data structures that might change in future patches.
Meanwhile Ulf Hansson wasn't sure adding a whole new set of API calls could be justified for what he felt might just be a few corner cases, but he acknowledged, "However, we currently know about at least two different SoCs [Systems on a Chip] that need this." So, he seemed to be on board with the idea that something needed to be done, even if it was more lightweight than Jon's original suggestion. He suggested that "it may be better to add some kind of aggregation layer on top of the current PM domain infrastructure" instead of an actual set of API calls.
Geert Uytterhoeven felt that there was probably enough relevant hardware to say that they weren't corner cases but deserved a more robust and flexible power-up/power-down approach.
Meanwhile, Rafael J. Wysocki pointed out that "the PM core takes only one set of PM operation[s] per device into account, and therefore, in every stage of system suspend, for example, the callback invoked by it has to take care of all actions that need to be carried out for the given device, possibly by invoking callbacks from other code layers. That limitation cannot be removed easily, because it is built into the PM core design quite fundamentally." He tended to believe that extending GenPD was not the way to go, "because power-on/power-off operations used by GenPD can be implemented in terms of lower-level power resource control, and I don't see the reason for mixing the two in one framework."
Jon was fine with this, because he felt that maybe GenPD should be a client sitting on top of whatever services his code would offer, "so that power domains are registered once and can be used by either method."
Nevertheless, the whole issue began to seem more complicated, because of Rafael's point about the inflexibility of the underlying PM infrastructure and the difficulty of retooling it to be more flexible.
At one point, Jon confessed that he didn't see a way forward, and Ulf wasn't sure either. There were a couple of alternatives suggested, but the thread petered out inconclusively.
Ultimately, it does seem as though something along the lines of this feature is needed, especially because, as Jon pointed out, his company needs the feature for Tegra hardware. Certainly other users are waiting in the wings, as well, so there'll definitely be a patch to support this at some point.
However, if the underlying kernel infrastructure is really as unfriendly as Rafael seemed to indicate, the whole project may end up being bigger than Jon or anyone else would have hoped.
Smoothing Out Disk Caching
Konstantin Khlebnikov wanted to change the way data is written to disk in Linux. Generally when you write to a file, the OS doesn't necessarily access the disk right away. Disk hardware tends to be slow, so there's a value in batching up writes in memory and dumping them all to disk at certain times. In general, this is all done highly efficiently, so you never actually notice it happening, but in a few cases, you might save a big file and then quickly cold boot the system, only to discover that the new data was never actually written to disk. That's one of the reasons why it's best to power down normally – it gives the system a chance to flush any remaining disk writes before powering off.
That's not what Konstantin was concerned with, though. Write caching is just par for the course and is one of the reasons the Linux user interface is so snappy and responsive. Konstantin's concern was that, although normal operations were indeed snappier with write caching, any task that required low disk latency would run into trouble, because the filesystem load would spike during cache writes, slowing down any other disk operations happening at the time.
He pointed out that although it was possible to tune the behavior of the Linux filesystem to some extent, there were not actually enough adjustable variables to influence this particular use case. The only other available option would be to run the filesystem in "sync mode," which would cause all disk writes to flush as soon as they occurred and would eliminate all caching, although the entire OS – especially the user experience – would grind and lurch.
He said:
This patch implements write-behind policy, which tracks sequential writes and starts background writeback when [they] have enough dirty pages in a row.
Write-behind tracks current writing position and looks into two windows behind it: [the] first represents unwritten pages, [the] second – async writeback.
Next write starts background writeback when [the] first window exceed[s the] threshold and waits for pages falling behind [the] async writeback window. This allows [it] to combine small writes into bigger requests and maintain optimal I/O depth.
Linus Torvalds replied:
This looks lovely to me.
I do wonder if you also looked at finishing the background write-behind at close()
time, because it strikes me that once you start doing that async writeout, it would probably be good to make sure you try to do the whole file.
I'm thinking of filesystems that do delayed allocation etc. – I'd expect that you'd want the whole file to get allocated on disk together, rather than have the 'first 256kB aligned chunks' allocated thanks to write-behind, and then the final part allocated much later (after other files may have triggered their own write-behind). Think loads like copying lots of pictures around, for example.
I don't have any particularly strong feelings about this, but I do suspect that once you have started that I/O, you do want to finish it all up as the file write is done. No?
It would also be really nice to see some numbers. Perhaps a comparison of 'vmstat 1' or similar when writing a big file to some slow medium like a USB stick (which is something we've done very very badly at, and this should help smooth out)?
Jens Axboe was also thrilled with Konstantin's general idea, but he said, "My only concerns would be around cases where we don't expect the writes to ever make it to media. It's not an uncommon use case – [an] app dirties some memory in a file and expects to truncate/unlink it before it makes it to disk. We don't want to trigger writeback for those."
Regarding Jens' issue, Konstantin agreed that "this is [a] case where serious degradation might happen." He felt, though, that there might be some relatively simple workarounds to avoid the issue in most cases.
Dave Chinner also agreed that Konstantin's basic idea was excellent, although he felt it needed some tweaking. He offered a series of problematic use cases, saying, "rapid write-behind behavior might not significantly affect initial write performance on an empty filesystem. It will, in general, increase file fragmentation, increase interleaving of metadata and data, reduce metadata writeback and read performance, increase free space fragmentation, reduce data read performance, and speed up the onset of aging related performance degradation."
He added, "a write-behind default of 1MB is bordering on insane because it pushes most filesystems straight into the above problems. At minimum, per-file write-behind needs to have a much higher default threshold and write-back chunk size to allow filesystems to avoid the above problems."
He also suggested implementing "a small per-backing dev threshold where the behavior is the current write-back behavior, but once it's exceeded we then switch to write-behind so that the amount of dirty data doesn't exceed that threshold."
Linus really liked that idea, although he acknowledged, "part of the problem there is that we don't have that historical 'what is dirty', because it would often be in previous files. Konstantin's patch is simple partly because it has only that single-file history to worry about."
Elsewhere, Andreas Dilger chimed in with his own experience, saying, "Lustre clients have been doing 'early writes' forever, when at least a full/contiguous RPC worth (1MB) of dirty data is available, because network bandwidth is a terrible thing to waste. The oft-cited case of 'app writes to a file that only lives a few seconds on disk before it is deleted' is IMHO fairly rare in real life, mostly dbench
and back in the days of disk based /tmp
. Delaying data writes for large files means that 30s*bandwidth of data could have been written before VM page aging kicks in, unless memory pressure causes writeout first. With fast devices/networks, this might be many GB of data filling up memory that could have been written out."
The conversation petered out inconclusively, although I'd expect Linus' preferences to be treated like gold and subsequent versions of the patch to adhere pretty strictly to his and Dave's suggestions. It's not that Linus requires that, but in general, developers tend to want to make him happy, especially when his wishes are so clearly expressed. Back when people were trying to pry out his preferences for a revision control system, there was plenty of resentment when, for example, Tom Lord's old "arch" project would go for years in development and never get picked up for kernel management.
In terms of Konstantin's write-behind patches, it seems like something very similar to his original patch will be accepted fairly quickly, with more tweaks and enhancements going in as well.
Buy this article as PDF
(incl. VAT)