Zack's Kernel News

Zack's Kernel News

Article from Issue 193/2016
Author(s):

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Improving Kernel Locks

Waiman Long posted a patch to implement a new type of futex, called the throughput-optimized (TO) futex. He intended it specifically for use in cases where data throughput was more important than any other issue, including ensuring that all processes were given fair access to resources protected by that lock.

Other futexes used for similar purposes include wait-wake and priority inheritance (PI) futexes. On certain benchmarks, Waiman saw that TO futexes performed significantly better than the others.

Thomas Gleixner had some concerns. Primarily, he wasn't sure a new futex was really needed in the kernel. He also felt that user space would need to support a similar locking mechanism, possibly via libc, if the TO futex were adopted in the kernel.

Davidlohr Bueso replied, saying he felt that the TO futex was better than the alternative, and that if there wasn't a need for an additional futex in the kernel, the existing one should be the one to get rid of.

At one point Waiman explained part of his motivation for writing this patch. He said that at his company, he was "part of the team that help[s] large application vendors to tune their application performance on our large SMP systems. Those application vendors tend to use futex directly instead of relying on glibc. We had seen spinlock contention in the futex could sometimes be a significant portion of the CPU cycles consumed depending on the workloads that were being run. We had been providing suggestions on the best practice of how to use futexes. But there is only so much you can do with tuning their locking code implementation."

At one point, Thomas said he thought users would find it hard to know which futex to use under which circumstances. The differences between Waiman's TO futex and the other similar futexes were subtle and unintuitive. Thomas said, "So the benefit of these newfangled futexes is only there for extreme short critical sections and a gazillion of threads fighting for the same futex, right? I really wonder how the average programmer should pick the right flavour, not to talk about any useful decision for something like glibc to pick the proper one."

Waiman replied, "Lock stealing will help performance when a gazillion of threads [fight] for the same futex. Optimistic spinning will help to reduce the lock transfer latency because the waiter isn't sleeping no matter the number of threads. One set of data that I haven't shown so far is that the performance delta between wait-wait and TO futexes actually increases as the critical section is lengthened. This is because for short critical section[s], the waiters of [a] wait-wake futex may not actually go to sleep because of the latency introduced by the code that has to be run before they do a final check to see if the futex value change[s] before going to sleep. The longer the critical section, the higher the chance that they actually sleep and hence their performance is getting worse relative to the TO futexes. For example, with the critical section of 50 pause instructions instead of 5, the performance gain is about 5X instead of about 1.6X in the latter case."

Waiman felt that the new TO futexes would perform better than other futexes most of the time, so maybe users wouldn't have to make the choice very often.

My personal sense is that kernel locking is one of those nightmare regions of code that make very little sense to the developers using them – and sometimes not even to the people writing the locking code itself. There's a reason why the Big Kernel Lock (BKL) lasted so long before being removed – it was far simpler and easier to use than the odd array of locking options the kernel provides today.

But it's also good to pay attention to why the developers worked so hard to get rid of the BKL in the first place. Finer grained locking makes for a more efficient use of system resources, as well as a smoother user experience. Locks are taken and released many times per second on any running Linux system. Even a slight algorithmic improvement could add up to a huge speed improvement for the user. Maybe that's worth a little bewildering complexity.

Cleaning Up an Error Case

Al Viro noticed that the writev() system call had a non-intuitive behavior on a certain error condition. The writev() call is used to write an array of memory buffers to a specified file descriptor. However, when the address of one of the middle buffers was undefined, Al noticed that writev() would still write a single segment of data before giving up at the undefined address.

He felt it would make more sense to not write any data at all and to return the EFAULT error code. He consulted the POSIX standard and found that it was vague enough to allow the implementation he proposed.

Al gave an example of what he was talking about:

Suppose we have a buffer spanning 10 pages (amd64, so these are 4K ones) – 7 valid, 3 invalid,

VVVVIIIVV

and it starts 100 bytes into the first page. And write goes into a regular file on e.g. tmpfs, starting at offset 31. We _can't_ write more than 4*4096-100 bytes, no matter what. It will be a short write. As the matter of fact, it will be even shorter than that – it will be 3*4096-31 bytes, up to the last pagecache boundary we can cover completely. That obviously depends upon the filesystem – not everything uses pagecache, for starters. However, the caller is *not* guaranteed that write() with an invalid page in the middle of a buffer would write everything up to the very beginning of the invalid page. A short write will happen, but the amount written might be up to [a] page size less than the actual length of valid part in the beginning of the buffer.

Now, for writev() we could have invalid pages in any iovec; again, we obviously can't write anything past the first invalid page – we'll get either a short write or -EFAULT (if nothing got written). That's fine; the question is what the caller can count upon wrt shortening.

Again, we are *not* guaranteed writing up to [an] exact boundary. However, the current implementation will end up shortening no more than to the iovec boundary. I.e. if the first iovec contains only valid pages and there's an invalid one in the second iovec, the current implementation will write at least everything in the first iovec. That's _not_ promised by POSIX or our man pages; moreover, I'm not sure if it's even true for each filesystem. And keeping that property is actually inconvenient – if we could discard it, we could make partial-copy ->write_end() calls a lot more infrequent.

He said he would really like the logic of writev() to be, "if some addresses in the buffer(s) we are asked to write are invalid, the write will be shortened by up to a PAGE_SIZE from the first such invalid address," which would bring the writev() behavior into line with write()'s behavior."

Linus Torvalds replied, "I'm pretty sure you can and should do that." He added, "there is no reason for the particular behavior."

However, Alan Cox said that POSIX version 1003.1 said, "Each iovec entry specifies the base address and length of an area in memory from which data should be written. The writev() function shall always write a complete area before proceeding to the next." This seemed to clearly contradict Al's preference. Alan remarked, "The moment you pass an invalid address you are in the land of undefined behaviour, so I would read the standard as actually trying to deal with the behaviour in defined situations (e.g., out of disk space mid-writev())."

Linus replied:

But as you note, the EFAULT case is undefined behavior, so what that POSIX language is *really* about is presumably making sure that readers of a file cannot see the "later" writes without seeing the earlier ones.

So you cannot do some fancy threaded thing where you do different iovec parts concurrently, because that could be seen by a reader (or more likely mmap) as doing the writes out of order.

Or, as you mention, the disk-full case.

The conversation ended there inconclusively, but for this sort of situation, I'd expect Linus to choose the sane behavior over POSIX compliance. He has always seemed to regard POSIX compliance as a convenient guide rather than as something to be blindly adhered to, and in this case, he seemed to want to find reasons to bring the writev() behavior into consistency with write().

Abortive Attempt at Alerts from User Space

Luis R. Rodriguez noticed a possible race condition at bootup, if the user used the kernel_read_file_from_path() call to read a file from the system's filesystem. If the call occurred before a certain point in the boot process, the filesystem wouldn't be available and the read would fail; however, after a certain point, the read would succeed.

Luis described his patch, saying, "We define kernel critical filesystems as filesystems which the kernel needs for kernel_read_file_from_path(). Since only user space can know when kernel critical filesystems are mounted and ready, let user space notify the kernel of this, and enable a new kernel configuration which lets the kernel wait for this event before enabling reads from kernel_read_file_from_path(). A default timeout of 10s is used for now."

Linus Torvalds took one look at this and said, "I really think this is a horrible hack. It's basically the kernel giving up, and relying on user space to give a single flag, and it's broken nasty crap. Worse, it's broken nasty crap with a user interface, so we'll be stuck with it forever. Please no."

He asked which drivers needed this and suggested that they simply be fixed to not hit that particular race.

Dmitry Torokhov agreed that the userspace interaction was bad, but he felt it was necessary nevertheless. He didn't see how the drivers could possibly be "fixed" to avoid the problem. He said:

Some devices do need to have firmware loaded so we know their capabilities, so we really can't push the firmware loading into 'open'. If your touch controller for some reason decided to crap over it's nvram and comes in bootloader mode it is nice for it to slurp in config data or firmware so use does not have to trigger update manually. And while the controller is in bootloader mode we can't create [an] input device because we do not know what capabilities to declare.

These devices we want to probe asynchronously and simply tell [the] firmware loader to wait for firmware to become available. The problem [is] we do not know when to give up, since we do not know where the firmware might be. But user space knows and can tell us.

Meanwhile Bjorn Andersson said that he had specific cases that would be helped by Luis' code. He said, "I have several cases where remoteproc drivers are used [to] boot DSPs upon boot of the device, but the firmware files are way too big for being stored in initramfs and all consumers of the provided services are (semi-) probable as the remote processor is booted. I.e., we need some way to figure out when these files become available so we can bring these remote processors up."

One solution that Linus had offered earlier was to require drivers running into this problem, to be built as loadable modules. But Bjorn remarked, "I really do not like the deployment issues that come with kernel modules during development. (The firmware and remoteproc driver normally do not have the same flow through a development process.)"

So while he agreed that Luis' proposal was "a horrible hack," he said, "I would appreciate a automagical mechanism that would relieve user space from having to signal to the kernel that the firmware partition has been mounted."

Linus wasn't satisfied with any of these arguments. He said if a driver definitely needed firmware (as opposed to it being simply optional), then "why don't we just tie the firmware and module together?" He added, "Really. If the driver doesn't work without the firmware, then why the hell is it separated from it in the first place?"

Bjorn responded from his own personal experience:

… generally there's a much stronger bond between the kernel and the driver than between the driver and the firmware in my cases.

E.g., we have a single remoteproc driver loading and controlling the Hexagon DSP found in several Qualcomm platforms, so a single kernel binary could (practically) load hundreds of variants of the firmware.

Both the kernel binary and the firmware in this example are side-loaded onto the device during development – independently of each other, as they are developed by different teams (or maybe even different companies).

I assume that you're not suggesting to actually tie the module together, as that would be practically difficult and a waste of resources.

Which leaves us with the suggestion that we should store the kernel module with the firmware file, which is just infeasible from a few practical reasons – again mostly related to the development flow and how the files are contained on the devices.

Bjorn reiterated that Luis's proposal seemed like an ugly hack – but he just didn't see an alternative.

Linus said, "all these arguments make no sense at all," and added, "Let me be very clear. I'm not applying that shit-for-brains stupid patch, and will not be pulling it unless somebody tricks me into it."

He went on:

If the driver doesn't work without the firmware, then anybody who distributes the driver binary without a firmware is just *fundamentally* doing something insane. You may do it for *development* purposes, but doing so for actual *use* would be entirely pointless.

See my point? If a distribution is distributing the driver without the firmware, then what the hell is the point of such a thing?

But even if you decide to do that for some odd reason, the patch is still just stupid. Instead of adding some crazy infrastructure for "now I've mounted everything," you could equally well just:

(a) Make the driver fail the module load if it cannot find a firmware binary

(b) After user space has mounted everything, just do "insmod -a" again (or insmod just that driver).

See? The point is, this "generic" hacky interface is just stupid. It's not adding any value. If you add user space "I'm ready now" points anyway, you might as well make those points do the right thing and just load the module that is now loadable.

We could mark such "late loading" modules explicitly if people want to, so that you can automate the whole thing about delaying the loading in user space.

At no point does it make sense to say "I have now mounted all the important filesystems." Maybe the firmware is extracted later by user space downloading it from the Internet, and the module will then work only after that point.

This whole "I have mounted important filesystems" is just pure and utter garbage. Stop pushing this shit.

Luis replied that, "this isn't just about firmware since we now have a generic API to read files from the kernel, so kernel_read_file_from_path(). Firmware is now just *one* user case. Also a complexity here is this is not just for modules but also for built-in drivers as well. And you want the option to update the files without the driver. The proposed solution provides a generic broad syfs entry for letting user space inform the kernel when files from the filesystems where it would typically read from (I'm calling them critical filesystems for lack of a better term) can be accessible. Something more specific requires a bit more thought given [that] this is not anymore about just firmware, [but] must also address built-in drivers, and allow for updates."

He added, "I can see the syfs approach being considered hacky – but I gladly welcome an actual alternative suggestion."

Luis also pointed out to Linus, that "the only reason I proposed this was to get the ball rolling in terms of finding a solution to the problem for why some folks claim they *need* the firmware usermode helper. Granted, upstream we only have two explicit users left. I'm told some out-of-tree users still need and use the usermode helper. They claim that without it there is the race between being ready and driver asking for the firmware. I was told there were quite a bit of out-of-tree hacks to address this without using the usermode helper, the goal of this patch was to create the discussion needed to [find] a proper resolution to this."

Several folks jumped into the conversation at this point, offering suggestions for possible alternative solutions. But Linus was having none of it. Finally he offered some back-handed advice for how to go about these sorts of kernel feature design discussions. He said:

The reason I've hated this whole discussion is that it's full of 'let's re-architect everything', and then it has these horribly warty interfaces. It's classic second-system syndrome.

Just do *one* thing, and do it well. Don't change anything else. Don't force existing drivers to use new interfaces. Don't over-architect, and don't do stupid interfaces.

If user space mounts a new filesystem (or just unpacks files from a tar file that has firmware images in it, for Chrissake), that is not some magical "critical mount event." The whole concept is just stupid. Is it a "mount event" when the user downloads a new firmware image from the Internet?

HELL NO.

But what is equally stupid is to then dismiss simple models because [of] some totally unrelated 'beyond firmware' issue.

Anything that is 'beyond firmware' shouldn't even be discussed, for Chrissake! It has nothing what-so-ever to do with firmware loading. If there ends up being some common helper functions and shared code, that *still* doesn't make it so.

Basic rules of thumb:

(a) Don't over-design.

(b) Don't have stupid illogical interfaces.

(c) Don't conflate different issues just because you think they may have shared code.

(4) Be consistent. Don't make up new interfaces, and most certainly do *NOT* dismiss something just because it's what we have done before.

The discussion petered out shortly afterward. This was an example of Linus really slamming some developers for using practices that he didn't like, but it was also an example of developers bearing up under that kind of blistering criticism, avoiding counterattacks, and simply proceeding with a consideration of what to try next. This is actually the usual case when Linus comes down hard on a particular approach. No matter how harsh he becomes, the developers nearly always simply extract the technical considerations and respond to them as best as they can.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • ZACK'S KERNEL NEWS
  • sysdig

    Many Linux diagnostic tools require knowledge of a special syntax, which complicates handling and confuses the output. Sysdig groups several important tools into a single interface.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

comments powered by Disqus