Kernel News

Kernel News

Article from Issue 276/2023
Author(s):

In Kernel News: Removing the Thorn and What Do You Do When There's Nothing to Do?

Removing the Thorn

Sometimes in Linux kernel development, the best thing to come out of a discussion thread is an update to the documentation.

David Howells recently reported a problem that had been found by Matt Whitlock. Kernel code using the splice() system call to pipe data from a file into a running process could potentially see some data corruption. The splice() call could return successfully while the data it sent into the pipe could potentially be modified by another system call before the target process could finish reading from that pipe.

To fix this problem, David posted a patch that changed the data's owner before sending it into the pipe. That way, nothing would be able to modify that data. In the kernel, changing owners of RAM pages to protect them this way is called "stealing."

However, this was nowhere near the end of the story. Miklos Szeredi pointed out that "removing spliced pages from the cache is basically guaranteed to result in a performance regression for any application using splice." He added, "The above behavior is well known, so it's not likely to be a problem."

But Matt replied, "it's not well-known, as it's not documented. If the splice(2) man page had mentioned that pages can be mutated after they're already ostensibly at rest in the output pipe buffer, then my nightly backups wouldn't have been incurring corruption silently for many months."

Miklos quoted the "Notes" section of the splice() man page, which said that splice() did not actually copy data. Instead, splice() created a pointer to the data and only copied those pointers, not the data itself. This would result in the behavior David and Matt had seen – data sent into a pipe that could nevertheless be modified after splice() returned.

Matthew Wilcox complained that this was not at all clear from the documentation or from any semantics he would expect from a system call that was specifically supposed to copy data. He asked where he could find the library calls that would actually copy data, given that splice() didn't appear to do this.

At this point, Linus Torvalds brought the hammer down on the discussion, saying:

"It's called 'read()' and 'write()'.

"Seriously.

"The *ONLY* reason for splice() existing is for zero-copy. If you don't want zero-copy (aka 'copy by reference'), don't use splice.

"Stop arguing against it. If you don't want zero-copy, you use read() and write(). It really is that simple.

"And no, we don't start some kind of crazy 'versioned zero-copy with COW'. That's a fundamental mistake. It's a mistake that has been done – several times – and made perhaps most famous by Hurd, that made that a big thing.

"And yes, this has been documented *forever*. It may not have been documented on the first line, because IT WAS SO OBVIOUS. The whole reason splice() is fast is because it avoids the actual copy, and does a copy-by-reference.

"That's still a copy. But a copy-by-reference is a special thing. If you don't know what copy-by-reference is, or don't want it, don't use splice().

"I don't know how many different ways I can say the same thing.

"IF YOU DON'T WANT ZERO-COPY, DON'T USE SPLICE.

"IF YOU DON'T UNDERSTAND THE DIFFERENCE BETWEEN COPY-BY-VALUE AND COPY-BY-REFERENCE, DON'T USE SPLICE.

"IF YOU DON'T UNDERSTAND THE *POINT* OF SPLICE, DON'T USE SPLICE.

"It's kind of a bit like pointers in C: if you don't understand pointers but use them anyway, you're going to have a hard time. That's not the fault of the pointers. Pointers are very very powerful. But if you are used to languages that only do copy-by-value, you are going to think they are bad things."

Matt replied, "Thanks for being so condescending. Your reputation is deserved."

He also said that, at least on a cursory glance at the documentation, "it is not unreasonable to believe that the point of splice is to avoid copying between user-space and kernel-space," rather than the point of splice() being speed.

And in that case, he argued, splice() was the right choice because, "If you use read() and write(), then you're making two copies. If you use splice(), then you're making one copy (or zero, but that's an optimization that should be invisible to the user)."

To which Linus replied:

"No. It really isn't.

"It is an optimization that is INHERENT IN THE INTERFACE and has been there literally since day #1. It was *never* invisible. It was the *point*.

"You getting this use case wrong is not an excuse to change reality. It is, at most, a good reason to clarify the documentation.

"The 'without copying between kernel address space and user address space' is about the ability to not copy AT ALL, and yes, let's by all means clarify that part."

Linus also added, "When documentation and reality collide, documentation loses. That's how this works."

He went on to propose another way to get what Matt wanted:

"If you want 'one-copy', what you can do is:

- mmap() the file data (zero copy, not stable yet)

- use 'write()' to write the data to the network. This will copy it to the skbs before the write() call returns and that copy makes it stable."

Matt had had enough and replied:

"This entire complaint/discussion/argument would have been avoided if splice(2) had contained a sentence like this one from sendfile(2):

"'If out_fd refers to a socket or pipe with zero-copy support, callers must ensure the transferred portions of the file referred to by in_fd remain unmodified until the reader on the other end of out_fd has consumed the transferred data.'

"That is a clear warning of the perils of the implementation under the hood, and it could/should be copied, more or less verbatim, to splice(2)."

Linus agreed with that proposal, saying, "Internally in the kernel, the two really have always been more or less of intermingled. In fact, I think splice()/sendfile()/tee() could – and maybe should – actually be a single man-page to make it clear that they are all facets of the same thing."

The discussion skewed off at that point, into speculative considerations of additional data copying methods like the one Linus identified using mmap() and write().

But clearly, the splice() documentation will be updated.

This was a fascinating discussion, and I hope that Matt discovered that his backups were corrupt before he actually needed to restore his data from those backups. David's patch may have been replaced by a single sentence in the documentation, but it's a single sentence that could save users from a lot of pain.

What Do You Do When There's Nothing to Do?

The automated bug-discovering tool syzbot posts anything it finds to the Linux Kernel Mailing List. Recently it reported a bug in Linux's Hierarchical File System (HFS) support. HFS is a very obsolete Apple filesystem that dates back to 1985. Even Apple doesn't support it anymore. Its big claim to fame was that it supported subdirectories – hence, "hierarchical."

Using git bisect (a powerful tool to find the exact patch that caused a problem), syzbot laid the blame at the feet of Arnd Bergmann, in a patch he submitted in November 2021. Arnd, in his own defense, said the patch was a simple attempt to get rid of a compiler warning after Linus Torvalds announced that no more compiler warnings would be tolerated in kernel code.

"My patch was a mechanical conversion from '/* panic? */' to 'WARN_ON()' to work around a compiler warning, and the previous code had been in there since the 2004 HFS rewrite by Roman Zippel," Arnd said.

Linus joined the discussion right away, noting, "since 2004, as you point out – you have to go back to before the git days to even see any development in this area."

Linus took a stab at debugging the syzbot report and untangled it to some extent, though he pointed out that this bug was very unlikely to ever be triggered. He remarked, "Looks like this is syzbot just mounting a garbage image (or is it actually some real hfs thing?)."

He went on to say:

"I suspect this code is basically all dead. From what I can tell, hfs only gets updates for

(a) syzbot reports

(b) vfs interface changes

"and the last real changes seem to have been by Ernesto A. Fernandez back in 2018."

Linus also reported, "Hmm. Looking at that code, we have another bug in there, introduced by an earlier fix for a similar issue [....] So we should probably fix that too." He whipped up a patch and sent it to the mailing list for testing.

As for whether HFS itself was truly dead, Arnd replied, "There is clearly no new work going into it, and most data exchange with MacOS would use HFS+, but I think there are still some users."

Jeffrey Walton spoke up, saying, "I've been running Linux on an old PowerMac and an old Intel MacBook since about 2014 or 2015 or so. I have needed the HFS/HFS+ filesystem support for about 9 years now [....] There's never been a problem with Linux and the Apple filesystems. Maybe it speaks to the maturity/stability of the code that already exists. The code does not need a lot of attention nowadays."

John Paul Adrian Glaubitz also said:

"HFS/HFS+ is also used by PowerPC users on PowerMac hardware for both exchanging data between MacOS and Linux as well as for booting Linux using GRUB with a HFS/HFS+ /boot partition.

"Debian's grub-installer creates an HFS partition from which GRUB loads the kernel and the initrd. In order to be able to update kernel and initrd, the running Linux system needs to be able to read/write HFS/HFS+ partitions which is used for the /boot partition."

Viacheslav Dubeyko took a look at the bug(s) and at Linus's patch, but he was not able to reproduce the issue. When debugging, the ability to cause the same problem to occur at will is generally a necessary first step. But Viacheslav said that as far as he could tell, the data corruption reported by syzbot "had happened somehow earlier. The reported issue is only a side effect of volume corruption. The real issue of HFS volume corruption had taken place before. And it was a silent issue somehow." But Matthew Wilcox pointed out that "Syzbot generates deliberately-corrupted (aka fuzzed) filesystem images. So basically, you can't trust anything you read from the disc."

Viacheslav replied, "If the volume has been deliberately corrupted, then no guarantee that file system driver will behave nicely. Technically speaking, inode write operation should never happened for corrupted volume because the corruption should be detected. [...] If we would like to achieve such nice state of HFS/HFS+ drivers, then it requires a lot of refactoring/implementation efforts."

However, he said he didn't think it was really worth the trouble, because very few people used these filesystems in Linux.

Michael Schmitz also looked at Linus's untested patch and wondered "whether the missing fd.entrylength size test in the HFS_IS_RSRC(inode) case was due to the fact that a file's resource fork may be empty?" However, Linus replied, "But if that is the case, then the subsequent hfs_bnode_read would return garbage, no?"

Linus continued, "But I really don't know the code, so this is all from just looking at it and going 'that makes no sense'. Maybe it _does_ make sense to people who have more background on it." Michael replied, "But I don't really understand the code too well either. I'll have to see for myself whether or not your patch does cause a regression on HFS filesystems such as the OF bootstrap partition used on PowerPC Macs."

Michael did a quick test and reported that Linus's patch seemed to work fine.

But the story doesn't end there.

Dmitry Vyukov remarked at some point during the conversation, "Most popular distros will happily auto-mount HFS/HFS+ from anything inserted into USB (e.g. what one may think is a charger). This creates interesting security consequences for most Linux users. An image may also be corrupted non-deliberately, which will lead to random memory corruptions if the kernel trusts it blindly."

To which Matthew said flatly, "Then we should delete the HFS/HFS+ filesystems. They're orphaned in MAINTAINERS and if distros are going to do such a damnfool thing, then we must stop them."

For John Paul Adrian, that seemed like going way overboard. He replied:

"Both HFS and HFS+ work perfectly fine. And if distributions or users are so sensitive about security, it's up to them to blacklist individual features in the kernel.

"Both HFS and HFS+ have been the default filesystem on MacOS for 30 years and I don't think it's justified to introduce such a hard compatibility breakage just because some people are worried about theoretical evil maid attacks.

"HFS/HFS+ [is] mandatory if you want to boot Linux on a classic Mac or PowerMac and I don't think it's okay to break all these systems running Linux."

Matthew shot back, "If they're so popular, then it should be no trouble to find somebody to volunteer to maintain those filesystems. Except they've been marked as orphaned since 2011 and effectively were orphaned several years before that (the last contribution I see from Roman Zippel is in 2008, and his last contribution to hfs was in 2006)."

In a later email he went on to say, "There are bugs in how this filesystem handles intentionally-corrupted filesystems. That's being reported as a critical bug because apparently some distributions automount HFS/HFS+ filesystems presented to them on a USB key. Nobody is being paid to fix these bugs. Nobody is volunteering to fix these bugs out of the kindness of their heart. What choice do we have but to remove the filesystem, regardless of how many happy users it has?"

Linus came down on the other side of that issue. He replied to Matthew, saying:

"We have tons of sane options. The obvious one is 'just don't mount untrusted media'.

"Now, the kernel doesn't know which media is trusted or not, since the kernel doesn't actually see things like /etc/mtab and friends. So we in the kernel can't do that, but distros should have a very easy time just fixing their crazy models.

"Saying that the kernel should remove a completely fine filesystem just because some crazy use-cases that nobody cares about are broken, now *that* [is] just crazy.

"Now, would it be good to have a maintainer for hgs [HFS]? Obviously. But no, we don't remove filesystems just because they don't have maintainers.

"And no, we have not suddenly started saying 'users don't matter'."

The discussion continued among various developers, but the decision was made. However, at one point Matthew offered some perspective:

"Google have decided to subject the entire kernel (including obsolete unmaintained filesystems) to stress tests that it's never had before. IOW these bugs have been there since the code was merged. There's nothing to back out. There's no API change to blame. It's always been buggy and it's never mattered before.

"It wouldn't be so bad if Google had also decided to fund people to fix those bugs, but no, they've decided to dump them on public mailing lists and berate developers into fixing them."

Eric W. Biederman offered some additional perspective:

"There are no known linux filesystems that are safe to mount when someone has written a deliberately corrupted filesystem on a usb stick.

"Some filesystems like ext4 make a best effort to fix bugs of this sort as they are discovered but unless something has changed since last I looked no one makes the effort to ensure that it is 100% safe to mount any possible corrupted version of any Linux filesystem.

"If there is any filesystem in Linux that is safe to automount from an untrusted USB stick I really would like to hear about it."

It's amazing to me that Linux continues to run on every piece of hardware that anyone actually uses, supporting all sorts of protocols and filesystems that have any users at all, even when the companies, like Apple, who created those things have long since stopped supporting them. It would have been the easiest thing in the world for Linus to throw up his hands and rip out that ancient code, but no. There are people using that code, so it stays in. Bug fixes will be accepted, and someday maybe even a maintainer will step forward to stand behind those remaining users until the end.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on: “Welcoming" a New Kernel Developer; and An Ancient Feature Goes Belly Up. 

  • Aufs2

    Add temporary write capability to a read-only device with the stacked filesystem aufs.

  • Kernel News

    Zack Brown discusses preventing the kernel from tainting, encrypting printk() output, and a new kernel bug reporting bot. 

  • Linus Announces Linux Kernel 3.12

    New release offers better graphics drivers and expands filesystem support.

  • Kernel News

    Chronicler Zack Brown reports on the patch submission process and the status of NTFS. 

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News