Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.
Race Conditions on File Operations
Michael Kerrisk discovered a POSIX violation inside the kernel and brought it up on the mailing list. Apparently the read()
, write()
, and other related system calls did not perform atomic input and output actions. In his tests, Michael found that multithreaded write operations would receive conflicting file offset information, causing them to overwrite each other's file edits.
He posted some code that would reliably reproduce the problem and suggested that his test results pointed to the VFS (virtual filesystem) layer as the likely culprit. Michael also did some research, finding some very old discussions (some as many as eight years old) claiming that similar situations were not actually POSIX violations. But, Michael asserted that they really were. He said:
"Linux isn't conformant with SUSv3 and SUSv4 and isn't consistent with other implementations such as FreeBSD and Solaris. And I'm pretty sure Linux isn't consistent with UNIX since early times. (e.g., page 191 of the 1992 edition of Stevens APUE discusses the sharing of the file offset between the parent and child after fork(). Although Stevens didn't explicitly spell out the atomicity guarantee, the discussion there would be a bit nonsensical without the presumption of that guarantee.)"
Linus Torvalds agreed that there did appear to be a POSIX violation because of the file position pointer data structure (f_pos
). He said:
"Long long ago we used to just pass in the pointer to f_pos directly, and then the low-level write would update it all under the inode semaphore, and all was good. And then we ended up having tons of problems with non-regular files and drivers accessing f_pos and having nasty races with it because they did *not* have any locking (and very fundamentally didn't want any), and we broke the serialization of f_pos. We still do the *IO* atomically, but yes, the f_pos access itself is outside the lock."
Linus asked Al Viro if he had an idea what to do about this. Al asked what exactly POSIX expected from the read()
and write()
system calls, and Michael replied that it should not be possible for one system call to be interrupted by another, in between performing an input/output operation and updating the f_pos
data structure. He quoted the POSIX requirement, as being "If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them."
Michael pointed out that a lot of system calls fell under this requirement and that he was sure a lot of violations could be found.
A few days later, Linus posted a patch – essentially adding a mutex for f_pos
so it would be locked along with other file data. He felt it was pretty simple and straightforward. There was a bit of technical discussion over exactly how to structure the source to generate the most efficient machine code, as well as exactly when to take the lock, and which cases didn't really need it. Al tested the code and, in particular, made sure all code paths worked well with Linus's patch.
At one point, Linus and Al got into a minor debate about whether it was OK to return register pairs. Some of Linus's code seemed a bit too refined for Al's tastes. Al thought that a number of architectures, including the PowerPC, Alpha, s390, and 32-bit Sparc, would not support Linus's attempt to return register pairs. He said those machines would return the values on the stack, which wouldn't work as well.
Linus argued that returning register pairs was a common programming pattern, and that the kernel did that sort of thing elsewhere already.
Al said he was sure that at least PowerPC, MIPS, and ARM did not support it.
Linus replied, "I doubt it's worth caring about. Even when passing things in memory, the end result isn't that much worse than the fget_light() model that passes just one of the two fields in memory."
He also said, "If the ARM/PPC people end up caring, they could add the struct-return support to gcc. It should be basically just adding the flag for enabling the calling convention – the core gcc support is obviously all there, and it's just an issue of whether the calling conventions allow it or not."
Al chose to disagree by describing a technical approach that would solve the whole problem better than Linus's method. Linus had one minor comment and added that he'd "happily take that approach, yes."
Al updated his code and ran a bunch of tests, compiling for various architectures. In every case, he thought his patch produced better machine code, and worked better all around, than Linus's patch.
Linus suggested that this patch should also go into Greg Kroah-Hartman's stable tree. Michael also confirmed that Al's code made his original bug disappear.
The whole issue of POSIX compatibility is interesting because it's so murky and unclear, in spite of POSIX's nature as a standard. Traditionally, Linus has treated POSIX as something to conform to only when doing so would be convenient and useful. In some cases, such as when implementing multithreading, Linus decided that the POSIX way was suboptimal, so he simply did things differently.
This attitude extends to other areas as well. Certain standard or semi-standard approaches to security have ended up being oddly implemented in the kernel because Linus insisted on implementing only portions of those approaches that actually helped protect users. This applied to certain aspects of Trusted Computing, for example.
So, these types of issues are always interesting to me, because I wonder where Linus will draw the line relative to one standard or another. Also, it's nice to see standards treated as tools that may or may not be used, rather than authorities to be obeyed.
On the other hand, I also root against file operations clobbering one another.
Tracking CPU Events
Andi Kleen pointed out that the perf performance analysis tool did not have great support for low-level CPU events. It supported high-level events, but Andi said that tracking low-level state changes "required specifying the event in raw form (very awkward) or using nonstandard front ends like ocperf or patching in libpfm."
He added, "Intel CPUs can have very large event files (Haswell has ~336 core events, much more if you add uncore or all the offcore combinations), which is too large to describe through the kernel interface. It would require tying up significant amounts of unswappable memory for this."
Andi said that typically CPU vendors maintained publicly available lists of events, and the OProfile developers would snarf those lists and incorporate them into their code. But, he went on to say that OProfile would often fall out of date with respect to these lists and that, even in the best case, "it ties up quite a bit of disk space, mostly for CPUs you don't have."
He posted some code that enabled perf to auto-download Intel's updated event lists for the user's current CPU, parse the files – they were in JSON format – and store them locally on the user's system. Then, perf could abstract the event descriptions and perform its analysis, all without putting any extra burden on the kernel itself.
Jiri Olsa was pleased to see this and offered to review Andi's code.
Ben Hutchings didn't like that Andi's code stored Intel's event lists in the ~/.events
directory. He felt it should be under ~/.cache
(or whatever the user had set $XDG_CACHE_DIR
to) and should be called ~/.cache/perf-events
.
Andi thought that was too perf-specific, but he agreed that just calling it "events" was too general. He opted for ~/.cache/pmu-events
(performance monitoring unit).
Ben also suggested using the more globally available download directory /var/cache/pmu-events
, but Andi didn't like this. He said it would encourage people to download the event files as root, which would never be a good idea. Keeping the directory within their home directory would emphasize that this was something they'd do as their own user.
Prettying Up Stack Dumps
Sasha Levin pointed out that stack dumps looked awful. Most of the output, he said, was "pure garbage." In particular, kernel addresses were displayed as file offsets such as ffffffff811f0ec8
, which meant "nuthin' to nobody." He posted a script to translate these offsets into meaningful line numbers in source files, such as fs/proc/generic.c:445
.
Linus Torvalds replied that he liked the sentiment, but that he'd prefer to fix the kernel so it wouldn't output those meaningless file offsets in the first place. He posted his own patch to make kernel stack dumps a bit prettier, but he said he still needed something to resolve the symbolic addresses.
Sasha replied that he'd already considered doing the change within the kernel, but that "it seems that it would require a rather large code addition just to deal with getting pretty line numbers."
Linus replied, "No no no. The *kernel* will never do line numbers, especially since only people who don't care about build performance compile with debug info, and even if you do do that, the kernel won't load it anyway."
He added, "The kernel is going to *remove* all the hex numbers that your script relies on, because those hex numbers are completely worthless. They are worthless and annoying now, but they are *doubly* worthless if the kernel is compiled with base address randomization, since nobody will know what the hex numbers mean."
This made sense to Sasha. He said he'd fix up his script to deal with the new situation.
Linus pointed out that this might be more difficult than Sasha thought. He said that the addr2line
tool that Sasha used in his script "doesn't take symbolic names. So either addr2line needs to be improved (which really would be a good idea regardless), or your script needs to use 'nm' or 'gdb' or something to first translate the symbol+off into a hex number (which will not match the kernel-provided hex number when base randomization is in effect, but it will match the pre-randomized data in the vmlinux file, so then addr2line would work)."
Sasha replied that he'd just take the data from the System.map
file and do his own math.
Linus agreed this would work, although he added, "reading the symbols from the vmlinux file would be *much* more convenient, since I know that not everybody saves the System.map file (cough cough me)." He said, "it would actually be much better to use 'nm vmlinux' as the source of the base information instead, and have just one file you depend on."
At this point, the discussion veered off sharply into debugging a problem with perf top
, and no more was said about Sasha's script. It seems as if, in general, everyone ended up happy, though. Linus made the kernel spit out less ugly stack dumps, and Sasha made his script rely on a more robust source of data.
Buy this article as PDF
(incl. VAT)