Kernel News
Kernel News
Zack Brown reports on bug hunting and process appreciation, and refusing "useful" patches.
Bug Hunting and Process Appreciation
There was a fun little bug hunt recently, when Mikael Pettersson reported that Linux 5.10.56 wouldn't boot on his system. This was particularly interesting because the 5.10 kernel is the "stable" series, meaning it receives long-term support from Greg Kroah-Hartman.
The stable vs. development dichotomy is one that has gradually emerged over many years. Originally, there was no such distinction, and Linus Torvalds simply released new patches and new versions as developers submitted them. The kernel was generally stable, but there was not yet any concept of a series of releases that actually aimed for stability, either in terms of bugs or features.
Eventually, Linus began to alternate between stable and development cycles. During the stable cycle, he would accept almost exclusively bug fixes and no new features. During the development cycle, it was a free-for-all, with everyone's favorite project hurtling headlong into the kernel.
But that approach was reviled, especially by developers who had to vacuum-style breathe during the stabilization periods. Eventually Linus hit upon a better approach, which is the current Release Candidate (RC) cycle. With this technique, he puts out a series of RC kernel versions over the short term. Each of these have increasing stability goals, until the final RC becomes the official kernel release.
With the RC technique, developers have frequent opportunities to hurl their ongoing work in Linus's direction, with relatively short periods of stabilization in between.
The result is more stable than simply doing ongoing hard-core development, but it's not as stable as a lot of users, especially corporate users, need. So, Greg decided to pick certain official releases and do his own stabilization cycle (with massive community assistance).
The 5.10 series, as one of Greg's picks for the stable cycle, is therefore relied on by, well, everyone. A 5.10 kernel that fails to boot is a 5.10 kernel that will receive a lot of TLC quickly.
So when Mikael reported that 5.10.56 wouldn't boot on his system, it was always going to be a thing. In this case, he knew that he did not have the problem before the 5.10.47 release, and he noticed it on the 5.10.56 release. So, one of the patches that went into the Git tree between those two releases had to be the culprit.
It's worth mentioning that the git
tool is Linus's other successful attempt to take over the world. And with git
, Mikael was able to bisect
the revision log in order to find exactly which patch introduced the problem. In fact, he had a slightly tougher time than just simply doing that, but, yes, he did that.
Bisecting is simply a binary search, and everyone uses it in kernel development and other software projects. Since Mikael had one version that was known to work and one version that was known to be broken, he could choose the patch directly in the middle of those two, and test it. If it was also broken, he could choose the patch directly in the middle between this test version and the known working version. By jumping from midpoint to midpoint in this way, each test cuts in half the total number of patches that might be causing the problem. For example, if Mikael had 4,000 patches to sift through, it would take him only 12 tests to find the bad one using the bisecting technique.
In fact, Mikael made a mistake with his bisection – at first he thought he had bisected to one particular patch that came after Linux version 5.10.52.
Finn Thain responded to this bug report, pointing out that the very same patch also went into Linus Torvalds's main-line Git tree, which would make the problem even more widespread. Mikael tested the main-line tree and did not see the problem. He said, "I suspect the commit has some dependency that hasn't been backported to 5.10 stable."
Finn continued the hunt, trying to isolate the potential bug in the main-line tree, on the assumption that even if Mikael couldn't reproduce the problem locally, the bad code from the stable branch was still also in the main-line tree and needed to be rooted out.
But at this point, while helping Finn track the problem into the main-line tree, Mikael realized he had made a mistake with his initial bisection. He ran it again, this time marking all the known good versions of the kernel, and the bisection landed on a different patch. Reverting that patch fixed the problem.
Michael Rapoport looked at the specific patch Mikael was talking about and confirmed that, yes, there did seem to be something wrong in that code. He posted a two-line patch and asked Mikael to test it out. Mikael did and confirmed that this fixed his boot problem.
For me, the most interesting parts of kernel development are not new features or new hardware support but, instead, the development process itself, as it has evolved and continues to evolve over time: The emergence of the long-term stable tree, the patterns of stabilization in the development tree, the invention of Git and the strange twists and turns that preceded it.
In this particular case, I find the relatively quick and straightforward process of identifying a bug and fixing it interesting. If you look at other software projects, bug hunting is not necessarily this quick and straightforward. The reason it is in this case is because of additional policies and practices put in place by Linus. The requirement is that the kernel should at the very least be compilable and bootable after every patch. Not all software projects have that requirement. But because of this requirement, something like Mikael's bisection process could be largely automated, the problem patch quickly identified, and the fix found.
These kernel development aspects are not just little details. Linus receives such a massive truckload of patches each month that when his work first started to be trackable via revision control a lot of long-time developers were shocked at the sheer quantity of it all. It simply hadn't occurred to anyone that a single person could be the focus of so many contributions and still be able to handle it all. It's the development of these essentially social aspects of kernel development that make such things possible.
Refusing "Useful" Patches
There is a certain type of patch that would improve things for certain users but that nevertheless has a near-zero chance of ever getting into the kernel. These are patches that help users recover from things they shouldn't have been doing in the first place. And one reason those patches don't tend to go into the kernel is that it is really a bottomless pit.
The motivations, however, are pure. For example, Florian Weimer recently said, "We have a persistent issue with people using cp (or similar tools) to replace system libraries. Since the file is truncated first, all relocations and global data are replaced by file contents, result[ing] in difficult-to-diagnose crashes. It would be nice if we had a way to prevent this mistake."
This was part of a discussion (and patch series) dealing with file permissions and related control over who could write to a filesystem and when.
There started to be some discussion about how to block users from doing this particular thing. And, at a certain point, Linus Torvalds brought the hammer down, saying:
"This is definitely a 'if you overwrite a system library while it's being used, you get to keep both pieces' situation.
"The kernel ETXTBUSY thing is purely a courtesy feature, and as people have noticed it only really works for the main executable because of various reasons. It's not something user space should even rely on, it's more of a 'ok, you're doing something incredibly stupid, and we'll help you avoid shooting yourself in the foot when we notice'.
"Any distro should make sure their upgrade tools don't just truncate/write to random libraries executables.
"And if they do, it's really not a kernel issue."
The whole discussion took place amidst a patch submission from David Hildenbrand that addressed file permissions and related user controls, which is why the issue came up. But Linus added, "This patch series basically takes this very historical error return, and simplifies and clarifies the implementation, and in the process might change some very subtle corner case (unmapping the original executable entirely?). I hope (and think) it wouldn't matter exactly because this is a 'courtesy error' rather than anything that a sane setup would _depend_ on."
Still, Florian's concerns are eternally compelling. And Eric W. Biederman replied to Linus, "I am trying to come up with advice on how userspace implementations can implement their tools to use other mechanisms that solve the overwriting shared libraries and executables problem that are not broken by design."
Eric went on to say, "today the best advice I can give to userspace is to mark their executables and shared libraries as read-only and immutable. Otherwise a change to the executable file can change what is mapped into memory."
And while Eric agreed that this was fundamentally not a kernel issue, he did add, "What is a kernel issue is giving people good advice on how to use kernel features to solve real world problems. I have seen the write to a mapped executable/shared lib problem, and Florian has seen it. So while rare the problem is real and a pain to debug."
He added, "As I am learning with my two year old, it helps to give a constructive suggestion of alternative behavior instead of just saying no. Florian reported that there remains a problem in userspace. So I am coming up with a constructive suggestion. My apologies for going off into the weeds for a moment."
In the course of the discussion, Linus also had some interesting comments about GNU Hurd, the operating system that was supposed to be the crown jewel of the GNU system before Linux came along and stole its thunder.
Eric had said at one point, "Given that MAP_PRIVATE for shared libraries is our strategy for handling writes to shared libraries, perhaps we just need to use MAP_POPULATE or a new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that everything mapped from the executable is guaranteed to be visible from the time of the mmap, and any changes from the filesystem side after that are guaranteed to cause a copy on write."
And Florian had remarked, "I think this is called MAP_COPY:" and gave a link to some GNU Hurd documentation [1]. He added, "If we could get that functionality, we would certainly use it in the glibc dynamic loader. And it's not just dynamic loaders that would benefit."
But now Linus really put on his stomping boots, saying:
"Please don't even consider the crazy notions that GNU Hurd did.
"It's a fundamental design mistake. The Hurd VM was horrendous, and MAP_COPY was a prime example of the kinds of horrors it had.
"I'm not sure how much of the mis-designs were due to Hurd, and how much of it due to Mach 3. But please don't point to Hurd VM documentation except possibly to warn people. We want people to _forget_ those mistakes, not repeat them."
Returning to the main topic, Linus went on:
"I'll just repeat: stop arguing about this case. If somebody writes to a busy library, THAT IS A FUNDAMENTAL BUG, and nobody sane should care at all about it apart from the 'you get what you deserve'.
"What's next? Do you think glibc should also map every byte in the user address space so that user programs don't get SIGSEGV when they have wild pointers?
"Again – that's a user BUG and trying to 'work around' a wild pointer is a worse fix than the problem it tries to fix.
"The exact same thing is true for shared library (or executable) mappings. Trying to work around people writing to them is *worse* than the bug of doing so.
"Stop this completely inane discussion already."
Andy Lutomirski, never one to simply sit down and shut up, replied, "How about we attack this in the opposite direction: remove the deny write mechanism entirely. In my life, I've encountered -ETXTBUSY intermittently, and it invariably means that I somehow failed to finish killing a program fast enough for whatever random rebuild I'm doing to succeed. It's at best erratic – it only applies for static binaries, and it has never once saved me from a problem I care about. If the program I'm recompiling crashes, I don't care – it's probably already part way through dying from an unrelated fatal signal. What actually happens is that I see -ETXTBUSY, think 'wait, this isn't Windows, why are there file sharing rules,' then think 'wait, Linux has *one* half baked file sharing rule,' and go on with my life. Seriously, can we deprecate and remove the whole thing?"
To which Linus, always happy to continue a discussion after asking for it to end, replied, "I think that would be ok, except I can see somebody relying on it. It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time. But you are right that we have removed parts of it over time (no more MAP_DENYWRITE, no more uselib()) so that what we have today is a fairly weak form of what we used to do. And nobody really complained when we weakened it, so maybe removing it entirely might be acceptable."
He added, "I guess we could just try it and see…. Worst comes to worst, we'll have to put it back, but at least we'd know what crazy thing still wants it."
Meanwhile Al Viro remarked, "I'm fairly sure that there used to be suckers that did replacement of binary that way (try to write, count on exclusion with execve while it's being written to) instead of using rename. Install scripts of weird crap and stuff like that. [...] and before anyone goes off – I certainly agree that using that behaviour is not a good idea and had never been one. All I'm saying is that there at least used to be very random (and rarely exercised) bits of userland relying upon that behaviour."
And David, whose patch inspired the whole conversation, replied, "Removing it completely is certainly more controversial than limiting it to the main executable. [...] I'd vote for keeping it in, and if we decide to rip it out completely, do it [as] a separate, more careful step."
The discussion continued for a bit, with nothing being really decided. But it's funny that rather than enhance a user protection against doing something dumb, Linus would rather remove the protection in its entirety, on the grounds that it's not the kernel's job to stop users from doing dumb things.
Infos
- GNU Hurd documentation: https://www.gnu.org/software/hurd/glibc/mmap.html
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.
-
CachyOS Adds Support for System76's COSMIC Desktop
The August 2024 release of CachyOS includes support for the COSMIC desktop as well as some important bits for video.
-
Linux Foundation Adopts OMI to Foster Ethical LLMs
The Open Model Initiative hopes to create community LLMs that rival proprietary models but avoid restrictive licensing that limits usage.
-
Ubuntu 24.10 to Include the Latest Linux Kernel
Ubuntu users have grown accustomed to their favorite distribution shipping with a kernel that's not quite as up-to-date as other distros but that changes with 24.10.
-
Plasma Desktop 6.1.4 Release Includes Improvements and Bug Fixes
The latest release from the KDE team improves the KWin window and composite managers and plenty of fixes.
-
Manjaro Team Tests Immutable Version of its Arch-Based Distribution
If you're a fan of immutable operating systems, you'll be thrilled to know that the Manjaro team is working on an immutable spin that is now available for testing.