Zack's Kernel News

Zack's Kernel News

Article from Issue 176/2015

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Librarifying the Kernel

Hajime Tazaki wanted to decouple the kernel network stack from the kernel and run it as a shared library. He posted a patch to do this. If accepted, anyone would then be able to replace the network stack with a home-rolled version. The code, however, got a mixed reception on the linux-kernel mailing list.

Richard Weinberger objected that there might be too many similarities between Hajime's project and user-mode Linux (UML). If Hajime could retool his library to be part of the UML code, that would avoid having multiple user-mode-type parts of the kernel.

Hajime pointed out that one of the goals with his kernel stack work was for the shared library to work on any POSIX-compatible operating system (e.g., OS X), so he didn't want to rely too heavily on Linux-specific implementation details.

Richard suggested that just in terms of namespaces, Hajime could use UML's /arch/um directory instead of his intended /arch/lib directory. He also suggested that Hajime's code could be adapted to enhance UML with a new "library operation" mode that would not interfere with the ability to run the shared library on other POSIX systems. Richard also mentioned that "Jeff (the original author of UML) wanted to create a special linker script such that you can build UML as shared object."

Hajime thought that, to some extent, it made sense to share code between his librarified network stack and UML. However, he also thought that the two projects had some fundamental differences in things like process context design, the way they handled system call execution, and other things, and that treating Hajime's code as a separate architecture felt justified. Richard said, "I'd say you should try hard to re-use/integrate your work in arch/um. With um we already have an architecture which targets userspace; having two needs a very good justification."

Hajime replied that his library "is not limited to run on user-mode: It is just a library which can be used with various programs. Thus it has a potential (not implemented yet) to run on a hypervisor like OSv or MirageOS does for application containment, or run on a bare-metal machine as rumpkernel does. We already have a clear interface for the underlying layer to be able to add such backend."

At this point, Richard dug into Hajime's patch and rooted around in it for a bit. Then, he replied that he agreed that Hajime's library approach was significantly different from something that could be simply incorporated into the UML codebase. Overall, Richard said, he liked Hajime's whole idea.

However, he also suggested that Hajime shouldn't treat his code as a separate architecture, which was what led to some of their earlier disagreement. He explained, "You don't implement an architecture, you take some part of Linux (the networking stack) and create stubs around it to make it work. That means that we'd also have to duplicate kernel functions into arch/lib to keep it running."

Hajime asked Richard to clarify what he meant by an "architecture," and why UML could be counted as one, whereas Hajime's library could not. Richard replied, "UML is an architecture as it binds the whole kernel to a computer interface. Linux userspace in that case." He went on, "Your arch/lib does not bind the Linux kernel to an interface. It takes some part of Linux and duplicates kernel core subsystems to make that part work on its own. For example arch/lib contains a stub implementation of core VFS functions like register_filesystem( )."

Richard also said that he was concerned that "arch/lib will be broken most of the time as every time the networking stack references a new symbol it has to be duplicated into arch/lib."

At this point, Rusty Russell joined the discussion, saying that this was "Exactly why I look forward to getting this in-tree." He said, "Jeremy Kerr and I wrote nfsim back in 2005(!) which stubbed around the netfilter infrastructure; with failtest and valgrind it found some nasty bugs. It was too much hassle to maintain out-of-tree though ( I look forward to a flood of great bugfixes from this work)."

In line with Rusty's comment, Hajime replied that the code had already resulted in identifying a bug in the xfrm code in the kernel.

To Richard, Hajime said he was open to an alternative to treating his library as an architecture. However, he was concerned that migrating the code into the core kernel subsystem might require a lot of #ifdef commands in the source, which he wasn't sure would be palatable to the kernel folks. Treating the library as its own architecture could avoid this and wouldn't require any changes to core kernel code. Richard offered a new suggestion:

What about putting libos into tools/testing/ and make it much more generic and framework alike. With more generic I mean that libos could be a stubbing framework for the kernel, i.e., you specify the subsystem you want to test/stub and the framework helps you do so. A lot of the stubs you're placing in arch/lib could be auto-generated as the vast majority of all kernel methods you stub are no-ops which call only lib_assert(false).

Using that approach only very few kernel core components have to be duplicated and actually implemented by hand. Hence, less maintenance overhead, and libos is not broken all the time.

Antti Kantee joined the conversation at this point. He said he'd been working on a similar project for NetBSD for the past eight years and had some insight to offer. He supported Hajime's work as a way to support a generic library mode, which was something he felt all OSs should offer. However, he said, "Autogenerating stubs only means that the libos will build, not that it won't be broken. Figuring out how to make the libos as close to zero-maintenance as possible is indeed the trick."

Antti thought that the best idea would be to focus on arranging the deeper kernel architecture to support Hajime's librarification approach, as opposed to making only minor changes to tolerably accommodate that approach. He added that, in his case, "There are practically no stubs in the NetBSD implementation; somewhere between 0 and 20 depending on what you count as a stub." Antti concluded, "Continuous testing is paramount. Running the kernel as a lib provides an unparalleled method for testing most of the kernel. It will improve testing capabilities dramatically, and on the flipside it will keep the libos working. Everyone wins."

Richard was not entirely convinced. If Antti's code was maintainable at the slow pace of NetBSD development, a similar thing might not be maintainable at all, at the much faster pace of Linux kernel development. He also pointed out that the need for stubbed testing in the Linux world had diminished over time, as other tools like lockdep, KASan, and kmemleak have been developed.

Hajime replied to both Richard and Antti that being able to auto-generate library stubs would greatly reduce the maintenance requirements.

Antti replied to Richard's concern about the maintenance burden for Linux versus NetBSD. He said, "it took four years to figure out how to make things as maintainable as I could figure out how to make them. The good news is that the results of those four years are more or less generic and independent of OS." He suggested evaluating Hajime's code for inclusion in the main kernel tree and then seeing, from that evaluation, what the maintenance costs might be.

At this point, Richard, Hajime, and the rest dug a bit deeper into the code itself, coming up with various technical suggestions and identifying more infrastructure requirements, such as the need for KBuild integration. A few days later, Hajime posted an updated patch, reflecting the conversation thus far.

Richard suggested, "Maintain LibOS in your git tree and follow Linus' tree. Make sure that all kernel releases build and work. This way you can experiment with automation and other stuff. If it works well you can ask for mainline inclusion after a few kernel releases. Your git history will show how much maintenance burden LibOS has and how much with every merge window breaks and needs manual fixup."

Hajime agreed with this, and the thread ended. A bit inconclusive, but it's at least clear that extracting deep portions of the kernel and turning them into shared libraries has support from big-time kernel contributors and that folks are in general in favor of finding the right place for something like this in the main kernel tree. It's also likely that any solution that ultimately gets accepted into the kernel will also support the librarification of other parts of the kernel, beyond just the networking stack.

Generic Filesystem Event Notification

Beata Michalska posted a patch to implement filesystems with a generic event-notification interface so that any filesystem could report an issue to the kernel in a consistent way. She said:

The events have been split into four base categories: information, warnings, errors and threshold notifications, with some very basic event types like running out of space or filesystem being remounted as read-only.

Threshold notifications have been included to allow triggering an event whenever the amount of free space drops below a certain level – or levels to be more precise as two of them are being supported: the lower and the upper range. The notifications work both ways: once the threshold level has been reached, an event shall be generated whenever the number of available blocks goes up again re-activating the threshold.

The interface configuration would use Sysfs and appear at the /sys/fs/events mountpoint, as an fstrace-type virtual filesystem. The user could set up an event notification by writing to the "events" file. By default, no event would trigger a notification.

Darrick J. Wong and Eric Sandeen dug into the code and offered some technical suggestions. Eric pointed out some global spinlock usage that might cause a system slowdown, and Beata replied, "I do agree that this might be a performance bottleneck event though I've tried to keep this to minimum – it's being taken only for hashtable look-up. But still, … I was considering placing the trace object within the super_block to skip this look-up part."

Meanwhile, Heinrich Schuchardt loved Beata's basic idea of generic notifications for all filesystems. In fact, he thought the scope of the project could even be enlarged. Heinrich posed several scenarios in rapid succession:

Many filesystems are remote (e.g. CIFS/Samba) or distributed over many network nodes (e.g. Lustre). How should filesystem notification work here?

How will FUSE filesystems be served?

The current point of reference is a single mount point. Every time I insert an USB stick several filesystems may be automounted. I would like to receive events for these automounted filesystems.

A similar case arises when starting new virtual machines. How will I receive events on the host system for the filesystems of the virtual machines?

Beata replied that she'd be very happy to extend the notification scope to include all of Heinrich's use cases. She pointed out that flexibility was already a goal, so new cases – including those mentioned by Heinrich – could be added fairly easily. However, she acknowledged that she wasn't first and foremost a filesystem person, so she welcomed more comments and suggestions from folks.

Jan Kara also replied with enthusiasm to Beata's initial announcement and offered some technical suggestions. Austin Hemmelgarn also offered some technical suggestions, specifically: "For some filesystems, it may make sense to differentiate between a generic warning and an error. For Btrfs and ZFS for example, if there is a csum error on a block, this will get automatically corrected in many configurations and won't require anything like fsck to be run, but monitoring applications will still probably want to be notified." Jan replied, "in that case just create an event CORRECTED_CHECKSUM_ERROR and use that. Then userspace knows what it should do with the event. No need to hide it behind warning/error category."

John Spray liked Austin's warning/error categories, saying, "Another key differentiation IMHO is between transient errors (like server is unavailable in a distributed filesystem) that will block the filesystem but might clear on their own, vs. permanent errors like unreadable drives that definitely will not clear until the administrator takes some action. It's usually a reasonable approximation to call transient issues warnings, and permanent issues errors."

Jan replied, "So you can have events like FS_UNAVAILABLE and FS_AVAILABLE but what use would this have?" Jan remarked in a later post that he wouldn't necessarily oppose reporting those events, he just wanted to make sure they'd actually be useful.

Austin replied to Jan, saying, "The use-case that immediately comes to mind for me would be diskless nodes with root-on-nfs needing to know if they can actually access the root filesystem." But, Beata herself said that this didn't seem like a convincing use case. She said, "most apps will access the root filesystem regardless of what we send over netlink, … so I don't see netlink events improving the situation there too much. You could try to use it for something like failover but even there I'm not too convinced – just doing some IO, waiting for timeout, and failing over if IO doesn't complete works just fine for that these days."

John also responded to Jan, saying that he could think of a couple of use-cases for something like FS_UNAVAILABLE and FS_AVAILABLE. For example, he said, "a cluster scheduling service (think MPI jobs or Docker containers) might check for events like this. If it can see the cluster filesystem is unavailable, then it can avoid scheduling the job, so that the (multi-node) application does not get hung on one node with a bad mount. If it sees a mount go bad (unavailable, or client evicted) partway through a job, then it can kill -9 the process that was relying on the bad mount, and go run it somewhere else."

John added, "We don't have to invent these event types now of course, but something to bear in mind. Hopefully if/when any of the distributed filesystems (Lustre/Ceph/etc.) choose to implement this, we can look at making the event types common at that time though."

Andreas Dilger applauded John's examples, saying, "Some users were just asking yesterday at the Lustre User Group meeting about adding an interface to notify job schedulers for [your point], and I'd much rather use a generic interface than inventing our own for Lustre."

Overall, amid the technical discussion, there were no voices objecting to the feature itself. It seems likely that Beata's feature will go into the kernel at some point, though the precise form may still be up for grabs. The interface itself, as well as configuration and various other technical details, don't seem to be fully nailed down yet, but the concept of a generic filesystem notification system seems to be thoroughly welcomed by all.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95