Zack's Kernel News

Zack's Kernel News

Article from Issue 184/2016
Author(s):

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Fixing Memory Usage by Not Fixing It

Al Viro recently posted some "flagday" patches – changes that were so invasive, they couldn't be done piecemeal. His idea was to convert the kernel memory handling APIs so that functions like free_page() and a bunch of others would return a pointer instead of just a plain number. His thinking was that everyone doing anything with RAM wanted to get a usable memory pointer, instead of having to do a typecast every time they called the memory handling functions.

Linus Torvalds, however, put the kibosh on the whole idea. Changing an API that had been in place for almost the entire lifespan of the Linux kernel project would cause a lot of confusion among developers. It would also, he said, make backporting new features to earlier versions of the kernel an even bigger headache than it already was, because the backport would have to make sure it undid all of Al's flagday changes, just to get a patch that would successfully apply to the earlier kernel version.

The proper way to do what Al had proposed, said Linus, would be to create a new set of functions that had different names and to allow both versions of each function to exist side by side. That way, people could migrate their portions of the kernel to the new functions in a piecemeal way over time.

But even creating new functions with different names, he said, was probably not a good idea either, just because the existing set of functions worked fine and had a long history of use.

Al replied that he was fine with Linus's decision, but he wanted to make it clear that the vast majority of calls to these functions didn't want the standard return values and had to use typecasts to get what they wanted. By his count, 1,408 typecasts could be removed from everybody's code if he made this change.

He said, "For me the bottom line so far is that we have a lot of places where page allocator is used and the majority of those uses the result as a pointer. That, with the calling conventions we have (and had all along), means tons of boilerplate. It also means a lot of opportunities to mix physical, virtual and DMA addresses, since typechecking is completely bypassed by those typecasts."

Linus agreed that the existing function behaviors were not what users wanted. But he said:

That doesn't mean that we should just convert a legacy interface. We should either just create a new interface and leave old users alone, or if we care about that code and really want to remove the cast, maybe it should just use kmalloc() instead.

Long ago, allocating a page using kmalloc() was a bad idea, because there was overhead for it in the allocation and the code.

These days, kmalloc() not only doesn't have the allocation overhead, but may actually scale better too, thanks to percpu caches etc.

So my point here is that not only is it wrong to change the calling convention for a legacy function (and it really probably doesn't get much more legacy than get_free_page – I think it's been around forever), but even the "let's make up a new name" conversion may be wrong, because it's entirely possible that the code in question should just be using kmalloc().

So I don't think an automatic conversion is a good idea. I suspect that old code that somebody isn't actively working on should just be left alone, and code that *is* actively worked on should maybe consider kmalloc().

And if the code really explicitly wants a page (or set of aligned pages) for some vm reason, I suspect having the cast there isn't a bad thing. It's clearly not just a random pointer allocation if the bit pattern of the pointer matters.

And yes, most of the people who used to want "unsigned long" have long since been converted to take "struct page *" instead, since things like the VM wants highmem pages etc. There's a reason why the historical interface returns "unsigned long": it _used_ to be the right thing for a lot of code. The fact that there now are more casts than not are about changing use patterns, but I don't think that means that we should change the calling convention that has a historical reason for it.

Al confirmed that the functions were "present in v0.01, with similar situation re callers even back then."

He went through the entire corpus of Linux code and came up with a statistical analysis of which behaviors were needed by which parts of the kernel, whether the needed behaviors were truly needed, or if they could be replaced by something else for a better result.

In the end, Al posted his recommended guidelines for memory usage within the kernel and requested feedback:

1) Most of the time kmalloc() is the right thing to use. Limitations: alignment is no better than word, not available very early in bootstrap, allocated memory is physically contiguous, so large allocations are best avoided.

2) kmem_cache_alloc() allows to specify the alignment at cache creation time. Otherwise it's similar to kmalloc(). Normally it's used for situations where we have a lot of instances of some type and want dynamic allocation of those.

3) vmalloc() is for large allocations. They will be page-aligned, but *not* physically contiguous. OTOH, large physically contiguous allocations are generally a bad idea. Unlike other allocators, there's no variant that could be used in interrupt; freeing is possible there, but allocation is not. Note that non-blocking variant *does* exist – __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL) can be used in atomic contexts; it's the interrupt ones that are no-go.

4) if it's very early in bootstrap, alloc_bootmem() and friends may be the only option. Rule of the thumb: if it's already printed "Memory: …../….. available….." you shouldn't be using that one. Allocations are physically contiguous and at that point large physically contiguous allocations are still OK.

5) if you need to allocate memory for DMA, use dma_alloc_coherent() and friends. They'll give you both the virtual address for your use and DMA address referring to the same memory for use by device; do *NOT* try to derive the latter from the former; use of virt_to_bus() et al. is a Bloody Bad Idea(tm).

6) if you need a reference to struct page, use alloc_page/alloc_pages.

7) in some cases (page tables, for the most obvious example), __get_free_page() and friends might be the right answer. In principle, it's case (6), but it returns page_address(page) instead of the page itself. Historically that was the first API introduced, so a _lot_ of places that should've been using something else ended up using that. Do not assume that being lower level makes it faster than e.g. kmalloc() – this is simply not true.

Improving System Call Error Reporting

Alexander Shishkin recently posted a patch to improve the way system calls reported errors. This had been a thorn in a lot of folks' sides for quite awhile already. Specifically, Alexander explained that some system calls would take dozens of parameters, whereas if any of them were incorrect or failed a particular validation check, the only return value would be EINVAL – invalid input. The user would then have to sift through all the parameters and perform many tests, just to identify the one incorrect item.

Alexander's patch was a generic approach to error reporting that allowed the called routines to annotate their return values with JSON data that could then be parsed and used to debug whatever problems there were.

To do this, he had to make sure that existing code would still be able to see the same return values they always had. He explained, "Each error 'site' is referred to by its index, which is folded into an integer error value within the range of [-EXT_ERRNO, -MAX_ERRNO], where EXT_ERRNO is chosen to be below any known error codes, but still leaving enough room to enumerate error sites. This way, all the traditional macros will still handle these as error codes and we'd only have to convert them to their original values right before returning to userspace. At that point we'd also store a pointer to the error descriptor in the task_struct, so that a subsequent prctl() call can retrieve it."

Jonathan Corbet took a look at Alexander's code and had some issues with it. He said, "if I read this correctly, once an extended error has been signalled, it will remain forever in the task state until another extended error overwrites it, right? What that says to me is that there will be no way to know whether an error description returned from prctl() corresponds to an error just reported to the application or not; it could be an old error from last week."

Alexander confirmed that this was the intended behavior and explained, "It seems to make sense to allow the program to clear it (via a flag in that prctl(), for example). That is, if we get an error, we try to fetch the extended description, clear it and forget about it. Then, this prctl() may be a part of the syscall wrapper (or a library function that uses that syscall), which might or might not want to leave the extended error code for the main program to inspect. Or a debugger might call this prctl() for its debugging purposes, but still keep it around for the main program."

Johannes Berg felt this could get dicey. He replied to Alexander, saying, "imagine a library wanting to use the prctl(), but the main application isn't doing that. Should the library clear it before every call, to be sure it's not getting stale data?"

Jonathan also said, "anything other than the errno 'grab it now or lose it' behavior will prove confusing. I don't think there is any other way to know that a given error report corresponds to a specific system call. Library calls can mess it up. Kernel changes adding extended reporting to new system calls can mess it up. Applications cannot possibly be expected to know which system calls might change the error-reporting status, they *have* to assume all of them will."

And Johannes followed up with, "an application that expects a certain syscall to have extended errors will get confused if running on an older kernel where that syscall in fact does *not* have extended errors (and thus also doesn't clear extended errors) and therefore the extended error from a previous syscall could still be lingering on (for example because the application didn't care to fetch it for that previous syscall.)"

The discussion petered out there, with no resolution. Over the years, there have been various calls to clean up system call error reporting, but apparently the best way to do this is not yet known. Alexander's approach, leaving errors available for inspection, seems to add confusion because if nothing inspects the error, it could get stale and become misleading. But if the errors are made to be use-it-or-lose-it, it might be difficult for them to reach the layer of code that most needs them.

Fixing the Y2038 Bug

By storing timestamps as 32-bit numbers, Unix timestamps are set to roll over in the year 2038. One way to deal with this would be to use 64-bit numbers instead. Deepa Dinamani recently pointed out that the VFS (virtual filesystem) still used 32-bit representations for timestamps on inodes and other data structures. She posted a patch to convert those timestamps to 64-bit numbers and to align and format them properly for minimal RAM requirements.

To prevent code elsewhere in the kernel from running into problems, Deepa also implemented accessor aliases so that the routines using inode and other structures containing the new timestamps would continue to see the data in the expected way.

Dave Chinner had no immediate objection to Deepa's overall goal, but he thought that her efforts to conserve RAM made things more complex than the value of the RAM they saved. As Deepa had described it, the code would "lay them out such that they are packed on a naturally aligned boundary on 64 bit arch as 4 bytes are used to store nsec. This makes each tuple(sec, nscec) use 12 bytes instead of 16 bytes."

Deepa replied to Dave, saying that the savings was significant – roughly 4MB on a lightly loaded system.

Dave had also objected to her accessor macros, saying he didn't see the need for them. In response to this, Deepa went over some of the alternatives she'd considered and some of the problems she'd had to solve. For example, as she put it, "there already are accessors in the VFS: generic_fillattr() and setattr_copy(). The problem really is that there is more than one way of updating these attributes (timestamps in this particular case). The side effect of this is that we don't always call timespec_trunc() before assigning timestamps which can lead to inconsistencies between on-disk and in-memory inode timestamps."

Dave still didn't see the value in Deepa's approach. He said, "you've got a wonderfully complex solution to a problem that we don't need to solve to support timestamps >y2038. It's also why it goes down the wrong path at this point – most of the changes are not necessary if all we need to do is a simple timespec -> timespec64 type change and the addition timestamp range limiting in the existing truncation function."

Arnd Bergmann spoke up at this point, saying, "I originally suggested doing the split representation because I was worried about the downsides of using timespec64 on 32-bit systems after looking at actual memory consumption on my test box. At this moment, I have a total of 145712700 inodes in memory on a machine with 64GB ram; saving 12 bytes on each amounts to a total of 145MB. I think it was more than that when I first looked, so it's between 0.2% and 0.3% of savings in total memory, which is certainly worth discussing about, given the renewed interest in conserving RAM in general. If we want to save this memory, then doing it at the same time as the timespec64 conversion is the right time so we don't need to touch every file twice."

But, Dave wanted to separate the two issues: on the one hand, fixing the Y2038 bug and, on the other, optimizing RAM usage. Given that the Y2038 bug was the only real requirement, whereas the RAM optimization was optional, he wanted to focus on fixing the bug first and deal with optimizations later.

Arnd said that would be fine with him, and Deepa also began to simplify her code in preparation for another version of the patch.

At this point, Deepa, Arnd, and Dave continued delving into the technical requirements for the patch. This involved a number of issues, including the post-2038 need to mount pre-2038 filesystems and an understanding of the types of service contracts that would require systems running decades before the bug actually manifested to have a fix for it. Also, there was the issue of how to make sure that all of the many filesystems had their own Y2038 fixes. Not all of them dealt with time in the same way, and not all of them could be fixed using the same approach. But one way or another, they all would have to have Y2038 fixes.

Ultimately, there turned out to be many problems associated with the overall bug fix and many angles to consider. At the time of this writing, no perfect solution had emerged, and the three developers seem to be trying to hew off sections of the problem that can be dealt with, in the hopes that the remaining pieces might start to look more tractable afterward.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    In kernel news: Heap Hardening Against Hostile Spraying; and Core Contention Improvements … or Not.

  • Packet Telemetry with Host-INT

    Inband Network Telemetry and Host-INT can provide valuable insights on network performance – including information on latency and packet drops.

  • NEWS

    In the news: Patreon Project Looks to Bring Linux to Apple Silicon; A New Chrome OS-Like Ubuntu Remix Is Now Available; System76 Refreshes the Galago Pro Laptop; Linux Kernel 5.10 Is Almost Ready for Release; Canonical Launches Curated Container Images; NS AWS Container Image Library in the Works.

  • Managing Linux Filesystems

    Even with all the talk of Big Data and the storage revolution, a steady and reliable block-based filesystem is still a central feature of most Linux systems.

  • Kernel News

    Improving the Android low memory killer; randomizing the Kernel stack; and best practices.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News