Zack's Kernel News

Zack's Kernel News

Article from Issue 218/2019

Zack Brown reports on Linus returning to the Kernel, and coscheduling and Intel vulnerabilities.

Linus Returns to the Kernel

Linus Torvalds's self-imposed exile from kernel development lasted exactly one week, ending on October 23 when he began to catch up on kernel merges. He had originally decided to take a break to think about his sometimes harsh treatment of Linux developers.

Given the extreme reactions to his departure – or at least to the accompanying "code of conduct" that entered the kernel source tree at the same moment – his return was greeted peacefully. Mostly he addressed the question of whether developers would prefer to receive an email acknowledgement when they submitted merge requests; the responses ranged all over the place, with some saying yes, some saying no, some explaining their own recipes for handling merge requests, and some suggesting various techniques and alternatives for automating various parts of Linus's workflow.

The real debate will come when, for example, someone disingenuously tries to get "security" code into the kernel that could be used to lock users out of controlling their own systems. How will Linus respond when someone refuses to acknowledge the validity of technical objections and continues to push for their particular patch?

My personal guess is that Linus and other top developers may now start to freeze certain people out. Instead of posting harsh responses, they may simply fail to reply to those posts at all. They won't make any public argument against certain approaches to the Linux kernel; those approaches will be granted a de facto legitimacy, even though their code won't immediately go into the kernel tree.

Coscheduling and Intel Vulnerabilities

Jan H. Schönherr wanted to extend the Completely Fair Scheduler (CFS) to support coscheduling, which is a technique of scheduling related processes on separate CPUs. It's not necessarily better than other scheduling approaches, but it has advantages and disadvantages depending on the situation and on the specific implementation details of each different scheduling approach.

Jan gave several examples of possible use cases that would benefit from coscheduling. For example, "if we can derive subsets of tasks, where tasks in a subset don't interfere much with each other when executed in parallel, then coscheduling can be used to realize this more efficient schedule. And 'resource' is a really loose term here: from execution units in an SMT system, over cache pressure, over memory bandwidth, to a processor's power budget and resulting frequency selection."

Jan was not trying to implement coscheduling for all circumstances. In particular, Jan's code would coschedule tasks that were not already being handled by other scheduling techniques in CFS; coscheduling would not be the primary scheduling approach.

As another caveat, Jan explained, "The collective context switch from one coscheduled set of tasks to another – while fast – is not atomic. If a use case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional handshake/barrier needs to be added."

Nishanth Aravamudan did some testing and found that he could hang the system under some circumstances. Jan looked at Nishanth's setup and felt that it should work fine – though he acknowledged that he too saw a system lockup with the same configuration. The two went back and forth hunting the bug, and Jan eventually solved several issues that had contributed to the lockups.

In his initial post, Jan had said that Peter Zijlstra had once called coscheduling a "scalability nightmare waiting to happen." Meanwhile, Peter also replied to Jan's initial post, saying, "this isn't anywhere near something to consider merging."

In particular, Peter said that there were certain issues in the existing kernel cgroup code, used for virtual machines (VMs), that made it hard to scale upward to many VMs – and he said Jan's coscheduling code made these particular scalability issues "a ton worse" and "many times worse." He felt that the cgroup code needed to be thoroughly gone through, cleaned up, and optimized, before anything like Jan's current patch could even be considered.

Peter added some context, saying, "I detest cgroups; for their inherent complexity and the performance costs associated with them. … It is after all, perfectly possible to run a KVM thingy without cgroups."

He felt that CFS itself was entirely the wrong place for the kind of "gang scheduling" Jan had in mind. He said such things were possible, "but not within the confines of something like CFS; they are also fairly inefficient because, as you do note, you will have to explicitly schedule idle time for idle vCPUs."

Peter was also unsatisfied with Jan's technical explanation of the coscheduling implementation details, saying "You gloss over a ton of details here; many of which are non trivial." He said, "Unless you have solid suggestions on how to deal with all of them, this is a complete non-starter."

He went into a bunch of technical details here, regarding specific areas of Jan's patch that Peter felt had no clear design. Peter also had a negative critique of the patch code itself, saying: "What about that atrocious locking you sprinkle all over the place? 'Some additional lock contention' doesn't even begin to describe that horror show. Hint: We're not going to increase the lockdep subclasses, and most certainly not for scheduler locking."

Jan replied that, "This patch set should 'just' give the user the additional ability to coordinate scheduling decisions across multiple CPUs. At least, that's my goal. If someone doesn't need it, they don't have to use it. Just like task groups."

On the other hand, Jan continued, if someone did want to use coscheduling, the patch would present them with the ability to experiment with coordinated scheduling decisions. He pointed out that there was a lot of interesting research about the kind of benefits that coscheduling could provide if it were ever implemented – so why not implement it? Jan reiterated that "existing scheduler features, like preemption, (should) still work as before with this variant of coscheduling, with the same look and feel."

Jan acknowledged that – as Peter had said – the patch wasn't ready to be merged into the kernel tree. He wasn't submitting it for inclusion, but was just trying to start a discussion about the mechanics and implementation.

Jan also addressed a number of Peter's technical objections – also reiterating that many of these were made against code that was acknowledged to be not yet ready for inclusion.

Peter replied again with a bit more of his own context. He explained, "I have, of course, been looking at (SMT) coscheduling, specifically in the context of L1TF, myself. I came up with a vastly different approach. … Note, that even though I wrote much of that code, I don't particularly like it either."

L1TF is a security hole in Intel chips, revealed in August 2018, which comes on the heels of a number of other very serious hardware flaws that have been discovered relatively recently in Intel chips. In the case of L1TF, the vulnerability allows hostile code to make an educated guess about what the CPU is likely to do next. It's specifically a vulnerability related to process scheduling.

One of Peter's original objections had been that Jan's sole motivation for writing this patch was to try to address L1TF, which Peter felt could really be addressed in better ways. Although later, Peter relented and acknowledged that regarding Jan's motivations, "I might have jumped to conclusions here."

Still, after Peter's acknowledgement, Rik van Riel wanted to know what other motivations beyond L1TF Jan had for producing the patch. Specifically, "What are the other use cases, and what kind of performance numbers do you have to show examples of workloads where coscheduling provides a performance benefit?"

Jan replied at length (complete with bibliographic references):

Many coscheduling use cases are not primarily about performance.

Sure, there are the resource contention use cases, which are barely about anything else. See, e.g., [1] for a survey with further pointers to the potential performance gains. Realizing those use cases would require either a userspace component driving this, or another kernel component performing a function similar to the current auto-grouping with some more complexity depending on the desired level of sophistication. This extra component is out of my scope. But I see a coscheduler like this as an enabler for practical applications of these kind of use cases.

If you use coscheduling as part of a solution that closes a side channel, performance is a secondary aspect, and hopefully we don't lose much of it.

Then, there's the large fraction of use cases, where coscheduling is primarily about design flexibility, because it enables different (old and new) application designs, which usually cannot be executed in an efficient manner without coscheduling. For these use cases performance is important, but there is also a trade-off against development costs of alternative solutions to consider. These are also the use cases where we can do measurements today, i.e., without some yet-to-be-written extra component. For example, with coscheduling it is possible to use active waiting instead of passive waiting/spin-blocking on non-dedicated systems, because lock holder preemption is not an issue anymore. It also allows using applications that were developed for dedicated scenarios in non-dedicated settings without loss in performance – like an (unmodified) operating system within a VM, or HPC code. Another example is cache optimization of parallel algorithms, where you don't have to resort to cache-oblivious algorithms for efficiency, but where you can stay with manually tuned or auto-tuned algorithms, even on non-dedicated systems. (You're even able to do the tuning itself on a system that has other load.)

Now, you asked about performance numbers, that I have.

If a workload has issues with lock-holder preemption, I've seen up to 5x to 20x improvement with coscheduling. (This includes parallel programs [2] and VMs with unmodified guests without PLE [3].) That is of course highly dependent on the workload. I currently don't have any numbers comparing coscheduling to other solutions used to reduce/avoid lock holder preemption, that don't mix in any other aspect like resource contention. These would have to be micro-benchmarked.

If you're happy to compare across some more moving variables, then more or less blind coscheduling of parallel applications with some automatic workload-driven (but application-agnostic) width adjustment of coscheduled sets yielded an overall performance benefit between roughly 10% to 20% compared to approaches with passive waiting [2]. It was roughly on par with pure space-partitioning approaches (slight minus on performance, slight plus on flexibility/fairness).

I never went much into the resource contention use cases myself. Though, I did use coscheduling to extend the concept of "nice" to sockets by putting all niced programs into a coscheduled task group with appropriately reduced shares. This way, niced programs don't just get any and all idle CPU capacity – taking away parts of the energy budget of more important tasks all the time – which leads to important tasks running at turbo frequencies more often. Depending on the parallelism of niced workload and the parallelism of normal workload, this translates to a performance improvement of the normal workload that corresponds roughly to the increase in frequency (for CPU-bound tasks) [4]. Depending on the processor, that can be anything from just a few percent to about a factor of 2.

Rik replied, "I like the idea of having some coscheduling functionality in Linux, but I absolutely abhor the idea of making CFS even more complicated than it already is. The current code is already incredibly hard to debug or improve. Are you getting much out of CFS with your current code? It appears that your patches are fighting CFS as much as they are leveraging it."

He felt that coscheduling would probably be much better as its own scheduler class and leave it out of CFS entirely. In another email, he reiterated, "CFS is already complicated enough that it borders on unmaintainable. I would really prefer to have the coscheduler code separate from CFS, unless there is a really compelling reason to do otherwise."

Jan replied that he still felt that the coscheduling code was not so terrible in CFS, and that really it just leveraged all the hard work that had already gone into CFS. But, he went on, if he were going to consider taking the coscheduling code out of CFS, "I'd overhaul the scheduling class concept as it exists today. Instead, I'd probably attempt to schedule instantiations of scheduling classes. In its easiest setup, nothing would change: one CFS instance, one RT instance, one DL instance, strictly ordered by priority (on each CPU). The coscheduler as it is posted (and task groups in general) are effectively some form of multiple CFS instances being governed by a CFS instance. This approach would allow, for example, multiple CFS instances that are scheduled with explicit priorities; or some tasks that are scheduled with a custom scheduling class, while the whole group of tasks competes for time with other tasks via CFS rules."

There was quite a bit of implementation discussion surrounding the debate. The primary debate ended roughly here though, and it seems clear that there is plenty of support for coscheduling in general. Efforts to route around L1TF are definitely motivating factors, although coscheduling has its own independent appeal as well.

The big issue mostly seems to center around how to implement coscheduling in a way that is not psychotically complex. Apparently, all the code it touches or might touch is already bordering on insane. For the people objecting to Jan's patch – primarily Rik and Peter – their main concern seems to be just finding a way to maintain the code once it goes into the kernel.


  1. Zhuravlev, S., J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys, 2012;45(1):4.1-4.28
  2. Schönherr, J. H., B. Juurlink, and J. Richling. TACO: A scheduling scheme for parallel applications on multicore architectures. Scientific Programming, 2014;22(3):223-237
  3. Schönherr, J. H., B. Lutz, and J. Richling. "Non-intrusive coscheduling for general purpose operating systems." In: Proceedings of the International Conference on Multicore Software Engineering, Performance, and Tools (MSEPT '12), ser. Lecture Notes in Computer Science, vol. 7303 (2012). Berlin/Heidelberg, Germany: Springer, pp. 66-77
  4. Schönherr, J. H., J. Richling, M. Werner, and G. Mühl. "A scheduling approach for efficient utilization of hardware-driven frequency scaling." In: M. Beigl and F. J. Cazorla-Almeida, eds. Workshop Proceedings of the 23rd International Conference on Architecture of Computing Systems (ARCS Workshops, 2010), Berlin, Germany: VDE Verlag, pp. 367-376

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus