Zack's Kernel News

Crashing and Warning

While submitting a patch, Yu Zhao made the lovely statement, "To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only."

His patch itself was not discussed, because a deeper issue came up. It turned out that in Yu's code he had several BUG_ON() calls. This function is a debugging feature that tests if a certain horrifying condition occurs and, if so, induces a crash (either of a specific process or the kernel itself). Andrew Morton noticed Yu's use of this call and pointed out, "General rule: don't add new BUG_ONs, because they crash the kernel. It's better to use WARN_ON or WARN_ON_ONCE [and] then try to figure out a way to keep the kernel limping along. At least so the poor user can gather logs."

Yu replied that his particular use of BUG_ON() (i.e., VM_BUG_ON()) was something that would only affect the kernel build process, not runtime. But Andrew replied, "I'm told that many production builds enable runtime VM_BUG_ONning."

But Yu pushed back, arguing:

"Nobody wants to debug VM in production. Some distros that offer both the latest/LTS kernels do enable CONFIG_DEBUG_VM in the former so the latter can have better test coverage when it becomes available. Do people use the former in production? Absolutely, otherwise we won't have enough test coverage. Are we supposed to avoid CONFIG_DEBUG_VM? I don't think so, because it defeats the purpose of those distros enabling it in the first place.

"The bottomline is that none of RHEL 8.5, SLES 15, [or] Debian 11 enables CONFIG_DEBUG_VM."

Andrew went to obtain the proof in the pudding and found specific instances of Red Hat Linux enabling CONFIG_DEBUG_VM in its build system. He showed these to Yu, who replied, "Yes, Fedora/RHEL is one concrete example of the model I mentioned above (experimental/stable). I added Justin, the Fedora kernel maintainer, and he can further clarify. If we don't want more VM_BUG_ONs, I'll remove them. But (let me reiterate) it seems to me that just defeats the purpose of having CONFIG_DEBUG_VM."

Andrew was not unsympathetic and said, "It was never expected that VM_BUG_ON() would get subverted in this fashion." He suggested potentially designing an additional BUG_ON() function that would be better. But he also pointed out, "none of this addresses the core problem: *_BUG_ON() often kills the kernel. So guess what we just did? We killed the user's kernel at the exact time when we least wished to do so: when they have a bug to report to us. So the thing is self-defeating. It's much much better to WARN and to attempt to continue. This makes it much more likely that we'll get to hear about the kernel flaw."

Linus Torvalds joined the discussion at this point, saying, "There is absolutely _zero_ advantage to killing the machine. If you want to be notified about 'this must not happen', then WARN_ON_ONCE() is the right thing to use. BUG_ON() is basically always the wrong thing to do."

Yu stood his ground, replying to Linus, "for the greater good, do we want to inflict more pain on a small group of users running experimental kernels so that they'd come back and yell at us quicker and louder? BUG_ONs are harmful but problems that trigger them would be presumably less penetrating to the user base; on the other hand, from my experience working with some testers (ordinary users), they ignore WARN_ON_ONCEs until the kernel crashes."

But Linus felt that argument did not hold water. He replied:

"First you say that VM_BUG_ON() is only for VM developers.

"Then you say 'some testers (ordinary users) ignore WARN_ON_ONCEs until the kernel crashes'.

"So which is it?

"VM developers, or ordinary users?

"Honestly, if a VM developer is ignoring a WARN_ON_ONCE() from the VM subsystem, I don't even know what to say.

"And for ordinary users, a WARN_ON_ONCE() is about a million times better, because:

  • the machine will hopefully continue working, so they can report the warning
  • even when they don't notice them, distros tend to have automated reporting infrastructure

"That's why I absolutely *DETEST* those stupid BUG_ON() cases – they will often kill the machine with nasty locks held, resulting in a completely undebuggable thing that never gets reported.

"Yes, you can be careful and only put BUG_ON() in places where recovery is possible. But even then, they have no actual _advantages_ over just a WARN_ON_ONCE."

Yu clarified that he was not talking about kernel developers ignoring warnings or that he believed VM_BUG_ON() was only for VM developers. He said he really was concerned with the ordinary user who might be more inclined to report a crash to the kernel developers than only a warning. To Linus's point, he said, "I hear you, and I wasn't arguing about anything, just sharing my two cents."

Meanwhile, Justin Forbes from Red Hat explained some of the thinking behind their use of CONFIG_DEBUG_VM. He said, "We almost split into 3 scenarios. In rawhide we run a standard Fedora config for rcX releases and .0, but git snapshots are built with debug configs only. The trade off is that we can't turn on certain options which kill performance, but we do get more users running these kernels which expose real bugs. The rawhide kernel follows Linus' tree and is rebuilt most weekdays. Stable Fedora is not a full debug config, but in cases where we can keep a debug feature on without it much getting in the way of performance, as is the case with CONFIG_DEBUG_VM, I think there is value in keeping those on, until there is not. And of course RHEL is a much more conservative config, and a much more conservative rebase/backport codebase."

He added, "If keeping the option on becomes problematic, we can simply turn it off. Fedora certainly has a more diverse installed base than typical enterprise distributions, and much more diverse than most QA pools. Both in the array of hardware, and in the use patterns, so things do get uncovered that would not be seen otherwise."

And, regarding the desire to warn rather than crash, Justin said, "I agree very much with this. We hear about warnings from users, they don't go unnoticed, and several of these users are willing to spend time to help get to the bottom of an issue. They may not know the code, but plenty are willing to test various patches or scenarios."

At a certain point in the discussion, Yu said, "Based on all the feedback, my action item is to replace all VM_BUG_ONs with VM_WARN_ON_ONCEs."

And that was the end of the discussion.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    In kernel news: Vulnerabilities using a 32-Bit Kernel on a 64-Bit CPU; Working Around Hardware Security Vulnerabilities; and When It's OK to Panic.

  • Kernel News

    Chronicler Zack Brown reports on the patch submission process and the status of NTFS. 

  • Kernel News

    Chronicler Zack Brown reports on the little links that bring us closer within the Linux kernel community.

  • Live Distros with NTFS

    A Linux live distro may be just what you need to recover a Windows computer brought down by a system problem or virus attack. Knoppix creator Klaus Knopper gives you some tips for accessing NTFS from live Linux.

  • Paragon NTFS for Linux

    Paragon’s NTFS for Linux is a low-cost commercial alternative for accessing NTFS from Linux.

comments powered by Disqus