Zack's Kernel News
Panic on OOM Timeout
Michal Hocko posted some patches implementing a "panic on OOM timout" feature. In other words, when the system detected an out-of-memory condition, it would start a timer. If the OOM killer happened to kill the correct process and free up enough memory, then the system would continue running. But, if it couldn't find the right process in the time allotted, there would be a panic, producing some hopefully usable debugging information, and the system could be rebooted in an orderly fashion. Without the timer, Michal argued, the OOM killer could just go on killing the wrong processes, leaving the system unusable for an unpredictable amount of time. The timer, he said, added an important element of predictability to the situation.
Tetsuo Handa, who had worked on a similar feature some months earlier, agreed with the feature in principle but had questions about the implementation. The two immediately launched into a technical comparison of their two patch sets, discussing specific scenarios that could lead to the timeout taking too long and other undesirable end results.
Part of the complexity of the debate arose from the fact that in an out-of-memory situation, the system is already ailing, so the question becomes identifying which ailments are preferable to others, when a given code path is trying too hard to solve a problem that won't make enough of a difference anyway, and whether a given code path gives up and shuts down the system while there is still hope of resurrecting it.
The two went back and forth for a while, each essentially defending their own implementations while also submitting additional patches that might either win the other over or address the other's concerns. Ultimately, their approaches grew closer together, but there was no true resolution by the end of the discussion.
The whole question of how to handle out-of-memory conditions is very thorny. It's possible that a technically superior approach might be rejected by Linus Torvalds or someone else along the way just because of the maintenance burden it would create. To some extent, the proper behavior might depend on the most likely user, which can also be hard to identify.
« Previous 1 2
Buy this article as PDF
(incl. VAT)