Linus Torvalds Upset over Ext3 and Ext4

Mar 30, 2009

Linus Torvalds, Ted Ts'o, Alan Cox, Ingo Molnar, Andrew Morton and other Linux kernel developers are embroiled in a contentious discussion over the sense -- or nonsense -- of journaling and delayed allocation before a commit in the ext3 and ext4 filesystems. Heavy words are flying.

It all started with a request for help from Jesper Krogh in one of the first responses to Torvalds's announcement March 24 of Kernel 2.6.29 on the gmane.linux.kernel mailing list. Krogh reported a significant delay when writing from cache with the ext3 filesystem, despite faster hardware and extensive RAM. Was there a way to autotune it? Ingo Molnar opined that Krogh's wait time of 10 minutes was totally unacceptable, "it is the year 2009, not 1959." His personal "pain threshold" is about one second: "the historic limit for the hung tasks check was 10 seconds, then 60 seconds."

Ted Ts'o, groundbreaking in the filesystem's development, chimed in to the forum. It was just recently that he had been confronted by users over data loss upon installing their apps on the new ext4 filesystem. Ts'o set himself intensely on the problem with the source research and detailed explanation. Again he described the delayed effect in writing data. Synchronization in ext3 occurs every five seconds, whereas ext4 normally writes from cache every two minutes. Ts'o got pretty defensive: "People can call file system developers idiots if it makes them feel better --- sure, OK, we all suck. If someone wants to try to create a better file system, show us how to do better, or send us some patches."

Torvalds, for one, didn't seem too excited about the delayed synchronization. He writes on the mailing list, "Doesn't at least ext4 default to the insane model of 'data is less important than metadata, and it doesn't get journalled'? And ext3 with 'data=writeback' does the same, no? Both of which are -- as far as I can tell -- total brain damage. At least with ext3 it's not the default mode." To avoid the synchronization problem Ts'o had recommended at least temporarily migrating ext4 to a few separate systems only. Torvalds considered this to be "crappy" advice and that "we might as well go back to ext2 then."

In his response, Ts'o fell back on the performance benefits thanks to delayed allocation, as had been allowed earlier under POSIX. By his experience, the difference between five seconds and three minutes "wasn't that big of a deal" in practice, "at least in the days when people were proud of their Linux systems having 2-3 year uptimes." Plus there was a remedy: "For precious files, applications that use fsync() will be safe." If this were a problem for some, they could "turn off delayed allocation with the nodelalloc mount option."

Kernel chief Torvalds is hardly convinced by these arguments. In his view, "if you write your metadata earlier (say, every 5 sec) and the real data later (say, every 30 sec), you're actually more likely to see corrupt files than if you try to write them together... This is why I absolutely detest the idiotic ext3 writeback behavior. It literally does everything the wrong way around -- writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."

Related content


  • You might as well use XFS

    If you go for big delayed writes to gain performance, we might as well go for XFS, or put our development efforts in it.
  • What FS Does Linus Use/LIke?

    Well looking at this, I'd be curious to know what FS Linus likes to use then?

  • Ext3/4 reliability

    Well, it's the old saw about performance vs. reliability in this case. In my opinion, as a designer and developer of large-scale distributed transaction processing systems, data is king. If a transaction commits, the data is on disc. I think that this should be the case for file systems as well. Journaling should allow roll-forward recovery (deltas are written, but the full file might not be updated) so that when the caller gets back control, the journal has been written to persistent store, including any required metadata. If the system fails after that fact, then roll-forward recovery should restore the data that the user application had written. In any case, if you want to look at power-fail safe file systems, look at QNX. They wrote the book on high-reliability file systems, IMO.
  • Mechanism for ext3/ext4 data loss?

    Has anyone worked through the/any mechanism by which data is lost?

    I can see that if there is a power failure when data is in memory and it hasn't been written to a journal somewhere - it can be lost. A fairly old fix to this is to write the journal to a battery backed memory on the disk controller. If this write can be done before the main power supply capacitors are depleted, there shouldn't be any loss. Maybe there is a less expensive way to do it.
  • Delayed sync

    You can delay sync and it (probably) does not matter in the majority of cases if that delay is 2 minutes. Because in that time any reloads are likely to be still cached locally anyway.

    Unless it is a day or time when your machine is busy.

    But imagine the situation on a fresh install or the copying of huge amounts of data, I can't help feeling that cacheing system is going to be a terrible bottle neck.

    Run a rsync in the back ground while you are editing pictures, when is the sync going to catch up exactly.

    I know the ext4 guys are getting hot under the collar, but surely they can understand that people are going to wonder at how good the ext4 is at deciding on the best action on the fly. Journal now or journal after a delay?

    Those data losses, why respond with "If you know a better way tell us...", well I know one that might be better, don't lose data.
comments powered by Disqus

Issue 25: Raspberry Pi Handbook 4th Ed./Special Editions

Buy this issue as a PDF

Digital Issue: Price $15.99
(incl. VAT)