Why Can't Computers Just Work All the Time?

Dec 04, 2008

The feud between Minix inventor and operating system czar Andrew W. Tanenbaum and Linux Torvalds is legendary in the OS world. Before Linux there was Minix. Torvalds used to be a Minix user who set up his first Linux version in 1991 on Professor Tanenbaum’s operating system. Mr. Tanenbaum has now agreed to write a guest editorial for Linux Magazine. His opinion has not changed over the years: Linux (and Windows) are “unreliable.”

Computer users are changing. Ten years ago, most computer users were young people or professionals with lots of technical expertise. When things went wrong-which they often did-they knew how to fix things. Nowadays, the average user is far less sophisticated, perhaps a 12-year-old girl or a grandfather. Most of them know about as much about fixing computer problems as the average computer nerd knows about repairing his car. What they want more than anything else is a computer that works all the time, with no glitches and no failures.

Many users automatically compare their computer to their television set. Both are full of magical electronics and have big screens. Most users have an implicit model of a television set: (1) you buy the set; (2) you plug it in; (3) it works perfectly without any failures of any kind for the next 10 years. They expect that from the computer, and when they do not get it, they get frustrated. When computer experts tell them: "If God had wanted computers to work all the time, He wouldn't have invented RESET buttons" they are not impressed.

Professor Andrew W. Tanenbaum

For lack of a better definition of dependability, let us adopt this one: A device is said to be dependable if 99% of the users never experience any failures during the entire period they own the device. By this definition, virtually no computers are dependable, whereas most TVs, iPods, digital cameras, camcorders, etc. are. Techies are willing to forgive a computer that crashes once or twice a year; ordinary users are not.

Home users aren't the only ones annoyed by the poor dependability of computers. Even in highly technical settings, the low dependability of computers is a problem. Companies like Google and Amazon, with hundreds of thousands of servers, experience many failures every day. They have learned to live with this, but they would really prefer systems that just worked all the time. Unfortunately, current software fails them.

The basic problem is that software contains bugs, and the more software there is, the more bugs there are. Various studies have shown that the number of bugs per thousand lines of code (KLoC) varies from 1 to 10 in large production systems. A really well-written piece of software might have 2 bugs per KLoC over time, but not fewer. An operating system with, say, 4 million lines of code is thus likely to have at least 8000 bugs. Not all are fatal, but some will be. A study at Stanford University showed that device drivers-which make up 70% of the code base of a typical operating system-have bug rates 3x to 7x higher than the rest of the system. Device drivers have higher bug rates because (1) they are more complicated and (2) they are inspected less. While many people study the scheduler, few look at printer drivers.

The Solution: Smaller Kernels

The solution to this problem is to move code out of the kernel, where it can do maximal damage, and put it into user-space processes, where bugs cannot cause system crashes. This is how MINIX 3 is designed. The current MINIX system is the (second) successor to the original MINIX, which was originally launched in 1987 as an educational operating system but has since been radically revised into a highly dependable, self-healing system. What follows is a brief description of the MINIX architecture; you can find more at www.minix3.org.

MINIX 3 is designed to run as little code as possible in kernel mode, where bugs can easily be fatal. Instead of 3-4 million lines of kernel code, MINIX 3 has about 5000 lines of kernel code. Sometimes kernels this small are called microkernels. They handle low-level process management, scheduling, interrupts, and the clock, and they provide some low-level services to user-space components.

The bulk of the operating system runs as a collection of device drivers and servers, each running as an ordinary user-space process with restricted privileges. None of these drivers and servers run as superuser or equivalent. They cannot even access I/O devices or the MMU hardware directly. They have to use kernel services to read and write to the hardware. The layer of processes running in user-mode directly above the kernel consists of device drivers, with the disk driver, the Ethernet driver, and all the other drivers running as separate processes protected by the MMU hardware so they cannot execute any privileged instructions and cannot read or write any memory except their own.

Above the driver layer comes the server layer, with a file server, a process server, and other servers. The servers make use of the drivers as well as kernel services. For example, to read from a file, a user process sends a message to the file server, which then sends a message to the disk driver to fetch the blocks needed. When the file system has them in its buffer cache, it calls the kernel to move them to the user's address space.

In addition to these servers, there is another server called the reincarnation server. The reincarnation server is the parent of all the driver and server processes and monitors their behavior. If it discovers one process that is not responding to its pings, it starts a fresh copy from disk (except for the disk driver, which is shadowed in RAM). The system has been designed so that many (but not all) of the critical drivers and servers can be replaced automatically, while the system is operating, without disturbing running user processes or even notifying the user. In this way, the system is self healing.

To test whether these ideas work in practice, we ran the following experiment. We started a fault-injection process that overwrote 100 machine instructions in the running binary of the Ethernet driver to see what would happen if one of them were executed. If nothing happened for a few seconds, another 100 were injected, and so on. In all, we injected 800,000 faults into each of three different Ethernet drivers and caused 18,000 driver crashes. In all cases, the driver was replaced automatically by the reincarnation server. Despite injecting 2.4 million faults into the system, not once did the operating system crash. Needless to say, if a fatal error occurs in a Linux or Windows driver running in the kernel, the entire operating system will crash instantly.

Is there a downside to this approach? Yes. There is a performance hit. We have not measured it extensively, but the research group at Karlsruhe, which has developed its own microkernel, L4, and then run Linux on top of it as a single user process, has been able to get the performance hit down to 5%. We believe that, if we put some effort into it, we could get the overhead down to the 5-10% range, too. Performance has not been a priority for us, as most users reading their e-mail or looking at Facebook pages are not limited by CPU performance. What they do want, however, is a system that just works all the time.

If Microkernels Are So Dependable, Why Doesn't Anyone Use Them?

Actually, they do. Probably you run many of them. Your mobile phone, for example, is a small but otherwise normal computer, and there is a good chance it runs L4 or Symbian, another microkernel. Cisco's high-performance router uses one. In the military and aerospace markets, where dependability is paramount, Green Hills Integrity, another microkernel, is widely used. PikeOS and QNX are also microkernels widely used in industrial and embedded systems. In other words, when it really matters that the system "just works all the time" people turn to microkernels. For more on this topic, see www.cs.vu.nl/~ast/reliable-os/.

In conclusion, it is our belief, based on many conversations with nontechnical users, that what they want above all else is a system that works perfectly all the time. They have a very low tolerance for unreliable systems but currently have no choice. We believe that microkernel-based systems can lead the way to more dependable systems.

Related content

  • Minix 3

    Minix is often viewed as the spiritual predecessor of Linux, but these two Unix cousins could never agree on the kernel design. Now a new Minix with a BSD-style free license is poised to attract a new generation of users.

  • FOSDEM 2010: Andrew Tanenbaum Sets Reliability Before Performance

    Computer science veteran Andrew Tanenbaum presented the third version of his Minix operating system at the FOSDEM 2010 conference on February 6-7 in Brussels, Belgium.

  • Kernel Up Close

    We celebrate 30 years of Linux with a special issue that takes you inside the kernel and shows you how to take your first steps with the kernel community.

  • Exploring the Hurd

    The GNU project hasn't given up on the venerable Hurd project, and this long-running GNU Free OS project has recently received a burst of new energy with the release of a new Debian port.

  • 17 Years Now: Linus Torvalds Introduces Linux

    Exactly 17 years ago, on October 5, 1991, Linus Torvalds sent an email to the comp.os.minix newsgroup.

Comments

  • What breaks and what doesn't.

    In my experience using desktop GNU/Linux, it isn't the kernel that has broken, but rather large applications that I run on top of X that likely have a lot of lines of code to break. My idea for a reliable system is a kernel (micro or otherwise) sitting under an embedded shell (eg busybox) with a small X window system (eg nano-X) and a bare browser (eg the bare components of XULRunner), This is what I'm going to be making instead of a desktop OS.

    However I do value your opinion, and I would use a micro-kernel if I just had to support one device. However I need to support lots of different desktop and laptop PCs and hardware support is my immediate concern for my OS, so I chose Linux, with built-in drivers instead of modules.

    My attempt at this and progress is at http://sechoes.homelinux.com/

    Using my way, there is less software to cause a crash, and thus in my opinion less crashing.

    Dan Dart
  • Self healing is good, but...

    Minix 3.0 was able to self-heal an ethernet driver bug 2.4 million times without a crash. That is great, but the big question is if the system was able to perform its functions despite the bug? Most bugs are found, reported and fixed because they cause a system crash, or a program to crash. I am not so sure a self-healing system that just continuously restarts processes in spite of fatal bugs is actually moving you in the direction of a more reliable system. Proper beta and release candidate testing should catch most of the big ones, and then constant monitoring and bug-fixing with frequent bug-fix releases will keep the systems running. Professor Tanenbaum has a great idea in theory. However for the end-user kernels he was writing about, I think that the Linux method of constant kernel bug-fix releases and upgrades is a superior way to get a kernel out the door and into productive use. Say what you will about Windows, but Microsoft has had to respond to the challenges from Apple and Linux to increase its bug-fix schedule, and improve its Service Pack roll-out process.
  • The GNU based microkernel operating system

    Why is there no mention and/or critique of the GNU Hurd?

    The GNU Hurd is a collection of servers that run on the Mach microkernel and the GNU project's replacement for the Unix kernel.

    http://www.gnu.org/software/hurd/
  • Stablity of a Micro kernel is not disputed.

    Epicanis

    Simple fact due to Micro kernel operations its one of the most stable designs out there. Even Linus does not dispute that fact. Linus disputes the amount of performance you have to give up to use a micro kernel design.

    Performance vs Stability. The problem is picking the right point. Problem with Minix they go after 100 percent stability no matter the Performance cost.

    Linux and most other Monolithic kernels went after higher performance and paid the price of having kernel crashes if drivers are wrong.

    Problem is Micro kernel stability only extends as far as kernel operations. Since drivers in Micro kernels run in user space those drivers malfunctioning can not cause a kernel crash so kernel can recover.

    There is a big big problem here. If I am a end user and my download or application crashes because a driver goes down even if system does not crash the lost of data I will still hate the OS.

    All the arguments over Micro-kernels vs Monolithic forget the most important thing the User. If the user data is hurt it really does not matter if the OS crashes or not. Lot of work is really needed working out how to make applications crash recoverable no matter what.
  • This is a little off the subject, but...

    "We believe that microkernel-based systems can lead the way to more dependable systems. [...]"

    Who is this "we"? Does Andrew Tannenbaum have an Evil (or perhaps Good) Twin?

    (I WORK in academia, and I still think this "'we' meaning 'I'" thing is a seriously obnoxious grammatical quirk of academics.)
  • MINIX3 distribution?

    If microkernels are so great, why are there hundreds of Linux distributions, and no MINIX3 distributions? Where can I download Kubuntu based on MINIX?

    The point is, microkernels might be great in small (functionality wise) environments. But with a lot of new things that come out with "we are small and fast and reliable", they tend to become big and slow because of functionality added.

    I am willing to test Mr. Tanenbaum's claims (which he has been making for the last few decades) on my desktop, as long as I can e-mail, surf the web, run OOo and develop web applications.

    uKernels might be great, but Linux has been delivering for more than 10 years now.

    Or to put it differently: microkernels sound nice, but repeat after me "show me the desktop! Show me the desktop!" (thanks to the movie Jerry Maguire).
  • Its always more complex than that.

    Issue here is Micro kernels are always put up as the solve all.

    NT design started off as a Microkernel and ended up hybrid for speed reasons. There is a balance between stability and speed.

    Linux is currently going threw a few interesting alterations. Ksplice is getting to the point where while the kernel is running it can replace every single part of it. So as long as it has not crashed yet any defect that is there can be removed. Really this does put Linux kernel slightly ahead of minix. Yes minix has a small code base but can it replace it on fly if it has to.

    cuse and fusd are out to provide user space drivers to Linux like minix uses. Older not support drivers were the most defects in the Linux Kernel turn up are going to be migrated to user-space so there stability issues are contained there.

    Really the best solution is a merge of the 3 existing designs of Operating systems. Monolithic like Linux, Microkernel like Minix and hybred like Windows.

    Users can choose then what they want. Bit like high speed bios option. Does not check your hardware completely runs faster but more likely to crash. Where the turtle option is slower and more stable.

    There is no generic answer for everyone that is where Mr. Tanenbaum and many others hit the wall.

    Until a OS creates the generic answer we will be stuck without the options to set our fate. If code can never get to Zero errors this means a Micro kernel could never be perfect either.

    No common OS out there supports updating all of its running code on fly yet. Until that day users will always have to restart there machine due to defects or run the risk of a virus strike bring there system down.

    Mr Tanenbaum is also forgetting OS does not have to crash for user to be hurt. Only the application holding key information has to crash with no form of recovery.

    Time to forget about the Microkernel/Monolithic arguments and move on to how we can make application crashing reversible and the defects that cause crashes removable without user ever needing to finding out.
  • Reliability

    My TV broke in its first year. A bad capacitor was the culprit. My washer lasted 13 months, then broke down, bad door switch. My dryer failed two months later, a bad computer chip and it just failed again. I've had a couple of check engine light problems in my car over the years, bad sensors. We replaced our toaster after only two years. I have some Linux servers that have been up over a year (not connected to the Internet). Reliability.....
  • Odd definition of "failure"

    It seems Dr. T. has a strange, sort of two faced, definition of failure. In one sentence he speaks of TVs that "work perfectly without any failures of any kind for the next 10 years" and then he talks of moving buggy code out of the kernel and into user space "where bugs cannot cause system crashes". I'm sorry but if the "tuner process" of my TV blew up, got restarted by the TV OS, and left the TV tuned to the previous channel, or left the tuner in an undefined state, after I had directed it to change the channel, I would not call that "working perfectly without any failures of any kind for the next 10 years". Do I care that the TV blinked and came back on the same old channel after the OS rebooted or that the TV blinked and came back on the same old channel after the OS restarted the tuner process? No, of course I don't.

    Fact is, bugs are bugs whether they live in the kernel or in userspace. Users are users and they don't even know what a kernel is unless it is popcorn. They do know what userspace is, that's where they sit.

    The only time u-kernels are helpful to reliability is in multitasking systems where a task failure can be tolerated but a total system crash can not. These are few and far between. I note that there is no mention in the article of lost or garbled data or response latency after Dr T blew up his Ethernet driver. Again, do the airliner passengers care if the autopilot flies them into the ground because of an OS crash that caused garbled commands to be sent to the flight surface actuators or because of a process crash that caused garbled commands to be sent to the flight surface actuators? No. Neither is tolerable so you have to TEST, FIND the bug, and then FIX the bug. What is helping military and aerospace systems be reliable is engineering discipline that is applied with an iron fist and with relatively little concern for COST.

    P.S. Speaking of TVs that DO NOT work perfectly, I have a DVR/cable box distributed by Comcast/Time Warner with Motorola's name on it that I'd like to demonstrate to you sometime. I'm certain that if I had written that code and said it was good to go, I'd have been fired for sure.

  • Andy forgot to mention a few things...

    First one is that no matter how reliable a system, if it's significantly slower than the alternative, it won't be used. That's why 250 million mobile phones run OKL4, and none run Minix.

    The other one is what's actually the killer argument for microkernels: they are small enough that you can mathematically prove them correct -- the ultimate assurance of reliability, safety and security. Google "L4.verified".
  • True But....

    True but in five years I've only ever had Linux lockup so that I couldn't reboot properly like 3 times max. And Windows XP - I don't think I've ever BSODed. Maybe once. I use Linux for everything all day except games and photoshop for which I use Windows XP. So I think the people getting all these kernel panics are doing werid stuff like installing whatever they find on the net.
  • Computer users don't value reliability.

    They certainly don't let reliability alter their buying habits. Look at ECC memory, if you can find any. It is unavailable and unsupported on anything short of a server. For $50 I just bought 4GB of memory which will generate undetected random errors somewhere in the once a day to once a week range. 1/9 of memory and a bit more logic would decrease this risk enormously but the only computer buyers who care to pay that are buying large servers so the option is gone from the main stream.

    Reliability doesn't sell software either, network effects sell software.

    If reliability were the goal there is a great deal of long hanging fruit which should be picked before the question of micro-kernel vs. monolithic kernel makes a substantial difference. Consider Windows which on seeing a USB device plugged in automatically installs and runs software from it. How can genuine bugs which developers really do work hard to avoid and fix compare with such an enormous and obvious security whole which is actively exploited by organized criminals the world over? Consider the way the vast majority of non-free software competes to auto-run, auto-update, run background processes such, compete for ownership of file types and space on the screen. How reliable can a computer be when it is in effect a battle field for software competing to use people's computers for sales opportunities or darker purposes?

    Compartmentalization is a wonderful way to improve reliability. Far more profit is to be had compartmentalizing applications, everything which is already outside the kernel, beating down the core war inside our computers. Even this is hard because the software vendors fight it. They want their programs to be able to take over instead of live in safe protected and of course enclosed little jails. The challenge to reliability is far more than mere bugs but different people's competing interests.

    I'm glad there are people who care about micro-kernel and investigate ways to make software more reliable through confining potential bugs to specific components. Perhaps good things will come of this though I do wonder if the natural level of sharing inherent in what we want the kernel to do makes it impractical. That distinction belongs to people who know the technical issues better than I. What I am confident of is the unreliability of the computers we use is dominated by issues more mundane and present than micro-kernel vs. monolithic. If the people signing the checks actually cared they might improve.
  • I agree

    I agree. Microkernels are the way to go.
    Does Minix3 support Windows and Linux devices?
    If it did it could rule.
comments powered by Disqus