Why Can't Computers Just Work All the Time?
The feud between Minix inventor and operating system czar Andrew W. Tanenbaum and Linux Torvalds is legendary in the OS world. Before Linux there was Minix. Torvalds used to be a Minix user who set up his first Linux version in 1991 on Professor Tanenbaum’s operating system. Mr. Tanenbaum has now agreed to write a guest editorial for Linux Magazine. His opinion has not changed over the years: Linux (and Windows) are “unreliable.”
Computer users are changing. Ten years ago, most computer users were young people or professionals with lots of technical expertise. When things went wrong-which they often did-they knew how to fix things. Nowadays, the average user is far less sophisticated, perhaps a 12-year-old girl or a grandfather. Most of them know about as much about fixing computer problems as the average computer nerd knows about repairing his car. What they want more than anything else is a computer that works all the time, with no glitches and no failures.
Many users automatically compare their computer to their television set. Both are full of magical electronics and have big screens. Most users have an implicit model of a television set: (1) you buy the set; (2) you plug it in; (3) it works perfectly without any failures of any kind for the next 10 years. They expect that from the computer, and when they do not get it, they get frustrated. When computer experts tell them: "If God had wanted computers to work all the time, He wouldn't have invented RESET buttons" they are not impressed.
For lack of a better definition of dependability, let us adopt this one: A device is said to be dependable if 99% of the users never experience any failures during the entire period they own the device. By this definition, virtually no computers are dependable, whereas most TVs, iPods, digital cameras, camcorders, etc. are. Techies are willing to forgive a computer that crashes once or twice a year; ordinary users are not.
Home users aren't the only ones annoyed by the poor dependability of computers. Even in highly technical settings, the low dependability of computers is a problem. Companies like Google and Amazon, with hundreds of thousands of servers, experience many failures every day. They have learned to live with this, but they would really prefer systems that just worked all the time. Unfortunately, current software fails them.
The basic problem is that software contains bugs, and the more software there is, the more bugs there are. Various studies have shown that the number of bugs per thousand lines of code (KLoC) varies from 1 to 10 in large production systems. A really well-written piece of software might have 2 bugs per KLoC over time, but not fewer. An operating system with, say, 4 million lines of code is thus likely to have at least 8000 bugs. Not all are fatal, but some will be. A study at Stanford University showed that device drivers-which make up 70% of the code base of a typical operating system-have bug rates 3x to 7x higher than the rest of the system. Device drivers have higher bug rates because (1) they are more complicated and (2) they are inspected less. While many people study the scheduler, few look at printer drivers.
The Solution: Smaller Kernels
The solution to this problem is to move code out of the kernel, where it can do maximal damage, and put it into user-space processes, where bugs cannot cause system crashes. This is how MINIX 3 is designed. The current MINIX system is the (second) successor to the original MINIX, which was originally launched in 1987 as an educational operating system but has since been radically revised into a highly dependable, self-healing system. What follows is a brief description of the MINIX architecture; you can find more at www.minix3.org.
MINIX 3 is designed to run as little code as possible in kernel mode, where bugs can easily be fatal. Instead of 3-4 million lines of kernel code, MINIX 3 has about 5000 lines of kernel code. Sometimes kernels this small are called microkernels. They handle low-level process management, scheduling, interrupts, and the clock, and they provide some low-level services to user-space components.
The bulk of the operating system runs as a collection of device drivers and servers, each running as an ordinary user-space process with restricted privileges. None of these drivers and servers run as superuser or equivalent. They cannot even access I/O devices or the MMU hardware directly. They have to use kernel services to read and write to the hardware. The layer of processes running in user-mode directly above the kernel consists of device drivers, with the disk driver, the Ethernet driver, and all the other drivers running as separate processes protected by the MMU hardware so they cannot execute any privileged instructions and cannot read or write any memory except their own.
Above the driver layer comes the server layer, with a file server, a process server, and other servers. The servers make use of the drivers as well as kernel services. For example, to read from a file, a user process sends a message to the file server, which then sends a message to the disk driver to fetch the blocks needed. When the file system has them in its buffer cache, it calls the kernel to move them to the user's address space.
In addition to these servers, there is another server called the reincarnation server. The reincarnation server is the parent of all the driver and server processes and monitors their behavior. If it discovers one process that is not responding to its pings, it starts a fresh copy from disk (except for the disk driver, which is shadowed in RAM). The system has been designed so that many (but not all) of the critical drivers and servers can be replaced automatically, while the system is operating, without disturbing running user processes or even notifying the user. In this way, the system is self healing.
To test whether these ideas work in practice, we ran the following experiment. We started a fault-injection process that overwrote 100 machine instructions in the running binary of the Ethernet driver to see what would happen if one of them were executed. If nothing happened for a few seconds, another 100 were injected, and so on. In all, we injected 800,000 faults into each of three different Ethernet drivers and caused 18,000 driver crashes. In all cases, the driver was replaced automatically by the reincarnation server. Despite injecting 2.4 million faults into the system, not once did the operating system crash. Needless to say, if a fatal error occurs in a Linux or Windows driver running in the kernel, the entire operating system will crash instantly.
Is there a downside to this approach? Yes. There is a performance hit. We have not measured it extensively, but the research group at Karlsruhe, which has developed its own microkernel, L4, and then run Linux on top of it as a single user process, has been able to get the performance hit down to 5%. We believe that, if we put some effort into it, we could get the overhead down to the 5-10% range, too. Performance has not been a priority for us, as most users reading their e-mail or looking at Facebook pages are not limited by CPU performance. What they do want, however, is a system that just works all the time.
If Microkernels Are So Dependable, Why Doesn't Anyone Use Them?
Actually, they do. Probably you run many of them. Your mobile phone, for example, is a small but otherwise normal computer, and there is a good chance it runs L4 or Symbian, another microkernel. Cisco's high-performance router uses one. In the military and aerospace markets, where dependability is paramount, Green Hills Integrity, another microkernel, is widely used. PikeOS and QNX are also microkernels widely used in industrial and embedded systems. In other words, when it really matters that the system "just works all the time" people turn to microkernels. For more on this topic, see www.cs.vu.nl/~ast/reliable-os/.
In conclusion, it is our belief, based on many conversations with nontechnical users, that what they want above all else is a system that works perfectly all the time. They have a very low tolerance for unreliable systems but currently have no choice. We believe that microkernel-based systems can lead the way to more dependable systems.
What breaks and what doesn't.
However I do value your opinion, and I would use a micro-kernel if I just had to support one device. However I need to support lots of different desktop and laptop PCs and hardware support is my immediate concern for my OS, so I chose Linux, with built-in drivers instead of modules.
My attempt at this and progress is at http://sechoes.homelinux.com/
Using my way, there is less software to cause a crash, and thus in my opinion less crashing.
Dan Dart
Self healing is good, but...
The GNU based microkernel operating system
The GNU Hurd is a collection of servers that run on the Mach microkernel and the GNU project's replacement for the Unix kernel.
http://www.gnu.org/software/hurd/
Stablity of a Micro kernel is not disputed.
Simple fact due to Micro kernel operations its one of the most stable designs out there. Even Linus does not dispute that fact. Linus disputes the amount of performance you have to give up to use a micro kernel design.
Performance vs Stability. The problem is picking the right point. Problem with Minix they go after 100 percent stability no matter the Performance cost.
Linux and most other Monolithic kernels went after higher performance and paid the price of having kernel crashes if drivers are wrong.
Problem is Micro kernel stability only extends as far as kernel operations. Since drivers in Micro kernels run in user space those drivers malfunctioning can not cause a kernel crash so kernel can recover.
There is a big big problem here. If I am a end user and my download or application crashes because a driver goes down even if system does not crash the lost of data I will still hate the OS.
All the arguments over Micro-kernels vs Monolithic forget the most important thing the User. If the user data is hurt it really does not matter if the OS crashes or not. Lot of work is really needed working out how to make applications crash recoverable no matter what.
This is a little off the subject, but...
Who is this "we"? Does Andrew Tannenbaum have an Evil (or perhaps Good) Twin?
(I WORK in academia, and I still think this "'we' meaning 'I'" thing is a seriously obnoxious grammatical quirk of academics.)
MINIX3 distribution?
The point is, microkernels might be great in small (functionality wise) environments. But with a lot of new things that come out with "we are small and fast and reliable", they tend to become big and slow because of functionality added.
I am willing to test Mr. Tanenbaum's claims (which he has been making for the last few decades) on my desktop, as long as I can e-mail, surf the web, run OOo and develop web applications.
uKernels might be great, but Linux has been delivering for more than 10 years now.
Or to put it differently: microkernels sound nice, but repeat after me "show me the desktop! Show me the desktop!" (thanks to the movie Jerry Maguire).
Its always more complex than that.
NT design started off as a Microkernel and ended up hybrid for speed reasons. There is a balance between stability and speed.
Linux is currently going threw a few interesting alterations. Ksplice is getting to the point where while the kernel is running it can replace every single part of it. So as long as it has not crashed yet any defect that is there can be removed. Really this does put Linux kernel slightly ahead of minix. Yes minix has a small code base but can it replace it on fly if it has to.
cuse and fusd are out to provide user space drivers to Linux like minix uses. Older not support drivers were the most defects in the Linux Kernel turn up are going to be migrated to user-space so there stability issues are contained there.
Really the best solution is a merge of the 3 existing designs of Operating systems. Monolithic like Linux, Microkernel like Minix and hybred like Windows.
Users can choose then what they want. Bit like high speed bios option. Does not check your hardware completely runs faster but more likely to crash. Where the turtle option is slower and more stable.
There is no generic answer for everyone that is where Mr. Tanenbaum and many others hit the wall.
Until a OS creates the generic answer we will be stuck without the options to set our fate. If code can never get to Zero errors this means a Micro kernel could never be perfect either.
No common OS out there supports updating all of its running code on fly yet. Until that day users will always have to restart there machine due to defects or run the risk of a virus strike bring there system down.
Mr Tanenbaum is also forgetting OS does not have to crash for user to be hurt. Only the application holding key information has to crash with no form of recovery.
Time to forget about the Microkernel/Monolithic arguments and move on to how we can make application crashing reversible and the defects that cause crashes removable without user ever needing to finding out.
Reliability
Odd definition of "failure"
Fact is, bugs are bugs whether they live in the kernel or in userspace. Users are users and they don't even know what a kernel is unless it is popcorn. They do know what userspace is, that's where they sit.
The only time u-kernels are helpful to reliability is in multitasking systems where a task failure can be tolerated but a total system crash can not. These are few and far between. I note that there is no mention in the article of lost or garbled data or response latency after Dr T blew up his Ethernet driver. Again, do the airliner passengers care if the autopilot flies them into the ground because of an OS crash that caused garbled commands to be sent to the flight surface actuators or because of a process crash that caused garbled commands to be sent to the flight surface actuators? No. Neither is tolerable so you have to TEST, FIND the bug, and then FIX the bug. What is helping military and aerospace systems be reliable is engineering discipline that is applied with an iron fist and with relatively little concern for COST.
P.S. Speaking of TVs that DO NOT work perfectly, I have a DVR/cable box distributed by Comcast/Time Warner with Motorola's name on it that I'd like to demonstrate to you sometime. I'm certain that if I had written that code and said it was good to go, I'd have been fired for sure.
Andy forgot to mention a few things...
The other one is what's actually the killer argument for microkernels: they are small enough that you can mathematically prove them correct -- the ultimate assurance of reliability, safety and security. Google "L4.verified".
True But....
Computer users don't value reliability.
Reliability doesn't sell software either, network effects sell software.
If reliability were the goal there is a great deal of long hanging fruit which should be picked before the question of micro-kernel vs. monolithic kernel makes a substantial difference. Consider Windows which on seeing a USB device plugged in automatically installs and runs software from it. How can genuine bugs which developers really do work hard to avoid and fix compare with such an enormous and obvious security whole which is actively exploited by organized criminals the world over? Consider the way the vast majority of non-free software competes to auto-run, auto-update, run background processes such, compete for ownership of file types and space on the screen. How reliable can a computer be when it is in effect a battle field for software competing to use people's computers for sales opportunities or darker purposes?
Compartmentalization is a wonderful way to improve reliability. Far more profit is to be had compartmentalizing applications, everything which is already outside the kernel, beating down the core war inside our computers. Even this is hard because the software vendors fight it. They want their programs to be able to take over instead of live in safe protected and of course enclosed little jails. The challenge to reliability is far more than mere bugs but different people's competing interests.
I'm glad there are people who care about micro-kernel and investigate ways to make software more reliable through confining potential bugs to specific components. Perhaps good things will come of this though I do wonder if the natural level of sharing inherent in what we want the kernel to do makes it impractical. That distinction belongs to people who know the technical issues better than I. What I am confident of is the unreliability of the computers we use is dominated by issues more mundane and present than micro-kernel vs. monolithic. If the people signing the checks actually cared they might improve.
I agree
Does Minix3 support Windows and Linux devices?
If it did it could rule.