Virtualization Sneak Peek
Core Technology
Ever wondered what's happening inside a virtual machine? Join us for an exciting tour into virtualization deep waters.
Today most of us enjoy the benefits of virtualization. With hardware support that finally came to x86 a decade ago and powerful open source hypervisors (also called virtual machine managers), such as VirtualBox or KVM, it's pretty straightforward to run Windows alongside Linux or share a single physical server between a dozen of tenants. In early 2000, to give a new Linux distro a try, you'd install it on a separate hard disk or perhaps run it directly from the CD. Now, you can just boot a virtual machine (VM) from the downloaded ISO image and have fun while reading your friends' tweets.
Yet virtualization internals is still a hairy topic that you may not have a complete picture of. In Linux, Qemu/KVM is the de facto standard tool – or tools (Figure 1). These two have quite a different history but are now seen together more often than not. What's the reason? How does Qemu interact with KVM (and vice versa) to keep your VMs running? In this Core Tech, we'll build a big picture of how an x86 hypervisor operates internally. Qemu, however, is a large and complex project, so we'll dissect a lighter alternative instead: kvmtool. It is simpler yet still real-world and functional, so you may find it useful in your virtualization scenarios.
A Need for Qemu (or Something)
KVM stands for Kernel-based Virtual Machine, so why is there a userspace part in the first place? To answer this question, we need to learn a little bit about how virtualization works in x86.
To virtualize a CPU, you need a mechanism to intercept so-called "control-sensitive instructions," which may affect other VMs running on the same host. Suppose you don't want the guest to access arbitrary I/O ports (this way it can reboot the host) or read arbitrary memory pages, for obvious reasons. Historically, there was no easy way to do this in x86, but things changed around 2006. At that time, both Intel and AMD announced virtualization extensions to their instruction sets, known as VMX (marketed as VT-x) and SVM (AMD-V), respectively. Although technically incompatible (even though KVM supports both), they are very similar in spirit.
Before this change, x86 CPUs had four privilege rings. Operating systems such as Linux or Windows use only two: Ring 0 to run the kernel and Ring 3 for userspace code. Hardware-assisted virtualization adds another dimension: host (sometimes called "root") mode and guest (non-root) mode. Any instruction that may affect other guests (even if it is not privileged) causes an exit from guest mode to host mode, often called a "VM exit" or "the world switch." This way, a hypervisor can always evaluate the instruction and execute it or inject a fault into the guest. VM exits are expensive in terms of performance, so good hypervisors try to keep their number at a minimum.
The hypervisor's main loop is as follows. First, the hypervisor sets up control structures that tell the CPU which events to trap. They also store the current guest state such as CPU registers. Then, the hypervisor executes a special machine instruction to switch into guest mode. This mode lasts until some event, such as an interrupt, switches the control back to the hypervisor. The hypervisor next analyzes the exit reason, modifies control structures to reflect changes to the guest state, and resumes the guest. That's basically what the KVM kernel module does.
However, Linux never executes guest code: It runs processes. Moreover, you need a way to launch new guests, specifying where their disk images are, how much memory they have, and so on. There is also device emulation: When a guest touches an I/O port that belongs to, say, a PS/2 controller, something should read the register and act accordingly.
The Qemu userspace process handles these tasks. It's an entity that holds the guest code at OS level. When Linux chooses to execute the Qemu process, whatever guest you launched (maybe Windows or Mac OS X) really runs. This means KVM reuses the Linux scheduler, thus confirming its "Kernel-based Virtual Machine" name. When the guest touches an I/O port, the KVM kernel module forwards the request to the Qemu process to emulate. This works on top of the ioctl(2) interface, and that's what we are going to examine in a moment (Figure 1).
Meet kvmtool
Qemu is a natural choice for a Linux-based hypervisor userspace component. It can already run unmodified guest OSs, so all the hairy stuff the hypervisor needs to emulate is already here. Put simply, you just want to hook into where Qemu is going to execute a CPU instruction and call KVM for that.
The reality is of course much more complex, and Qemu is a complex piece of software, too. But if you agree not to run anything beyond Linux kernels preconfigured for virtual environments, much of this complexity goes away. Most importantly, you want guest kernels to use virtualized I/O devices instead of real hard disks or network cards. Emulating peripherals is hairy and slow; on the contrary, virtio [1] devices are somewhat like thin wrappers for the ring buffers. This makes virtio devices faster to run and much simpler to implement.
Kvmtool [2] is such a lightweight Linux native KVM tool. It supports Linux guests only, and they must be compiled for the same architecture as the host (so no ARM on x86-64 this time). It emulates a bare minimum of legacy devices (including a real-time clock, a serial port, and a keyboard controller), and that's it. Surprisingly enough, that's the configuration the majority of us run Qemu/KVM anyway.
Born as a hobby tool and an experiment, kvmtool is now being (slowly) adopted in production. For example, rkt, a CoreOS application container engine, includes the experimental kvmtool-based stage1 as an alternative to the traditional cgroups/namespaces-based approach [3]. Qemu-less KVM is in fact not exotic: Google also uses a homegrown tool (albeit not kvmtool) in the Google Compute Cloud (Figure 2) for security reasons [4].
The simplest way to get kvmtool is to clone the official Git repository:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git
Note that although it lives at kernel.org, it's not part of the official Linux kernel (and Linus Torvalds is repeatedly rejecting the idea).
Today's kvmtool is somewhat larger than the initial prototype, which spanned around 5K lines of C code. It's still clean, well structured, and easy to understand, which supports its role as a learning tool. Let's follow the code path that triggers when you run a VM and see what exactly the userspace part does and how it communicates to the kernel KVM module.
Under the Hood
The bottom-up approach seems a natural choice for this task. To run the VM, you issue a run
command (see below). Kvmtool implements it in builtin-run.c
. A great deal of this file parses command-line options and prepares VM configuration such as the guest RAM size; see the kvm_cmd_run_init()
function. As a part of this initialization, the kvm__init()
function is called.
It begins with the opening of the /dev/kvm
device file. This serves as a gateway between the userspace tool (be it kvmtool, Qemu, or anything else) and the KVM kernel part. Ioctls are used as the communication mechanism, and you see a couple of them right away in Listing 1.
Listing 1
kvm__init() Function Snippet
There, they check if the KVM API the kernel speaks is supported and create a VM for us. The kvm__init()
function triggers some architecture-specific initialization and creates the guest's RAM. Finally, it loads the kernel image into the guest memory. On bare metal, a bootloader such as GRUB does this, and you can see kvmtool emulating the boot protocol for bzImage kernels in the load_bzimage()
function in x86/kvm.c
. Note that load_bzimage()
adjusts the instruction pointer (CS:IP
) to point just where the real-mode initialization code is in the Linux kernel:
kvm->arch.boot_selector = BOOT_LOADER_SELECTOR; kvm->arch.boot_ip = BOOT_LOADER_IP + 0x200;
The next initialization function we encounter is kvm_cpu__init()
. It creates and initializes all virtual CPUs (vCPUs) the guest runs on. You can pass the exact number as the command-line parameter (see below); otherwise, KVM will supply a sane default. The function issues KVM_CREATE_VCPU
to allocate and initialize KVM vCPU kernel structures. Then, it maps a read-write memory block backed with the /dev/kvm
file descriptor. The size of this block is determined with the KVM_GET_VCPU_MMAP_SIZE
ioctl, and the KVM_RUN
ioctl uses it later to exchange data with userspace.
Now, back to builtin-run.c
. There, kvm_cmd_run_work()
comes into play. It creates a thread per vCPU with kvm_cpu_thread()
as the thread function. The latter is a thin wrapper on top of kvm_cpu__start()
, which implements the KVM "main loop" we discussed.
The first thing kvm_cpu__start()
does is, unsurprisingly, reset the vCPU. On x86, the majority of registers get all-zeros default values. The instruction pointer and the stack pointer are notable exceptions: They get their "boot values" from kvm->arch.boot_*. Additionally,
kvm_cpu__start() sets Unix signal handlers for the vCPU thread. Kvmtool employs real-time Unix signals for VM life-cycle management and SIGUSR1
for debugging, as shown below. The KVM "main loop" (heavily trimmed) is shown in Listing 2.
Listing 2
KVM Main Loop (Edited)
First, the loop checks for cpu->paused
and a few other flags. Those are set in the signal handler in response to life-cycle management events. Next, it calls kvm_cpu__run()
, which translates to the KVM_RUN
ioctl. This is where the magic happens: In the kernel, KVM switches to the guest mode and continues executing as a guest until the next VM exit. When it happens, KVM fills relevant parts of the mapped memory block and returns from the ioctl.
The large switch
that follows handles VM exits. This could be due to a debug-related event such as breakpoint or because of the I/O. In the latter case, kvmtool analyzes the memory block contents to determine which port or memory address was accessed, and whether it was read or write. Then it calls into various parts of the device emulation code (see hw/
in the sources). KVM_EXIT_INTR
indicates that there was a signal pending; KVM_EXIT_SHUTDOWN
means that the guest is shutting down and we want to break the main loop. As for system events, kvmtool implements reboots, so the last case
is a virtual equivalent of a reset button.
Now you understand what KVM does to run your guest; it's time to see it in action.
Buy this article as PDF
(incl. VAT)