A deep look at how Linux uses virtual memory
Don't Forget
Virtual memory makes your system safer and more efficient. But what is it really? We take a look inside this powerful feature that is built into Linux.
Like other modern operating systems, Linux is multitasking, meaning that it can manage multiple running processes at the same time. With that great capability, though, comes great responsibility. Linux must ensure that no process can meddle with the memory used by another process. Historically, a vast number of security vulnerabilities were caused by malicious code being executed from a memory area that was intended for ordinary data storage only and not for executable code. The operating system also must ensure that enough memory is available for the needs of all running processes and must take steps to make memory available if there is not enough. All these responsibilities must be fulfilled as quickly as possible, or otherwise performance will suffer.
Fortunately, Linux provides a way to manage the memory resources for many disparate processes simultaneously: virtual memory [1]. Essentially, when a process refers to a memory address, it does not refer directly to a physical memory location. Instead, the memory address is used as an index into one or more tables, which are then used to translate the memory address into a real, physical memory address.
The concept of virtual memory is so well established that modern computer hardware almost always has at least some basic facilities built-in to make virtual memory management easier for the operating system. However, the exact details of how virtual memory is implemented varies from one hardware platform to another. Most of the concepts outlined in this article apply equally to any platform, but I have chosen to use the 64-bit x86 (PC) architecture as the basis for examples.
This article describes how Linux implements virtual memory by inspecting the detailed virtual memory map of a very simple Linux program. This example program is a bit contrived – all it does is wait around, consuming resources until it is terminated – but it will demonstrate the important features of virtual memory.
General Concepts of Virtual Memory
All computer systems have some amount of relatively fast temporary storage, known as the computer's random-access memory (RAM), physical memory, or just memory for short. As far as the computer is concerned, the memory is a large space divided into a number of memory cells. Usually the smallest unit of storage that can be addressed is eight bits (a byte), though usually memory you can access data in blocks of several consecutive bytes.
Without virtual memory, a program accesses memory by addressing one or more memory cells, using a number that corresponds directly to the address of that memory cell. The numbers are almost always assigned sequentially, starting from
; memory address
refers to the first byte of memory, and 19
refers to the 20th byte. Accessing memory is thus a very quick and direct operation.
The main problem with such a scheme is that it is often too direct. If multiple programs are running on the same computer, nothing stops one program from accidentally altering the memory used by another program. Worse, one or more of the programs may be malicious or may have been co-opted by malicious code, and nothing stops such a program from reading the memory used by another program. Under such circumstances, a seemingly useful utility program you downloaded off the Internet could read, for instance, your private financial information as stored for online banking purposes.
Physical memory has other limitations. The programs you run cannot use more memory than is available in the computer system, even when not all of those programs are currently in use. Actually, it is possible for the programs themselves to write any memory they are currently not using to permanent storage and then free the memory for other uses. Such a procedure is known as swapping, and modern operating systems often perform swapping when memory runs low. But again, without virtual memory, programs must perform the swapping themselves; while technically possible, it is highly error-prone, and a mistake can affect not only the program performing the swapping but potentially any other running program.
Virtual addressing simply inserts one or more extra steps into the process of addressing memory. Instead of programs addressing physical memory directly, memory addresses are interpreted as indexes into a table in memory, known as a page table, which is set up by the operating system ahead of time. Each possible virtual address has its own entry in this table, and each entry contains the physical memory address associated with the virtual address. The page table acts as sort of a "map," and each virtual address maps to a physical address, which is then used to finally access a physical memory location. Figure 1 illustrates this concept.
If the entry corresponding to a particular virtual address lacks a corresponding physical address, a page fault occurs, and the kernel performs additional processing. The kernel may take one of several actions, from terminating the program to loading the necessary data into memory from other storage. The program typically terminates if the program attempts to access memory outside of what it has been allocated; the kernel can fill the page table with valid entries corresponding to the memory locations the kernel wants to allow the program to access and then fills all the remaining entries with invalid physical addresses to place the rest of memory off-limits to the program.
Empty entries in the page table can also be used to implement swapping. If the amount of free memory becomes very low, the kernel can move some lesser-used data from memory to other places, usually to some form of permanent storage such as hard disk space. Then, the kernel removes the physical addresses from the corresponding entries in the page table. Now, if the program tries to access the data that has been "swapped out," a page fault occurs and the kernel is called. When the kernel inspects the details of what caused the page fault, it identifies that the data is now stored outside of memory and then fetches the data back from permanent storage into memory again, first moving other data out of memory if necessary to free up some space. In this scenario, the kernel does not terminate the program but instead restarts the program where it left off, before the page fault occurred.
Besides the fact that the memory access took an unusually long time to complete (due to the extra overhead of the page fault and the disk access), the program otherwise is completely ignorant and unaware that anything unusual happened; the details of the swapping are handled completely transparently, without the program having to worry about any complicated memory management details.
Virtual Memory in Practice
Actually, the description of virtual address translation that I just gave is a bit oversimplified. In reality, no practical computer system implements address translation with just one table. The reason is one of practicality. On the x86 platform [2], virtual addresses are 48 bits long, whereas physical addresses vary in size from one computer hardware implementation to another but generally are between 36 and 40 bits long. Memory is managed in 4096 byte (4KiB) blocks at a time; the least-significant 12 bits of any virtual or physical address are thus interpreted directly as an index into the selected block of memory. (These blocks of memory are known as page frames or often simply pages.) Thus, only the remaining 36 bits of a 48-bit virtual address are interpreted as a page table index.
However, even with a 36-bit index, the page table would need to be of sufficient size to include 236 entries – about 64 billion page table entries. On x86, each page table entry occupies 8 bytes, so the page table would have to be about 512 gibibytes (GiB). Most computers today do not have sufficient physical memory to store such an enormous table in memory. On top of that, each running process on Linux usually gets its own page table, which means that each running process on a Linux system would need an additional 512 GiB of memory! Obviously, using just a single page table is an unworkable solution.
Instead, on the x86 architecture, the most-significant 36 bits of a virtual address are split into four parts. The most-significant 9-bit chunk is used as an index into a first-level page table, consisting of just 512 (29) entries. But instead of the selected first-level page table entry pointing directly to the page frame containing the requested data, the page table entry points to the physical address of a second-level page table. The second-most-significant 9-bit chunk of the virtual address is then used as an index into this second-level page table, which in turn points to a third-level page table, and so on for the fourth-level page table. Finally, the selected entry in the fourth-level page table is what points to the page frame containing the requested data. Figure 2 illustrates this concept of multiple levels of page tables.
The advantage of using multiple levels of page tables is that less memory needs to be set aside for the page tables. Most programs use nowhere near all 48 bits of the available virtual address space, which means that page table entries for the vacancies in the virtual address space are unneeded. If a selected entry in any level of page table contains an invalid physical address – that is, the entry points to neither the next level of page table nor the final page frame – a page fault is issued, giving the kernel an opportunity to either terminate the process or make the requested memory available. Entire swaths of page table entries therefore need not exist.
The Simplest Possible Example
The title of this section is no exaggeration. I want to start off by showing you the virtual address map of a real, running program. However, almost all existing programs on Linux (and other operating systems) contain excess code and data that they do not really need in order to function correctly.
For this reason, I have chosen to write an extremely simple program for the purposes of demonstrating virtual memory concepts. This program does nothing more than pause, doing nothing until terminated by pressing Ctrl+C or using the kill
command at the command line. To reduce its memory and storage footprint, I have opted not to write it in a high-level language such as C, C++, Python, or Rust; I wrote it instead in x86 assembly language, which is essentially a marginally more convenient representation of the machine language that the computer actually executes.
Listing 1 provides the code, which I saved into a file called pause.s
; you may call it whatever you want, so long as the filename ends in the .s
suffix. (See the "Understanding Assembly Language Code" box for more about what exactly the assembly language code means.)
Listing 1
Extremely Simple Program
.section .note.GNU-stack, "", @progbits .text .global _start .type _start, @function _start: movl $34, %eax syscall movl $231, %eax movl $0, %edi syscall ud2
Understanding Assembly Language Code
Assembly language programs are translated into machine language using a program called an assembler. But the assembler by itself does not produce a ready-to-run executable file; the assembler only produces an intermediate object file. The object file must be passed through another program called the linker to produce a final, executable program. The linker's job is to scan through one or more object files, searching for references to program functions and variables in each object file. Once the location of each so-called symbol has been identified, the linker combines the contents of each object file into one executable program and then changes all symbol references to point to the new locations of each symbol in the resulting executable.
The simple example program in this article consists of only one source code file and contains no references to external symbols whatsoever; however, the linking process is still necessary to create the executable binary file in a file format suitable for execution.
The first line of the program code is not actually assembly language code; it is merely an instruction to the assembler to ensure that, when the resulting program is executed, the program's stack is not executable. (An executable stack can make a program vulnerable to buffer overflow.)
The next four lines are also assembler directives. Respectively, the .text
directive instructs the assembler that program code follows, as opposed to data used by the program; the remaining three lines instruct the assembler to generate a symbol for a program function and to name the function _start
. The linker will look for a function by this name when the linking process is performed, as this function contains the first code to be executed when the program is executed. All of the actual code of the program is contained within this one function, since this program is so simple.
The program code begins with an instruction to place a particular integer value into a specific register. To perform a system call, the convention on Linux is to place a specific number corresponding to the desired operation in the CPU register eax
[3]. Linux supports dozens of different system calls, each with its own system call number assigned to it [4]. If any parameters need to be passed to the system call, the parameters go into other, specific registers, starting with edi
for the first parameter. The system call is actually invoked using one of several methods; I have chosen to use the syscall
instruction for this purpose because it is the fastest way to invoke a system call on Linux.
The first group of program code consists of two lines of code. The first (movl $34, %eax
) loads the number associated with the pause
system call into the eax
register. The pause
system call simply instructs the kernel to put the program to sleep indefinitely; the program will not continue until terminated (such as by using the kill
command) or until the program receives a signal which it has chosen ahead of time to handle. For our purposes, the only way the program can end is if terminated explicitly using the kill
command. The pause
system call accepts no parameters, so there is no need to fill any other registers with specific values; all that needs to be done now is to execute the syscall
instruction to invoke pause
.
To be on the safe side, in the event that the pause
system call does terminate, the next code invokes the system call with number 231, which is the exit
system call. This cleanly ends the program. The exit
system call requires one parameter, specifically the return value of the program (what status code to return to the program's parent, the program that launched this program). The following code (movl $0, %edi
) sets this parameter to zero, which will result in the program returning a status code of zero (indicating a successful exit).
The final line of code (ud2
) is another unnecessary formality that is nevertheless good practice. Most CPUs today will look ahead and start executing code even before the prior program instructions have completed. With a program as short as this one, there is a small chance that the CPU will attempt to execute random junk "code" after the end of the real program instructions. The ud2
is guaranteed to never be assigned to any actual instruction. When any x86 CPU encounters a ud2
, the CPU is guaranteed to immediately stop executing the program code. While unnecessary, it is good practice to include this line to ensure the system doesn't execute code after the end.
Before you can run the program, you must first assemble and link it; in other words, it must first be converted down into the machine code that the computer can execute. Most high-level languages do this for you behind-the-scenes when you compile or run your programs, but with assembly language you must launch these tasks manually. The following two commands will do the trick:
as pause.s -o pause.o ld pause.o -o pause
The previous two commands will produce two new files – one called pause.o
, the other simply pause
. The pause
file is the final, executable program. I recommend that when you run it, you run it in the background by appending an ampersand character (&
) to the program's command line. By backgrounding the program, you will keep your shell open for further commands to inspect the program, and the shell will also print the program's process ID:
$ ./pause & [1] 20397
The second number printed (the one outside the square brackets) is the process ID of the pause
program. The number will probably be different on your system; make sure to note down the process ID, because it will come in very handy for subsequent commands.
Once you have started the program, it will continue to run, sitting in the background doing nothing. When you are done with the program, you can terminate it with the kill
command:
kill [pid]
Replace [pid]
with the process ID you wrote down earlier.
Buy this article as PDF
(incl. VAT)