A deep look at how Linux uses virtual memory

Don't Forget

© Lead image © lukjonis, 123rf.com

© Lead image © lukjonis, 123rf.com

Article from Issue 281/2024

Virtual memory makes your system safer and more efficient. But what is it really? We take a look inside this powerful feature that is built into Linux.

Like other modern operating systems, Linux is multitasking, meaning that it can manage multiple running processes at the same time. With that great capability, though, comes great responsibility. Linux must ensure that no process can meddle with the memory used by another process. Historically, a vast number of security vulnerabilities were caused by malicious code being executed from a memory area that was intended for ordinary data storage only and not for executable code. The operating system also must ensure that enough memory is available for the needs of all running processes and must take steps to make memory available if there is not enough. All these responsibilities must be fulfilled as quickly as possible, or otherwise performance will suffer.

Fortunately, Linux provides a way to manage the memory resources for many disparate processes simultaneously: virtual memory [1]. Essentially, when a process refers to a memory address, it does not refer directly to a physical memory location. Instead, the memory address is used as an index into one or more tables, which are then used to translate the memory address into a real, physical memory address.

The concept of virtual memory is so well established that modern computer hardware almost always has at least some basic facilities built-in to make virtual memory management easier for the operating system. However, the exact details of how virtual memory is implemented varies from one hardware platform to another. Most of the concepts outlined in this article apply equally to any platform, but I have chosen to use the 64-bit x86 (PC) architecture as the basis for examples.

This article describes how Linux implements virtual memory by inspecting the detailed virtual memory map of a very simple Linux program. This example program is a bit contrived – all it does is wait around, consuming resources until it is terminated – but it will demonstrate the important features of virtual memory.

General Concepts of Virtual Memory

All computer systems have some amount of relatively fast temporary storage, known as the computer's random-access memory (RAM), physical memory, or just memory for short. As far as the computer is concerned, the memory is a large space divided into a number of memory cells. Usually the smallest unit of storage that can be addressed is eight bits (a byte), though usually memory you can access data in blocks of several consecutive bytes.

Without virtual memory, a program accesses memory by addressing one or more memory cells, using a number that corresponds directly to the address of that memory cell. The numbers are almost always assigned sequentially, starting from  ; memory address   refers to the first byte of memory, and 19 refers to the 20th byte. Accessing memory is thus a very quick and direct operation.

The main problem with such a scheme is that it is often too direct. If multiple programs are running on the same computer, nothing stops one program from accidentally altering the memory used by another program. Worse, one or more of the programs may be malicious or may have been co-opted by malicious code, and nothing stops such a program from reading the memory used by another program. Under such circumstances, a seemingly useful utility program you downloaded off the Internet could read, for instance, your private financial information as stored for online banking purposes.

Physical memory has other limitations. The programs you run cannot use more memory than is available in the computer system, even when not all of those programs are currently in use. Actually, it is possible for the programs themselves to write any memory they are currently not using to permanent storage and then free the memory for other uses. Such a procedure is known as swapping, and modern operating systems often perform swapping when memory runs low. But again, without virtual memory, programs must perform the swapping themselves; while technically possible, it is highly error-prone, and a mistake can affect not only the program performing the swapping but potentially any other running program.

Virtual addressing simply inserts one or more extra steps into the process of addressing memory. Instead of programs addressing physical memory directly, memory addresses are interpreted as indexes into a table in memory, known as a page table, which is set up by the operating system ahead of time. Each possible virtual address has its own entry in this table, and each entry contains the physical memory address associated with the virtual address. The page table acts as sort of a "map," and each virtual address maps to a physical address, which is then used to finally access a physical memory location. Figure 1 illustrates this concept.

Figure 1: A virtual address is translated into a physical address by interpreting the most-significant bits of the virtual address as an index into an in-memory table. Each entry in the table contains the physical address corresponding to the indexed virtual address.

If the entry corresponding to a particular virtual address lacks a corresponding physical address, a page fault occurs, and the kernel performs additional processing. The kernel may take one of several actions, from terminating the program to loading the necessary data into memory from other storage. The program typically terminates if the program attempts to access memory outside of what it has been allocated; the kernel can fill the page table with valid entries corresponding to the memory locations the kernel wants to allow the program to access and then fills all the remaining entries with invalid physical addresses to place the rest of memory off-limits to the program.

Empty entries in the page table can also be used to implement swapping. If the amount of free memory becomes very low, the kernel can move some lesser-used data from memory to other places, usually to some form of permanent storage such as hard disk space. Then, the kernel removes the physical addresses from the corresponding entries in the page table. Now, if the program tries to access the data that has been "swapped out," a page fault occurs and the kernel is called. When the kernel inspects the details of what caused the page fault, it identifies that the data is now stored outside of memory and then fetches the data back from permanent storage into memory again, first moving other data out of memory if necessary to free up some space. In this scenario, the kernel does not terminate the program but instead restarts the program where it left off, before the page fault occurred.

Besides the fact that the memory access took an unusually long time to complete (due to the extra overhead of the page fault and the disk access), the program otherwise is completely ignorant and unaware that anything unusual happened; the details of the swapping are handled completely transparently, without the program having to worry about any complicated memory management details.

Virtual Memory in Practice

Actually, the description of virtual address translation that I just gave is a bit oversimplified. In reality, no practical computer system implements address translation with just one table. The reason is one of practicality. On the x86 platform [2], virtual addresses are 48 bits long, whereas physical addresses vary in size from one computer hardware implementation to another but generally are between 36 and 40 bits long. Memory is managed in 4096 byte (4KiB) blocks at a time; the least-significant 12 bits of any virtual or physical address are thus interpreted directly as an index into the selected block of memory. (These blocks of memory are known as page frames or often simply pages.) Thus, only the remaining 36 bits of a 48-bit virtual address are interpreted as a page table index.

However, even with a 36-bit index, the page table would need to be of sufficient size to include 236 entries – about 64 billion page table entries. On x86, each page table entry occupies 8 bytes, so the page table would have to be about 512 gibibytes (GiB). Most computers today do not have sufficient physical memory to store such an enormous table in memory. On top of that, each running process on Linux usually gets its own page table, which means that each running process on a Linux system would need an additional 512 GiB of memory! Obviously, using just a single page table is an unworkable solution.

Instead, on the x86 architecture, the most-significant 36 bits of a virtual address are split into four parts. The most-significant 9-bit chunk is used as an index into a first-level page table, consisting of just 512 (29) entries. But instead of the selected first-level page table entry pointing directly to the page frame containing the requested data, the page table entry points to the physical address of a second-level page table. The second-most-significant 9-bit chunk of the virtual address is then used as an index into this second-level page table, which in turn points to a third-level page table, and so on for the fourth-level page table. Finally, the selected entry in the fourth-level page table is what points to the page frame containing the requested data. Figure 2 illustrates this concept of multiple levels of page tables.

Figure 2: The virtual address is split into four 9-bit indexes that reference a chain of four much smaller page tables; only the fourth-level (and final) page table contains entries that point to the data originally requested.

The advantage of using multiple levels of page tables is that less memory needs to be set aside for the page tables. Most programs use nowhere near all 48 bits of the available virtual address space, which means that page table entries for the vacancies in the virtual address space are unneeded. If a selected entry in any level of page table contains an invalid physical address – that is, the entry points to neither the next level of page table nor the final page frame – a page fault is issued, giving the kernel an opportunity to either terminate the process or make the requested memory available. Entire swaths of page table entries therefore need not exist.

The Simplest Possible Example

The title of this section is no exaggeration. I want to start off by showing you the virtual address map of a real, running program. However, almost all existing programs on Linux (and other operating systems) contain excess code and data that they do not really need in order to function correctly.

For this reason, I have chosen to write an extremely simple program for the purposes of demonstrating virtual memory concepts. This program does nothing more than pause, doing nothing until terminated by pressing Ctrl+C or using the kill command at the command line. To reduce its memory and storage footprint, I have opted not to write it in a high-level language such as C, C++, Python, or Rust; I wrote it instead in x86 assembly language, which is essentially a marginally more convenient representation of the machine language that the computer actually executes.

Listing 1 provides the code, which I saved into a file called pause.s; you may call it whatever you want, so long as the filename ends in the .s suffix. (See the "Understanding Assembly Language Code" box for more about what exactly the assembly language code means.)

Listing 1

Extremely Simple Program

.section .note.GNU-stack, "", @progbits
.global _start
.type   _start, @function
        movl    $34, %eax
        movl    $231, %eax
        movl    $0, %edi

Understanding Assembly Language Code

Assembly language programs are translated into machine language using a program called an assembler. But the assembler by itself does not produce a ready-to-run executable file; the assembler only produces an intermediate object file. The object file must be passed through another program called the linker to produce a final, executable program. The linker's job is to scan through one or more object files, searching for references to program functions and variables in each object file. Once the location of each so-called symbol has been identified, the linker combines the contents of each object file into one executable program and then changes all symbol references to point to the new locations of each symbol in the resulting executable.

The simple example program in this article consists of only one source code file and contains no references to external symbols whatsoever; however, the linking process is still necessary to create the executable binary file in a file format suitable for execution.

The first line of the program code is not actually assembly language code; it is merely an instruction to the assembler to ensure that, when the resulting program is executed, the program's stack is not executable. (An executable stack can make a program vulnerable to buffer overflow.)

The next four lines are also assembler directives. Respectively, the .text directive instructs the assembler that program code follows, as opposed to data used by the program; the remaining three lines instruct the assembler to generate a symbol for a program function and to name the function _start. The linker will look for a function by this name when the linking process is performed, as this function contains the first code to be executed when the program is executed. All of the actual code of the program is contained within this one function, since this program is so simple.

The program code begins with an instruction to place a particular integer value into a specific register. To perform a system call, the convention on Linux is to place a specific number corresponding to the desired operation in the CPU register eax [3]. Linux supports dozens of different system calls, each with its own system call number assigned to it [4]. If any parameters need to be passed to the system call, the parameters go into other, specific registers, starting with edi for the first parameter. The system call is actually invoked using one of several methods; I have chosen to use the syscall instruction for this purpose because it is the fastest way to invoke a system call on Linux.

The first group of program code consists of two lines of code. The first (movl $34, %eax) loads the number associated with the pause system call into the eax register. The pause system call simply instructs the kernel to put the program to sleep indefinitely; the program will not continue until terminated (such as by using the kill command) or until the program receives a signal which it has chosen ahead of time to handle. For our purposes, the only way the program can end is if terminated explicitly using the kill command. The pause system call accepts no parameters, so there is no need to fill any other registers with specific values; all that needs to be done now is to execute the syscall instruction to invoke pause.

To be on the safe side, in the event that the pause system call does terminate, the next code invokes the system call with number 231, which is the exit system call. This cleanly ends the program. The exit system call requires one parameter, specifically the return value of the program (what status code to return to the program's parent, the program that launched this program). The following code (movl $0, %edi) sets this parameter to zero, which will result in the program returning a status code of zero (indicating a successful exit).

The final line of code (ud2) is another unnecessary formality that is nevertheless good practice. Most CPUs today will look ahead and start executing code even before the prior program instructions have completed. With a program as short as this one, there is a small chance that the CPU will attempt to execute random junk "code" after the end of the real program instructions. The ud2 is guaranteed to never be assigned to any actual instruction. When any x86 CPU encounters a ud2, the CPU is guaranteed to immediately stop executing the program code. While unnecessary, it is good practice to include this line to ensure the system doesn't execute code after the end.

Before you can run the program, you must first assemble and link it; in other words, it must first be converted down into the machine code that the computer can execute. Most high-level languages do this for you behind-the-scenes when you compile or run your programs, but with assembly language you must launch these tasks manually. The following two commands will do the trick:

as pause.s -o pause.o
ld pause.o -o pause

The previous two commands will produce two new files – one called pause.o, the other simply pause. The pause file is the final, executable program. I recommend that when you run it, you run it in the background by appending an ampersand character (&) to the program's command line. By backgrounding the program, you will keep your shell open for further commands to inspect the program, and the shell will also print the program's process ID:

$ ./pause &
[1] 20397

The second number printed (the one outside the square brackets) is the process ID of the pause program. The number will probably be different on your system; make sure to note down the process ID, because it will come in very handy for subsequent commands.

Once you have started the program, it will continue to run, sitting in the background doing nothing. When you are done with the program, you can terminate it with the kill command:

kill [pid]

Replace [pid] with the process ID you wrote down earlier.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • maddog's Doghouse

    A look at the history of computer memory and a classic algorithm text.

  • Fresh Memory – Learning Aid

    When memorizing facts, learning, storing, and processing internalizes the subject matter. The Fresh Memory program offers various methods to help you anchor the facts in your memory.

  • Security Lessons: Rescue Tools

    When attackers strike your system, you need to determine exactly what damage has been done. Here are some tools to help.

  • Code Analysis

    Linux offers some sophisticated tools for understanding how malware can slip through the gaps in an unsuspecting application.

  • Hercules Mainframe Emulator

    Many enterprises still rely heavily on mainframes, which makes it all the more difficult to get your hands on one and install Linux on it. The Hercules emulator gives you a full-fledged alternative.

comments powered by Disqus