Core Technology

Article from Issue 198/2017

Everyone wants to be root, because root can do anything. But in fact, its powers are now split. Learn more in this overview of capability sets.

Today's Linux is somewhat like a famous sightseeing city you might have visited on your last vacation. There is a historic part that's of no practical use now, yet it is what keeps the city's identity. There are some well-known tourist spots that everybody seems to visit. And, finally, there are some secluded locations you never find in an advertisement in a travel agency. These are places a friend living there would show you, and they are essential for sensing a real spirit of the city, not its pamphlet picturesque image.

Okay, maybe I've taken the analogy a bit too far here. But if you agree to follow it for a second, capabilities would be one of these secluded locations. Introduced with Linux 2.2, they are what really tells if process X can do Y. Yet they are often lost in shadows of traditional Unix privileges, SELinux, eBPF, and many others. By the end of this Core Tech article, you'll know who really sets your limits in a city of Linux.

An All-Mighty Root (Actually, Not)

Back in ye olde days, the permission system of Linux was pretty much simple. A user with UID 0 – often called "root" – could do any privileged operation, and he wasn't subject to permission checks. Note it is the UID, not the name, which is important. A user called "val" with UID 0 holds all powers of root user as well.

This "all-or-nothing" approach served well but wasn't very flexible. What if you do not want someone to install new packages or add new users, yet want him to create raw sockets the ping command uses? Granting someone a permission to adjust the system date doesn't mean you would be happy if he or she reconfigured Nginx or MySQL on this server.

The sudo tool solves this, kind of. You can tell it which command a given user can execute as root, so being able to run date doesn't imply a permission to execute passwd. But command-level granularity is sometimes too coarse to be useful. If a single command writes files and adjusts dates, /etc/sudoers provides no way to restrict the former and grant the latter. This means that you leave your system possibly open to the attack.

Let's look at how the kernel implements permission checks for privileged operations. Listings 1 and 2 show a relevant part of the inet_create() function, which is called in response to the socket(AF_INET, ...) system call. The listings also show functions that do actual permission checks; they are really defined in a separate file, linux/sched.h.

The code in Listing 1 comes from Linux 2.0. It's straightforward to see that it only evaluates if the current process effective UID is zero. However, Linux 2.2 is not concerned with the user ID anymore. Instead, it checks for specific flags in the process descriptor. These bit flags are essentially capabilities of the process or, more precisely, a thread. Any capability could be either set or reset. (See the "Secure Bits" box for more information.)

Secure Bits

Capabilities are a tricky yet flexible system, and you may be wondering why we still keep root users today. A short answer is it preserves backward compatibility and works well in many cases. A longer answer is that you really don't have to.

Starting with Linux 2.6.26, it is possible to establish a root-less, capabilities-only environment. In this environment, UID 0 is treated no differently from any other UID. As process permissions are really granted as per capabilities since Linux 2.2, establishing such an environment only needs some flags to disable special handling of the UID 0. These flags are commonly known as "secure bits."

Perhaps the most important security bit is SECBIT_NOROOT, which disables setting permitted and inherited file capabilities to all-ones, as I described. Two other flags, SECBIT_KEEP_CAPS and SECBIT_NO_SETUID_FIXUP remove the effect of switching between zero and non-zero UIDs.

All these "base" flags also have companion "locked" flags. A locked flag forbids modifications to the corresponding base flag, and it can't be cleared. This means you can set up a secure bits environment the way you want, lock it, and be confident no process could ever change it. Secure bits are managed with a prctl(2) system call, and a PAM module would be an appropriate place to do so.

Listing 1

Permission Checks in Linux Kernels Before 2.2


Linux understands a few dozen capabilities now; see linux/capability.h [1] or capabilities(7) man page [2]. The highest capability's number available (zero-based) is also in /proc/sys/kernel/cap_last_cap:

cat /proc/sys/kernel/cap_last_cap

I'd be happy to say any privileged operation has a dedicated capability flag now, but it isn't the case. Some capabilities span several operations. For instance, CAP_NET_ADMIN permits one to configure network interfaces, manage firewall rules, and modify routing tables (besides other things). You see the grouping is natural, so when a capability feels coarser than you might expect, it's usually not a problem.

As you may have guessed by now, CAP_NET_RAW allows creating a raw (and packet) network socket which is useful for the ping command and for sniffing tools such as tcpdump.

Capability Sets

You may have noticed that the code in Listing 2 checks capabilities in the cap_effective member of the process descriptor. There are a few other cap_something members as well because each thread in Linux has several associated capability sets. Effective is, of course, what defines capabilities currently in action. Other sets are used, for example, when a thread does an execve(2) system call to execute some new code for which you may want different capabilities.

Listing 2

Permission Checks in Linux Kernels 2.2 and Newer


First, there is the permitted capability set. It contains all capabilities a thread may ever assume – that is, add to any other capability set. If a thread drops a capability from the permitted set, there is no way back, at least until the thread executes the same program.

This brings us to the inheritable capabilities set. As the name implies, these are capabilities that are preserved across the execve(2) system call. Inheritable capabilities are automatically added to the permitted set when a program is executed. However, this only applies to privileged processes, which either run as root or execute a setuid binary. For everything else, inheritable capabilities are simply ignored. So, if ping had CAP_NET_RAW in its inheritable set, and you trick it to run a Python interpreter for you somehow, you still won't be able to create arbitrary raw network sockets. Only ping could do it, and it properly restricts the use of this powerful feature to innocent ICMP echo requests.

This raises a question: How do you execute a privileged helper then? This is, in fact, a common scenario: Consider a network management app. You don't need privileges to fill in stuff like an IP address or a gateway. Yet when you apply these settings, the app calls some helper script (often it is setuid-root) to put the configuration you want in effect.

Before Linux 4.3, there was no straightforward way to do this using capabilities. Now we have the ambient capabilities set. A capability in this set must be both permitted and inheritable (the kernel enforces it automatically), and these capabilities are preserved across execve(2) calls in unprivileged programs. When you execute a setuid or a setgid program, the kernel clears ambient capabilities to keep things safe.

A process can also directly change capabilities in the ambient set using prctl(2) system call. Keep in mind, however, that everything I described so far applies to execve(2) only. Forks are nothing special from the capabilities point of view: Both a parent and a child get a bitwise copy of all capabilities set. It's execve(2) that matters as it decides which code a thread will ultimately execute.

Capability Math

Now you have the idea of how the kernel implements thread capabilities, but where do these capabilities come from? Nowadays, they're usually attached to an executable file. Linux stores capabilities in a dedicated extended attribute within the security namespace [3]:

$ getfattr -m - -d /usr/bin/ping
# file: usr/bin/ping

Interestingly, there is a dedicated capability, CAP_SETFCAP, which grants a permission to set file capabilities. This is a sort of chicken and egg problem, although an "all-mighty root" concept solves it easily.

As with thread capabilities, there are several file capabilities set. Perhaps the most important one is the permitted set. Capabilities in this set are automatically granted when you execute a file, even if they aren't in the inheritable set of a thread doing an execve(2) call. So, if an executable file has CAP_KILL attached, the process will be able to send signals to arbitrary siblings, even if it doesn't run as root. Note that adding a capability to the file's permitted set isn't enough. You should also set a so-called "effective bit" in the file's capabilities. This bit makes permitted capabilities effective, that is, raised in the effective capabilities set after execve(2).

Files also have an inheritable capabilities set, which is ANDed with the thread inheritable capabilities at execve(2) time. This is a way of saying "a thread executing this code never should be granted CAP_X." If you know the program is going to adjust the system clock and nothing else, limiting the file's inheritable capabilities set to CAP_SYS_TIME would mean dropping any other capability a thread may have gained.

If a process calling execve(2) runs as root or the binary itself is setuid-root and has no capabilities attached (Figure 1), both permitted and inheritable file sets are assumed to be all ones (remember they are really just bitmaps). That's how the kernel preserves an all-mighty root illusion in 2017.

Figure 1: An empty capabilities set is not the same as no capabilities at all, as getfattr and capget show.

If the previous text was too verbose for you, the capabilities(7) man page neatly summarizes the rules in just four formulas. Think of a process as doing execve(2) and let P(something) be capabilities in the respective set. Then, new capabilities, P'(something), are defined as:

P'(ambient) = (file is privileged) ? 0 : P(ambient)

If the file is setuid/setgid-root or has capabilities attached, ambient capabilities are cleared.

P'(permitted) = (P(inherit.) & F(inherit.)) | (F(permitted) & cap_bset) | P'(ambient)

This one is trickier. Thread inheritable permissions are put in the permitted set if file inheritable permissions don't disable them. Then, the file's permitted capabilities are dropped into the mix, subject to the capability bounding set (see the man page [2] for details). Finally, ambient capabilities are added for non-privileged processes.

P'(effective) = F(effective) ? P'(permitted) : P'(ambient)

If the file's effective bit is set, permitted capabilities become effective ones. Note that they could never be stricter than F(permitted), cap_bset aside. Otherwise, only ambient capabilities are in effect.

Inheritable capabilities remain unchanged during execve(2): P'(inheritable) = P(inheritable). If you are interested in (somewhat mind-bending) implementation details, refer to [4] (also Figure 2).

Figure 2: Even if you are a non-programmer, Linux kernel sources are the ultimate authority in how capabilities work.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • AppArmor

    After penetrating a remote system, intruders might think they are home and dry, but AppArmor spoils the fun, locking the miscreants in a virtual cage.

  • Security Lessons

    We look at the history of the rootkit, including its newest incarnation, the DR RootKit.

  • Pony Programming Language

    Pony, an object-oriented programming language with static typecasting, trots down well-mapped paths to deliver secure, high-performance code for concurrent applications.

  • Pinger

    The Pinger network monitoring tool uses ping to look for switches and estimate cable lengths.

  • AppArmor

    Today's security environment is a tumultuous landscape riddled with threats. AppArmor offers an extra ring of protection for your system, and it is easier to learn and implement than many alternative mandatory access control solutions.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More