Watching activity in the kernel with the bpftrace tool
Programming Snapshot – bpftrace
Who is constantly creating the new processes that are paralyzing the system? Which process opens the most files and how many bytes is it reading or writing? Mike Schilli pokes inside the kernel to answer these questions with bpftrace and its code probes.
If you are tasked with discovering the cause of a performance problem on a Linux system that has slowed down to a crawl, you will typically turn to tools such as iostat
, top
, or mpstat
to see exactly what is throwing a spanner in the works [1]. Not enough RAM? Lame hard disk? CPU overloaded? Or is network throughput the bottleneck?
Although a tool like top
shows you the running processes, it cannot detect short-lived instances that start and end again immediately. Periodically querying the process list only makes sense to visualize long-running processes.
Fortunately, the Linux kernel already contains thousands of test probes known as Kprobes and tracepoints. Users can inject code, log events, or create statistics there. One totally hot tool for doing this is bpftrace. With simple one-liners, it injects into the kernel scripts that determine in real time metrics like bytes heading off into the network or onto the hard drive, or lists which processes open or close which files.
BPF stands for Berkeley Package Filter and testifies to the origin of the corresponding tool from the BSD world as a tracing tool for network packets. The practice of scattering probes throughout the code that are usually tacit, but run small snippets of code when triggered, proved so practical that it soon entered the Linux world as eBPF.
Once there, it lost its ties to network packets and conquered wide areas of the kernel code as a generic tracing concept. Good naming is hard work that engineers often shy away from, so the author of eBPF changed the name of his now popular work back to BPF. Of course, that complicates things for authors writing tutorials like this one, who are hard pressed to find an explanation as to why the BPF name has lost all meaning with respect to the product as it is today.
The approach of distributing dynamically deployable probes in kernel code came from the Sun world. For a long time, Solaris was the only operating system that allowed administrators to use DTrace to activate small pieces of D language code at strategic points, such as the system call entry point, and fire off counters or timers for performance analysis.
BPF on newer Linux kernels works in a similar way to DTrace, but has been rewritten (also for patent reasons). It executes instructions assigned to the probes in the BPF language, either in an interpreter or via a JIT compiler in native code, directly inside the kernel.
Status: Improving
The bpftrace programming language is very reminiscent of scripting with Unix veteran Awk, but it's still incomplete, and programmers sometimes struggle to complete even the simplest of tasks.
The bpftrace parser (implemented via the Unix veterans Lex and Yacc) is in a sorry state that doesn't even come close to the functionality of Awk – but maybe it will at some point. Netflix engineer Brendan Gregg and some open source friends are working on fixing it. Brendan's book on BPF [2] will be published in December 2019 (a preview is already available).
Back to the task at hand: How do you enable a probe in the kernel that outputs a message each time any userspace program calls the open()
function to open a file? With this function, you'll be able to monitor in real time processes of active files. Turns out this is really easy to do. Listing 1 [3] shows the program code; Figure 1 shows the program output.
Listing 1
sys-open.bt
01 #!/usr/bin/bpftrace 02 03 interval:s:5 04 { 05 exit(); 06 } 07 08 kprobe:do_sys_open 09 { 10 printf("%s %s\n", comm, str(arg1)); 11 }
Compact Code
The actual work starts in line 8 with the definition of the kprobe:do_sys_open
probe; the following block contains instructions to be executed when the probe triggers. When triggering it, the kernel tells the probe which file the open()
system call wants to open. In the block, the printf()
instruction outputs the Unix command of the triggering Unix process stored in the comm
variable along with the first argument arg1
, which carries the name of the file to be opened. Because printf()
expects a string, but BPF saves arg1
as a character pointer, the standard str()
function converts the pointer appropriately.
The code for the interval:s:5
event starting in line 3 is just some optional feature that cancels the program after five seconds. The event defines an interval of five seconds at which bpftrace jumps into the code block. The call to exit()
, which shuts down the program, occurs here as soon as the block has been accessed for the first time. Tracing tools often use intervals like this to output consolidated statistics every few seconds. Once bpftrace has been installed on a Ubuntu system like this:
$ sudo apt-get update $ sudo apt-get install bpftrace
all you need to do is run Listing 1 with sudo
. It launches in the blink of an eye and keeps showing you which processes on the system are currently attempting to open which files. Before you get too excited, however, please note that bpftrace only works on relatively new kernels. Its creators recommend at least version 4.9, and preferably a series 5 kernel.
It is a very powerful tool. Astonished users will rub their eyes in amazement thinking about what just happened behind the scenes during the inconspicuous call: Bpftrace activated the do_sys_open
kprobe
in the kernel and translated the printf()
statement into an internal format. It then installed the compiled code on the probe, causing it to display a message every time the kernel passes the probe. When the bpftrace call terminates, it deactivates the probe in the kernel and removes the injected code.
Full Tilt While Idle
How does this work inside the kernel? It would obviously be devastating for kernel performance if it had to check whether each probe is currently active and then carry on normally in the program in almost 100 percent of the cases when the probe is inactive. There are always very few, if any, probes active from thousands of possible ones.
Instead, the BPF technology, just like DTrace under Solaris, uses a trick: Normally, when the probe is inactive, it inserts a 5 byte no-op instruction into the code, which the processor skips with practically no impact at run time. If the user activates the probe, for example by calling bpftrace, BPF replaces the no-op instruction in the kernel with a jump address to the interpreter that executes the desired code.
No doubt, the CPU will consume time when executing the BPF instructions, which will slow down the kernel a bit. But since the processor stays in kernel mode and doesn't have to switch to user space every time, the probe can quickly refresh the desired statistics – then the flow continues with the actual kernel code.
However, if the infiltrated code were to block the kernel, the result would be devastating: The entire system would stop, which is tantamount to a computer crash. That's why BPF verifies the code before it is introduced and only inserts it if the analysis shows that it will terminate relatively quickly. This is why the bpftrace language does not offer for
loops or similar constructs for which it cannot predict with certainty whether they will stop running in the foreseeable future.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.