Watching activity in the kernel with the bpftrace tool

Huge Selection

There's plenty of choice of probes in the kernel. From vfs_read (the function that reads bytes from disk and can pass a count to a probe), through do_exe_cve (for monitoring newly created Unix processes), to trace_pagefault_reg (which is triggered when a memory page is reloaded), users can inspect the kernel's workings at will and discover in real time what's going on and where the bottlenecks are.

Figure 2 lists the probes that bpftrace prints when called with the -l switch. BPF distinguishes between kprobes, which track important kernel functions by name, and tracepoint probes, which the kernel maintainers manually maintain at a slightly higher logical level and which are thus more resilient to changes in the kernel. In contrast to userspace-facing kernel APIs, the kernel's internal functions are by no means guaranteed to be stable.

Figure 2: Bpftrace can tap into a selection of tracepoints and Kprobes.

Potential for More

How about a script that outputs all newly created processes on the system in real time, including the command that was used to start them and their parameters? Listing 2 shows a one-liner that activates the sys_enter_execve tracepoint and prints its argument list argv in the args structure.

Listing 2

01 #!/usr/bin/bpftrace
04 {
05   printf("New processes with arguments\n");
06 }
08 tracepoint:syscalls:sys_enter_execve
09 {
10   join(args->argv);
11 }

Here you can see that the range of functions in bpftrace still has potential for more. For example, there is the join() function, which uses spaces to join and output elements of a command line in args->argv. It cannot return the result as a string, however, so you could format the output with printf(). Hopefully, upcoming versions will resolve this issue.

The BEGIN block from line 3 simply provides entertainment for the user. If you want the script to display a message or initialize a variable right at startup, this happens in the BEGIN block as shown in Listing 2, based on the Awk programming model.

In the Thick of It

However, things become more complicated if a probe that detects a problem cannot output the desired data because it is located somewhere else. For example, to look at processes that try to open files that do not exist (or to which they have no access), Listing 3 taps into the sys_exit_openat tracepoint, which the kernel runs through when the open() system call returns.

Listing 3

01 #!/usr/bin/bpftrace
03 tracepoint:syscalls:sys_enter_openat
04 {
05   @filename[tid] = args->filename
06 }
08 tracepoint:syscalls:sys_exit_openat
09   / @filename[tid] /
10 {
11   if ( args->ret < 0 ) {
12     printf("%s %s\n", comm, str(@filename[tid]));
13   };
14   delete(@filename[tid]);
15 }

Using the condition args->ret < 0, Bpftrace checks whether the return code from the system call was negative, which indicates that the desired file could not be opened. If so, we want the code to output the name of the process in question and the file name at this point. However, the exit tracepoint does not have access to the file name, which was only present when the kernel previously ran the open() function, tied to the sys_enter_openat tracepoint (notice the subtle difference between enter versus exit).

The solution in this case is to have bpftrace create a data structure during the open() call and somehow carry it over to exit, which then extracts the filename from it and reports the error with the desired context. For this to happen, the script stores all names of opened files in a Map type data structure when entering open() (i.e., in the sys_enter_openat tracepoint), under the key of the current kernel thread ID, which is present in the predefined tid variable. If the file fails to open later on, the sys_exit_openat tracepoint can look up the name of the file in question in the map and notify the user of this and even tell it the command of the process in comm that experienced the error.

The filter set in line 9 of Listing 3 is / @filename[tid] /, and it ensures that the probe executes the following code if the kernel thread has previously set a file name in the map. If the call came from elsewhere than the sys_enter_openat tracepoint defined above, the map entry won't exist, and the filter lets bpftrace ignore the event.

After reporting the incident, the code proceeds to line 14, which calls delete to remove the map entry. If it forgot to do that, the map would grow indefinitely and eventually consume too much memory if the bpftrace script were to run for a longer period of time.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • perf

    The kernel supports performance analysis with built-in tools via the Linux performance counters subsystem. perf is easy to use and offers a detailed view of performance data.

  • Tracing Tools

    Programs rarely reveal what they are doing in the background, but a few clever tools, of interest to both programmers and administrators, monitor this activity and log system functions.

  • Kernel News


  • How Does ls Work?

    A simple Linux utility program such as ls might look simple, but many steps happen behind the scenes from the time you type "ls" to the time you see the directory listing. In this article, we look at these behind-the-scene details.

  • Userspace Drivers

    New versions of the Linux kernel will support a special userspace driver
    model, but some technical pitfalls might limit the use of this interesting
    new feature.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95