# Core Technology

Article from Issue 194/2017

If you were using Linux back in 2006, you will remember the desktop search tool Beagle (Figure 1), which was notified when you changed your files so it could re-index them. Modern file managers also rely on notifications to update their displays when files are created, deleted, or renamed (Figure 2), unlike earlier file managers that counted on the user to refresh the display (Figure 3). Now, think of the ClamAV open source antivirus software. If you try to open a file containing malware, you expect an on-access scanner to ban it. In this case, notifications aren't enough; ClamAV needs to be an active part of the process, allowing or denying certain operations. Happily, Linux can handle both cases. The downside is, it does so with two separate APIs, and you can't just choose one over another.

Figure 1: Beagle hasn't released a new version since 2009, yet it's remembered as a "textbook" example of inotify usage. (The figure [1] is reproduced under Apache License, Version 2.0 [2].)
Figure 2: Many file managers, including Dolphin, rely on notifications to keep directory content current.
Figure 3: The fabulous Midnight Commander is an exception – you reload contents yourself with Ctrl+R. Patches, anyone?

#### Inotify

The first API I'll look at, and the first one to appear in Linux 2.6, is inotify. It went mainline with Linux 2.6.13 in 2005. Inotify was the workhorse behind Beagle's indexer. It replaced an older filesystem monitoring technology (see the "dnotify" sidebar), improving it in several ways.

dnotify

Inotify was introduced with early 2.6 kernels, but filesystem monitoring in Linux is really much older. Its first incarnation, dnotify, appeared in Linux 2.4.0 back in 2001.

The dnotify API was a step forward: Original approaches to the problem involved polling directories for changes, which, was very inefficient. However, dnotify's design made an API cumbersome and was not easy to use. Although later approaches introduced separate system calls, dnotify relied on fcntl(2).

Signals are used for notifications, and they are somewhat difficult to treat correctly because they convey little information (not even the name of a file triggers the event), and you can't integrate them easily into event loops, although signnalfd(2) file descriptors mitigate the issue to some extent. Dnotify forces you to retain an opened descriptor for each filesystem object you monitor. Moreover, it has no notion of events that rename a file, leaving programmers to figure it out by comparing two directory trees. If I recall correctly, Dropbox once offered a similar puzzle to candidates seeking an engineering position within the company (i.e., it's not trivial).

Dnotify is still available in the latest kernels, but with inotify and fanotify, there is little sense to use it except in legacy code.

First, inotify replaced a cumbersome signal-based notification mechanism with a pollable file descriptor from which you just read events. This makes event loop integration a breeze. It also waived a need to retain an opened file descriptor for each directory you monitor. To do so, inotify introduced three new system calls: inotify_init(2), inotify_add_watch(2), and inotify_rm_watch(2).

Your code starts with inotify_init(), which returns a file descriptor acting as a handler to an in-kernel event queue. A newer variant, inotify_init1() accepts the extra flags argument. Passing IN_NONBLOCK here opens the descriptor in non-blocking mode, saving you an fcntl(2) call. IN_CLOEXEC flag is a similar shortcut:

int fd = inotify_init();
if (fd < 0) { /* Handle the error */ }

Then, you add watches for filesystem objects of interest with the inotify_add_watch() system call. It accepts three arguments: an inotify file descriptor, a pathname, and a set of flags (or "mask") telling the events in which you are interested. IN_CREATE fires when an entry (think a file or a subdirectory) is created in the directory you watch. IN_OPEN is reported when a file is opened, followed by IN_ACCESS or IN_MODIFY, when the contents are read or changed. Later, when the file is closed, the kernel sends either IN_CLOSE_WRITE (if the file was opened for writing), or IN_CLOSE_NOWRITE. You can capture both with IN_CLOSE.

Some flags carry a _SELF suffix, like in IN_DELETE_SELF. They apply to the monitored directory itself, not its children. In particular, IN_DELETE_SELF is reported when you remove a watched directory. The kernel then reports an IN_IGNORED event for it. Moving a directory can also generate IN_DELETE_SELF if it occurs across filesystem boundaries, but normally, it produces a sequence of two events: IN_MOVED_FROM and IN_MOVED_TO.

The inotify(7) man page lists all supported flags. Here, I just pass IN_ALL_EVENTS, which – you guessed it – captures everything:

int wd = inotify_add_watch(fd, argv[1], IN_ALL_EVENTS);
if (wd < 0) { /* Handle this one as well */ }

An inotify_add_watch() returns a so-called "watch descriptor." It matches events to watched filesystem objects and can be used to "unmonitor" them later with inotify_rm_watch(). If anything goes wrong, inotify_add_watch() returns -1 and errno is set appropriately.

Now you wait for an inotify descriptor to become readable. In a real application, this happens in an event loop. In Listing 1, I just spin in read():

Listing 1

It makes sense to use large buffers capable of storing multiple events for performance reasons. A struct inotify_event represents a single inotify event. The wd field contains the watch descriptor, which you can map back to a pathname (see inotify(7) for details).

if (ev->mask & IN_OPEN)
printf("IN_OPEN ");
/* Handle other events here */

The mask tells what exactly has happened. Besides the IN_* flags you supply to inotify_add_watch(), it may contain the aforementioned IN_IGNORED or IN_Q_OVERFLOW if the in-kernel queue has overflowed. The queue can store up to 16,384 events by default. Although the size is adjustable via /proc/sys/fs/inotify/max_queued_events, there is always a limit. Otherwise, you leave your system open to a local denial of service (DoS) attack. When a queue overflows, the kernel discards further events (keeping memory usage constrained) until an application empties the queue or destroys it.

printf("%s ", ev->mask & IN_ISDIR ? "directory" : "file");
if (ev->len)
printf("%s", ev->name);
printf("\n");

If the event pertains to an entry within the watched directory, name is its name, and len is the name's length. For subdirectories, IN_ISDIR is also set in the mask. Finally, cookie is an arbitrary but unique integer that links IN_MOVE_FROM and IN_MOVE_TO events (not shown here) together.

Despite all its goodies, the inotify API is somewhat limited: It doesn't support recursive operations, so to monitor a directory, including all its children, you'd have to add watches one by one. Keep in mind that the directory can change while you install watches, so your code should anticipate possible races.

Rename events pose another difficulty. Their twofold split is natural from the kernel's point of view, but it leaves your code guessing as to whether it will get matching IN_MOVE_TO and IN_MOVE_FROM events. What if you don't monitor a directory to which the object moved? A common solution is to wait for IN_MOVE_TO for a few milliseconds. If it's absent, you conclude it won't appear at all. Although not ultimately robust, this approach is reported to produce accurate results 95%-99% of the time.

Inotify doesn't convey a PID for the process doing changes and provides no mechanism for access permission decisions. This doesn't mean inotify is flawed or should be deprecated. Much the opposite, many Linux applications rely on it; yet, there is some room for a more advanced API that addresses at least some of these issues.

#### Fanotify

Fanotify is such an API. It made a debut in Linux 2.6.36, almost five years after inotify. Conceptually, fanotify is similar to inotify, yet somewhat closer to the low-level kernel API they both use. It introduces a set of system calls to obtain a file descriptor and "marks" filesystem objects as being watched. Unlike inotify, fanotify can monitor a mounted filesystem as a whole. Monitoring a single directory recursively is still impossible, though. Whereas the inotify file descriptor is read-only, its fanotify counterpart is writable, which is how you tell fanotify your access permission decisions.

To create a fanotify file descriptor, use the fanotify_init() system call:

int fd = fanotify_init(FAN_CLASS_CONTENT, O_RDONLY);
if (fd < 0) { /* You know what to do */ }

Compared with inotify_init(), calling fannotify_init() usually implies root privileges. The call accepts two bitmask arguments. The first defines fanotify behavior; FAN_CLASS_CONTENT or FAN_CLASS_PRE_CONTENT are required to handle permission events. If a single file has multiple watchers, FAN_CLASS_PRE_CONTENT wins and gets a chance to modify the file's data; FAN_CLASS_CONTENT runs next, so it sees the contents in their final form (hence the name). The default, FAN_CLASS_NOTIF, runs last and can't be used with permission events.

If you want a fanotify descriptor to be non-blocking, add FAN_NONBLOCK. To waive limits for the in-kernel events queue size and the number of watches, use FANOTIFY_UNLIMITED_QUEUE and FANOTIFY_UNLIMITED_MARKS, respectively. Keep DoS attack scenarios in mind if you use these arguments.

When a file produces some event, fanotify opens a new file descriptor and hands it over to the userspace code. The second argument to fanotify_init() tells how exactly to do it. It's the same as flags in open(2). Here, you're not going to modify files, so read-only access is sufficient.

Next, you start adding marks. The fanotify_mark() system call multiplexes adding, removing, and flushing marks. An equivalent watch descriptor is not available in fanotify, so the call just returns zero if everything went okay:

int err = fanotify_mark(fd, FAN_MARK_ADD | FAN_MARK_MOUNT, FAN_OPEN_PERM, AT_FDCWD, argv[1]);
if (err) { /* Sorry, this didn't work out this time */ }

The fanotify file descriptor is fd, and FAN_MARK_ADD adds a mark. The whole mounted filesystem (FAN_MARK_MOUNT) is being monitored for file open permission events (FAN_OPEN_PERM). The last two arguments define an object to monitor. You'll find quite a few more possibilities in the fanotify_mark(2) man page. AT_FDCWD is a special value for the current working directory's file descriptor. If argv[1] is not an absolute path, it is treated as being relative to the current directory (.). This isn't utterly important here when installing a mount point mark, so any directory residing on a filesystem has the same effect.

Compared with inotify, fanotify's assortment of events might feel limited. At present, creating, deleting, and removing events are not supported: You can watch files and directories being opened, accessed, and closed, and that's it. Moreover, mmap() generates no events. Fanotify isn't an inotify replacement; instead, it focuses on cases such as malware scanning and hierarchical storage management.

Now you can start looping for events again. Fanotify represents events as struct fanotify_event_metadata. In theory, it varies in size, so fanotify provides some macros to aid iteration (Listing 2).

Listing 2

Looping for Events

Among the struct fanotify_event_metadata fields, three are most interesting: mask, fd, and pid. The mask bitfield stores FAN_* flags, as already discussed, plus FAN_Q_OVERFLOW, indicating an overflow event (unless the queue is unlimited). The pid process identifier generated the event, and fd is a file descriptor for an object to which the event pertains. The second argument to fanotify_init() dictates whether it is writable or not. Because there is no metadata->name, to retrieve it from metadata->fd, you just read a symbolic link in /proc/self/fd:

char path[PATH_MAX], real_path[PATH_MAX];
if (path_len < 0) { /* That's an error */ };
real_path[path_len] = '\0';

Imagine you want to ban access to files containing EvilSignature in their first 1KB of data. Real antivirus software is far more sophisticated than that, but the big picture stays the same (Listing 3).

Listing 3

Banning Files

Operations on metadata->fd don't generate further fanotify events (there would be an infinite loop otherwise). To convey a permission decision, you write an instance of struct fanotify_response to fanotify's fd. You must set r->fd to the file descriptor in question, and r->response is either FAN_ALLOW or FAN_DENY. Fanotify handles permissions events in first-in, first-out fashion, and until you reply, the requesting process remains blocked. If access is banned, the process gets an EPERM error. However, when your fanotify application terminates, all unhandled permission events are granted implicitly. Keep this in mind if you design security software.

The above example was intentionally simple. Refer to the fanotify(7) man page for a more elaborate version.

#### Flying Higher

Having a native API for filesystem monitoring is good, but sometimes not exactly what you want. It might be too low-level for Python code, or it could be too Linux-specific for cross-platform development. For these situations, higher level libraries wrap platform specifics in an easy-to-use interface.

If getting filesystem notifications in Python code is all you need, consider pyinotify [3], best described as "inotify, the Python way." It supports both Python 2 and Python 3. To install watches with the add_watch() method, you use a pyinotify.WatchManager instance (usually a singleton). The pyinotify.Notifier class is a central dispatcher hub. Pyinotify provides a simple blocking and threaded notifier and integrations for popular Python asynchronous frameworks, such as asyncore/asyncio modules and Tornado. When an event fires, pyinotify runs a defineded handler (a Python callable). If you want to chain handlers, consider using pyinotify.ProcessEvents instead of plain functions or lambdas.

On the other hand, python-fanotify [4] is best described as Python bindings to the native C API. This module comes from Google and has zero Python code inside except for setup.py. Documentation is missing as well (sans two examples), which is probably not a big deal. The API stays the same, except you prefix identifiers with fanotify plus a dot and rename them to match Python standards; so, fanotify_init() becomes fanotify.Init() and FANOTIFY_EVENT_NEXT() translates to fanotify.EventNext(), because Python has no notion of macros.For dessert, try Watchdog [5]. As opposed to the first two libraries, which wrap a single API, this one abstracts several OS-dependent mechanisms, such as inotify on Linux and kqueue on FreeBSD. To use the library, you create a watchdog.observers.Observer thread object. Watchdog detects your target platform and chooses the appropriate notification mechanism automatically. Then, you implement an event handler, which is a class that inherits watchdog.events.FileSystemEventHandler and overrides instance methods like on_moved() or on_created(), which are probably self-explanatory. Now, you "schedule" monitoring for a specific directory with observer.schedule(). Watchdog can recursively monitor by passing the recursive=True keyword argument. Finally, spawn the monitoring thread with observer.start() and start receiving notifications.

Watchdog also provides the watchmedo command-line tool for your shell scripts. Watchmedo executes shell commands in response to various filesystem events and serves as a reference for how to use the library in a real-world project.

Command of the Month: inotifywait

Watchmedo isn't the only command to mate filesystem notifications with shell scripts. The inotifywait command, along with its cousin inotifywatch, is the de facto standard in Linux. Both come in a single package, often called inotify-tools.

The purpose of these tools is to wait for filesystem events in selected directories then dump some statistics. The difference is inotifywait's output is easy to parse (and is configurable with --format and --csv), whereas inotifywatch prints a human-readable table (Figure 4).

The command-line syntax is also rather similar. You supply a path you want to monitor (either a file or a directory). The -r switch enables recursive operation. To exclude certain pathnames, use the --exclude key, which accepts a regular expression. Events to monitor are specified with -e. By default, inotifywait captures a single event, but you can override this with --monitor/-m. In this mode, the command executes forever. To do the same thing, but dump events to a file rather than stdout, use --daemon/-d. This doesn't apply to inotifywatch, which lasts until you interrupt it with Ctrl+C or a time out specified with -t and the number of seconds.

Study this snippet from an inotifywait session:

$inotifywait -rm -e create,access /tmp /tmp/ CREATE tmpfBnccrk ... /tmp/mc-val/ CREATE extfs1C1AQYMathJax.js /tmp/mc-val/ ACCESS extfs1C1AQYMathJax.js ... Here, I instructed Midnight Commander to open a ZIP archive and viewed a file in it. The output spans a few dozen lines: /tmp is a busy place on a live Linux system. Figure 4: Inotifywatch can gather you some quick stats on what's is going on in your /home, while you are out. The Author B:Valentine Sinitsyn develops high-loaded services and teaches students completely unrelated subjects. He also has a KDE developer account that he's never really used. ## Buy this article as PDF Express-Checkout as PDF Price$2.95
(incl. VAT)

SINGLE ISSUES

SUBSCRIPTIONS

TABLET & SMARTPHONE APPS

UK / Australia

## Related content

• iWatch

Why wait for cron? iWatch monitors critical files and directories in realtime. This handy Perl script then notifies the user or runs a configurable command when change occurs.

• Motion Sensor

Inotify lets applications subscribe to change notifications in the filesystem. Mike Schilli uses the cross-platform fsnotify library to instruct a Go program to detect what's happening.

• Charly's Column: iWatch

Recently, sys admin Charly was faced with the task of synchronizing a directory on a server with two NFS-mounted clients. He wanted the whole thing to happen quickly and to be easily manageable, which ruled out DRBD and GlusterFS.

• Monitoring with incron

The incron utility provides an easy way to initiate commands and scripts triggered by filesystem events.

• Linux-Kongress: Corbet Presents New Kernel 2.6.27

In the second keynote of the Linux-Kongress in Hamburg, Germany, cofounder of LWN.net and kernel developer Jonathan Corbet presented details on yesterday's released Kernel 2.6.27, but also described some of the work Linus Torvalds and his group of hackers have been up to.