Faster Finds with the xargs Command

Turbocharger

© Lead Image © rouslan, 123RF.com

© Lead Image © rouslan, 123RF.com

Article from Issue 303/2026
Author(s):

If you're processing a group of files in a single command pipe, the xargs tool just might save you some precious execution time.

Everyone is familiar with the classic problem of finding a string of text in a mass of files. You could approach this problem in a naive way by simply using grep and a parameter set (Listing 1, first line). But be aware that this approach often ends with the error shown in the second line of Listing 1. Houston, we have a problem: There are too many files in the target folder for grep to pass them as parameters in Bash.

Listing 1

grep

$ grep -Hni needle haystack/*
bash: /usr/bin/grep: Argument list too long
$ find . -type f -exec grep -H -n -i needle {} \;

Another option is find. If you take a quick look at the man page for the find tool, you will soon encounter the -exec parameter. This approach involves find passing the current search results to the command after -exec, represented by a pair of curly brackets (Listing 1, last line). A final semicolon, which you need to escape with a backslash, guarantees that find will detect the end of the -exec argument.

This approach does solve the problem, but at a price. What happens here is slow and, above all, wasteful. For each search result, find has to start its own grep process, irrespective of whether the search term that you want grep to find even exists in the file.

A far more efficient way to solve this problem is to combine find with the xargs utility. Instead of processing the matches individually with grep, delegate this task to xargs using a pipe.

The xargs tool reads a list of items from standard input and executes the command defined with it on the items in the list. In other words, rather than invoking a separate process for each file, you can use one process for a series of files. If the number of files in the list exceeds the maximum limit, xargs will execute multiple times until it has processed the complete list.

The first command in Listing 2 searches for files with a .txt suffix. Instead of launching grep for every single file, though, xargs handles the task of transferring the search results to grep.

Listing 2

grep and xargs

$ find . -type f -name "*.txt" | xargs grep -Hni needle
$ grep -Hni needle haystack/2025-10 Article Linux Magazine.txt/
$ find . -type f -print0 -name "*.txt" | xargs -0 grep -Hni needle

This approach is considerably more efficient because xargs does not just forward one file as an argument, but several. Exactly how many is very much dependent on the operating system and security limits. Obviously, anyone who is allowed to pass unlimited parameters to a program might be able to use it to launch an attack on the operating system. Ultimately, a few variables determine xargs's scope, but more on that subject later.

Now that I've glued find and grep together with xargs as the man-in-the-middle, the setup runs like clockwork and is significantly faster than before. Unfortunately, I'm now seeing masses of errors. It looks like there are problems with folders and files that have names containing spaces. A quick check shows that, in this case, xargs sends the matches to grep word by word because it thinks the space is a separator. This causes errors because files with these names do not exist. The second line of Listing 2 shows what grep might be seeing.

The solution is to tell find to pass the results to xargs and terminate with a binary 0. The print0 switch does this; xargs has precisely the counterpart I need in the form of the -0 switch. The two tools complement each other perfectly. The command in the third line of Listing 2 does the trick.

Optimum Design

The xargs utility offers several options. For instance, --max-args (or -n for short), lets you limit the number of results to a manageable size. Another powerful feature is -max-procs. This parameter – my personal favorite – lets you control the number of processes you wish to execute in parallel. This command means that I could type --max-procs 4 grep … (or -P for short) to search through the find results list with four grep instances. On modern CPUs, this is pretty much the Holy Grail for power users. In fact, there are other parameters for parallel grepping. Parallelization does not necessarily make grep faster, especially if you are using grep on hard disks – after all, the read/write heads would have to move very frequently. Even on an NVMe SSD, the first command from Listing 3 is faster than the second.

Listing 3

Speed

$ find . -type f -print0 -name "*.txt" | \
  xargs -0 grep -Hni NEEDLE
$ find . -type f -print0 -name "*.txt" | \
  xargs -0 -P 4 -n 40 grep -Hni NEEDLE

The situation is completely different when it comes to parallelizing complex tasks, such as encoding or transcoding. This becomes particularly noticeable if you decide to encode a large number of WAV files to Ogg Vorbis. The individual tasks can easily be distributed across separate processes. There is no need for synchronization, as it is almost impossible for multiple processes to write to a shared object at the same time. In fact, the processes share the standard output, which can be quite confusing.

On a multicore CPU, each core can run its own oggenc process. The number of CPU cores can easily be set using the standard nproc tool. I chose two use cases for this benchmark, as shown in Listing 4. In the first case, the test involves having oggenc convert a series of WAV files sequentially. In the second case, I use find to pass in the files to xargs; xargs is then supposed to pass the values sent to it by find, shown as %, to oggenc. In both cases, I record the time required for processing using the time utility.

Listing 4

oggenc vs. oggenc Parallel

$ time oggenc -q9 *.wav
$ time find . -type f -iname "*.wav" -print0 | xargs -0 -n 1 -I % -P $(nproc) /usr/bin/oggenc -q9 %

Before you jump to conclusions about the results, first take a look at the xargs -n 1 and -I % parameters. The first parameter forces oggenc to accept exactly one filename for exactly one WAV file. This is absolutely essential, because otherwise oggenc would start sequential encoding. The second parameter, -I %, stores the filenames passed to xargs in the % placeholder, which you can see at the very end of the parameter chain. The -q9 option simply tells oggenc to use nearly the maximum quality for encoding.

Unsurprisingly, the difference is considerable. The time utility measures the command runtimes, showing a significant deviation in real time (see the box entitled "About Time"). Thanks to parallel processing, the xargs variant completes the same task in about a third of the time taken by the sequential variant.

About Time

Even after many years of using Linux, I was never totally clear about the exact meaning of the three times provided by time (Listing 5). Thanks to xargs, I am now a little smarter, because the time man page provides the information I was looking for. The real time refers to the wall clock time (i.e., the actual time elapsed from the start to the end of the process). The term actual or real time would probably make Albert Einstein spin in his grave; Newtonian time might be a better choice of wording.

The user time, on the other hand, is the time that the CPU spends running in user mode – in other words, the time spent executing non-kernel code within the process. The sys time is the time the CPU spends executing kernel code within the process. Be warned, the user time added to the sys time does not necessarily equal the real time: The times for process switching and waits for resources such as disk I/O are not included.

Listing 5

Runtimes

# sequential processing
    real    0m53.998s
    user    0m52.653s
    sys    0m1.265s
# parallel processing:
    real    0m16.871s
    user    1m5.848s
    sys    0m1.581s

In addition to the parameters I have discussed so far, I would also like to mention the -t switch. Contrary to all the usual conventions, -t activates verbose mode, for example, for debugging purposes. The --show-limits option reports the actual sizes for parameters on the current system. For instance, you can discover the maximum total length of the arguments and the maximum command length.

Clever Collection

I often collect files with specific characteristics in separate folder logs, so that I can more easily correlate them. For this scenario, xargs has the right tool: placeholders.

Suppose I want to store the logfiles from the database, web server, operating system, and so on from the last three days in the LogAnalysis/ folder. You can use find to select the relevant logfiles, and then let xargs take care of the copying (Listing 6).

Listing 6

Placeholders

$ find /var/log/ -mtime -3 -print0 | \
  xargs -0 -I {} cp -av {} ~/temp/LogAnalyse
$ find . -type f -name "*.txt" | \
  xargs -I {} sh -c 'ls -l {}; du -h {}'

In Listing 6, find uses -print0 to write null-terminated filenames to stdout; xargs fields them with -0 and stores them in the {} placeholder to pass them in as the first parameter to the cp copy tool. Please note that if you omit this placeholder, cp would swap the source and destination. Without the placeholder, xargs would simply pass the current logfile to cp as the last parameter, using it as the copy destination rather than the copy source.

Of course, xargs can forward more than just one additional command. There are ways to pass a string stored as a placeholder to multiple programs. Instead of starting a process directly, the second call in Listing 6 triggers another shell, which then fields the placeholders from xargs. This means that you can use a single command to create a directory list with ls and then display the storage requirements directly afterwards with du.

Conclusion

The xargs tool is one of the unsung heroes in the Linux universe. It really comes into its own when used in conjunction with find. This is especially true when it comes to complex manipulation of existing files, such as encoding, modifying the Exif data, and many other uses.

The Author

Thomas Reuß is a passionate Linux admin who is hugely interested in security. He is currently working as a consultant in the SAP environment.

Buy this article as PDF

Download Article PDF now with Express Checkout
Price $2.95
(incl. VAT)

Buy Linux Magazine

Related content

  • The Watchmen

    Two monitoring tools, watch and fswatch, let you gather system information from the command line.

  • Tiny Core Linux

    Tiny Core Linux does not boast a big repository. Sooner or later, you'll need to create your own extensions to get the most out of Tiny Core. This article shows you how.

  • ICgrep

    One of the most common tasks when working on computers involves browsing texts for search patterns. Here, ICgrep offers a modern, parallel, and Unicode-enabled alternative to the classic grep.

  • Backup Integrity

    A backup policy can protect your data from malware attacks and system crashes, but first you need to ensure that you are backing up uncorrupted data.

  • gdu, godu, duf

    Three modern tools, gdu, godu, and duf, make the task of checking the utilization level of hard disks easier thanks to fast execution speed and a good graphical implementation.

comments powered by Disqus