Fuzzy text searches with agrep

Better Finds

© Lead Image © 3dalia, 123RF.com

© Lead Image © 3dalia, 123RF.com

Article from Issue 184/2016

The agrep tool expands on grep by adding fuzzy search capabilities to text string-matching operations.

The grep command, which allows users to find strings and patterns in text files, is probably familiar to Linux users who use the command line; however, its variants are less well known. For example, the following two commands are precisely equivalent:

egrep <term> <file(s)>
grep -e <term> <file(s)>

The tool interprets the <term> as an extended regular expression. By contrast, fgrep interprets this as being equivalent to grep -f, where all components in <term> are normal characters, thus ignoring their potential regex meanings. It thus works a little faster than a plain vanilla grep, and this is especially noticeable when searching large volumes of data. A third candidate, rgrep, works like grep -r, recursively parsing folder structures, which affects its speed. All these commands have one thing in common: They only find direct hits for <term>; a search for similar terms only works if you design a matching regex.


The grep variant agrep implements the methods provided by the Levenshtein algorithm (see the "Levenshtein Distance" box) to find strings or patterns in text files. In contrast to grep, agrep can only interpret relatively simple patterns, but it can also find similar terms and often works faster than the original.

Levenshtein Distance

Consider the strings "grep" and "gerp," which differ by two letters in different positions, whereas "grap" or "grip" change one letter, and "egrep" adds a letter. To define such deviations in a mathematically precise way, the Russian mathematician Vladimir Iosifovich Levenshtein defined the Levenshtein distance [1] in 1965. This value, also known as the edit distance, is used as a measure of the minimum number of insertions, deletions, and replacement operations for converting one string to another [2]. If you weight the algorithmic "cost" of the necessary operations, you arrive at the weighted Levenshtein distance (WLD). Evaluating the Levenshtein distance makes it possible to find similar words, compensate for spelling errors, and generally determine minor word differences and ignore them where appropriate.

Originally, agrep [3] was written for an index system named Glimpse [4] – a kind of predecessor of today's desktop search engine – at the University of Arizona. Today, Glimpse is hardly used; on the desktop, Recoll [5] or similar, much more powerful programs have taken over its role. The syntax of agrep closely follows that of grep and adheres to the pattern:

$ agrep [<options>] <pattern> [<file>] ...

Linux currently uses two versions of agrep: the classic version from the Glimpse suite [6] and a newer one based on the TRE library [7]. The former often works faster, but does not support Unicode characters and other multibyte encoding [8]. Therefore, most distributions now provide the TRE version of the program. The two versions differ not only in performance but also in terms of options (Table 1).

Table 1

agrep Options


TRE agrep

Glimpse agrep

Define patterns; useful for patterns that start with "-"

-e <pattern>

Use content of stated file as pattern

-f <filename>

Ignore case



Do not ignore case


Ignore case; replace numbers with numbers, letters with letters


Do not evaluate non-standard characters in the pattern



Pattern describes a whole word



Pattern describes a whole line


Recursive processing


Set details (preset: 1)

Cost(1) of deleting characters



Cost of inserting characters



Cost of replacing characters



Maximum cost of a match



Maximum permissible error count for match (independent of cost)



Manage Output

Show cost for a match


Show match with lowest cost(2)



Suppress prompt for -B


Color highlight matches


Show match count



Show matches without filename



Show matches with filename



Show filenames instead of matches



Always show filenames with matches


Enumerate matches



No output



Show position of rist character of a match(4)



Output line breaks beind matches instead of in front of matches


Line break characters

-d <pattern>(5)

Only show lines without matches



Output additional details


(1) The "cost" <value> weights the operation, influencing its effect on the Levenshtein distance.(2) Requires twice the time (two runs) and does not work for input from STDIN.(3) Exit with return code 0 for a match.(4) Important for developing patterns.(5) Default: \n.

Which variant of agrep you use depends on the purpose and availability. When processing Unicode characters, there is no alternative to the TRE variant; but, because this usually requires significantly more resources and computing time than the classic version, there are good reasons to use the original as well.

Hands On

In principle, you use agrep like the standard grep command and can thus search arbitrary data streams (i.e., data from files as well as the STDIN channel):

$ agrep-tre --col -s -E1 -I 2 Tst *

Like grep, agrep parses streams record by record, searching for patterns. The delimiter decides what constitutes a record. By default, this is a line break (\n), which means that agrep views each line as a record and applies the pattern defined by -e <pattern>.

However, you will see several significant differences between the (GNU) grep and agrep variants. For example, agrep does not support many of the options normal grep command has at its disposal (e.g., context control of the output -A, -B, -C) or the respective include or exclude options as for regular expressions.

The option -d <pattern> for separating the records also deserves special attention. It does not work as you would expect in all cases. For example, if you define -d '', the records should be delimited by spaces, but this does not work in all cases. For such problems with the record boundaries, the options --color, --show-position, and -n (TRE variant) and -b, -n, and -V (classic version) can help you discover the problem.

The tr (translate) tool is also a relatively simple workaround for these problems. You can use the simple command to wrap data streams quickly and reliably to create lines. The syntax is quite simple:

$ tr <options> <search_char> <replace_char>

To convert spaces to line breaks, for example, you would use the command,

tr ' ' '\n'

and to replace several <search_chars>, use:

$ echo "Hello World" | tr "A-Za-z" "a-zA-Z"

The advantage of this variant is that because agrep uses the new lines as record boundaries, the results then match your expectations.

Behind the Scenes

Because almost all distributions provide agrep as one of its standard tools, some shell scripts and GUIs use the command. A typical example is the build program ding [9]. Spellcheckers like Aspell also apply the agrep method implicitly. A phonetic transcription occurs first; the program uses the Levenshtein algorithm to find the best approximation. More complex search engines such as Elasticsearch (Recoll cannot do this) also use the Levenshtein algorithm to generate the best possible results.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Command Line – tre-agrep

    Tre-agrep has all of grep's functionality but can also do ambiguous or fuzzy searches without deep knowledge of regular expressions.

  • Tracked Down

    Searching for text in files or data streams is a common and important function. Ugrep tackles this task quickly, efficiently, and even interactively if needed.

  • Charly's Column: Biabam and Tre-agrep

    Most of the tools that show up in this column are small, smart, fast, and easily explained. This month is no exception; we feature a dynamic duo of tools.

  • ICgrep

    One of the most common tasks when working on computers involves browsing texts for search patterns. Here, ICgrep offers a modern, parallel, and Unicode-enabled alternative to the classic grep.

  • Command Line – Pgrep

    Pgrep is a valuable tool for tracking down processes.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95