Using fuzzy searches with tre-agrep

A Grep Replacement

Article from Issue 186/2016
Author(s):

Tre-agrep has all of grep's functionality but can also do ambiguous or fuzzy searches without deep knowledge of regular expressions.

Grep [1] is a standard command-line tool. It searches files for regular expressions, then displays any lines that include a match. In expert hands, grep can be a flexible tool, but gaining expertise can take years of practice. Nor do related commands like egrep [2] or fgrep [3] make grep any easier to use. For these reasons, those lacking expertise might want to check out TRE [4], which includes a reimplementation of agrep (approximate grep) [5] as a command-line utility. tre-agrep is a grep-like tool that has all of grep's functionality but can also do ambiguous or fuzzy searches that are much easier to learn.

Grep and tre-agrep share similar options, such as --ignore-case and --count. However, the logic of their searches can be different. (I say "can be" because often both commands have multiple ways of getting the same result.) To give a simple example, imagine that you are searching for files that contain both "Linux," and "Linus." Using grep, you would probably use regular expressions one way or the other. Probably the simplest would be:

grep 'Linu.' *.txt

Here, the period in practi.e indicates that any character can be substituted for it, giving results with both "practice" and "practise." This use of a regular expression is relatively simple, but it must be entered and positioned accurately. If it were more complicated, newer users might be put off by a series of familiar and unfamiliar characters used with non-standard syntax.

By contrast, with tre-agrep, the command is more likely to use an option for ambiguity:

tre-agrep -1 'Linux' *.txt

The option here means that the results should include those with one character different from the string "Linux" – a command that requires both less precision and less user knowledge, but perhaps at the price of more irrelevant results (Figure 1). Moreover, the entered command would find typos anywhere in the string, not just in the second-to-last letter. Notice, too, that, both commands begin displaying the results with the name of the file and end with the current account and the file path.

Figure 1: Set to search for "Linux" or a word with a one-character difference, tre-agrep locates "Linus."

Usually, tre-agrep displays the first result that matches the search. If you want more than the first search result, you can specify the number of errors. For example, if you set the command to look for results with four errors, results with three errors will not show, so you might want to make several searches with minor differences.

The original version of agrep was developed in 1988-2001 by Udi Manber and Sun Wu. Originally written for Unix, this version was widely ported to other operating systems, but it's rare in Linux distributions, because for years, it was released under a non-free license. Since 2014, it has been released under the ISC Open Source License [6], but either the new license is not recognized as free, or the change has gone unnoticed, because Debian still includes it in the non-free section of its repositories.

Today, the most common version is tre-agrep, written in 2002-2004 by Ville Laurikari. Tre-agrep uses a different library from the Manber and Wu version and is released under a BSD license. Most distributions include it in their repositories, although not as part of the default installation.

When used without any options, tre-agrep's output is identical to grep's. However, it is the options that make tre-agrep's results different. All tre-agrep's options come under one of three categories: options for approximations, regular expressions, and output filtering and formatting..

Options for Setting Approximations

Approximations or fuzzy logic are at the heart of tre-agrep. The man page describes the number of differences as the cost (based on the Levenshtein distance [7]), which is a count of the number of characters that a command using approximation options can depart from the precise string entered in the command. By default, a missing, an extra, or a substituted character all have a cost of 1, although you change these costs with --delete-cost=NUM (-D NUMBER), --insert-cost=NUMBER (-I NUMBER), or --substitute-cost=NUMBER (-s NUMBER) to reflect your needs.

The concept of cost is used without explanation in the command's help, but its usefulness of the concept soon becomes clear enough. Cost is a way to judge output records and sort through them. Most of the time – although not always – the lower the cost, the closer the result is likely to be to your intention. Conversely, the higher the cost, the greater the chance that an output record is relevant. However, if you know, for example, that relevant results are most likely to be a substitution, you can set the cost of substitutions to  , lowering their cost and making them easy to find with an output option such as --best-match or --show-costs (see below).

If you are not interested in changing the cost of approximations, the concept of fuzzy results is straightforward. The most useful option for approximations is -#, which should be replaced in a command by a digit between   (an exact match) and 9 errors – with "error" being the name for any deviation from the string entered as part of the command. You can also further filter output records via --max-errors=NUMBER (-E NUMBER). These are simple but powerful options, and they are easily remembered.

Options for Regular Expressions

Regular expressions are search patterns, in which characters stand for other groups of characters in files, the contents of files, or locations in a file [8]. Both grep and tre-agrep can use the same standard set of regular expressions (Table 1).

Table 1

Common Regular Expressions

Character Keys

Meaning

.

Any single character

*

Any any number of characters, or none

^

The following regular expression at the start of a line

$

The following regular expression at the end of a line

[]

Any of the characters in the brackets

\

Turn off the next character's meaning as a regular expression

\<

Characters at the start of a word

\>

Characters at the end of the word

?

One or zero instances of the preceding regular expression

Regular expressions can be entered directly into the string part of the command. However, ambiguity sometimes can be reduced by using the option --regexp=PATTERN (-e PATTERN). In particular, this option can be useful if a search includes a hyphen (-), which might be misinterpreted as introducing an option, or a forward slash (/), which might be read as introducing a directory.

As in grep, a search for regular expressions can be refined in several ways. With --ignore-case (-i), a regular expression treats lower and upper case letters the same, both in a search pattern and in the names of input files. With --literal (-k), the search pattern is read as though it has no special characters in it. You can also use --word-regexp (-w) to match only whole words, or --invert-match (-v) to select records that do not match the regular expression you entered. These refinements can help filter results, but they can add another level of complexity; therefore, unless you have a special need, you might first prefer to focus only on using regular expressions until you are comfortable with basic patterns.

Output Options

Some of tre-agrep's output options are less well known than those for approximations, but some can be almost as useful. Some are identical to grep's, such as --quiet (-q), which suppresses output, letting you know only that a match has been found, or --files-with-matches (-l), which lists only the names of files with matching results. Still another option shared with grep is --count (-c), which only tells you the number of matches in each file, but does not display them (Figure 2).

Figure 2: The --count option lists the number of matches in each file in the directory.

However, by far the most useful option for filtering results is --best-match (-B), whose option displays only the records with the lowest cost – that is, those closest to the string you entered in the command. By using this option, especially with approximations, you can reduce the results through which to scroll, although possibly at the cost of missing serendipitous results.

Another way to judge results is to add --show-cost (-s), which displays the cost directly after the file name at the start of the result. By seeing how far a result differs from the string you enter, you might be able to judge each result's reliability and usefulness.

Other output options format rather than filter results. For example, --color (--colour) is almost always useful, because it highlights results in the output strings, using the GREP_COLOR environment variable. Similarly, you can use --show-position (Figure 3) to prefix each output record with the start and end of the record (the first character of the record and the first character after the match). You might also help organize results by prefacing each output reference with the name of the file in which it is located, using --with-filename (-H). As you continue with your work, you might also find it useful to number each output record by adding the option --record-number (-n).

Figure 3: Here, the --show-position option shows two matches in nearly identical files, each of which starts three characters from the start of a file and ends 11 characters from the start.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • agrep

    The agrep tool expands on grep by adding fuzzy search capabilities to text string-matching operations.

  • Charly's Column: Biabam and Tre-agrep

    Most of the tools that show up in this column are small, smart, fast, and easily explained. This month is no exception; we feature a dynamic duo of tools.

  • ICgrep

    One of the most common tasks when working on computers involves browsing texts for search patterns. Here, ICgrep offers a modern, parallel, and Unicode-enabled alternative to the classic grep.

  • Command Line: Grep

    Once you understand the intricacies of grep, you can find just about anything.

  • Simple Regex Language

    Regular expressions are a powerful tool, but they can also be very hard to digest. The Simple Regex Language lets you write regular expressions in natural language.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News