Search more efficiently with ugrep
Tutorial – ugrep
Searching for text in files or data streams is a common and important function. Ugrep tackles this task quickly, efficiently, and even interactively if needed.
Grep is one of the oldest Unix commands. The abbreviation "grep" stands for Global/Regular Expression/Print or Global search for a Regular Expression and Print out matched lines. It picks up on the syntax of the original Unix editor, QED, which used g/re/p
to search for patterns in text files. In addition to fixed search terms, it can also search for patterns with wildcard characters. The GNU variant of grep is normally installed on Linux. It extends the features of the original grep in some places, for example, allowing recursive searching in directories.
Another variant of grep, agrep (approximate grep) [1], extends text searching to include fuzzy searches. It also finds near misses as long as the differences are below a specified threshold, known as the word distance. This is calculated from the necessary permutations, deletions, and additions of letters that convert the search pattern into the actual data.
In addition, there are some variants of grep that also find search patterns in certain archive types, such as ZIP files. These programs are relatively slow, since they first need to unpack the archive. However, all grep variants used on Linux can also read data from pipes via the standard input channel and write the results to the standard output channel for searching in archives (Listing 1).
Listing 1
Archive Search
$ zcat archive.gz | grep <pattern>
ugrep
Ugrep can do all of this and more without explicitly unpacking the data streams. In addition, the program is known for its exceptionally fast processing speed. To speed up the search, it uses multiple threads if necessary.
On Debian and Arch Linux, setting up ugrep is easy. Debian has the tool in its repositories; with Arch Linux, you can use the AUR. For all other distributions, you will have to install ugrep from the source code [2]. The commands required for this are shown in Listing 2.
Listing 2
Installing ugrep
$ git clone https://github.com/Genivia/ugrep $ cd ugrep && ./build.sh $ sudo make install
Ugrep is programmed in C++, has been around for several years, and is available not only on Linux, but also on other operating systems. Search patterns specified as regular expressions can span consecutive lines, a thing that many other grep variants cannot do. By default, ugrep assumes Unicode as the encoding for the search data.
Ugrep supports archive types including CPIO, JAR, PAX, TAR, and ZIP, compressed with all common methods (BZIP, GZ, LZ, and XZ). In addition, you can use filters to prepare data in special formats in advance. For example, PDF documents can be converted to text with a filter, before ugrep performs the search.
Like all grep variants, the program is largely controlled by options. For most options, as usual, there is a short form (-<O>
) and a long form (--<Option>
). Table 1 summarizes the most important options.
Table 1
Important Options
|
Interpret data as text |
|
Match count |
|
Search for specified pattern (can specify multiple patterns) |
|
Interpret search patterns as extended regular expressions (default) |
|
Set encoding for data |
|
Load search pattern from specified file |
|
Interpret search pattern as string (special characters are considered as text) |
|
Pre-filter based on specified filter criteria |
|
Interpret search patterns as simple regular expressions |
|
Ignore case in pattern |
|
Define negative search pattern |
|
Interpret all of the following search patterns as exclusion patterns |
|
Edit only files with the specified extension |
|
Interpret search patterns as Perl expressions |
|
Set pager for terminal output |
|
Incremental search with optional delay |
|
Recursive search |
|
Word search |
|
Output in hexadecimal form |
|
Unpack compressed data streams in advance |
|
Fuzzy search with set criteria for allowed deletions, insertions, or substitutions |
Besides all of this, the developer suggests a number of alias constructs for the .bashrc
to ensure compatibility with GNU grep, for example (see Table 2). Some of these short forms rely on the ug
command variant. In this form, ugrep reads in a configuration file (by default $HOME/.ugrep
) which can contain special settings. This means that important presets can be applied implicitly without having to specify them at the command line every time.
Table 2
Suggested Alias Constructs
Alias | Function |
---|---|
|
Interactive, incremental search |
|
Binary search |
|
Search in (compressed) archives |
|
Grep for Git |
Compatibility with classic variants |
|
|
Search with simple regular expressions |
|
Search with extended regular expressions |
|
Search without regular expressions |
|
Search with Perl regular expressions |
Search in compressed data |
|
|
Archive search with simple regular expressions |
|
Archive search with extended regular expressions |
|
Archive search for strings |
|
Archive search with Perl regular expressions |
Ugrep supports several search pattern variants, which you enable through appropriate options (see the "Patterns" box). Besides simple and extended regular expressions like GNU grep, ugrep also supports Perl regexes and word patterns. In addition to these default patterns, which always define positive patterns, ugrep can also use negative patterns (exclusion patterns). They let you, for example, ignore matches if they occur in comments. Files whose names match a certain pattern can also be excluded from the search. The --not
option has a special effect: All patterns to the right of it are used by ugrep as exclusion patterns.
Patterns
The term "pattern" usually appears in multiple contexts with different meanings in search programs like ugrep. Patterns in file names determine which files the program processes. The file content patterns are the actual search patterns for which it searches the processed files. With ugrep, these may also be across lines. Ugrep and some other search programs also support negative patterns. They are used to exclude files or not to display corresponding matches. In fact, ugrep takes this procedure quite far: In the program's documentation, there is a separate section, Search this but not that with -v, -e, -N, --not, -f, -L, -w, -x, that deals with the finer points of this subject.
Extensions
In many places ugrep extends the other, classic program versions. The new features for patterns in file names ("globbing") are particularly interesting. For example, **/
stands for any number – even zero – directories. At the end of a path definition, /**
stands for any number of files. The special case \\?
addresses zero characters or one. In the man page, the globbing section summarizes these features and also gives numerous examples.
Special environment variables let you additionally control the behavior of ugrep. $GREP_PATH
simplifies access to so-called pattern files (i.e., files that define search patterns); the -f
option enables this feature. Patterns in external files are a good way to keep complex search patterns permanently.
Some options, including -Q
, can use an external editor that the key combination Ctrl+Y starts. If the $GREP_EDIT
environment variable is set, ugrep uses the editor defined there; otherwise the one defined in $EDITOR
is used.
The $GREP_COLOR
and $GREP_COLORS
environment variables let you specify when and how ugrep color highlights matches when using the --color
option. The GREP_COLORS
section in the man page describes this in more detail.
But the really outstanding extensions in ugrep are the incremental search feature and the user interface.
User Interface
Grep programs are usually used interactively in command lines, scripts, or pipes; in many cases the results then act as input for further commands. This also works without any restrictions in ugrep. In addition, the developer has also paid great attention to extended interactive usability. For example, incremental searching is currently an absolutely unique selling point of ugrep. The user interface used for this was modeled on editors such as Emacs and is normally reserved for GUI programs.
With this type of search, each additional letter specified further refines the search and reduces the number of matches. All lines that match the previous entries are then displayed. For this form of search, ugrep provides a special interface that you enable using the -Q
option. As an argument of -Q
, you can specify a small delay that ugrep waits for before evaluating the input.
The Q>
prompt now appears in the upper left corner of the terminal. Everything you type is interpreted by ugrep as a search pattern; each additional keystroke refines the search. Typos can be corrected with the backspace key. In the example from Figure 1, we called ugrep with the -ZQ
(fuzzy, interactive) options and searched for "alles" ("everything" in German). Due to the fuzzy search, ugrep also finds "alpes", "alls", "ales," and so on.
This feature is so powerful that ugrep in this mode can sometimes even replace a pager for displaying output. For example, man ugrep | ugrep -Q
displays the man page of ugrep and lets you define exactly which search term it should display. The output can also be shifted vertically with the arrow keys; Esc ends the mode again.
On top of that, this option can be combined with others. In case you need more than the ability to see just the line with the match, you can add two context lines before and after the match to the output using -C2
. In this form, ugrep is extremely useful as an alias (alias q2='ug -C2 -G '
), shell function, or script.
The ability to search archives is a similar case. Many modern documents are in complex formats like EPUB, ODF, etc. There, the options usually only act on metadata in the document containers – often ZIP archives. To search in the actual contents, you have to unpack these archives, which is done either by a filter (more on that later) or the -z
option, often combined with -r
for recursive.
Ugrep supports fuzzy searching with the -Z
option, which may be followed by a number appended directly without spaces. The latter determines the degree of fuzziness, that is, the permissible number of errors (omitted, added, swapped characters). The default is 1
. Larger values quickly lead to many additional hits, but this sometimes makes the results unusable.
However, the type of allowable errors can be specified: With a prefix of +
or -
, the specification refers only to additions or omissions, respectively. The tilde (~
) groups several errors. -Z~-2
means that up to two omissions or swaps are allowed. The --sort=best
option sorts the output so that the files with the best matches appear first.
Ugrep uses some function keys for special tasks in interactive mode. For example, F1 activates the online help (Figure 2) where ugrep displays the current keyboard shortcuts. You can enable additional options by calling them in this mode. For example, after pressing F1, the key combination Alt-Left+Shift+Z activates fuzzy searching.
Invoked with the --save-config
option, the program creates the $HOME/.ugrep
configuration file. If necessary, you can create another file using --save-config=/<path>/<file>
). Similarly, --config
reads configuration files. Calling ugrep as ug
automatically parses the configuration.
Since configuration files are a powerful means of controlling ugrep, there is also the shorthand ---<file>
for loading. You can create configuration files with certain preset options with the following command:
$ ugrep -<option> [...] --save-config
The configuration files are well commented and can be easily customized with a text editor if needed.
Buy this article as PDF
(incl. VAT)