Faster search filter: ICgrep

Well Filtered

Article from Issue 189/2016
Author(s):

One of the most common tasks when working on computers involves browsing texts for search patterns. Here, ICgrep offers a modern, parallel, and Unicode-enabled alternative to the classic grep.

Grep is arguably the most import text-browsing tool in Unix. Although its origins are not entirely clear, likely, the name for this tool developed from the command g/re/p (global, regular expression, print) from the Unix standard editor Ed.

In any case, grep searches entries line by line for certain formulated patterns, as regular expressions. Depending on the options used, the tool can display the matches, their location, number, and so on.

With today's texts, this concept is stretched to its limits in several respects. For example, modern systems no longer use the antiquated ASCII character set with its 128 characters or its larger, 256-character siblings such as Latin 1 or increasingly Unicode [1] (see the "Using Unicode Characters" box).

Using Unicode Characters

Unicode aims to assign a digital code to each existing meaningful character or text element of all known writing cultures and character systems. The character range is equally extensive – encoding a single character requires several bytes.

Due to the large scope of the Unicode character range, it often isn't possible to enter characters on the keyboard directly because there simply aren't any keys for them. However, in Linux, Qt and GTK allow you to enter characters using key codes. To do so, you just need to precede the codes with the keyboard shortcut Ctrl+U or Ctrl+Shift+U – but not all programs support this.

It is often easier just to search for these – especially if you don't know the code for a particular character. There are two utilities for GTK-based desktops that are both used regularly: Gucharmap (Figure 1) and Gnome Characters (Figure 2). The latter is particularly useful for finding special characters and copying their codes.

Figure 1: The Gucharmap tool systematically lists the characters of many languages.
Figure 2: The Gnome Characters tool sorts special characters into different, often immediately insightful classes.

Another problem is the definition of input lines. A plethora of variants already existed for identifying or inducing a line break with the previous, non-Unicode-enabled systems [2]. The default in Unix is NL (New Line), and there are other options with Unicode, in particular NEL (New Line), but also LS (Line Separator) or PS (Paragraph Separator).

The relatively low operating speed of grep – with large input volumes and complex regular expressions – makes matters worse. Even in the best case scenario, regexes [3] are pretty complicated to enter. When editing complex expressions, the computer requires a corresponding amount of computing power. On top of that, there are also conditional, non-greedy, and greedy regexes – which attempt to generate matches for samples to include as much text as possible – and many other advanced variants. All grep varieties process regular expressions at run time so that the complexity is mapped directly in the calculation time.

An online study [4] shows just how quickly ICgrep [5] works in comparison to the standard grep. It describes a scenario in which the standard grep needs more than a minute to search for a pattern, whereas ICgrep handles the same tasks in just over a second. According to the ICgrep developers, there might be differences in speed with a factor of up to 100.

Installation

Installing ICgrep proved to be a little tricky and quite time consuming in the test; its prerequisites are Subversion, Cmake, and Clang (Listing 1, line 1).

Listing 1

Creating ICgrep

01 $ sudo apt-get install cmake clang subversion
02 [...]
03 $ svn co http://parabix.costar.sfu.ca/svn/tags/icgrep1.0
04 [...]
05 Checked out, revision 5012.
06 $ cd icgrep1.0/icgrep-devel/llvm-build/
07 $ cmake -DCMAKE_INSTALL_PREFIX=../libllvm -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_TOOLS=OFF -DLLVM_BUILD_EXAMPLES=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER:FILEPATH=/usr/bin/clang++ -DCMAKE_C_COMPILER:FILEPATH=/usr/bin/clang ../llvm-3.5.0.src
08 [...]
09 -- Build files have been written to: /home/jluther/icgrep1.0/icgrep-devel/llvm-build
10 $ make && sudo make install
11 Scanning dependencies of target LLVMSupport
12 [  0%] Building CXX object lib/Support/CMakeFiles/LLVMSupport.dir/APFloat.cpp.o
13 [...]
14 [100%] Built target LTO
15 [sudo] password for jluther:
16 -- Installing: [...]

To begin, load the latest ICgrep version onto your computer using Subversion. The message Checked out, revisionNumber marks the end of this action (line 5). Now switch to the icgrep1.0/icgrep-devel/ directory. There is an installation guide, among other things, in the README-icgrep-1.00a.txt file. It essentially consists of two parts: the interpretation and the installation of Low Level Virtual Machine (LLVM) and the subsequent construction of ICgrep itself.

To install LLVM, it is important to generate the required Makefiles. To do this, switch to the llvm-build/ directory (line 6) and let Cmake perform its role there (line 7). After a short time, you can initiate the build of the LLVM using make && make install (line 10). Depending on the computing power, the interpretation process takes about an hour, and the subsequent installation only takes a few seconds to complete.

The second step involves creating ICgrep (Listing 2). To do so, switch to the icgrep1.0/icgrep-devel/icgrep-build/ directory and create the Makefiles there using the command from the second line. The compiler automatically detects which CPU extensions the program can use to speed up processing. Alternatively, you can present one of the options SSE2, SSE3, SSE4_1, SSE4_2, AVX1, or AVX2 using the -DSIMD_SUPPORT button.

Listing 2

Generate the Executable File

01 $ cd ../icgrep-build
02 $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER:FILEPATH=/usr/bin/clang++ -DCMAKE_C_COMPILER:FILEPATH=/usr/bin/clang ../icgrep-1.00a
03 [... Tests ...]
04 -- Performing Test AVX2
05 -- Performing Test AVX2 - Failed
06 -- Performing Test AVX1
07 -- Performing Test AVX1 - Failed
08 -- Performing Test SSE4_2
09 -- Performing Test SSE4_2 - Failed
10 -- Performing Test SSE4_1
11 -- Performing Test SSE4_1 - Success
12 -- Configuring done
13 -- Generating done
14 -- Build files have been written to: /home/jluther/icgrep1.0/icgrep-devel/icgrep-build
15 $ make
16 Scanning dependencies of target RegExpADT
17 [  2%] Building CXX object CMakeFiles/RegExpADT.dir/re/re_re.cpp.o
18 [... Compilation ...]
19 [100%] Building CXX object CMakeFiles/icgrep.dir/compiler.cpp.o
20 Linking CXX executable icgrep
21 [100%] Built target icgrep

The next step is to generate the executable file (Listing 2, line 15) without options using make – this again takes a little longer. The generated program icgrep can be made smaller using strip. Last but not least, you can install the program manually in the /usr/local/bin/ directory.

In Practice

You can largely control ICgrep – and its cousins grep or agrep – using options. As this tool is a piece of research software, it contains two or three sets of options with different relevance for typical users. As usual, -help shows the most important ones; -help-list goes a step further: It generates a short, compact list of important buttons that, however, do require quite a bit of knowledge of regular expressions. The -help-list-hidden option ultimately gives you the full list of options; however, many of these are only important in special cases.

You can define patterns, that is, regular expressions, using the -e button ("expression"). If there is a risk that the shell will detect and evaluate or replace metacharacters in it, you should include the pattern between the single or double quotes. An equals sign can (but does not have to) follow the call, which may occur more than once:

$ icgrep -e Patt -e=ern

These types of expressions then stand for a logical OR: Condition 1 OR Condition 2 OR both must be met. If the number of patterns is too large and the commands get too confusing, it is advisable to outsource the pattern to an external file, one per line. You can specify this file using icgrep -f <SampleFile>. Table 1 summarizes the most important buttons.

Table 1

Important Options in ICgrep

Option

Function

-e=<Sample>

Defines a sample, can be specified multiple times

-f=<File>

Sample file

-c

Only counts matches, doesn't show them

-n

Displays the line numbers of matches

-H

Displays file name

-i

Ignores upper/lowercase

-disable-Unicode-linebreak

Only NL as line break

-normalize-line-breaks

Normalizes line breaks

-dump-generated-IR

Shows the regular expressions used

-print-REs

Shows named expressions

One of ICgrep's significant shortcomings is that there's no option for recursively editing, like GNU grep does with the -r button. Instead, you have to resort to the combination of Find, Xargs, and ICgrep:

$ find <Path> -type f | xargs icgrep <Sample>

Alternatively, today many shells, including Bash, support recursive directory searching using **. However, this functionality must be activated via shopt -s globstar. The shopt globstar command then shows globstar on. Caution: This process might lead to overly long argument lists and error messages like -bash: /usr/local/bin/icgrep: The argument list is too long.

The other important options currently missing in ICgrep include -v ("invert match"), -C, -A, -B (displays context lines), -m (defines maximum number of matches), -a, -b (interprets inputs as text or binary), and --color (highlights matches in color). The developers are planning to implement some of them soon [6].

Testing Regexes

With large data streams, it is advisable to test regular expressions first before applying them to all the data. Tools that visualize regular expressions make life easier for less experienced users. Two such tools are available in Ubuntu in the repositories: Kodos and Kiki [7].

Although Kodos can only be installed on older distributions with difficulty, Kiki is less demanding (Figure 3). It uses regular expressions defined and generated by Python, which it then applies to a previously entered text.

Figure 3: Kiki (above) and regexper (below) test regular expressions before using them on large amounts of data.

The regexp testing tool regexper [8] works somewhat differently. The Java program naturally uses Java-based regular expressions and applies them to a previously loaded test file (Listing 3). The test data may be relatively large, which sometimes proves advantageous.

Listing 3

Test File

$ java -Xmx1200M -Dfile.encoding=windows-1251 -jar bin/regexper.jar -f <TestFile>

All of these tools, however, have weaknesses when it comes to processing Unicode characters. Therefore, when constructing regular expressions, they are only suitable for initial tests.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Simple Regex Language

    Regular expressions are a powerful tool, but they can also be very hard to digest. The Simple Regex Language lets you write regular expressions in natural language.

  • agrep

    The agrep tool expands on grep by adding fuzzy search capabilities to text string-matching operations.

  • Command Line: Grep

    Once you understand the intricacies of grep, you can find just about anything.

  • Command Line – tre-agrep

    Tre-agrep has all of grep's functionality but can also do ambiguous or fuzzy searches without deep knowledge of regular expressions.

  • Go Programming

    The Go programming language combines type safety with manageable syntax and an extensive library. We take you through a programming example.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News