Needle in a Haystack

How odfgrep Works

If you thought that an odfgrep script would have to be very complicated, think again. The code in Listing 1 is the odfgrep that I use from time to time on my own GNU/Linux computers, and it is less than 30 lines.

Listing 1

odfgrep Script

  1  #! /bin/bash
  2
  3  OPTIONS=$@
  4  ODFOPTIONS=`echo $@ | sed -e 's/\.\(odt\|odp\|ods\)\b/\.\1\.odfgrep\.txt/g'`
  5
  6  for FF in $OPTIONS
  7  do
  8      if [ -f "$FF" ]
  9      then
 10           case "$FF" in
 11               *.odt|*odp*|*ods)
 12                    odt2txt --width=-1 $FF > $FF.odfgrep.txt
 13                    FILES2REMOVE="$FILES2REMOVE $FF.odfgrep.txt"
 14                    ;;
 15               *) # non-ODF file found, nothing to do
 16                    ;;
 17           esac
 18      fi
 19  done
 20
 21  grep $ODFOPTIONS | sed -e 's/\.odfgrep\.txt//'
 22
 23  if [[ -n "${FILES2REMOVE// }" ]]
 24  then
 25      rm $FILES2REMOVE
 26  fi
 27  exit

There are surely many other ways to write an odfgrep script, and many of those ways may be faster than the example here or handle some weird combinations of search patterns and ODF files better. Personally, however, I have not had any problems yet, and the simple odfgrep discussed here should be enough for the needs of the great majority of Linux desktop users. Its high-level flow diagram (Figure 2) is easy to describe:

Figure 2: Main steps of the algorithm used to find text in ODF office documents, as if they were plain text files.
  1. Get the list of all the files to analyze from the user.
  2. Figure out which of those files are in ODF format.
  3. Make a temporary plain text version of each of those files, each in the same folder as the original.
  4. Pass to the standard grep the same options specified by the user, but a different list of files, in which plain text versions of the ODF files are used.
  5. "Massage" the output of grep on the fly so that the user sees the names of the original ODF files.
  6. Remove all the temporary plain text files created in step 3.

I look at the assumptions behind this algorithm, its limits, and some ways to expand it at the end of the tutorial. For now, I'll just look at how the code that implements it works, line by line.

Commands typed at a prompt are interpreted and executed on the spot, line by line, by special programs called "shells" in the Linux world. You can save long sequences of commands in files that a shell may then execute automatically, one line at a time. These special files are called scripts, and odfgrep is just that: a script.

The weird stuff on line 1 is the standard header of every shell script. The two initial characters ("shebang" in Unix slang) mean that the file uses the syntax of the default shell on GNU/Linux systems, called Bourne Again Shell (Bash), and therefore must be interpreted by the bash program that is in the /bin folder.

Each shell script can receive options, or switches, that modify its default behavior. In Bash, those options are saved in the special variable called $@. Lines 3 and 4 copy all those switches, for readability, in two string variables called OPTIONS and ODFOPTIONS. The first will only be used to figure out which of the files that grep is to scan are in ODF format (line 6).

The $ODFOPTIONS variable is filled in line 4 with the customized file list mentioned above. In that line, in fact, sed (Stream EDitor) receives the original options and appends on the fly the string .odfgrep.txt to each occurrence of the .odt, .odp, and .ods file extensions.

In other words, if you asked odfgrep to find all the occurrences of "Linux" in two files called thesis.odt and thesis-slides.odp

#> odfgrep Linux thesis.odt thesis-slides.odp

then $ODFOPTIONS would assume the value Linux thesis.odt.odfgrep.txt thesis-slides.odp.odfgrep.txt.

sed achieves this by substituting each occurrence of the text pattern between the first two forward slashes with the other pattern between second and third forward slashes. A complete sed tutorial would not fit (and be off topic) here, but you need to understand two pieces of line 4: The \1 means "put here the string just found with the pattern in the set of parentheses to the left" (i.e., .odt, .ods, or .odp).

The \b pattern modifier makes sed only act on word boundaries: Without it, line 4 would modify a file name like notes.odtconference.odt to notes.odt.odfgrep.txtconference.odt.odfgrep.txt. Not a 100% bulletproof solution, since it would also work, say, on strings like my.odt.notes.txt. In practice, that has never been a problem for me.

The loop in lines 6 to 19 creates the plain text copies of each ODF file, saving their names (which are the same previously written in $ODFOPTIONS, remember?) in the $FILES2REMOVE variable. To do this, odfgrep copies the substrings inside $OPTIONS, one at a time, in the variable $FF (line 6) and looks at them. But nothing happens unless:

  1. $FF is the name of an actual file (line 8), and
  2. Its extension is .odt, .odp, or .ods (line 11).

In that case, and only in that case, the odt2txt utility (line 12) writes a plain text copy of $FF in a temporary file with the same suffix used in line 4 to build $ODFOPTIONS – that is, .odfgrep.txt.

Please note that, even if I just called it a "file name," $FF includes the path to a file (i.e., it may have values like work/essays/phd-thesis.odt). In this case, odt2txt would save the plain text copy as work/essays/phd-thesis.odt.odfgrep.txt, so it is in the same folder as the original. The same string is also appended, in line 13, to the variable $FILES2REMOVE, which is necessary for reasons that will be clear in a moment.

The --width option in line 12 tells odt2txt the width at which text lines should be wrapped. Its default value is 65 characters. Setting it to -1 means "do not wrap lines." This adjustment is necessary because grep works line by line. If you were searching for a sentence like Linux is great, but odt2txt split it across two consecutive lines, grep would not find it.

Once the loop that started in line 6 ends, $ODFOPTIONS contains three types of "objects":

  • The options and search patterns that the ordinary grep should use.
  • The paths to all the non-ODF files passed by the users.
  • The paths to all the plain text copies of ODF files generated in line 12.

The objects of the first two types are not modified in any way, because they were not file names with ODF extensions; therefore, the loop did nothing to them!

At this point, you can finally run grep with the $ODFOPTIONS (line 21), but with one trick: Filter its output with sed in a way that makes all the .odfgrep.txt strings disappear. This will make odfgrep always return the names of the original ODF files, instead of their plain text copies, which are the only ones that grep sees. Without that sed command, the output of grep could be something like

phd-thesis.odt.odfgrep.txt:   Linux is great and I love it..

and this would confuse the users. The sed part of line 21, instead, transforms the output line above in this way, pointing to the original ODF file:

phd-thesis.odt: Linux is great and I love it..

After this, the only thing left to do is clean up (lines 23 to 26). Line 23 means "check if, after removing all whitespaces from the FILES2REMOVE variable, it has a number of characters greater than 0." If that is true, it means that at least one plain text file was created, and its name was appended to $FILES2REMOVE (lines 12 and 13). In that case, execute line 25, which removes all the files listed therein. Done!

As an example, Listing 2 shows the output of odfgrep on a test directory that contains several files of different kinds in different subfolders. The command says "show me all the lines containing the word linux (case insensitive) in all the files inside testdir and all its subfolders" (some lines were truncated for better formatting).

Listing 2

odfgrep for "linux"

#> find testdir -type f -exec odfgrep -i linux {} /dev/null \;
testdir/references/mfioretti.odp:Writer for several Linux magazines
testdir/references/mfioretti.odp:any Gnu/Linux distribution is OK
testdir/references/open-business-models.odt:Yochai Benkler, Linux and the Nature
testdir/notes/go-linux.md:what trouble? Why not check your data table inside a spreadsheet or database? Because it's often...
testdir/notes/go-linux.txt:Linux(1) is the best kernel around
testdir/notes/go-linux.txt:Linux,1
testdir/notes/go-linux.txt:he actually said "Linux is the best kernel around"

As you can see, odfgrep works and generates output in the same way as the standard grep, always returning the right file names, both on actual plain text files like go-linux.txt and on ODF slide shows (mfioretti.odp) or document files (open-business-models.odt).

In another example, I ask odfgrep to tell me how many times the word politics appears in the ODF text documents inside a certain folder:

#> odfgrep -c politics testing/references/*odt
testing/references/conference-proceedings.odt:2
testing/references/openness-essay.odt:3

Here, odfgrep found two matching documents in that folder; the word politics appeared two times in one file and three times in another file.

Installing odt2txt and odfgrep

The odt2txt program [2] is present in the repositories of the main GNU/Linux distributions. On Ubuntu, for example, you can install it by simply typing:

#> sudo apt-get install odt2txt

To install odfgrep, first save the code in Listing 1 (except the line numbers [3]) in a plain text file called odfgrep with the use of an editor like Gedit, Kate, or the venerable Vi or Emacs. Then, copy that file to a directory (e.g., /usr/local/bin), where all users of your computer can access it, and make it executable with the

#> sudo mv odfgrep /usr/local/bin
#> sudo chmod 755 /usr/local/bin/odfgrep

commands.

Caveats and Limits

The odfgrep script explained here is simple but very useful, provided you acknowledge some of its limitations or underlying assumptions.

The first things to know are about folder and file names. Depending on the language settings of your computer, this odfgrep may fail on names containing characters that are not ASCII alphanumerical characters, periods, underscores, or hyphens. It will surely fail if it comes across folder or file names containing spaces. At the same time, it will not detect, and therefore convert, ODF files that do not have their own default extensions (e.g., .odt, .ods, or .odp).

Personally, I consider these "sure failures" more of a feature than a bug for one simple reason: In my humble opinion, "limiting" yourself to files and folder names without spaces, apostrophes, and non-ASCII letters guarantees that any software or filesystem on the planet will deal with them without surprises, and it makes it much simpler to write all sorts of file-managing scripts for any purpose.

A more substantial limitation is the inability to work as intended in folders where you have no permission to create new files. Running odfgrep with sudo or giving it special SUID powers, as explained in an article online [4], would solve this problem. However, even that will not be enough to work on non-writable media like DVD archives, in which the normal grep tool would work just fine.

You must also take into account that odt2txt cannot fix stuff that "disturbs" the main text flow, like footnotes. If, for example, your ODF text contains a sentence like Linux(1) is the best kernel around, and (1) is a footnote with the text a Unix-like kernel by Linus Torvalds, then odt2txt will split the original text over five lines:

Linux,1
 **
a Unix-like kernel by Linus Torvalds
 **
is the best kernel around

If you were looking for the exact phrase Linux is the best kernel around, odfgrep would miss it, exactly because it was spread over multiple lines with extra text in it.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Tutorials – ODF Metadata

    It is no secret that the native file format of LibreOffice and OpenOffice, the OpenDocument Format (ODF), is a truly open standard for word processing documents, spreadsheets, and presentations. What most people do not know is that ODF files contain lots of metadata that is very easy to read or modify.

  • Command Line – diff and merge

    Diff and merge: They're not just for developers.

  • Tutorials – Attachment Extraction

    If your inbox is full of email messages with important attachments, retrieving those attachments manually can be a tedious task. The script presented in this article does this task automatically and can even save the email as a plain text file.

  • Command Line: Archives

    Gzip and bzip2 not only compress files, they also provide lean and powerful tools for viewing, searching, and comparing text files.

  • Tracked Down

    Searching for text in files or data streams is a common and important function. Ugrep tackles this task quickly, efficiently, and even interactively if needed.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News