Source code browsers

Navigators

© Lead Image © alexandragl, 123RF.com

© Lead Image © alexandragl, 123RF.com

Article from Issue 194/2017
Author(s):

If you've ever struggled to get a sense of someone else's code, the right tool could save you hours of grepping.

Open source is all about code. Contributors read tons of code, of which they have written only a small fraction. Being able to comprehend a program is crucial to the contribution process, and free software is all about contribution. In other words, you need tools to read and understand code.

These tools are called "source code browsers" or "source navigators." Linux has many of them, and normally they fall in two large categories. Older ones implement their own (simplified) parsers to recognize language symbols, such as function definitions, and record their location in the source code. This works fast and reasonably well, yet most programming languages have complex grammar that simplified parsers can't fully embrace. Newer browsers rely on the tool set to build an abstract syntax tree (AST) [1]. This makes indexing more precise, but also slower and more cumbersome to generate. Choosing one approach over another depends on the situation, and I hope this text provides you with some guidance.

Ubiquitous Ctags

Ctags is the de facto standard for source code indexers in Linux. As the name suggests, it builds on the "tag" concept. Put simply, a tag is a syntax construct that has an index entry, such as a class, function, or macro definition. This index comes in a so-called "tags" file, and the main purpose of the non-interactive ctags command is to generate tags files from source-code trees. Tags files have a well-defined format, and virtually all code editors in Linux understand it.

Despite the "C" prefix, Ctags supports a wide range of programming languages. The output of

ctags --list-languages

varies between Ctags flavors, but C/C++/C#, Java, Perl, Python, and a few dozen others are usually there. Ctags guesses a programming language by the file's suffix. If it guesses wrong, you can override its choice with --language-force. For each programming language, Ctag recognizes different kinds of tags, and

ctags --list-kinds

displays known tags for each supported language.

To build an index, you supply ctags a list of files to consider. For example, ctags -R treats subdirectories recursively; -f sets the output filename. This should be enough in most scenarios. However, because Ctags is "neither a preprocessor nor a compiler" [2], it may get certain things wrong. If this is your case, use the -I switch to ignore or substitute identifiers that require special handling.

A tags file is text, yet it was designed to be machine-readable. Ctags can also build a human-friendly tabular cross-reference if you call it as

ctags -x <other arguments>

For each tag, Ctags reports its name, kind, and line number in the source code. The original output is in fact quite similar, except it stores not a location, but the EX editor command to use to find the tag. Most often, it's a regular expression search, but Ctags provides --excmd and few other command-line switches to adjust this behavior.

The description above applies to the original Ctags. In so-called "Etags mode," the file format is different, and --excmd and friends are just ignored. "E" was for Emacs originally, but now many other programs (e.g., Midnight Commander's internal editor) recognize tags in this format (Figure 1). You can often tell the format by the filename: tags is for Ctags, while TAGS is for Etags.

Figure 1: With a tags file, even humble mcedit looks much more IDE-like. Just don't forget it wants an Etags format.

The most popular Ctags implementation to date is Exuberant Ctags [3], and it is likely what your distro ships. It provides both ctags and etags commands; you can also enable Etags mode with the -e switch to ctags. Universal Ctags [4] are also gaining momentum. As the Exuberant Ctags homepage suggests, it isn't actively maintained now, whereas Universal Ctags attempts to continue the development and sports completely rewritten C/C++, Python, and HTML, as well as many new parsers (e.g., for Rust). The downside is you'll probably need to compile the program yourself. Luckily, the homepage describes this process in detail.

Compared with original Ctags, which indexes where the tag was defined, Ultimate Ctags can also track where it was referenced: see the Reference tags section on the Docs page. This brings Ultimate Ctags on a par with the second nominee, Cscope.

Venerable Cscope

Can you imagine software written in the PDP-11 era that still remains in use today? Can you imagine that the software was made free (as in speech) thanks to an infamous SCO Group predecessor? Meet Cscope [5]: a C code browser with some (limited) support for C++ and Java, born in Bell Labs, and open-sourced by Santa Cruz Operation in 2000. Cscope was briefly introduced in a Linux Voice cover feature last year [6], and now it is time to pay this tool the respect it truly deserves.

Cscope should be available in your package manager. Before you start using it, you'll need a cross-reference database for your source code, which you can build separately with cscope -b. This is not a must, because when you launch Cscope's curses-based interface, it automatically indexes all C, Bison/Yacc, and Lex files in the current directory. Add the -R switch to recurse into subdirectories, which is usually what you want to do. Some projects even provide dedicated makefile targets to generate a Cscope cross-reference database. For example, make cscope in Linux kernel source code produces a so-called inverted index that makes symbol lookup a bit faster. Should you want to achieve the same effect on your own, run cscope -bq. Also, consider -k to enable the "kernel mode." In this mode, Cscope doesn't look into standard locations like /usr/include, because kernels (and other low-level code) don't use them. On startup, Cscope detects changes to source code and rebuilds the cross-reference as necessary. This makes subsequent launches faster. Note you still need to tell Cscope where to look for source code, even if the database already exists. To trigger a rebuild from within Cscope, type Ctrl+R.

Cscope records not only where symbols were defined, but also how they were used, so you can find all expressions that involve a given variable, or functions calling a given function, or functions that the given function calls. Many other tools restrict your searches to C language identifiers. In Cscope, you can grep for arbitrary text strings and regular expressions (see the info box titled "Ack"). For dessert, Cscope can look up a file by name or find all files that #include the specific header.

Ack

Source-code files are text. Language tokens are pieces of text. The first tool that springs to mind when you think about searching for text is Grep.

You can surely use grep to navigate source code, but you have a better alternative: ack [7]. Ack is pure Perl (Are you scared yet?), and you can install it from your distro's package manager or via CPAN.

What makes Ack a better Grep? Two things: It's fast, and it's designed to search code, which means fewer keystrokes for common tasks. Ack ignores non-code directories (e.g., .git or .svn), backup files, and the like, and it doesn't need -R to recurse into subdirectories. In a multilanguage project, you can tell it to look only in Python sources with ack --python. Ack sports Perl regular expressions (guess why) and happily highlights matches it finds.

Ack can't do the semantic analysis that Clang can – nor can it brew your coffee. However, none of the tools I cover here can do a free-text search (except Cscope), so Ack certainly deserves being in your toolbox. Don't forget to share the ~/.ackrc snippets you found most useful in Linux Voice forums [8]!

Cscope's curses-based interface splits the screen into halves. You enter search terms in the lower half and get results in the upper (Figure 2). Cscope supports POSIX extended regular expression syntax. Filenames allow partial matches while C symbols don't. Putting foo in the Find this file field matches foo.c, foo.h, and foobar.c. Putting foo in the Find this C symbol field matches the first, but not the second, expression below:

void foo();
int foobar = 1;.
Figure 2: Cscope is a powerful tool that easily copes with the Linux kernel. Here, it browses its own source.

The Tab key lets you switch halves, and you select fields with arrow keys. Some symbols, such as ^, are reserved (see cscope(1) [9]). To enter them, first, type \ as an escape character. For each search result, Cscope displays the location (file, function, and line number) and some context. It also assigns the result a single-letter hotkey you can type to open it in the editor ($CSCOPE_EDITOR). The spacebar switches search results pages. You can save the results in a file with > or >>. Should you need them later, load this file with < or cscope -F. To refine the results, type ^ or |. Both filter through an external shell command. Entering ^ replaces the original results, whereas | simply displays filtered lines and keeps the results untouched.

A few other hotkeys are available. Ctrl+C toggles case sensitivity. Ctrl+Y/Ctrl+A repeat your last search, and Ctrl+B/Ctrl+F do the same, yet in a search field above or below the current one. This comes in handy if you typed your query in a wrong box. For those accustomed to GNU Readline, history support in Cscope may feel limited, and it probably is. Pressing ? brings the help page, and Ctrl+D exits Cscope.

The man page [9] describes more hotkeys and command-line switches. I suggest you spend some time learning them, because it greatly improves your Cscope experience. Cscope also runs in line mode or as a Vim extension. I leave exploring those options as an exercise to a curious reader (i.e., you).

Woboq Code Browser

Once you understand the traditional tools, you can compare them to Clang-based alternatives. Naturally, this limits support to C/C++, but Clang is a real C/C++ compiler, so it should have no problem handling even the most convoluted syntax constructs, provided they are correct.

On the other hand, if you call Clang to index your code, you should supply it all the information the build system (CMake, Autoconf) normally does. This is not the case with Ctags or Cscope, which can simply scan files one by one, looking for specific patterns, such as function declarations. For Clang, build information usually comes via a JSON compilation database (compile_commands.json). CMake introduced this format first, and in a nutshell, it contains the list of source files and exact commands used to build them.

Woboq Code Browser [10] builds on Clang and produces a set of annotated HTML pages showing a project's source code. A bit of JavaScript makes them interactive, and no code is required on the back end; yet, you'll probably want to serve these pages with a web server, because most browsers don't allow Ajax requests to file:// URLs by default. (That's a security flaw.)

You'd want to compile Code Browser yourself, because it probably hasn't made its way to your distro repositories. It uses CMake, which you'll need to tell where to find the llvm-config tool on your system. On Ubuntu, it's at /usr/bin/llvm-config; otherwise, the process is straightforward.

How you build compile_commands.json depends on the build system of the project you are trying to index. If it's CMake, the day just got better, because you only need to use

cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

For other build systems, say Autoconf or Qmake, Woboq Code Browser provides a fake_compiler.sh helper script. However, there is a better tool: Build EAR (Bear), which you'll probably find it in your package manager, and it's available online [11]. Bear sets LD_PRELOAD to inject a dynamic library that traces calls to the compiler and collects command-line arguments. To use Bear, you just need to prefix a make invocation with bear:

bear make

The bear --help command lists a few available options. With the JSON compilation database ready, Code Browser can index your project (Listing 1).

Listing 1

Code Browser Indexing

 

This implies you did an in-tree build of Code Browser. The $BUILDDIRECTORY argument is where $PROJECTNAME (your project) is built, and $VERSION is the project's version. The $OUTPUTDIRECTORY argument should be set to wherever your web server looks for static HTML (e.g., ~/public_html/$PROJECTNAME). The -a switch tells codebrowser_generator to process all source code found in compile_commands.json. The second command builds index.html for each subdirectory in the project, and the last line copies scripts and stylesheets.

The end result is worth the fuss. You may choose a theme of your liking (Qt Creator/KDevelop/Solarized) to feel at home. Mouse over a symbol to see a pop-up box containing the description and references. For global symbols, the reference kind (e.g., value read or address taken) is also shown. Click on a symbol to jump to the declaration. Location history is also supported, yet it is bare bones, with no way to clear the history and no indication as to which files history items belong. Similarly, the sidebar on the right contains definitions collected from the current source code file (Figure 3); however, you can't tell whether the definition is a function, variable, or type.

Figure 3: To showcase Code Browser, Woboq publishes indexes for several well-known software projects, including the Linux kernel.

Another downside is licensing. Code Browser is dual-licensed (CC BY-NC-SA 3.0 [12] and proprietary), so I feel it's okay to use the open source version to index open source code, as long as you keep the Woboq branding. For anything else, you'll probably need a commercial license. You should contact Woboq directly if you are serious about Code Browser deployment.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Perl: Tricks with Vim

    The Vim editor has any number of tricks for helping you avoid unnecessary typing. In this month’s article, we look at some effort-saving Vim techniques for Perl hackers.

  • Tech Tools
  • Kernel News

    In kernel news: Rust in Linux; and Compiler and Kernel Frenemies.

  • QR Code Generators

    With the right tools, you can create your own QR code squares with information you want to share, for example, on a business card, in a letter, or on your website.

  • Start from Scratch

    Coreboot lets you build your own custom firmware while learning more about Linux.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News