Implementing fast queries for local files in Go

Closure as a Bridge

Listing 2 shows the indexer, which uses the Walk() method from the standard path/filepath package to navigate through a file hierarchy, starting with the start directory specified by the user on the command line. Arguments passed to the program are found in the os.Args array, as in C, with the program name as the first element and all of the call parameters in the following ones.

Listing 2

index.go

01 package main
02
03 import (
04   "database/sql"
05   _ "github.com/mattn/go-sqlite3"
06   "os"
07   "path/filepath"
08 )
09
10 type Walker struct {
11   Db *sql.DB
12 }
13
14 func main() {
15   if len(os.Args) != 2 {
16     panic("usage: " + os.Args[0] +
17           " start_dir")
18   }
19   root := os.Args[1]
20
21   db, err :=
22       sql.Open("sqlite3", "./files.db")
23
24   w := &Walker{
25     Db: db,
26   }
27
28   err = filepath.Walk(root, w.Visit)
29   checkErr(err)
30
31   db.Close()
32 }
33
34 func (w *Walker) Visit(path string,
35           f os.FileInfo, err error) error {
36   stmt, err := w.Db.Prepare(
37       "INSERT INTO files VALUES(?,?,?)")
38   checkErr(err)
39
40   _, err = stmt.Exec(
41       path, f.ModTime().Unix(), f.Size())
42   checkErr(err)
43
44   return nil
45 }
46
47 func checkErr(err error) {
48   if err != nil {
49     panic(err)
50   }
51 }

Browsing a file tree isn't rocket science, but Go uses the Visit() callback function to communicate with the traversing function in line 28. The problem here is that no database handle exists within the scope of this callback starting on line 34, which it needs to make the necessary changes to the database. The solution to this dilemma is to turn the Visit() function into a closure.

To do this, Visit() in line 34 defines a so-called receiver between the func keyword and the function name, thus telling Go to connect the Walker data structure (line 10), which contains a database handle, with the Visit() function. This allows Visit() to access the handle via the w variable used for defining the receiver. With the handle, it inserts new records into the database.

The actual work of setting up a database query is done by the Prepare() method, which prepares an SQL command and returns a statement handle. Line 40 then fires the Exec method at the latter and passes the parameters to be stored to the SQL command: the path to the file, its last modification timestamp, and its size.

To avoid the need for the program to check after each function call whether the error variable err has a value of nil, and thus everything is OK, line 47 defines a function named checkErr(), which does this and aborts the program with panic, if something unforeseen happens.

Finders, Keepers

After the indexer finished its work, the database table files on my computer had more than a million entries, as shown in Figure 3. The reason for the high number of files was probably numerous cloned Git repositories and Snapshot articles from more than 20 years. With this data in the files.db SQLite database, a SQLite client can now quickly fire off queries and determine which files in my home directory have recently changed, for example.

Figure 3: After the indexer has finished, there are more than one million file entries in the flat file SQLite database.

To do this, Listing 3 connects to the SQLite database and issues a SELECT command that queries all rows in the table, sorts them in descending order of the timestamp in the modified column, and then outputs the first 10 matches.

Listing 3

latest.go

01 package main
02
03 import (
04   "database/sql"
05   "fmt"
06   _ "github.com/mattn/go-sqlite3"
07 )
08
09 func main() {
10   db, err :=
11     sql.Open("sqlite3", "./files.db")
12   checkErr(err)
13
14   rows, err := db.Query("SELECT path, " +
15     "modified FROM files " +
16     "ORDER BY modified DESC LIMIT 10")
17   checkErr(err)
18
19   var path string
20   var mtime string
21
22   for rows.Next() {
23     err = rows.Scan(&path, &mtime)
24     checkErr(err)
25     fmt.Printf("%s %s\n", path, mtime)
26   }
27 }
28
29 func checkErr(err error) {
30   if err != nil {
31     panic(err)
32   }
33 }

The rows.Next() call in line 22 works its way step-by-step through the matches, and rows.Scan() retrieves the first two column values of each match and assigns them to the path and mtime variables passed in as pointers; both of these were previously declared as strings. Go supports pointers, but it does not leave memory management up to the user and does not blow up in smoke like C if an address is wrong because of a bug; instead, it quits with helpful error messages.

Which files in my home directory take up the most space? Listing 4 finds this out quickly by sorting all entries in descending order (ORDER BY size DESC) using the SELECT query from line 25 and LIMITing the output to a maximum number of matches. The user defines this number with the --max-files parameter at the command line, and Go provides a convenient interface for parsing the parameters of a command with the flag package.

Listing 4

max-size.go

01 package main
02
03 import (
04   "database/sql"
05   "fmt"
06   "flag"
07   "os"
08   "strconv"
09   _ "github.com/mattn/go-sqlite3"
10 )
11
12 func main() {
13   db, err :=
14     sql.Open("sqlite3", "./files.db")
15   checkErr(err)
16
17   max_files := flag.Int("max-files", 10,
18     "max number of files")
19
20   flag.Parse()
21   if len(flag.Args()) != 0 {
22     panic("usage: " + os.Args[0])
23   }
24
25   rows, err := db.Query("SELECT path," +
26     "size FROM files " +
27     "ORDER BY size DESC LIMIT " +
28     strconv.Itoa(*max_files))
29   checkErr(err)
30
31   var path string
32   var size string
33
34   for rows.Next() {
35     err = rows.Scan(&path, &size)
36     checkErr(err)
37     fmt.Printf("%s %s\n", path, size)
38   }
39 }
40
41 func checkErr(err error) {
42   if err != nil {
43     panic(err)
44   }
45 }

It first expects the declaration of the variable that will hold the value passed in from the command line (max_files in line 17). The call to the flag.Int() method specifies that only integers can be used as values. Then flag.Parse() (line 20) analyzes the existing command-line parameters and – if the user has set --max-files – assigns this value to a variable that the max_files pointer references.

The Itoa() function from the strconv package converts the integer behind the dereferenced *max_files pointer back into a string, and line 28 injects it into the SQL command using a LIMIT clause. The advantage of this conversion type is that an integer actually ends up in the query and not a character string that could be abused for SQL injection attacks.

In comparison, Listing 5 shows that a database client in a scripting language like Python is easier to program. Since SQLite also features a Python driver, the same database created by Go earlier can be used by Listing 5 without further ado. It digs out all database entries whose file paths correspond to a predefined pattern. It expects a regular expression at the command line, stuffs it into an SQL query, and outputs the matches.

Listing 5

like.py

01 #!/usr/bin/env python3
02 import sys
03 import sqlite3
04
05 try:
06   _, pattern = sys.argv
07 except:
08   raise SystemExit(
09      "usage: " + sys.argv[0] + " pattern")
10
11 conn = sqlite3.connect('files.db')
12 c = conn.cursor()
13 like = "%" + pattern + "%"
14 for row in c.execute('SELECT path,size FROM files WHERE path LIKE ?', [like]):
15   print(row)

More Luxury, More Lines

Go's type checking and the fact that it does not run inside a bytecode interpreter, but as a compiled binary with more elegant memory management than a C or C++ program, has its price: It requires more detailed instructions and generally more lines of code. Go programs run faster than Python scripts, but, as is so often the case, the bottleneck in the use case at hand is not in processing instructions, but in communicating with external systems. In this case, database calls consume most of the compute time. Whether the program code itself runs 10 or 100 percent faster is largely irrelevant.

However, the compact binary format with embedded libraries and no dependency worries is a big advantage, and probably one of the reasons Go has become the first choice for all types of system programming tasks.

Infos

  1. Google Code Search: https://github.com/google/codesearch
  2. Russ Cox, "Regular Expression Matching with a Trigram Index," 2012: https://swtch.com/~rsc/regexp/regexp4.html
  3. Listings for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/215/

The Author

Mike Schilli works as a software engineer in the San Francisco Bay area, California. Each month in his column, which has been running since 1997, he researches practical applications of various programming languages. If you email him at mailto:mschilli@perlmeister.com he will gladly answer any questions.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • SQLite Tutorial

    Several databases likely reside on your desktop and smartphone, and it is easy to manage the data in these files or to create similar databases yourself.

  • Usql

    Usql is a useful tool that lets you manage many different databases from one prompt.

  • Publish Pygmynote Snips on the Web
  • Digital Shoe Box

    In honor of the 25th anniversary of his Programming Snapshot column, Mike Schilli revisits an old problem and solves it with Go instead of Perl.

  • Patterns in the Archive

    To help him check his Google Drive files with three different pattern matchers, Mike builds a command-line tool in Go to maintain a meta cache.

comments powered by Disqus