Full-text search with Solr, Xapian, and Sphinx

Tracker Dogs

Article from Issue 153/2013
Author(s):

Full-text search engines like Solr, Xapian, and Sphinx make the daily data chaos on your hard disk searchable – and they even cooperate with relational databases.

Creating a list of 10 websites that discuss the latest Ubuntu release is simple: just use Google or another one of the popular web search engines. But if you host an information-packed website yourself and want to offer your own search function for it, you need a full-text search tool. Full-text search engines have other benefits for the user and developer. If you are building a custom application or DVD, for instance, you might want to include a full-text search tool to put important information at the user's fingertips. Full-text search delves the depths of random or systematically arranged data for one or more search terms. You will want the search results sorted by relevance, and you will want the results in a split second.

Luckily, admins and developers need not reinvent the wheel: Solr, Xapian, and Sphinx are open source projects that index and analyze data. But how do you define data? You can roughly distinguish two states in which the search engines find information: structured and unstructured.

Structured data has a fixed, predefined structure that allows it to be easily recognized, categorized, and processed with the help of applications. The most common form of structured data is a relational database, with data organized in rows and columns that, in turn, are connected in the form of tables. In contrast to this, unstructured data lacks a data model. Such data sets are often so ambiguous that a program cannot simply process them because the data, facts, and figures are totally mixed. Unstructured data is the domain of search engines that can at least arrange the chaotic data semantically.

Although both structured and unstructured data are both important, structured data offers several challenges for the search engine. For instance, the search engine must interact with a database engine to gain access to the data (see the box titled "No Connector?"). Also, websites and other custom web service infrastructures increasingly rely on back-end database systems for storing and organizing critical information.

No Connector?

Clearly, searching a database is more complicated with no database connector. In this case, the users themselves must ensure that the search engine treats each database entry as an individual document and indexes it as if it were unstructured data. Additionally, you need to create a new document for any subsequent entries in the database.

On top of this, you have to program the search manually to interact with the database. Although it is possible to work around the constraints, you have to consider a few factors – for example, the size of the database and whether it resides in a distributed environment.

In this article, I describe the Solr, Xapian, and Sphinx full-text search tools, paying particular attention to how to handle data stored in relational databases – with the spotlight on SQL variants.

Indexing and Search

Search engines in the wild exist in the form of libraries, which can be integrated into existing code or run as standalone services. These standalone services provide APIs for developers that let them tap into the search engine with their own programs.

Basically, search engines follow two aims: Indexing prepares the data so that it is easily accessible, and the search process compiles the information in a meaningful way. If you look up the term index in the dictionary, you will see that indexing means making a list of things. Making a list is the first step in the process.

Solr, Sphinx, and Xapian rely on indexers to sift through the full inventory of data and apply different algorithms to create a searchable form (see the box titled "Stop Words and Stem Formation"). Grossly simplified, the search engine isolates key terms from the mass of data and stores them in the form of trees, binary trees, or hash tables in index files that also remember the relative position of the terms in the original documents.

Stop Words and Stem Formation

Every language includes words that do not add any significance to the body of text in which they appear; the indexer usually discards these words. In English, these stop words include the definite article (the), indefinite articles (a, an), and conjunctions (and, or).

Stem formation is a method that reverts declined or conjugated words to their stems. The associated algorithm assigns, for example, both "went" and "gone" to the stem "go."

When it comes to indexing information stored in a database, the index file generated by the search engine usually contains at least two things: the fields that are searched by the user, and the ID of the entry in the database, which makes it possible to reconstruct the record acquired by the search engine if necessary.

If multiple tables in a database are linked, the search engines must first understand how the individual tables are constructed, how they are related, and what information the full-text search needs when indexing in the database. When it finally comes to the actual data search, the search engine looks in the index file for the search terms and returns the IDs of the matches and the data sets created for them.

Some full-text search engines, whether they be servers or libraries, come with native support for interacting with popular database services, which allows automated indexing.

Indexing alone is useless if no search function can interpret the generated indices in a meaningful way. If you specify one or more keywords for a search, a search function investigates the indices created earlier and returns relevant results [1]. Because indexing removes redundant and meaningless words, the search proceeds much more quickly than if a computer were to search the original documents.

Reorganizing the data allows the user to track down an article, titled Many ways to find a cat by entering, for example, I'm looking for my kitty.

Apache Solr

Solr [2] is an open source, full-text search server by the Apache Software Foundation (the current version is 4.3.0) based on the Lucene search library [3], which also belongs to the Apache Software Foundation. Solr thus automatically assimilates new features when the Lucene project updates its software. Companies and institutions that use Solr include Netflix and SourceForge, but also NASA and the White House. As early as May 2011, Solr and Lucene decided to merge.

Even though Lucene provides the search results, Solr comes with some interesting extra features. These features include not only match highlighting, which visually highlights the sites in the search results, but also near-real-time indexing, which ensures that recently indexed data is almost immediately available for searching. It also indexes and parses rich documents (e.g., DOC, HTML, and PDF files) using Apache Tika [4], integrates a geodetic data feature, and has an extensible plugin architecture.

Solr's API lets you communicate with the server in almost any programming language because it uses REST-like standards (see the "REST" box) and stores information in XML, JSON, and CSV files.

REST

The term Representational State Transfer (REST) was first introduced in a doctoral dissertation by Roy Fielding. Among other things, the REST concept specifies that all components of dynamically generated web pages should be accessible via URIs.

Because it is written in Java and already looks back on a few years of existence, Solr copes well with structured data [5]. The server puts users in a position to index almost any database if they do two things: install the correct JDBC driver and configure some XML files. The first step becomes complicated if you do not use a standard database engine. However, if you rely on the widespread MySQL database [6], you can simply install the libmysql-java package on Ubuntu 13.04 and make a few manual adjustments.

Solr itself can be installed using solr-common; in addition to this, you need solr-tomcat as a matching server. A data import handler then takes care of tapping into the database. To use the import handler in Ubuntu, the admin first needs to define it in the /etc/solr/conf/solrconfig.xml file (Listing 1, line 1). At the same time, this file expects a link to the configuration file (data-config.xml, line 3), which you can store, for example, in /home/username.

Listing 1

solrconfig.xml

01 <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
02 <lst name="defaults">
03   <str name="config">/home/<username>/data-config.xml</str>
04 </lst>
05 </requestHandler>

This configuration file (Listing 2) needs to explain how Solr can access the data in the database (line 5), which XML fields and columns it should read, and whether it should change, delete, or add fields (lines 9 to 12). If necessary, different databases can be tapped by using the name field [7]. In this example, the Solr schema file (/etc/solr/conf/schema.xml) additionally needs the id, name, and desc (line 11), which the database user can set up after reading the schema [8], before modifying the data-config.xml file.

Listing 2

data-config.xml

01 <dataConfig>
02   <dataSource type="JdbcDataSource"
03          name="<database1>"
04               driver="com.mysql.jdbc.Driver"
05               url="jdbc:mysql://<localhost>/<database_name>"
06               user="<username>"
07               password="<password>"/>
08   <document>
09     <entity name="id"
10         dataSource="<database1>"
11              query="select id,name,desc from <sample_table>">
12     </entity>
13   </document>
14 </dataConfig>

The next step is to ensure that the appropriate JDBC driver ends up in the /usr/share/solr/web/WEB-INF/lib directory in the form of a .jar file. In the case of MySQL, you can discover this by entering

sudo dpkg -L mysqldata-java

then linking the driver. In the last step, a command imports the data into Solr; a simple HTTP call to the appropriate port specification is all you need:

http://localhost:8080/solr/dataimport?command=full-import

The command can optionally include a =delta-import, if you simply need to update an existing database.

Xapian

Unlike Solr, Xapian [9] is not an independent server that runs separately from the applications that use its services. Instead, it is a C++ library that can be integrated into applications as a search function. The project includes bindings for other programming languages, including Perl, Python, PHP, Java, Tcl, C#, Ruby, Lua, and Erlang.

Xapian users include, for example, the weekly newspaper "Die Zeit" and the Debian project, which uses the search function for the mailing list. The packages for Ubuntu 13.04 are xapian-omega, xapian-tools, and libxapian22.

Xapian stores the data collected during indexing in a standard database that is used in the search step. Xapian also features a weighted probability search: Important words are assigned more weight than unimportant ones, so that more relevant documents are likely to move to the top of the results.

The user can search for exact terms (Figure 1), or for a specific number of words that either appear sorted or random in the search text. Xapian also corrects search keys entered by the user on the basis of its own search index and supports data indexing during a search by immediately integrating new documents into the search results.

Figure 1: The Xapian search page, running under Ubuntu on an Apache server, investigating some previously indexed test documents.

Xapian provides an interesting answer to the problem of searching databases. By default, it does not include support for database indexing. However, the Xapian project also includes a server engine named Omega [10] that uses the Xapian libraries as a basis. On Ubuntu, Omega resides in the xapian-omega package and installs an Apache server. A guide [11] explains how to search some test data.

The link between Xapian and Omega is similar to that between Solr and Lucene. Omega allows users to index databases, including all SQL databases and other database systems supported by the Perl DBI module. To be more specific, the Xapian/Omega team searches MySQL, PostgreSQL, SQLite, Oracle, DB2, MS SQL, LDAP, and ODBC.

For convenience, I'll assume you have a customer table with the fields id (primary key), customerName (varchar 255), and address (text), which belongs to the MySQL database myDB, and that you want to index this table. For Xapian, you need to convert the database format to text. To do so, export the database username and associated password at the command line:

export DBUSER=<DB-username>;
export DBPASSWORD='<DB-Password>';

In the next step, run a script named dbi2omega; Ubuntu 13.04 puts this script in /usr/share/doc/xapian-omega/examples/dbi2omega. The script transfers the contents of the table to a text file:

dbi2omega myDB customer >> customer.txt

In the next step, the Xapian user must create a script (Listing 3) that tells Xapian which fields to include – in this example, the script is named fields.script. In this way, the search software understands that the id field is a unique field, that it needs to index the customerName field, which has a five times greater relevance than the address field, which also ends up in the index  [12].

Listing 3

fields.script

01 id: field=id unique=Q boolean=Q
02 customerName: field=customerName lower index weight=5
03 address: field=address lower index

On Ubuntu, you then run scriptindex, which resides in the path /usr/bin/, to generate the example database:

scriptindex --stemmer=german <ExampleDB> fields.script customer.txt

The --stemmer option helps with stem formation; Xapian's algorithm assumes English as the default language. For another language (e.g., German), you would optionally specify german, as shown in the example.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Recoll

    Whether you’re looking for a letter to the Internal Revenue Service or an email from an online trader, the Recoll desktop search machine will help you find it with just a few mouse clicks.

  • Baloo: KDE Semantic Search

    Baloo replaces Nepomuk as the semantic search engine on the KDE desktop, but it gets off to a bumpy start.

  • Index Search with Lucene

    Even state-of-the-art computers need to use clever methods to process ever-increasing amounts of document data. The open source Lucene framework uses inverted indexing for fast searches of document collections.

  • Perl: Elasticsearch

    The Elasticsearch full-text search engine quickly finds expressions even in huge text collections. With a few tricks, you can even locate photos that have been shot in the vicinity of a reference image.

  • KTools: Kat

    The Kat desktop search tool turns up more than text strings.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News