Full-text search with Solr, Xapian, and Sphinx

Tracker Dogs

Author(s):

Full-text search engines like Solr, Xapian, and Sphinx make the daily data chaos on your hard disk searchable – and they even cooperate with relational databases.

Creating a list of 10 websites that discuss the latest Ubuntu release is simple: just use Google or another one of the popular web search engines. But if you host an information-packed website yourself and want to offer your own search function for it, you need a full-text search tool. Full-text search engines have other benefits for the user and developer. If you are building a custom application or DVD, for instance, you might want to include a full-text search tool to put important information at the user's fingertips. Full-text search delves the depths of random or systematically arranged data for one or more search terms. You will want the search results sorted by relevance, and you will want the results in a split second.

Luckily, admins and developers need not reinvent the wheel: Solr, Xapian, and Sphinx are open source projects that index and analyze data. But how do you define data? You can roughly distinguish two states in which the search engines find information: structured and unstructured.

Structured data has a fixed, predefined structure that allows it to be easily recognized, categorized, and processed with the help of applications. The most common form of structured data is a relational database, with data organized in rows and columns that, in turn, are connected in the form of tables. In contrast to this, unstructured data lacks a data model. Such data sets are often so ambiguous that a program cannot simply process them because the data, facts, and figures are totally mixed. Unstructured data is the domain of search engines that can at least arrange the chaotic data semantically.

[...]

Use Express-Checkout link below to read the full article (PDF).

Read full article as PDF:

Price $2.95

Related content

  • Recoll

    Whether you’re looking for a letter to the Internal Revenue Service or an email from an online trader, the Recoll desktop search machine will help you find it with just a few mouse clicks.

  • Index Search with Lucene

    Even state-of-the-art computers need to use clever methods to process ever-increasing amounts of document data. The open source Lucene framework uses inverted indexing for fast searches of document collections.

  • Perl: Elasticsearch

    The Elasticsearch full-text search engine quickly finds expressions even in huge text collections. With a few tricks, you can even locate photos that have been shot in the vicinity of a reference image.

  • KTools: Kat

    The Kat desktop search tool turns up more than text strings.

  • Wikia Search Approaching Launch

    Wikia Search, a search engine with free algorithms was one of last year's most eagerly awaited projects. The wait is due to end January 7.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News