An indexing search engine with Nutch and Solr

CMS, wikis, text files … modern companies store important data in many different places, and that data must be accessible down to the tiniest detail through a single search. Commercial software vendors such as Google [1] offer tools that will index the data and store the index on an external server. But many organizations prefer to keep control of the search capabilities – for security and privacy reasons, but also to add flexibility and promote innovation and customization.

A handy constellation of open source tools from the Apache project will help you build your own search index for the assorted documents and data on your network: Nutch, Solr, Apache, and Lucene.

Nutch [2] is a powerful web crawler, and Apache Solr [3] is a search engine based on Apache Lucene [4]. You can combine Nutch with Solr to create a complete search engine – a miniature Google, if you like.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Index Search with Lucene

    Even state-of-the-art computers need to use clever methods to process ever-increasing amounts of document data. The open source Lucene framework uses inverted indexing for fast searches of document collections.

  • Full-Text Search Engines

    Full-text search engines like Solr, Xapian, and Sphinx make the daily data chaos on your hard disk searchable – and they even cooperate with relational databases.

  • Open Data with CKAN

    CKAN, a versatile data management system, lets you build a portal to share your open data.

  • Search Engines

    If you are interested in data privacy, you might want to try an alternative search engine. We discuss a few search engines that serve up good results, along with an option for setting up your own search engine.

  • FOSS Metasearch Engines

    Alternative open source metasearch engines offer more privacy than mainstream search engines and can sometimes yield better results. While SearXNG is the best-known open source metasearch engine, 4get is a capable alternative worth checking out.

comments powered by Disqus