Indexing and searching text with Lucene
SmartSearch
© adimas, Fotolia.com
Even state-of-the-art computers need to use clever methods to process ever-increasing amounts of document data. The open source Lucene framework uses inverted indexing for fast searches of document collections.
Nowadays, almost any commercially available hard drive can store more text than a whole library. In the digital world, a traditional system such as a card catalog or a knowledgeable librarian is no longer adequate to help find the right shelf. Even software equivalents such as find or zgrep are not always fast enough to track a particular piece of information amongst giga- or terabytes of data.
The science that deals with this type of search problem is called information retrieval. Computer scientists have developed sophisticated methods for tracking down files that users don't even know exist. The free Java library Lucene [1] implements some of these methods. Doug Cutting published an early version of Lucene in 1999. Two years later, the project, which carries the middle name of Cutting's wife, came under the auspices of the Apache Foundation when it joined the Apache Jakarta Project.
Lucene has been available in Version 4.0 since October 2012. The index file structures are backward compatible, so the transition from 3.6 to 4.0 does not cause any problems. Over the years, Lucene has become one of the most widely used solutions for indexing and searching text. (See the box titled "Lucene In All Its Facets.")
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
CIQ Releases Compatibility Catalog for Rocky Linux
The company behind Rocky Linux is making an open catalog available to developers, hobbyists, and other contributors, so they can verify and publish compatibility with the CIQ lineup.
-
KDE Gets Some Resuscitation
KDE is bringing back two themes that vanished a few years ago, putting a bit more air under its wings.
-
Ubuntu 26.04 Beta Arrives with Some Surprises
Ubuntu 26.04 is almost here, but the beta version has been released, and it might surprise some people.
-
Ubuntu MATE Dev Leaving After 12 years
Martin Wimpress, the maintainer of Ubuntu MATE, is now searching for his successor. Are you the next in line?
-
Kali Linux Waxes Nostalgic with BackTrack Mode
For those who've used Kali Linux since its inception, the changes with the new release are sure to put a smile on your face.
-
Gnome 50 Smooths Out NVIDIA GPU Issues
Gamers rejoice, your favorite pastime just got better with Gnome 50 and NVIDIA GPUs.
-
System76 Retools Thelio Desktop
The new Thelio Mira has landed with improved performance, repairability, and front-facing ports alongside a high-quality tempered glass facade.
-
Some Linux Distros Skirt Age Verification Laws
After California introduced an age verification law recently, open source operating system developers have had to get creative with how they deal with it.
-
UN Creates Open Source Portal
In a quest to strengthen open source collaboration, the United Nations Office of Information and Communications Technology has created a new portal.
-
Latest Linux Kernel RC Contains Changes Galore
Linux kernel 7.0-rc3 includes more changes than have been made in a single release in recent history.
