Smart research using Elasticsearch

More, Please

Article from Issue 182/2016

Author(s): Mike Schilli

Websites often offer readers links to articles about similar topics. Using Elasticsearch, the free search engine, is one way to find related documents instantly and automatically.

When people rummage around on the StackOverflow website looking for advice on programming questions, they can find list of links to related topics in the Related section (Figure 1). This helps users if the first search result didn't show what they expected or the located resource is insufficient. According to Gormley and Tong [1], Elasticsearch [2] [3], the free search engine, generates these links on the website in real time from the growing and very impressive collection of 10 million StackOverflow contributions.

Figure 1: Using Elasticsearch, StackOverflow generates links to similar topics in the Related section.

Artificial Brain

This isn't actual intelligence at work, because computers still find it difficult to understand the content of a document, meaning they can't find documents with related content. In fact, the algorithm used is based on simple nitpicking – it combines values for word frequency and derives a score from those values.

Elasticsearch uses an inverted index for this task; this index is a complete list of individual words that appear in any of the documents that have been added to the search engine so far. It remembers which document each word has been found in and can therefore instantly output a list of documents for a search term.

For example, if a user is looking for the term perl, Elasticsearch will immediately find the doc-1 document in the inverted index from Figure 2 and present this (one hopes) accurate search result.

Figure 2: The inverted index maintains a list of documents in which the indexer found specific words.

When searching for two words (e.g., linux and cpan), two documents come into consideration, but because doc-2 only contains one term, whereas doc-3 contains both, the algorithm gives doc-3 a higher relevance score. In a match list sorted by descending score, doc-3 is then at the very top and is more likely to match the user's expectations.

What Is Relevant?

Not all words are equally important. For example, the word file understandably comes up in quite a large number of documents on computer topics. If the user searches for file linux cpan, doc-2 and doc-3 provide two matches each, but because linux is more significant for one document than file, the algorithm rates linux cpan higher than cpan file and gives doc-3 preference.

The Tf-idf score [4] determines how important a word is within a document. It gives a high value to those words that are prevalent in one document but do not occur too frequently in other documents competing for a high score (i.e., words that underpin the uniqueness of the document). A word's relevance value increases with the number of times the word appears in the document (this is known as term frequency, TF) and decreases if the word also appears in many other documents in the collection (IDF, inverse document frequency).

Searching for the Same

To find documents in the database that are similar to document x, Elasticsearch first extracts all relevant words from x, then forms a search query using these words and returns the results. Elasticsearch performs this search for similar documents using the more_like_this query [5] command with very little programming required. However, all the relevant documents must be added to the index beforehand. I'll be using the official Perl client Search::Elasticsearch from CPAN for this.

Listing 3 (described later) wades through a directory of text files. These are stories from my Usarundbrief.com blog that I extracted from the home-grown content management system via another Perl script. Each text file corresponds to a blog entry – 877 messages accumulated over almost 20 years, which is why the directory contains 877 files. Listing 1 shows the shell output [6].

Listing 1

Feed Data

$ ls idx | wc -l
877
$ mlt-index
Added 10-cents-for-a-grocery-bag.txt
Added a-job-for-angelika.txt
Added absurd-and-funny-american-tv-shows.txt
...

1 2 3 Next »

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.
Canonical Bumps LTS Support to 12 years

Linux , open source , Operating Systems , Ubuntu

If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
Fedora 40 Beta Released Soon

Fedora , Gnome , open source , Plasma , Wayland

With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
New Pentesting Distribution to Compete with Kali Linux

Linux , open source , Tools , Ubuntu

SnoopGod is now available for your testing needs
Juno Computers Launches Another Linux Laptop

Hardware , laptop , Linux , Ubuntu

If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.

Smart research using Elasticsearch