Skip to content

jaju/lucene-clj

Repository files navigation

org.msync/lucene-clj https://github.com/jaju/lucene-clj/actions/workflows/clojure.yml/badge.svg

What is lucene-clj?

A simple Clojure wrapper for Apache Lucene (version 8.9.0).

Key usage scenarios

Search
The core use-case of Lucene.
Suggest
Prefix-queries for content in any field.

Both in-memory, and on-disk indexes can be used depending on the dataset size.

Note: UNSTABLE API. No releases yet.

Inspired by other example wrappers I’ve come across. Notably

Adding Dependency to a Project

[org.msync/lucene-clj "0.2.0-SNAPSHOT"]

Available via clojars.

Why Would You Want to Use lucene-clj?

The primary use-case is for in-process text search needs for read-only data-sets that can be managed on single-instance deployments. For multi-instance deployments, keeping modifications of data in sync is an effort.

Use this library when you need light-weight text-search support without the hassle of setting up something like Solr. You may update the index if you wish, but have to take care of any race conditions, and since it is in-process, you will also need to take care of updating all instances in a multi-instance use scenario.

The objectives are loosely as follows.

  • Stick to core Lucene. No script/language specific dependencies part of the core library, but can be added by users per need.
  • Support for prefix based suggestions - a feature of Lucene I found quite undocumented, as well as lacking good examples for.
  • Track the latest Lucene versions.

I am thankful to the above library authors for their liberal licensing. I’ve used their ideas/code in places.

Usage - A Complete Scenario

There’s sample data in the repository that we use in our examples. A hand-created sample with fictional and non-fictional characters is here and one from Kaggle on music albums is here. These are also used in the tests.

A complete scenario from index creation to search actions is described below.

Sample Datasets Used

  1. Albums - Kaggle - [local]
  2. Hand-created, real + fictional characters here

Lucene’s Document Model

When dealing with Lucene and data it processes, key terms to note are

Document
A unit of related text. It has possibly many fields, and is a unit of consumption and also of each search result. A Document is a collection of Fields.
Field
Every field is a container of indexable content. They can range across many types, from simple text to latitude and longitude.
Analyzer
Analyzes the input documents, and preprocesses terms appropriately. Depending on the context, decisions on tokenizing, stemming, stopwords removal, or treating input as-is - these are controlled by the use of appropriate analyzers

This is a pretty hand-wavy description, but useful enough for our purpose.

Some Background - Data Preparation

Lucene consumes documents, each of which is made up of fields having values. As is natural in Clojure, we represent all such things as maps.

{
 :title-field "This is a title"
 :abstract-field "This is an abstract of what is to follow"
 :author-field "Lekhak Sampaadak"
 :body-field "And here's the crux of the article with all the gory details"
 }

To prepare our content for ingestion and indexing, we do some straightforward CSV parsing and conversion of each row into a map. Each column has a name and is used as the key for the field name in the document-map. All the preparation code is in the msync.lucene.tests-common test namespace, which we’ll refer to as the common namespace where required for clarity. We use two CSV data-sets as our sources of documents to create two indexes, to demonstrate some distinct use-cases. All data files are in the ~test-resources~ subdirectory.

A Glimpse of the Data

We use two simple datasets, stored as CSV. Loading is straightforward CSV parsing and converting to maps – the first rows in each file are the header rows, holding names of respective columns.

  • Sample, hand-coded documents. Plain, simple data.
;; In the common namespace
(take 5 (read-csv-resource-file sample-data-file))
first-namelast-nameagerealgenderbio
SuppandiVaradarajan16falsemA wonderful, innocent soul. You’ll enjoy his antics.
ShikariShambhu32FalsemCarries a gun. But no bullets. Animals love him.
ChachaChaudhary64FalSemThe supercomputer. And then some more!
SabuJupiterwala2falsemYes, of legal age. Just a different age-scale because of the planet he comes from. Strong, powerful, but kind. Because, not an earthling. Children love him.
  • Albums data. From Kaggle.
    • The columns Genre and Subgenre, are comma-separated values themselves
      • They are to be pre-processed before feeding to lucene-clj
      • These are multi-valued fields.
;; In the common namespace
(take 5 (read-csv-resource-file albums-file))
NumberYearAlbumArtistGenreSubgenre
11967Sgt. Pepper’s Lonely Hearts Club BandThe BeatlesRockRock & Roll, Psychedelic Rock
21966Pet SoundsThe Beach BoysRockPop Rock, Psychedelic Rock
31966RevolverThe BeatlesRockPsychedelic Rock, Pop Rock
41965Highway 61 RevisitedBob DylanRockFolk Rock, Blues Rock

Creating Analyzers

Analyzers process each field’s content in a manner that is apt - according to what the programmer/domain-expert decides.

Some fields need to be tokenized and stemmed, while some are to be treated verbatim. Natural language text, versus some proper nouns like company name or music genre.

In the albums dataset, the Year, Genre and Subgenre fields’ texts are not to be tokenized and stemmed, or filtered for stop-words. Hence, they are configured to be analyzed with the keyword analyzer. Other fields can be treated like normal text. So, in this case, we use a composed analyzer that can treat each field in its special way.

Note that the same analyzers we use while creating indexes should be used when querying the index for search and suggest to avoid surprises. This shouldn’t be surprising.

Here’s how we create analyzers.

;; In the common namespace
;; This is the default analyzer, an instance of the StandardAnalyzer
;; of Lucene
(defonce default-analyzer (analyzers/standard-analyzer))

;; This analyzer considers field values verbatim
;; Will not tokenize and stem
(defonce keyword-analyzer (analyzers/keyword-analyzer))

;; A per-field analyzer, which composes other kinds of analyzers
;; For album data, we have marked some fields as verbatim
;; Takes a default analyzer, and then a map of field to field-specific analyzer
(defonce album-data-analyzer
  (analyzers/per-field-analyzer default-analyzer
                                {:Year     keyword-analyzer
                                 :Genre    keyword-analyzer
                                 :Subgenre keyword-analyzer}))

Some simple REPL-runs

With the background setup done and explained, let us move ahead to demonstrating indexing and searching. You may want to try the following in a REPL by requiring the namespace the prior code is in and then playing along. I’ve used the dev namespace below, the code for which can be found here.

Preamble

(ns dev
  (:require [msync.lucene :as lucene]
            [msync.lucene
             [document :as ld]
             [tests-common :as common]]))

Create an index

In memory

(defonce album-index (lucene/create-index! :type :memory
                                           :analyzer common/album-data-analyzer))

Or, on disk

(defonce album-index (lucene/create-index! :type :disk
                                           :path "/path/to/index/directory"
                                           :analyzer common/album-data-analyzer))

A sample of the album data for reference. The Genre and Subgenre columns are pre-processed, as mentioned above, and split further.

(drop 2 (take 5 common/album-data))
({:Number "3",
  :Year "1966",
  :Album "Revolver",
  :Artist "The Beatles",
  :Genre ("Rock"),
  :Subgenre ("Psychedelic Rock" "Pop Rock")}
 {:Number "4",
  :Year "1965",
  :Album "Highway 61 Revisited",
  :Artist "Bob Dylan",
  :Genre ("Rock"),
  :Subgenre ("Folk Rock" "Blues Rock")}
 {:Number "5",
  :Year "1965",
  :Album "Rubber Soul",
  :Artist "The Beatles",
  :Genre ("Rock" "Pop"),
  :Subgenre ("Pop Rock")})

Index documents

Documents are Clojure maps. Each key-value in the map represents one org.apache.lucene.document.Field. The options passed to the `index!` function control behavior in various ways

  • :stored-fields - Lucene can index for efficient searching, but to save space, it need not store all the field values. If you want Lucene to also store the contents, pass them as a collection to this argument. The alternative is to use Lucene to index without storing large fields, and
  • :suggest-fields - Fields that are treated specially during indexing, allowing Lucene to create internal structures for quick prefix matching.
  • :context-fn - Lucene allows for a list of contexts to associate with the suggest fields, which allow us to filter on them while querying for suggestions.

In the following, we instruct the `index!` function to

  • Store the mentioned fields
  • Use the :Album and :Artist fields to index for suggestions - this uses some special processing and storage in the index.
  • Use the :Genre field as context. Note that :Genre is itself can be multiple values for each document, and that works fine.
(lucene/index! album-index common/album-data
               {:stored-fields  [:Number :Year :Album :Artist :Genre :Subgenre]
                :suggest-fields [:Album :Artist]
                :context-fn     :Genre})

Now, we can search

A simple search example, in which we pass a map specifying the field, and the value we are looking for. The result includes the :hit, a :score for that :hit, and the :doc-id which is an identifier that Lucene manages. Notice that the result - :hit - is a Lucene Document object.

(lucene/search album-index {:Year "1977"}
               {:results-per-page 2})
[{:doc-id 25,
  :score 1.4994705,
  :hit
  #object[org.apache.lucene.document.Document 0x24750f97 "Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Number:26> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Year:1977> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Album:Rumours> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Artist:Fleetwood Mac> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Genre:Rock> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Subgenre:Pop Rock>>"]}
 {:doc-id 40,
  :score 1.4994705,
  :hit
  #object[org.apache.lucene.document.Document 0x6d6a6fe4 "Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Number:41> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Year:1977> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Album:Never Mind the Bollocks Here's the Sex Pistols> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Artist:Sex Pistols> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Genre:Rock> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Subgenre:Punk>>"]}]

For convenience, lucene-clj has a function that can be used to convert the Lucene Document into a Clojure map. But use beyond basic use-cases, supply your own.

(lucene/search album-index {:Year "1977"}
               {:results-per-page 2
                :hit->doc ld/document->map})
[{:doc-id 25,
  :score 1.4994705,
  :hit
  {:Number "26",
   :Year "1977",
   :Album "Rumours",
   :Artist "Fleetwood Mac",
   :Genre "Rock",
   :Subgenre "Pop Rock"}}
 {:doc-id 40,
  :score 1.4994705,
  :hit
  {:Number "41",
   :Year "1977",
   :Album "Never Mind the Bollocks Here's the Sex Pistols",
   :Artist "Sex Pistols",
   :Genre "Rock",
   :Subgenre "Punk"}}]

Notice though, that the :Genre and :Subgenre fields did not come back as collections. The document->map function isn’t smart to identify that, and needs a hint to make that happen. With the modified hit->doc argument, the two fields come back as vectors with possibly multiple values.

(lucene/search album-index
               {:Year "1977"}
               {:results-per-page 2
                :hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])})
[{:doc-id 25,
  :score 1.4994705,
  :hit
  {:Number "26",
   :Year "1977",
   :Album "Rumours",
   :Artist "Fleetwood Mac",
   :Genre ["Rock"],
   :Subgenre ["Pop Rock"]}}
 {:doc-id 40,
  :score 1.4994705,
  :hit
  {:Number "41",
   :Year "1977",
   :Album "Never Mind the Bollocks Here's the Sex Pistols",
   :Artist "Sex Pistols",
   :Genre ["Rock"],
   :Subgenre ["Punk"]}}]

Paginated query results are supported via the :page option. Also, the following example projects a subset of the document fields by passing a modified function as the :hit->doc argument.

(lucene/search album-index
               {:Year "1968"} ;; Map of field-values to search with
               {:results-per-page 5 ;; Control the number of results returned
                :page 2             ;; Page number, starting 0 as default
                :hit->doc         #(-> %
                                       ld/document->map
                                       (select-keys [:Year :Album]))})
[{:doc-id 160,
  :score 1.4311604,
  :hit {:Year "1968", :Album "The Dock of the Bay"}}
 {:doc-id 170,
  :score 1.4311604,
  :hit {:Year "1968", :Album "The Notorious Byrd Brothers"}}
 {:doc-id 204,
  :score 1.4311604,
  :hit {:Year "1968", :Album "Wheels of Fire"}}
 {:doc-id 233,
  :score 1.4311604,
  :hit {:Year "1968", :Album "Bookends"}}
 {:doc-id 257,
  :score 1.4311604,
  :hit
  {:Year "1968",
   :Album "The Kinks Are The Village Green Preservation Society"}}]

Search variations

Simple search

Searching in a single field, for a single value

(lucene/search album-index {:Year "1967"} {:results-per-page 2 :hit->doc ld/document->map})

OR Search

Searching in a single field, where any of the values in the set are allowed

(lucene/search album-index {:Year #{"1960" "1965"}}
               {:results-per-page 5
                :hit->doc #(-> % ld/document->map (select-keys [:Year :Album]))})
[{:doc-id 118,
  :score 2.2562923,
  :hit {:Year "1960", :Album "At Last!"}}
 {:doc-id 347,
  :score 2.2562923,
  :hit {:Year "1960", :Album "Muddy Waters at Newport 1960"}}
 {:doc-id 357,
  :score 2.2562923,
  :hit {:Year "1960", :Album "Sketches of Spain"}}
 {:doc-id 3,
  :score 1.6102078,
  :hit {:Year "1965", :Album "Highway 61 Revisited"}}
 {:doc-id 4,
  :score 1.6102078,
  :hit {:Year "1965", :Album "Rubber Soul"}}]

AND Search

When looking for multiple terms in a single field, pass a vector.

(lucene/search album-index {:Album ["complete" "unbelievable"]} {:hit->doc ld/document->map})
[{:doc-id 253,
  :score 3.0571077,
  :hit
  {:Number "254",
   :Year "1966",
   :Album
   "Complete & Unbelievable: The Otis Redding Dictionary of Soul",
   :Artist "Otis Redding",
   :Genre "Funk / Soul",
   :Subgenre "Soul"}}]

Be sure that your queries are semantically right for the data-set. For example, AND-ing over two different years will lead to an empty result-set, obviously.

(lucene/search album-index {:Year ["1964" "1965"]})
[]

Phrase search

Spaces in the query string are inferred to mean a phrase search operation

(lucene/search album-index {:Album "the sun"} {:hit->doc ld/document->map})
[{:doc-id 10,
  :score 2.8861985,
  :hit
  {:Number "11",
   :Year "1976",
   :Album "The Sun Sessions",
   :Artist "Elvis Presley",
   :Genre "Rock",
   :Subgenre "Rock & Roll"}}
 {:doc-id 287,
  :score 2.544825,
  :hit
  {:Number "288",
   :Year "1968",
   :Album "Anthem of the Sun",
   :Artist "Grateful Dead",
   :Genre "Rock",
   :Subgenre "Psychedelic Rock"}}
 {:doc-id 310,
  :score 2.544825,
  :hit
  {:Number "311",
   :Year "1994",
   :Album "The Sun Records Collection",
   :Artist "Various",
   :Genre "& Country",
   :Subgenre "Rockabilly"}}]

Searching across fields

This is an AND operation

(lucene/search album-index {:Album "the sun" :Year "1976"} {:hit->doc ld/document->map})
[{:doc-id 10,
  :score 4.56387,
  :hit
  {:Number "11",
   :Year "1976",
   :Album "The Sun Sessions",
   :Artist "Elvis Presley",
   :Genre "Rock",
   :Subgenre "Rock & Roll"}}]

Suggestions

Notice that in the suggest function call, the field and suggestion-prefix are not passed as a map, as unlike search, suggest calls are only supported over a single field.

Suggestions support for fields passed via :suggest-fields

From above, the fields Album and Artist have been marked to be indexed in a way so that we can ask for prefix-based suggestions.

(lucene/suggest album-index :Album "par"
                {:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
                 :contexts ["Electronic"]})
[{:hit
  {:Number "140",
   :Year "1978",
   :Album "Parallel Lines",
   :Artist "Blondie",
   :Genre ["Electronic" "Rock"],
   :Subgenre ["New Wave" "Pop Rock" "Punk" "Disco"]},
  :score 1.0,
  :doc-id 139}]

We can ask for fuzzy matching when querying for suggestions.

(lucene/suggest album-index :Album "per"
                {:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
                 :fuzzy? true
                 :contexts ["Electronic"]})
[{:hit
  {:Number "140",
   :Year "1978",
   :Album "Parallel Lines",
   :Artist "Blondie",
   :Genre ["Electronic" "Rock"],
   :Subgenre ["New Wave" "Pop Rock" "Punk" "Disco"]},
  :score 2.0,
  :doc-id 139}
 {:hit
  {:Number "76",
   :Year "1984",
   :Album "Purple Rain",
   :Artist "Prince and the Revolution",
   :Genre ["Electronic" "Rock" "Funk / Soul" "Stage & Screen"],
   :Subgenre ["Pop Rock" "Funk" "Soundtrack" "Synth-pop"]},
  :score 2.0,
  :doc-id 75}]

Or, do a fuzzy search

Notice how forever matches fever too.

(lucene/search album-index {:Album "forever"}
               {:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
                :fuzzy? true})
[{:doc-id 39,
  :score 3.0850303,
  :hit
  {:Number "40",
   :Year "1967",
   :Album "Forever Changes",
   :Artist "Love",
   :Genre ["Rock"],
   :Subgenre ["Folk Rock" "Psychedelic Rock"]}}
 {:doc-id 131,
  :score 0.9592955,
  :hit
  {:Number "132",
   :Year "1977",
   :Album "Saturday Night Fever: The Original Movie Sound Track",
   :Artist "Various Artists",
   :Genre ["Electronic" "�Stage & Screen"],
   :Subgenre ["Soundtrack" "�Disco"]}}]

Additional notes

  • Some minimal technical overview of Lucene internals for this project can be found here.

License

Copyright © 2018-2020 Ravindra R. Jaju

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.