Lucene Tutorial

By Maurizio Farina | Posted on August 2017 | Updated to Lucene 6

Lucene is a full-text search library in Java.

Lucene allows to perform queries returning results ranked by the relevance or sorted by a field.

Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

Basic Concept Description
Documents In Lucene, a Document is the unit of search and index. An index consists of one or more Documents.
Fields A Document consists of one or more Fields. A Field is simply a name-value pair.
Searching Searching requires an index to have already been built. It involves creating a Query (usually via a QueryParser) and handing this Query to an IndexSearcher, which returns a list of Hits.
Queries The Lucene query language allows to specify which field(s) to search on, which fields to give more weight to (boosting), the ability to perform boolean queries (AND, OR, NOT) and other functionality.

Query Syntax

Keyword matching

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Search for word "foo" in the title field.
title:foo

Search for phrase "foo bar" in the title field.
title:"foo bar"

Search for phrase "foo bar" in the title field AND the phrase "quick fox" in the body field.
title:"foo bar" AND body:"quick fox"

Search for either the phrase "foo bar" in the title field AND the phrase "quick fox" in the body field, or the word "fox" in the title field.
(title:"foo bar" AND body:"quick fox") OR title:fox

Search for word "foo" and not "bar" in the title field.
title:foo -title:bar

Wildcard matching

1
2
3
4
5
Search for any word that starts with "foo" in the title field.
title:foo*

Search for any word that starts with "foo" and ends with bar in the title field.
title:foo*bar

Proximity matching

Range searches

1
2
3
4
Range Queries allow one to match documents whose field(s) values are between the lower and upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically.

mod_date:[20020101 TO 20030101]
Solr's built-in field types are very convenient for performing range queries on numbers without requiring padding.

Boosts

1
2
3
4
Query-time boosts allow one to specify which terms/clauses are "more important". The higher the boost factor, the more relevant the term will be, and therefore the higher the corresponding document scores.

(title:foo OR title:bar)^1.5 (body:foo OR body:bar)
You should carefully examine explain output to determine the appropriate boost weights.

Scoring

Lucene implements a variant of the TfIdf scoring model.

The factors involved in Lucene's scoring algorithm are as follows:

  • tf = term frequency in document = measure of how often a term appears in the document
  • idf = inverse document frequency = measure of how often the term appears across the index
  • coord = number of terms in the query that were found in the document
  • lengthNorm = measure of the importance of a term according to the total number of terms in the field
  • queryNorm = normalization factor so that queries can be compared
  • boost (index) = boost of the field at index-time
  • boost (query) = boost of the field at query-time

Customizing scoring For example, if you want to ignore how common a term appears across the index,

1
2
3
4
5
Similarity sim = new DefaultSimilarity() {
  public float idf(int i, int i1) {
    return 1;
  }
}

and if you think for the title field, more terms is better

1
2
3
4
5
6
Similarity sim = new DefaultSimilarity() {
  public float lengthNorm(String field, int numTerms) {
    if(field.equals("title")) return (float) (0.1 * Math.log(numTerms));
    else return super.lengthNorm(field, numTerms);
  }
}