Lucene 2.4 in 60 seconds

by

[Update: 8 Mar 2014. I've just written a quick introduction to Lucene 4:

The contents of this introduction are excerpted from the Lucene chapter in my new book:

This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene's search query syntax.]

This is a tutorial on getting started with the Lucene 2.4 Java API which avoids using deprecated classes and methods. In particular it shows how to process search results without using Hits objects, as this class is scheduled for removal in Lucene 3.0. The time estimate of 60 seconds to complete this tutorial is more wish than promise; my goal is to present the essential concepts in Lucene as concisely as possible.

In the best “Hello World” tradition I have written a class called HelloLucene which builds a Lucene index over a set of 3 texts: { “hello world”, “hello sailor”, “goodnight moon” }. HelloLucene has two methods (besides main): buildIndex and searchIndex, which are called in turn. buildIndex builds an index over these 3 texts; searchIndex takes the args vector and runs a search over the index for each string, printing out results to the terminal.

Here is buildIndex and its required import statements:

import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
...
public static void buildIndex() throws IOException {
    IndexWriter indexWriter
        = new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
                          new StandardAnalyzer(),
                          IndexWriter.MaxFieldLength.LIMITED);
    String[] texts = new String[] {  "hello world",
                                     "hello sailor",
                                     "goodnight moon" };
    for (String text : texts) {
        Document doc = new Document();
        doc.add(new Field("text",text,
                    Field.Store.YES,Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
    }
    indexWriter.close();
}

A Lucene data store is called an Index. It is a collection of Document objects. A Document consists of one or more named Field objects. Documents are indexed on a per-field basis. An IndexWriter object is used to create this index. Let’s go over the call to its constructor method:

new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
                new StandardAnalyzer(),
                IndexWriter.MaxFieldLength.LIMITED);

The first argument is the index that the IndexWriter operates on. This example keeps the index on disk, so the FSDirectory class is used to get a handle to this index. The second argument is a Lucene Analyzer. This object controls how the input text is tokenized into terms used in search and indexing. HelloLucene uses a StandardAnalyzer, which is designed to index English language texts. It ignores punctuation, removes common function words such as “if”, “and”, and “”but”, and converts words to lowercase. The third argument determines the amount of text that is indexed. The constant IndexWriter.MaxFieldLength.LIMITED defaults to 10,000 characters.

for (String text : texts) {
    Document doc = new Document();
    doc.add(new Field("text",text,Field.Store.YES,Field.Index.ANALYZED));
    indexWriter.addDocument(doc);
}
indexWriter.close();

The for loop maps texts into Document objects, which contain a single Field with name “text”. The last 2 arguments to the Field constructor method specify that the contents of the field are stored in the index, and that they are analyzed by the IndexWriter‘s Analyzer. The IndexWriter.addDocument() method adds each document to the index. After all texts have been processed the IndexWriter is closed.

Both indexing and search are operations over Document objects. Searches over the index are specified on a per-field basis, (just like indexing). Lucene computes a similarity score between each search query and all documents in the index and the search results consist of the set of best-scoring documents for that query.

Here is searchIndex and its required import statements:

import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
...
public static void searchIndex(String[] queryStrings) 
    throws IOException, ParseException {
    Searcher searcher 
        = new IndexSearcher(FSDirectory.getDirectory("helloLuceneIndex"));
    QueryParser parser = new QueryParser("text",new StandardAnalyzer());
    for (String queryString : queryStrings) {
        System.out.println("nsearching for: " + queryString);
        Query query = parser.parse(queryString);
        TopDocs results = searcher.search(query,10);
        System.out.println("total hits: " + results.totalHits);
        ScoreDoc[] hits = results.scoreDocs;
        for (ScoreDoc hit : hits) {
            Document doc = searcher.doc(hit.doc);
            System.out.printf("%5.3f %sn",
                              hit.score, doc.get("text"));
        }
    }
    searcher.close();
}

Access to a Lucene index (or indexes) is provided by a Searcher object. searchIndex instantiates a IndexSearcher over the directory “helloLuceneIndex”. Search happens via a call to the Searcher.search() method. Before calling the search() method the search string must be processed into a Lucene Query. To prepare the query we create a QueryParser and call its parse() method on the search string. The QueryParser is constructed using the same Analyzer as was used for indexing because the QueryParser processes the search string into a set of search terms which are matched against the terms in the index, therefore both the indexed text and the search text must be tokenized in the same way.

Here are the statements which process the results:

TopDocs results = searcher.search(query,10);
System.out.println("total hits: " + results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
    Document doc = searcher.doc(hit.doc);
    System.out.printf("%5.3f %sn",
                      hit.score, doc.get("text"));
}

The search() method returns a TopDocs object, which has 2 public fields: scoreDocs and totalHits. The latter is the number of documents that matched the query, and the former is the array of results in the form of ScoreDoc objects, where a ScoreDoc is itself a simple object consisting of two public fields: doc, the Document id (an int value); and score a float value calculated by Lucene (using a similarity metric best described as Unexplainable Greek kung fu). searchIndex reports the TopDoc.totalHits, then iterates over the TopDoc.scoreDocs array. The ScoreDoc.doc is used to retrieve the Document object from the index, and finally the method Document.get() retrieves the contents of the Field with name “text”.

In order to compile this program, download the Lucene 2.4 distribution, which contains the jarfile lucene-core-2.4.0.jar. Either add this to your classpath, or simply put it in the same directory as HelloLucene, and compile with the following command:

javac -cp "lucene-core-2.4.0.jar" HelloLucene.java

Invoke the program with the following command:

java -cp ".;lucene-core-2.4.0.jar" HelloLucene "hello world" hello moon foo

Produces these results:

built index

searching for: hello world
total hits: 2
1.078 hello world
0.181 hello sailor

searching for: hello
total hits: 2
0.625 hello world
0.625 hello sailor

searching for: moon
total hits: 1
0.878 goodnight moon

searching for: foo
total hits: 0

all done

The first query “hello world” matches exactly against our first text, but it also partially matches the text “hello sailor”. The second query “hello” matches both texts equally well. The third query “moon” matches against the only document which contains that word. The fourth query “foo” doesn’t match anything in the index.

Exercise for the reader

Refactor HelloLucene.java into two programs: BuildHelloLucene.java and SearchHelloLucene.java, where BuildHelloLucene uses its command line arguments to build the index, instead of the strings supplied by Sring[] texts.

Runing BuildHelloLucene with inputs “hello world”, “hello sailor”, and “goodnight moon” and then running SearchHelloLucene with inputs “hello world”, “hello”, “moon” and “foo” should give the same results as above. Different input texts and searches will expose the behaviour of the StandardAnalyzer in particular, and Lucene in general.

11 Responses to “Lucene 2.4 in 60 seconds”

  1. Antonio Says:

    Hi.

    I’m using Lucene 2.4 to index a document with the following text:

    HELLO JAMES WELCOME

    [stored/uncompressed,indexed, stored/uncompressed,indexed, stored/uncompressed,indexed]

    HELLO FATHER GOODBYE JAMES

    [stored/uncompressed,indexed, stored/uncompressed,indexed]

    HELLO FATHER WELCOME FATHER

    [stored/uncompressed,indexed]

    GOODBYE JAMES GOODBYE FATHER

    [stored/uncompressed,indexed, stored/uncompressed,indexed, stored/uncompressed,indexed]

    For each line I created a new document with the instruction doc = new Document(); I saved the text in a Lucene index doc.add(new Field(“p”, line, Field.Store.YES, Field.Index.NOT_ANALYZED)); the number of the line of each phrase doc.add(new Field(“numLine”, numLine, Field.Store.YES, Field.Index.NOT_ANALYZED)); and finally for each people’s name I created a new Lucene index doc.add(new Field(“name”, name_person, Field.Store.YES, Field.Index.NOT_ANALYZED));

    I want to make a searcher that gives every files and fields with the searching name, for instance, searching for “JAMES” will create a result in documents 1,2 and 4; searching for “HELLO FATHER” will give documents 2 and 3 and searching “FATHER” will result in documents 2, 4 and 3 (twice)….but…

    If I search “hello” the result is the number 1.
    If I search “hello f*” the result is the number 1.
    If I search “father” the result is nothing.
    If I search “james” the result is the number 1,2 and 4. –> Is the only correct.

    Right now I just know how to look for one word, for example with “JAMES” I use query: name: james.

    How can I search for a phrase or more than one word? Will it be something similar to query: p:hello p:father ?

    Bye

  2. hari Says:

    Hi,

    I am working on the Lucene 2.4 for indexing the documents and searching. I have a requirement where the documents should be searched against a keyword among multiple fields (for each document). Can i search for the documents without mentioning the field names of the document?

    Thanks

  3. lingpipe Says:

    Everyone:

    This post and comment thread wasn’t intended to be a replacement for the:

    Lucene Mailing Lists

    They get a lot of traffic, but are good at answering questions.

    Having said that, these questions are easy:

    @Antonio: yes. try it.

    @Hari: yes, set the default field in the analyzer.

  4. Pankil Patel Says:

    [link to copy of new Lucene in Action Book removed].

    • lingpipe Says:

      While it’s nice that the new book is out, I don’t think the authors or Manning Press would appreciate the pirated PDF! So I’m erasing the link.

  5. Sundus Hassan Says:

    Is Lucene is a Knowledge base like Wikitology?
    That is Wikitology has own knowledge base and index developed on Wikipedia.
    But uptill now the examples I have gone through in that we are building index on text given by the user, not on any knowledge base.

    Please help me in this regards, to clear this concept.

    Will be looking forward for reply.

    Thanks in advance.

  6. Paul Says:

    Just a quick question since I am just checking on lucene configuration I found differing explainations of the “MaxFieldLength”-parameter.

    “The constant IndexWriter.MaxFieldLength.LIMITED defaults to 10,000 characters.”
    – is it realy characters or maybe words? on the lucene page it says “The maximum number of terms that will be indexed for a single field in a …”. Also I wonder if it only skips english fill words (and,but, if etc. or also other languages).

    • Bob Carpenter Says:

      That’s a bunch of different questions. It may be characters instead of tokens, because a field is constructed with a string or stream of characters.

      The doc for Lucene’s perhaps not the best place to look. Go and look at the source code, and you should be able to track down what it’s doing. Or write a test case.

      Lucene also has a very responsive user’s mailing list.

      As for what gets skipped, that depends on the Analyzer implementation. Lucene has analyzers for many languages in its extended distribution (beyond the core jar). You can also define your own stop words. Typically, punctuation also gets removed, too.

  7. TrulyYours Says:

    Very good tutorial. Simple and easy to understand. Do you have tutorials about BM25 algorithm?

  8. Java Bird Says:

    Nice tutorial. Why then Solr

    http://antguider.blogspot.com/2012/06/solr-search.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 817 other followers