[Update: 8 Mar 2014. I’ve just written a quick introduction to Lucene 4:
The contents of this introduction are excerpted from the Lucene chapter in my new book:
This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene’s search query syntax.]
This is a tutorial on getting started with the Lucene 2.4 Java API which avoids using deprecated classes and methods. In particular it shows how to process search results without using Hits
objects, as this class is scheduled for removal in Lucene 3.0. The time estimate of 60 seconds to complete this tutorial is more wish than promise; my goal is to present the essential concepts in Lucene as concisely as possible.
In the best “Hello World” tradition I have written a class called HelloLucene
which builds a Lucene index over a set of 3 texts: { “hello world”, “hello sailor”, “goodnight moon” }. HelloLucene
has two methods (besides main
): buildIndex
and searchIndex
, which are called in turn. buildIndex
builds an index over these 3 texts; searchIndex
takes the args
vector and runs a search over the index for each string, printing out results to the terminal.
Here is buildIndex
and its required import statements:
import java.io.IOException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.FSDirectory; ... public static void buildIndex() throws IOException { IndexWriter indexWriter = new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"), new StandardAnalyzer(), IndexWriter.MaxFieldLength.LIMITED); String[] texts = new String[] { "hello world", "hello sailor", "goodnight moon" }; for (String text : texts) { Document doc = new Document(); doc.add(new Field("text",text, Field.Store.YES,Field.Index.ANALYZED)); indexWriter.addDocument(doc); } indexWriter.close(); }
A Lucene data store is called an Index. It is a collection of Document
objects. A Document
consists of one or more named Field
objects. Documents are indexed on a per-field basis. An IndexWriter
object is used to create this index. Let’s go over the call to its constructor method:
new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"), new StandardAnalyzer(), IndexWriter.MaxFieldLength.LIMITED);
The first argument is the index that the IndexWriter
operates on. This example keeps the index on disk, so the FSDirectory
class is used to get a handle to this index. The second argument is a Lucene Analyzer
. This object controls how the input text is tokenized into terms used in search and indexing. HelloLucene
uses a StandardAnalyzer
, which is designed to index English language texts. It ignores punctuation, removes common function words such as “if”, “and”, and “”but”, and converts words to lowercase. The third argument determines the amount of text that is indexed. The constant IndexWriter.MaxFieldLength.LIMITED
defaults to 10,000 characters.
for (String text : texts) { Document doc = new Document(); doc.add(new Field("text",text,Field.Store.YES,Field.Index.ANALYZED)); indexWriter.addDocument(doc); } indexWriter.close();
The for
loop maps texts into Document
objects, which contain a single Field
with name “text”. The last 2 arguments to the Field
constructor method specify that the contents of the field are stored in the index, and that they are analyzed by the IndexWriter
‘s Analyzer
. The IndexWriter.addDocument()
method adds each document to the index. After all texts have been processed the IndexWriter is closed.
Both indexing and search are operations over Document
objects. Searches over the index are specified on a per-field basis, (just like indexing). Lucene computes a similarity score between each search query and all documents in the index and the search results consist of the set of best-scoring documents for that query.
Here is searchIndex
and its required import statements:
import java.io.IOException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.FSDirectory; ... public static void searchIndex(String[] queryStrings) throws IOException, ParseException { Searcher searcher = new IndexSearcher(FSDirectory.getDirectory("helloLuceneIndex")); QueryParser parser = new QueryParser("text",new StandardAnalyzer()); for (String queryString : queryStrings) { System.out.println("nsearching for: " + queryString); Query query = parser.parse(queryString); TopDocs results = searcher.search(query,10); System.out.println("total hits: " + results.totalHits); ScoreDoc[] hits = results.scoreDocs; for (ScoreDoc hit : hits) { Document doc = searcher.doc(hit.doc); System.out.printf("%5.3f %sn", hit.score, doc.get("text")); } } searcher.close(); }
Access to a Lucene index (or indexes) is provided by a Searcher
object. searchIndex
instantiates a IndexSearcher
over the directory “helloLuceneIndex”. Search happens via a call to the Searcher.search()
method. Before calling the search()
method the search string must be processed into a Lucene Query
. To prepare the query we create a QueryParser
and call its parse()
method on the search string. The QueryParser
is constructed using the same Analyzer
as was used for indexing because the QueryParser
processes the search string into a set of search terms which are matched against the terms in the index, therefore both the indexed text and the search text must be tokenized in the same way.
Here are the statements which process the results:
TopDocs results = searcher.search(query,10); System.out.println("total hits: " + results.totalHits); ScoreDoc[] hits = results.scoreDocs; for (ScoreDoc hit : hits) { Document doc = searcher.doc(hit.doc); System.out.printf("%5.3f %sn", hit.score, doc.get("text")); }
The search()
method returns a TopDocs
object, which has 2 public fields: scoreDocs
and totalHits
. The latter is the number of documents that matched the query, and the former is the array of results in the form of ScoreDoc
objects, where a ScoreDoc
is itself a simple object consisting of two public fields: doc
, the Document
id (an int
value); and score
a float
value calculated by Lucene (using a similarity metric best described as Unexplainable Greek kung fu). searchIndex
reports the TopDoc.totalHits
, then iterates over the TopDoc.scoreDocs
array. The ScoreDoc.doc
is used to retrieve the Document
object from the index, and finally the method Document.get()
retrieves the contents of the Field
with name “text”.
In order to compile this program, download the Lucene 2.4 distribution, which contains the jarfile lucene-core-2.4.0.jar
. Either add this to your classpath, or simply put it in the same directory as HelloLucene
, and compile with the following command:
javac -cp "lucene-core-2.4.0.jar" HelloLucene.java
Invoke the program with the following command:
java -cp ".;lucene-core-2.4.0.jar" HelloLucene "hello world" hello moon foo
Produces these results:
built index searching for: hello world total hits: 2 1.078 hello world 0.181 hello sailor searching for: hello total hits: 2 0.625 hello world 0.625 hello sailor searching for: moon total hits: 1 0.878 goodnight moon searching for: foo total hits: 0 all done
The first query “hello world” matches exactly against our first text, but it also partially matches the text “hello sailor”. The second query “hello” matches both texts equally well. The third query “moon” matches against the only document which contains that word. The fourth query “foo” doesn’t match anything in the index.
Exercise for the reader
Refactor HelloLucene.java
into two programs: BuildHelloLucene.java
and SearchHelloLucene.java, where BuildHelloLucene
uses its command line arguments to build the index, instead of the strings supplied by Sring[] texts
.
Runing BuildHelloLucene
with inputs “hello world”, “hello sailor”, and “goodnight moon” and then running SearchHelloLucene
with inputs “hello world”, “hello”, “moon” and “foo” should give the same results as above. Different input texts and searches will expose the behaviour of the StandardAnalyzer
in particular, and Lucene in general.
March 10, 2009 at 4:41 am |
Hi.
I’m using Lucene 2.4 to index a document with the following text:
HELLO JAMES WELCOME
[stored/uncompressed,indexed, stored/uncompressed,indexed, stored/uncompressed,indexed]
HELLO FATHER GOODBYE JAMES
[stored/uncompressed,indexed, stored/uncompressed,indexed]
HELLO FATHER WELCOME FATHER
[stored/uncompressed,indexed]
GOODBYE JAMES GOODBYE FATHER
[stored/uncompressed,indexed, stored/uncompressed,indexed, stored/uncompressed,indexed]
For each line I created a new document with the instruction doc = new Document(); I saved the text in a Lucene index doc.add(new Field(“p”, line, Field.Store.YES, Field.Index.NOT_ANALYZED)); the number of the line of each phrase doc.add(new Field(“numLine”, numLine, Field.Store.YES, Field.Index.NOT_ANALYZED)); and finally for each people’s name I created a new Lucene index doc.add(new Field(“name”, name_person, Field.Store.YES, Field.Index.NOT_ANALYZED));
I want to make a searcher that gives every files and fields with the searching name, for instance, searching for “JAMES” will create a result in documents 1,2 and 4; searching for “HELLO FATHER” will give documents 2 and 3 and searching “FATHER” will result in documents 2, 4 and 3 (twice)….but…
If I search “hello” the result is the number 1.
If I search “hello f*” the result is the number 1.
If I search “father” the result is nothing.
If I search “james” the result is the number 1,2 and 4. –> Is the only correct.
Right now I just know how to look for one word, for example with “JAMES” I use query: name: james.
How can I search for a phrase or more than one word? Will it be something similar to query: p:hello p:father ?
Bye
March 10, 2009 at 8:37 am |
Hi,
I am working on the Lucene 2.4 for indexing the documents and searching. I have a requirement where the documents should be searched against a keyword among multiple fields (for each document). Can i search for the documents without mentioning the field names of the document?
Thanks
March 10, 2009 at 11:44 am |
Everyone:
This post and comment thread wasn’t intended to be a replacement for the:
Lucene Mailing Lists
They get a lot of traffic, but are good at answering questions.
Having said that, these questions are easy:
@Antonio: yes. try it.
@Hari: yes, set the default field in the analyzer.
May 10, 2010 at 10:12 am |
[link to copy of new Lucene in Action Book removed].
May 10, 2010 at 4:56 pm |
While it’s nice that the new book is out, I don’t think the authors or Manning Press would appreciate the pirated PDF! So I’m erasing the link.
March 2, 2011 at 2:29 am |
Is Lucene is a Knowledge base like Wikitology?
That is Wikitology has own knowledge base and index developed on Wikipedia.
But uptill now the examples I have gone through in that we are building index on text given by the user, not on any knowledge base.
Please help me in this regards, to clear this concept.
Will be looking forward for reply.
Thanks in advance.
June 27, 2011 at 9:55 am |
Just a quick question since I am just checking on lucene configuration I found differing explainations of the “MaxFieldLength”-parameter.
“The constant IndexWriter.MaxFieldLength.LIMITED defaults to 10,000 characters.”
– is it realy characters or maybe words? on the lucene page it says “The maximum number of terms that will be indexed for a single field in a …”. Also I wonder if it only skips english fill words (and,but, if etc. or also other languages).
June 27, 2011 at 12:47 pm |
That’s a bunch of different questions. It may be characters instead of tokens, because a field is constructed with a string or stream of characters.
The doc for Lucene’s perhaps not the best place to look. Go and look at the source code, and you should be able to track down what it’s doing. Or write a test case.
Lucene also has a very responsive user’s mailing list.
As for what gets skipped, that depends on the Analyzer implementation. Lucene has analyzers for many languages in its extended distribution (beyond the core jar). You can also define your own stop words. Typically, punctuation also gets removed, too.
October 16, 2011 at 10:26 pm |
Very good tutorial. Simple and easy to understand. Do you have tutorials about BM25 algorithm?
October 19, 2011 at 12:24 am |
I’m afraid not — I’ve never gotten through the formulas. I believe there’s been discussion in the past on the Lucene mailing lists about BM25.
February 13, 2013 at 11:59 pm |
Nice tutorial. Why then Solr
http://antguider.blogspot.com/2012/06/solr-search.html
May 13, 2020 at 12:42 pm |
[…] Along with user428747 answer, you can also read this article. […]