Archive for the ‘Lucene’ Category

Using Luke the Lucene Index Browser to develop Search Queries

July 24, 2012

Luke is a GUI tool written in Java that allows you to browse the contents of a Lucene index, examine individual documents, and run queries over the index. Whether you’re developing with PyLucene, Lucene.NET, or Lucene Core, Luke is your friend.

Downloading, running Luke

Downloads are available from http://code.google.com/p/luke/ You can download the source or just the standalone binary .jar file. Right now we’re using the just-released Luke 4.0.0-ALPHA standalone binary. To run Luke, just fire up the jar file from the command line:

> java -jar lukeall-4.0.0-ALPHA.jar

Once launched, Luke opens with a dialog menu to open an index. We’re using an index called fed-papers.idx, which is an index over the Federalist Papers, a set of 85 op-ed pieces written by James Madison, Alexander Hamilton, and/or John Jay. This is taken from the example in our recently updated Lucene 3 tutorial. To build the index yourself, download the data and accompanying source code and follow the instructions in the tutorial. Each of the 85 papers is treated as a document. The text of the document has been indexed using Lucene’s StandardAnalyzer and is stored in a field named text.

The following is a screenshot of the Luke GUI upon first opening the index. We’ve circled the main controls on the top left corner of the GUI. Luke opens with the Overview tab pane.

All 85 documents have a similar structure. They start with an identifying number, e.g. FEDERALIST No. 1. The opening salutation is: To the People of the State of New York:. Looking at the top terms in the index, we see that there are 85 documents which contain the words {federalist, people, state, york} as well as common function words. The StandardAnalyzer has stop-listed other function words, including {to, of, the, no}. Otherwise, these too would be among the top terms with a document frequency of 85.

Exploring Document Indexing

The Document tab pane lets you examine individual documents in the index. In the screenshot below we have selected the 47th document in the index, FEDERALIST No. 48. Circled in red are the series of controls used to get to this point. First we chose the Document tab pane, and then used the “Browse by document number” control to choose document 47. The bottom half of the window shows the document fields.

Clicking on the “Reconstruct & Edit” control opens a new pop-up window which allows us to inspect the document contents on a field-by-field basis. Below we show side-by-side screenshots of the text field. On the left we see the raw text as stored; on the right we see the result of tokenization and indexing via the Lucene StandardAnalyzer. Lucene’s StandardAnalyzer includes a StandardTokenizer, StandardFilter, Tokenization happens first. This removes punctuation and assigns each token a position. Then the tokens are lower-cased and stop-listed. LowerCaseFilter and StopFilter.

We have circled the text of the opening salutation: To the People of the State of New York:. It consists of 9 words followed by 1 punctuation symbol. The tokenizer strips out the final colon. The stop-list filter removes the tokens at positions {1, 2, 4, 5, 7}. The text of tokens which have been indexed are displayed as is. Tokens which were stop-listed are missing from the index, so Luke displays the string null followed by the number of consecutive stop-listed tokens.

Exploring Search

The Search tab pane packs a lot of controls. The annotated screenshot below shows the results of running a search over the index.

In the top right quadrant is an embedded tab pane (tabs circled in red). On the Analysis tab, we have selected the StandardAnalyzer from a pull-down list of Analyzers and selected the default search field from a pull-down list of all fields in the index. Given this, Luke constructs the QueryParser used to parse queries entered into the text box in the upper left quadrant. We entered the words: “Powers of the Judiciary” into this text box, circled in red. Directly below is it the parsed query, also circled in red. Clicking on the “Search” control runs this search over the index. The bottom pane displays the ranked results.

Luke passes the input string from the search text box to the QueryParser which parses it using the specified analyzer and creates a set of search terms, one term per token, over the specified default field. In this example, the StandardAnalyzer tokenizes, lowercases, and stop-lists these words, resulting in two search terms text:powers and text:judiciary. The result of this search pulls back all documents which contain the words powers and/or judiciary.

To drill down on Lucene search, we use the Explain Structure control, circled in red on the following screenshot.

When clicked, this pops up another pane which shows all details of the query constructed from the search expression. The structure of the query used in this example is:

lucene.BooleanQuery
  clauses=2, maxClauses=1024
  Clause 0: SHOULD
    lucene.TermQuery
      Term: field='text' text='powers'
  Clause 1: SHOULD
    lucene.TermQuery
      Term: field='text' text='judiciary'

The search expression can be any legal search string according to the Lucene Query Parser Syntax. To search for the phrase powers of the judiciary we need to enclose it in double quotes. But this new search produces no results, which is clearly wrong, since this phrase occurs in the title of FEDERALIST No. 80 (see search results above).

In the query details we see that the input has been parsed into text:"powers judiciary". The Explain Structure feature returns the following:

lucene.PhraseQuery, slop=0
  pos: [0,1]
  Term 0: field='text' text='powers'
  Term 1: field='text' text='judiciary'

The problem here is that the token positions specified don’t allow for the intervening stop-listed words. To correct this, we need to adjust Luke’s default settings on the QueryParser. The screenshot below shows both the controls changed and the query results.

First we go to the QueryParser tab in the set of search controls in the top right quadrant, then we click the checkbox labeled Enable positionIncrements. Now the parsed query looks like this: text:"powers ? ? judiciary", which translates to the following programmatic query:

lucene.PhraseQuery, slop=0
  pos: [0,3]
  Term 0: field='text' text='powers'
  Term 1: field='text' text='judiciary'

Finally, we select the single search result, and click on the Explain control (circled in red). This pops up another window which explains the query score (outlined in red). Here is the text of the explanation:

0.0847  weight(text:"powers ? ? judiciary" in 79) [DefaultSimilarity], result of:
  0.0847  fieldWeight in 79, product of:
    1.0000  tf(freq=1.0), with freq of:
      1.0000  phraseFreq=1.0
    3.6129  idf(), sum of:
      1.3483  idf(docFreq=59, maxDocs=85)
      2.2646  idf(docFreq=23, maxDocs=85)
    0.0234  fieldNorm(doc=79)

Using the Lucene XML Query Parser

The Lucene API contains many kinds of queries beyond those generated by the QueryParser. You can use Luke to develop these queries as well, via the Lucene XML Query Parser. It is almost impossible to find any on-line documentation on this most excellent contribution to Lucene by Mark Harwood. Luckily, it is distributed with Lucene source tgz files. For Lucene 3.6 the path to this documentation is: lucene-3.6.0/contrib/xml-query-parser/docs/index.html.

Recast in as a Lucene XML Query, our original search term:powers term:judiciary becomes:

 <BooleanQuery fieldName="text">
         <Clause occurs="should">
             <TermQuery>powers</TermQuery>
         </Clause>
         <Clause occurs="should">
               <TermQuery>judiciary</TermQuery>
         </Clause>
 </BooleanQuery>

The screenshot below shows the results of pasting the above XML into the search expression textbox with the QueryParser control Use XML Query Parser turned on. (Note: currently this works in Luke 3.5 but not in Luke 4.0-ALPHA. We hate the bleeding edge).

The XML rewrites to exactly the same query as our first query.

Big deal, you might be saying right about now. That’s a whole lotta chopping for not much kindling. Au contraire! One of the reasons for developing the XML Query Syntax is this:

To bridge the growing gap between Lucene query/filtering functionality and the set of functionality accessible through the standard Lucene QueryParser syntax.

For example, a while back, Mark Miller blogged about Lucene SpanQueries. This is similar to but not the same as PhraseQuery. We can use Luke to compare and contrast the difference between the two. Here’s our earlier phrase query Powers of the Judiciary recast as a span query.

  <SpanNear slop="3" inOrder="true" fieldName="text">		
      <SpanTerm>powers</SpanTerm>
      <SpanTerm>judiciary</SpanTerm>
  </SpanNear>

The screenshot below shows the results of running this query.

FEDERALIST No. 48 scores slightly better on this query than does FEDERALIST No. 80. Understanding why is left as an exercise to the reader.

Lucene Tutorial updated for Lucene 3.6

July 5, 2012

The current Apache Lucene Java version is 3.6, released in April of 2012. We’ve updated the Lucene 3 tutorial and the accompanying source code to bring it in line with the current API so that it doesn’t use any deprecated methods and my, there are a lot of them. Bob blogged about this tutorial back in February 2011, shortly after Lucene Java rolled over to version 3.0.

Like other 3.x minor releases, Lucene 3.6 introduces performance enhancements, bug fixes, new analyzers, and changes that bring the Lucene API in line with Solr. In addition, Lucene 3.6 anticipates Lucene 4, billed as “the next major backwards-incompatible release.”

Significant changes since version 3.0

  • IndexReader delete methods are deprecated and will be removed entirely in Lucene 4. All deletes and updates are done via an IndexWriter.
  • There is a single IndexWriter constructor that takes two arguments: the index directory and an IndexWriterConfig object. The latter was introduced in Lucene 3.1. It holds configuration information that was previously specified directly as additional arguments to the constructor.
  • IndexWriter optimize methods are deprecated. The merge method(s) supply this functionality.

Building the Source

The ant build file is in the file src/applucene/build.xml and should be run from that directory. The book’s distribution is organized this way so that each chapter’s demo code is roughly standalone, but they are able to share libs. There are some minor dependencies on LingPipe in the example (jar included), but those are just for I/O and could be easily removed or replicated. As an added bonus, the source code now includes the data used in the examples throughout the tutorial, the venerable Federalist Papers from Project Gutenberg.

Call for PhD Ideas for Lucene/Solr Implementation

November 29, 2010

Otis over at the Sematech blog wants to get people identifying hard problems faced by people working in search that PhD students can select from and hopefully solve with a Lucene/Solr implementation. The students get a ‘real world problem’ and the world gets a concrete open source implementation of the solution. The call is: Lucene / Solr for Academia: PhD Thesis Ideas

My suggested PhD idea is tolerable precision at high recall dictionary matching of phrases. Mike Ross spent a good deal of time trying to get 100% matches of genes in MEDLINE abstracts given a dictionary of genes (Entrez Gene) and aliases. The core of the problem is that not all the mentions of genes are on the aliases set for the gene. Huge issues around efficiency in addition to getting it working at all.

Finite-State Queries in Lucene

March 25, 2010

Otis just posted Robert Muir‘s presentation to the NY Lucene Meetup:

What’s it Do?

Read the slide show. It’s really nicely presented.

The idea’s simple for us automata hackers. The set of terms making up the keys of the core term-document (reverse) index is used as suffix array (i.e. a compact representation of a trie). A query is then converted into a finite state automaton which is intersected with the trie implicitly represented by the suffix array.

This is much much faster than doing an edit distance comparison with every term, because the prefix computations are shared. If you put a bound on the maximum distance, you can reject most terms high up in the trie as you implicitly run the intersection.

It warms my heart to see this kind of nice algorithm so neatly implemented. The really groovy part is that the suffix array was already there!

Bonus Material

As an added bonus, there’s an extended bonus section including hints on how to do term expansion at query time instead of indexing time (though this does affect TF/IDF), and some hints about how to integrate morphological processing (e.g. stemming or lemmatization).

Similar Applications

LingPipe uses the same kind of algorithm for spell checking, only with an explicit trie with source probabilities and a customizable position-specific edit-distance model. The spell checker only handles single input strings rather than automata and provides ranked results.

I’m guessing Google’s using this kind of thing all over the place for speech rec, because they hired the guys who wrote the AT & T Finite State Machine Library and went on to build Google’s Open FST Library.

Do it Yourself

If you want to learn how to do this kind of thing yourself, you could do much worse than reading my perennial recommendation, Dan Gusfield’s Algorithms on Strings, Trees, and Sequences.

BWT Anyone?

Luckily, memory will probably keep up with our terms in language indexing.

In biological sequencing applications, where they store billions of different 50-grams, the Burrows-Wheeler transformed compressed index lets you do the same thing. The only problem for search is that it doesn’t let you update dynamically (that i know of).

Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval

August 28, 2009

I’m about to implement the technique for assigning Medical Subject Heading (MeSH) terms to text described in this paper from the TLAs EBI, HMI and TNO ICT:

The same technique could be applied to assign Gene Ontology (GO) terms to texts, tags to tweets or blog posts or consumer products, or keywords to scientific articles.

k-NN via Search

To assign MeSH terms to a text chunk, they:

  1. convert the text chunk to a search query,
  2. run the query against a relevance-based MEDLINE index, then
  3. rank MeSH terms by frequency in the top k (k=10) hits.

In other words, k-nearest-neighbors (k-NN) where “distance” is implemented by a relevance-based search.

Just the Text, Ma’am

Trieschnigg et al. concatenated the title and abstract of MEDLINE citations into a single field for both document indexing and query creation.

k-NN implemented as search could be faceted to include journals, years, authors, etc. For products, this could include all the facets seen on sites like Amazon or New Egg.

Language-Model-Based Search

They use language-model-based search, though I doubt that’s critical for success. Specifically, they estimate the maximum-likelihood unigram language model for the query and interpolated (with a model trained on all documents) model for each document, and then rank documents versus that query by cross-entropy of the query model given the document model (given the MLE estimate for the query, this is just the log probability of the query in the document’s LM.) Other LM-based search systems measure similarity by KL-divergence.

There weren’t any details on stemming, stoplisting, case normalization or tokenization in the paper or supplement; just a pointer to author Wessel Kraaij’s Ph.D. thesis on LM-based IR.

Application to MEDLINE

The text being assigned MeSH terms was another MEDLINE title-plus-abstract. This may seem redundant given that MEDLINE citations are already MeSH annotated, but it’s useful if you’re the one at NLM who has to assign the MeSH terms, or if you want a deeper set of terms (NLM only assigns a few per document).

It’s easy to apply the authors’ approach to arbitrary texts, such as paragraphs out of textbooks or full text articles or long queries of the form favored by TREC.

Efficient k-NN!

I did a double-take when I saw k-nearest-neighbors and efficiency together. As we all know, k-NN scales linearly with training set size and MEDLINE is huge. But, in this case, the search toolkit can do the heavy lifting. The advantage of doing k-NN here is that it reproduces the same kind of sparse assignment of MeSH terms as are found in MEDLINE itself.

Error Analysis

The authors did a nice job in the little space they devoted to error analysis, with more info in the supplements (PR curves and some more parameter evaluations and the top few hits for one example). They reported that k-NN was better than some other systems (e.g. thesaurus/dictionary-based tagging and direct search with MeSH descriptors as pseudocuments) at assigning the sparse set of MeSH terms found in actual MEDLINE citations.

Errors tended to be more general MeSH terms that just happened to show up in related documents. I’d also like to see how sensitive performance is to the parameter setting of k=10, as it was chosen to optimize F measure against the sparse terms in MEDLINE. (All results here are for optimal operating points (aka oracle results), which means the results are almost certainly over-optimistic.)

What You’ll Need for an Implementation

It should be particularly easy to reproduce for anyone with:

Look for a demo soon.

Discriminative Semi-Supervised Classifiers?

It’d be nice to see an evaluation with text generated from MeSH and articles referencing those terms that used any of the semi-supervisied or positive-only training algorithms (even just sampling negative training instances randomly) with some kind of discriminative classifier like logistic regression or SVMs.

Improving IR with MeSH (?)

I didn’t quite follow this part of the paper as I wasn’t sure what exactly they indexed and what exactly they used as queries. I think they’re assigning MeSH terms to the query and then adding them to the query. Presumably they also index the MeSH terms for this.

I did get that only the best MeSH-term assigner improved search performance (quantized to a single number using TREC’s mean average precision metric).

Alternatives like generating a 24,000-category multinomial classifier are possible, but won’t run very fast (though if it’s our tight naive Bayes, it might be as fast as the authors

Lucene 2.4 in 60 seconds

February 18, 2009

[Update: 17 Nov 2010. We now have an extensive (20 pages in textbook format) tutorial for Lucene 3.0 as an appendix for the LingPipe book. The PDF for the current draft and code samples in a tarball are available from the following site:

As a bonus, the tokenization chapter provides adapter implementations for converting LingPipe tokenizers to Lucene analyzers and vice-versa.]

 

This is a tutorial on getting started with the Lucene 2.4 Java API which avoids using deprecated classes and methods. In particular it shows how to process search results without using Hits objects, as this class is scheduled for removal in Lucene 3.0. The time estimate of 60 seconds to complete this tutorial is more wish than promise; my goal is to present the essential concepts in Lucene as concisely as possible.

In the best “Hello World” tradition I have written a class called HelloLucene which builds a Lucene index over a set of 3 texts: { “hello world”, “hello sailor”, “goodnight moon” }. HelloLucene has two methods (besides main): buildIndex and searchIndex, which are called in turn. buildIndex builds an index over these 3 texts; searchIndex takes the args vector and runs a search over the index for each string, printing out results to the terminal.

Here is buildIndex and its required import statements:

import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
...
public static void buildIndex() throws IOException {
    IndexWriter indexWriter
        = new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
                          new StandardAnalyzer(),
                          IndexWriter.MaxFieldLength.LIMITED);
    String[] texts = new String[] {  "hello world",
                                     "hello sailor",
                                     "goodnight moon" };
    for (String text : texts) {
        Document doc = new Document();
        doc.add(new Field("text",text,
                    Field.Store.YES,Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
    }
    indexWriter.close();
}

A Lucene data store is called an Index. It is a collection of Document objects. A Document consists of one or more named Field objects. Documents are indexed on a per-field basis. An IndexWriter object is used to create this index. Let’s go over the call to its constructor method:

new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
                new StandardAnalyzer(),
                IndexWriter.MaxFieldLength.LIMITED);

The first argument is the index that the IndexWriter operates on. This example keeps the index on disk, so the FSDirectory class is used to get a handle to this index. The second argument is a Lucene Analyzer. This object controls how the input text is tokenized into terms used in search and indexing. HelloLucene uses a StandardAnalyzer, which is designed to index English language texts. It ignores punctuation, removes common function words such as "if", "and", and “"but", and converts words to lowercase. The third argument determines the amount of text that is indexed. The constant IndexWriter.MaxFieldLength.LIMITED defaults to 10,000 characters.

for (String text : texts) {
    Document doc = new Document();
    doc.add(new Field("text",text,Field.Store.YES,Field.Index.ANALYZED));
    indexWriter.addDocument(doc);
}
indexWriter.close();

The for loop maps texts into Document objects, which contain a single Field with name "text". The last 2 arguments to the Field constructor method specify that the contents of the field are stored in the index, and that they are analyzed by the IndexWriter's Analyzer. The IndexWriter.addDocument() method adds each document to the index. After all texts have been processed the IndexWriter is closed.

Both indexing and search are operations over Document objects. Searches over the index are specified on a per-field basis, (just like indexing). Lucene computes a similarity score between each search query and all documents in the index and the search results consist of the set of best-scoring documents for that query.

Here is searchIndex and its required import statements:

import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
...
public static void searchIndex(String[] queryStrings) 
    throws IOException, ParseException {
    Searcher searcher 
        = new IndexSearcher(FSDirectory.getDirectory("helloLuceneIndex"));
    QueryParser parser = new QueryParser("text",new StandardAnalyzer());
    for (String queryString : queryStrings) {
        System.out.println("nsearching for: " + queryString);
        Query query = parser.parse(queryString);
        TopDocs results = searcher.search(query,10);
        System.out.println("total hits: " + results.totalHits);
        ScoreDoc[] hits = results.scoreDocs;
        for (ScoreDoc hit : hits) {
            Document doc = searcher.doc(hit.doc);
            System.out.printf("%5.3f %sn",
                              hit.score, doc.get("text"));
        }
    }
    searcher.close();
}

Access to a Lucene index (or indexes) is provided by a Searcher object. searchIndex instantiates a IndexSearcher over the directory "helloLuceneIndex". Search happens via a call to the Searcher.search() method. Before calling the search() method the search string must be processed into a Lucene Query. To prepare the query we create a QueryParser and call its parse() method on the search string. The QueryParser is constructed using the same Analyzer as was used for indexing because the QueryParser processes the search string into a set of search terms which are matched against the terms in the index, therefore both the indexed text and the search text must be tokenized in the same way.

Here are the statements which process the results:

TopDocs results = searcher.search(query,10);
System.out.println("total hits: " + results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
    Document doc = searcher.doc(hit.doc);
    System.out.printf("%5.3f %sn",
                      hit.score, doc.get("text"));
}

The search() method returns a TopDocs object, which has 2 public fields: scoreDocs and totalHits. The latter is the number of documents that matched the query, and the former is the array of results in the form of ScoreDoc objects, where a ScoreDoc is itself a simple object consisting of two public fields: doc, the Document id (an int value); and score a float value calculated by Lucene (using a similarity metric best described as Unexplainable Greek kung fu). searchIndex reports the TopDoc.totalHits, then iterates over the TopDoc.scoreDocs array. The ScoreDoc.doc is used to retrieve the Document object from the index, and finally the method Document.get() retrieves the contents of the Field with name "text".

In order to compile this program, download the Lucene 2.4 distribution, which contains the jarfile lucene-core-2.4.0.jar. Either add this to your classpath, or simply put it in the same directory as HelloLucene, and compile with the following command:

javac -cp "lucene-core-2.4.0.jar" HelloLucene.java

Invoke the program with the following command:

java -cp ".;lucene-core-2.4.0.jar" HelloLucene "hello world" hello moon foo

Produces these results:

built index

searching for: hello world
total hits: 2
1.078 hello world
0.181 hello sailor

searching for: hello
total hits: 2
0.625 hello world
0.625 hello sailor

searching for: moon
total hits: 1
0.878 goodnight moon

searching for: foo
total hits: 0

all done

The first query "hello world" matches exactly against our first text, but it also partially matches the text "hello sailor". The second query "hello" matches both texts equally well. The third query "moon" matches against the only document which contains that word. The fourth query "foo" doesn’t match anything in the index.

Exercise for the reader

Refactor HelloLucene.java into two programs: BuildHelloLucene.java and SearchHelloLucene.java, where BuildHelloLucene uses its command line arguments to build the index, instead of the strings supplied by Sring[] texts.

Runing BuildHelloLucene with inputs "hello world", "hello sailor", and "goodnight moon" and then running SearchHelloLucene with inputs "hello world", "hello", "moon" and "foo" should give the same results as above. Different input texts and searches will expose the behaviour of the StandardAnalyzer in particular, and Lucene in general.

Lucene’s Missing Term/Field Query Structure

January 21, 2009

In a follow-up to an earlier blog post, Lucene’s Missing TokenStream factory, I’d like to discuss a problem with Lucene’s query structure. Not Lucene’s query syntax, which has its own problems, but gets the term/field distinction (mostly) right. But be careful — the query syntax has a different notion of term than the object representations in Lucene’s search package.

The underlying problem is the same as with the missing token stream factories: a missing abstraction at the plain text level that winds up conflating notions of plain text, like ranges, and notions of fielded queries, like ranges of plain text in a specified field.

At the root of the query structure problem is the index.Term class, which is constructed with a field and a textual value for that field. The simplest possible query is a search.TermQuery, which consists of a single term. So far, no problem. But what about a search.RangeQuery? It’s constructed using a pair of terms. There’s a note in the constructor that both terms must be from the same field. What’s the problem? Range queries shouldn’t be built out of two field/text pairs, but rather out of one field and two text objects.

Phrase queries (search.PhraseQuery) have the same problem as range queries, in that they should only apply to terms with the same field. Although the add term checks that added terms have the same field and throws an exception if they don’t, there is no warning in the javadoc.

The solution to the problem is simple. Split the structure of queries into two types, one for the text part of a query and one for restricting it to fields. It’s easiest to see in BNF, though what I’m really talking about is object structure, not the query syntax:

TextQuery ::= String                               // term query
                  | (String TO String)              // range query
                  | "String String ... String"      // phrase query
                  | (TextQuery AND TextQuery)
                  | (TextQuery OR TextQuery)

FieldedQuery ::= [Field : TextQuery]
                     | (FieldedQuery AND FieldedQuery)
                     | (FieldedQuery OR FieldedQuery)

The logical operations all distribute through fields, so a query like [AUTHOR: (Smith OR Jones)] is equivalent to ([AUTHOR:Smith] OR [AUTHOR:Jones]).

An advantage the query object representation has over the query syntax is that there are no default fields. Lucene’s query syntax, on the other hand, allows a fielded query to consist of a term query. The use of an analyzer in constructing an actual query parser then fills in the missing field.

The class search.WildcardQuery is equally problematic in that it takes a term as an argument. It’s overloading the textual value of the term to represent not a term, but a string with structure including a special character for the multiple character (*) and single character (?) wildcards. But what if I want a question mark or asterisk in my term itself? The query syntax handles this problem, but not the object representation. The classes search.PrefixQuery and search.FuzzyQuery have the same problem with conflating terms and a syntax for search.

Phrase queries and boolean queries have the additional problem of being mutable, which makes their use in collections problematic (just as using mutable collections within other collections is problematic). For instance, if you construct a phrase query, its notion of equality and hash code change as more terms are added, because equality is defined to require having an equal list of terms.

Lucene or a Database?   Yes!

November 22, 2008

This question comes up pretty often on the Lucene mailing lists. Assuming that’s an inclusive OR in the question, we say yes! to both. In many situations this is not just a clever answer it’s the correct one — while the two systems can both be used as a data store, there are some things that a database can do better than Lucene, and some things that Lucene can do that a database can’t do at all.

Lucene provides full-text search over a collection of data that returns results in order of relevancy. If search over (lots of) text data is a non-negotiable requirement, then a date with Lucene (or one of its cousins) is in your future, because this is something that most database systems do badly, if at all. Here’s a nice critique of this feature in MySQL (an otherwise fine RDBMS).

Lucene provides search and indexing over Document objects. A Document is a set of Fields. Indexing a Field processes its contents into one or more Term objects which contain the Field name and the Term string value. Lucene maintains an inverted index which maps terms into documents. Queries against the index are scored by TF-IDF, a measure of document relevance. A Lucene Analyzer defines how index terms are extracted from text. The developer can define their own Analyzer, or use one of the pre-existing analyzers in the Lucene API, which can do tokenization for many different languages, as well as stop-listing of common words, case normalization, stemming, and more. For a nice example of a custom analyzer, see Bob‘s chapter on “Orthographic variation with Lucene” in the Lucene in Action book.

Relational databases store data as rows and columns in tables. A table declaration specifies the columns, the type of data stored in each column, and constraints over columns. The DBMS enforces these constraints as rows are added to the tables. In Lucene there is no notion of document type. A document is a set of fields, and processing a document consists of processing those fields into fielded terms that are added to the index. Lucene was designed for rapid indexing of documents. Checking document completeness is the responsibility of the calling application, ditto enforcing cross-document constraints. In the latter case, trying to enforce a uniqueness constraint on a term in the index is likely to impair indexing speed.

Searches against the two are different both in the search operations and search results. Database queries are specified in SQL which allows for composition of data from different tables via JOIN, complex boolean selectional restrictions, and ordering of results. Search over a database returns a ResultSet object containing a table which is a new view on the data resulting from executing the search query, that is, it may contain data stored in several different tables underlyingly.

Lucene search is over the term index, and the results returned are a set of pairs of documents and scores. Lacking explicit document types, a query is over some number of fields. Lucene has its own query language, and the package org.apache.lucene.search provides classes to do this programmatically. Lucene uses the score to limit the number of documents returned. Overriding this behavoir requires getting all matching documents for a query, and this can be very expensive, and queries designed to find things like dates or numbers within some range can be inefficient.

So if you want to provide relevance-based text search over some heterogeneous data store that has both a transactional component, such as an accounting system, as well as a storing large amounts of text you’re going to have to use both Lucene and a database, and the questions to ask are:

  • What is the mapping between the database tables and views and Lucene documents?
  • What text, if any, needs to be stored in the index?
  • How do I keep my Lucene index in sync with the database?

The Scaling Web blog has a nice case study. They have some good tips about how to conflate information from multiple table columns into good fielded documents, and how best to use IndexReader and IndexWriter objects in order to keep the index up-to-date with the database.

Of course there are situations where just Lucene or just a database is a sufficient data store. In the LingMed sandbox project we use Lucene to store a local version of the MEDLINE citation index. The challenge there is making sure that updates to the index are visible to all search clients, and this was covered in an earlier blog post: “Updating and Deleting Documents in Lucene”. Each MEDLINE citation has a unique PubMed ID, and treating this as an index field allows for rapid search by PubMed ID. The citation title and abstract are also indexed into their own fields, and we can use Lucene to find out how many different documents contain a word in the article title or abstract – that is, we can quickly and easily calculate the document frequency for an item.

LingMed uses a LingPipe Chunker and a LanguageModel to find all mentions of genes in MEDLINE citations and assign a score to each. A MySQL database is the appropriate data store because we need to keep track of a many-to-many relationship between genes and articles (each gene mention); as well as information about the article itself. We don’t need sophisticated text search over this data, nor do we want to rank our results by TF-IDF; we are using LingPipe to compute our own scores, and use SQL’s “ORDER BY” clause to retrieve the best-scoring items.

The correct answer to the question “Lucene or a Database?” always depends on the specifics of the situation. Furthermore, as more functionality is added to Lucene (and family), the question is worth revisiting, and is revisited fairly frequently on the Lucene mailing lists. These questions are usually accompanied by a specific use case, and the comments of the Lucene contributors provide good guidelines which should help you find the answer that is right for your application.

Updating and Deleting Documents in Lucene 2.4: LingMed Case Study

November 5, 2008

Lucene Java 2.4 was recently released, and this is a good excuse to review the cumulative changes to the org.apache.lucene.index IndexReader and IndexWriter classes. Before Lucene Java 2.1 documents were added to an index using IndexWriter.addDocument(), deleted using IndexReader.delete(). Document updates were written as a delete followed by an add. In 2.1 the method IndexWriter.deleteDocument() was added to the API, and in version 2.2 IndexWriter.updateDocument() was added as well. Most of the demos and tutorials out there predate this release, causing no end of confusion and even heartbreak for the newbie developer. Doug Cutting summarizes this nicely, in his comments on this blog post:

The presence of IndexReader.delete() traces back to the origins of Lucene. IndexReader and IndexWriter were written, then deletion was added. IndexWriter should really have been called IndexAppender. The only place deletion was possible was IndexReader, since one must read an index to figure out which document one intends to delete, and IndexWriter only knows how to append to indexes. This has big performance implications: IndexReader is a heavy-weight view of an index, while IndexWriter is lightweight. … The issue has been discussed at length for years, but no acceptable solution had been developed.
Recently Ning Li contributed to Lucene a fix for this, a version of IndexWriter that can efficiently intermix deletes with additions.

In the LingMed sandbox project, we use Lucene to maintain local version of the MEDLINE citation index, where each MEDLINE citation is stored as a Lucene document. Every weekday the NLM releases a set of updates which contains new citations, revised citations, and lists of citations that should be deleted from the index. This is an XML file, and we use classes from the com.aliasi.medline package to process the updates file. As an article goes through the publishing pipeline (e.g. accepted, pre-print, corrections added), the corresponding citation in the MEDLINE index is updated accordingly. We only want to keep the latest version of a citation in our Lucene index. The MEDLINE updates file encodes all citations as MedlineCitation entities, therefore our handler must figure out whether the citation is new or revised. Since all Medline citations have a unique PMID (PubMed identifier) element, we search the index for that PubMed id – if it exists then we call IndexWriter.updateDocument, else we call IndexWriter.addDocument.

Search over the index requires us to open an IndexReader. This raises the question: if the IndexWriter makes changes to the index how can the IndexReader see them? The answer is found in the IndexWriter's javadoc:

an IndexReader or IndexSearcher will only see the index as of the “point in time” that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened.

Given this, processing a set of MEDLINE updates file(s) is pretty straightforward: before processing each file we open an IndexReader on the index, and after processing the entire file we call the IndexWriter.commit(), which flushes any pending changes out to the index. (The Lucene javadoc it says that after a commit “a reader will see changes” but only if the reader goes looking for them! See also this discussion on the Lucene mailing list).

We use a com.aliasi.medline.MedlineParser to parse the updates file. The parser takes a visitor in the form of a MedlineHandler, which processes the MEDLINE citations as they are extracted by the parser. The parser invokes the handler’s handle(MedlineCitation) method on each MedlineCitation element that it parses out of the updates file, and invokes the handler’s delete(String) method on each PubMed identifier in the DeleteCitation element.

Here is the calling code which processes the MEDLINE updates file(s), (calls to the log4j logger ommitted):

MedlineParser parser = new MedlineParser(true); // true = save raw XML
IndexWriter indexWriter
      = new IndexWriter(FSDirectory.getDirectory(mIndex),
               mCodec.getAnalyzer(),
               new IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
for (File file: files) {
    ...
    MedlineIndexer indexer = new MedlineIndexer(indexWriter,mCodec);
    parser.setHandler(indexer);
    parseFile(parser,file);
    indexer.close();
    ...
}
indexWriter.optimize();
indexWriter.close();

The class MedlineIndexer implements MedlineHandler and does the work of updating the index:

static class MedlineIndexer implements MedlineHandler {
    final IndexWriter mIndexWriter;
    final MedlineCodec mMedlineCodec;
    final IndexReader mReader;
    final IndexSearcher mSearcher;

    public MedlineIndexer(IndexWriter indexWriter, MedlineCodec codec) 
        throws IOException {
        mIndexWriter = indexWriter;
        mMedlineCodec = codec;
        mReader = IndexReader.open(indexWriter.getDirectory(),true); // read-only
        mSearcher = new IndexSearcher(mReader);
    }
    ...
    public void close() throws IOException { 
        mSearcher.close();
        mReader.close();
        mIndexWriter.commit();
    }

Instantiating a MedlineIndexer automatically opens a fresh IndexReader and IndexSearcher on the index. The MedlineIndexer.close() method closes these objects and commits updates on the index.

The MedlineCodec maps the MedlineCitation.pmid() to its own field in the Lucene Document, so that we can uniquely identify documents by PubMed identifier. To lookup documents by PubMed id we create a org.apache.lucene.index.Term object:

Term idTerm = new Term(Fields.ID_FIELD,citation.pmid());

Here is the MedlineIndexer.handle() method:

public void handle(MedlineCitation citation) {
    Document doc = mMedlineCodec.toDocument(citation);
    try {
        ...
            Term idTerm = new Term(Fields.ID_FIELD,citation.pmid());
            if (mSearcher.docFreq(idTerm) > 0) {
                mIndexWriter.updateDocument(idTerm,doc);
            } else {
                mIndexWriter.addDocument(doc);  
            }
         }
    } catch (IOException e) {
        mLogger.warn("handle citation: index access error, term: "+citation.pmid());
    }
}

We use the IndexSearcher's doqFreq() method to check if a document with this PubMed id is in the index. If so, then the handler updates the document, else it adds it.

To delete a citation from the index we use the deleteDocuments method. Here is the MedlineIndexer.delete() method:

public void delete(String pmid) {
    ...
    Term idTerm = new Term(Fields.ID_FIELD,pmid);
    mLogger.debug("delete citation: "+pmid);
    try {
        mIndexWriter.deleteDocuments(idTerm);
    } catch (IOException e) {
        mLogger.warn("delete citation: index access error, term: "+pmid);
    }
}

Indexing the MEDLINE daily updates file is a simple batch-oriented process. The IndexWriter holds a write-lock on the index (a file-system based lock, see the LockFactory javadoc), so only a single update process can ever be running. However there can be any number of IndexReaders open on the index. Their view of the index is whatever state the index was in when either their open() or reopen() method was called. Once you understand this about the Lucene IndexReader and IndexWriter classes, you’re ready to start building applications capable of handling a stream of search requests, interleaved with updates to the index, (or else you’re ready to use Solr, or something like it).

Chinese Information Retrieval with LingPipe and Lucene

October 17, 2008

The standard approach to information retrieval in English involves tokenizing text into a sequence of words, stemming the words to their morphological roots, then tossing out stop words that are too frequent or non-discriminative. This is how the standard settings work in Apache’s Java-based Lucene search engine.

Taking this approach for Chinese would require being able to find the words. But the state of the art in Chinese word segmentation is only 97-98% or so on a relatively well-edited and homogeneous test set making up the third SIGHan segmentation bakeoff. And keep in mind that this is the usual average performance figure, with rarer words being harder to find. Out-of-training-vocabulary performance is at best 85%. Unfortunately, it’s those rarer words in the long tail, likes names, that are often the key disrciminative component of a query.

It turns out that you can go a very long way indexing Chinese using character unigrams and bigrams. That is, instead of breaking text into tokens, pull out all the characters and index those as if they were words, then pull out all the two-character sequences and index those the same way. This is what we did for Technorati. The evaluation on real queries showed what you’d expect: very good recall, but often poor precision because of unintended cross-word and sub-word effects. Consider an English example of the word sequence “and add it to your”, which would look like “andaddittoyour” without spaces. The problem is that without spaces, it matches the words “dad”, “toy”, “ditto”, “you”, and “our”. (In swear word filtering, naive replacement in subwords leads to the “clbuttic” mistake).

By the way, subword problems come up in English at an alarming rate. How many of you know that “Spider-man” is two words while “Batman” is only one? In looking at customer query logs for TV data, we saw that users have no idea how many spaces to put in many examples. We just expect our search engines to have solved that one for us. They actually solve this problem roughly the same way as we suggest for Chinese. (This is also a problem in noun compounding languages like German and Danish, or generally highly compounding languages like Turkish.)

The cross-word effects would be completely mitigated if we could teach Chinese query writers to insert spaces between words. I’m told this is impossible given the conventions of the language, but my reply is “where there’s a will, there’s a way”, and I think the adoption of Pinyin provides some direct support for my conjecture. But let’s suppose we can’t get the Chinese searchers to insert spaces; it won’t take care of the subword problem anyway.

So how can we improve precision for unigram and bigram indexing? Add words when we can find them and let TF/IDF sort out the weights. This is the idea behind this paper, which evaluates a number of such combinations:

Jian-Yun Nie, Jiangfeng Gao, Jian Zhang, and Ming Zhou. 2001. On the Use of Words and N-grams for Chinese Information Retrieval. Proceedings of the 5th International Workshop Information Retrieval with Asian Languages.

The best indexing seems to arise from a mixture of character unigrams, character bigrams and words.

So how do we get the words? Two ways, for maximal precision and control. First, we use a dictionary. We can build an efficient tokenizer to find dictionary words using LingPipe’s Aho-Corasick implementation dict.ExactDictionaryChunker, the use of which is described in the javadoc and in the named entity tutorial. Just make sure you use the right tokenizer, which for Chinese would be the tokenizer.CharacterTokenizerFactory. A dedicated implementation would be faster than going through tokenizers, given that we’re treating each character as a token.

As to where to find a dictionary, that’s up to you. Training data is still available on the old SIGHan site or from LDC.

The second approach to word spotting can be approximate, and for that, we recommend our noisy channel Chinese word segmenter, as described in our tutorial on Chinese word segmentation.

The only fiddly bits of coding are wrapping up our dictionary chunker and chinese word segmenter as Lucene analyzers (see lucene.analysis.Analyzer. You can see how to adapt LingPipe tokenizer factories for use as Lucene analyzers in our new sandbox project LingMed; look for the top-level read-me.html file and the adapter implementation for LingPipe tokenizers in src/com/aliasi/lingmed/lucene/LuceneAnalyzer.java.

We’ve sidestepped the fact that Chinese has two distinct ways of rendering characters, traditional Chinese characters (used mainly in Taiwan and Hong Kong and among ex-pats) and simplifed Chinese characters (used primarily on the mainland). The unicode standard, of course, contains both.

It’s a simple matter to load a single dictionary on both types of characters, and a single word segmenter operating over both types of characters. We’ve had good luck in training multilingual (English/Hindi) named entity extractors. It works because it relies on local context to predict words, and local context is almost always homogeneous in character set. That way, you can handle docs with mixed character sets and don’t have to do a character set ID pass up front (which is tricky, because the code points overlap in unicode and queries are often short).

We should point out that this method can also work for Japanese. Though you might want to up the n-gram count for the syllabic characters in Hirigana and Katakana, while keeping 1-grams and 2-grams for the imported Chinese ideograms making up Kanji.


Follow

Get every new post delivered to your Inbox.

Join 290 other followers