Updating and Deleting Documents in Lucene 2.4: LingMed Case Study

by

[Update: 10 Feb 2014. Much has changed in Lucene since 2.4. An extensive tutorial for Lucene 4 is now available as a chapter in the book

This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene's search query syntax.]

Lucene Java 2.4 was recently released, and this is a good excuse to review the cumulative changes to the org.apache.lucene.index IndexReader and IndexWriter classes. Before Lucene Java 2.1 documents were added to an index using IndexWriter.addDocument(), deleted using IndexReader.delete(). Document updates were written as a delete followed by an add. In 2.1 the method IndexWriter.deleteDocument() was added to the API, and in version 2.2 IndexWriter.updateDocument() was added as well. Most of the demos and tutorials out there predate this release, causing no end of confusion and even heartbreak for the newbie developer. Doug Cutting summarizes this nicely, in his comments on this blog post:

The presence of IndexReader.delete() traces back to the origins of Lucene. IndexReader and IndexWriter were written, then deletion was added. IndexWriter should really have been called IndexAppender. The only place deletion was possible was IndexReader, since one must read an index to figure out which document one intends to delete, and IndexWriter only knows how to append to indexes. This has big performance implications: IndexReader is a heavy-weight view of an index, while IndexWriter is lightweight. … The issue has been discussed at length for years, but no acceptable solution had been developed.
Recently Ning Li contributed to Lucene a fix for this, a version of IndexWriter that can efficiently intermix deletes with additions.

In the LingMed sandbox project, we use Lucene to maintain local version of the MEDLINE citation index, where each MEDLINE citation is stored as a Lucene document. Every weekday the NLM releases a set of updates which contains new citations, revised citations, and lists of citations that should be deleted from the index. This is an XML file, and we use classes from the com.aliasi.medline package to process the updates file. As an article goes through the publishing pipeline (e.g. accepted, pre-print, corrections added), the corresponding citation in the MEDLINE index is updated accordingly. We only want to keep the latest version of a citation in our Lucene index. The MEDLINE updates file encodes all citations as MedlineCitation entities, therefore our handler must figure out whether the citation is new or revised. Since all Medline citations have a unique PMID (PubMed identifier) element, we search the index for that PubMed id – if it exists then we call IndexWriter.updateDocument, else we call IndexWriter.addDocument.

Search over the index requires us to open an IndexReader. This raises the question: if the IndexWriter makes changes to the index how can the IndexReader see them? The answer is found in the IndexWriter's javadoc:

an IndexReader or IndexSearcher will only see the index as of the “point in time” that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened.

Given this, processing a set of MEDLINE updates file(s) is pretty straightforward: before processing each file we open an IndexReader on the index, and after processing the entire file we call the IndexWriter.commit(), which flushes any pending changes out to the index. (The Lucene javadoc it says that after a commit “a reader will see changes” but only if the reader goes looking for them! See also this discussion on the Lucene mailing list).

We use a com.aliasi.medline.MedlineParser to parse the updates file. The parser takes a visitor in the form of a MedlineHandler, which processes the MEDLINE citations as they are extracted by the parser. The parser invokes the handler’s handle(MedlineCitation) method on each MedlineCitation element that it parses out of the updates file, and invokes the handler’s delete(String) method on each PubMed identifier in the DeleteCitation element.

Here is the calling code which processes the MEDLINE updates file(s), (calls to the log4j logger ommitted):

MedlineParser parser = new MedlineParser(true); // true = save raw XML
IndexWriter indexWriter
      = new IndexWriter(FSDirectory.getDirectory(mIndex),
               mCodec.getAnalyzer(),
               new IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
for (File file: files) {
    ...
    MedlineIndexer indexer = new MedlineIndexer(indexWriter,mCodec);
    parser.setHandler(indexer);
    parseFile(parser,file);
    indexer.close();
    ...
}
indexWriter.optimize();
indexWriter.close();

The class MedlineIndexer implements MedlineHandler and does the work of updating the index:

static class MedlineIndexer implements MedlineHandler {
    final IndexWriter mIndexWriter;
    final MedlineCodec mMedlineCodec;
    final IndexReader mReader;
    final IndexSearcher mSearcher;

    public MedlineIndexer(IndexWriter indexWriter, MedlineCodec codec) 
        throws IOException {
        mIndexWriter = indexWriter;
        mMedlineCodec = codec;
        mReader = IndexReader.open(indexWriter.getDirectory(),true); // read-only
        mSearcher = new IndexSearcher(mReader);
    }
    ...
    public void close() throws IOException { 
        mSearcher.close();
        mReader.close();
        mIndexWriter.commit();
    }

Instantiating a MedlineIndexer automatically opens a fresh IndexReader and IndexSearcher on the index. The MedlineIndexer.close() method closes these objects and commits updates on the index.

The MedlineCodec maps the MedlineCitation.pmid() to its own field in the Lucene Document, so that we can uniquely identify documents by PubMed identifier. To lookup documents by PubMed id we create a org.apache.lucene.index.Term object:

Term idTerm = new Term(Fields.ID_FIELD,citation.pmid());

Here is the MedlineIndexer.handle() method:

public void handle(MedlineCitation citation) {
    Document doc = mMedlineCodec.toDocument(citation);
    try {
        ...
            Term idTerm = new Term(Fields.ID_FIELD,citation.pmid());
            if (mSearcher.docFreq(idTerm) > 0) {
                mIndexWriter.updateDocument(idTerm,doc);
            } else {
                mIndexWriter.addDocument(doc);  
            }
         }
    } catch (IOException e) {
        mLogger.warn("handle citation: index access error, term: "+citation.pmid());
    }
}

We use the IndexSearcher's doqFreq() method to check if a document with this PubMed id is in the index. If so, then the handler updates the document, else it adds it.

To delete a citation from the index we use the deleteDocuments method. Here is the MedlineIndexer.delete() method:

public void delete(String pmid) {
    ...
    Term idTerm = new Term(Fields.ID_FIELD,pmid);
    mLogger.debug("delete citation: "+pmid);
    try {
        mIndexWriter.deleteDocuments(idTerm);
    } catch (IOException e) {
        mLogger.warn("delete citation: index access error, term: "+pmid);
    }
}

Indexing the MEDLINE daily updates file is a simple batch-oriented process. The IndexWriter holds a write-lock on the index (a file-system based lock, see the LockFactory javadoc), so only a single update process can ever be running. However there can be any number of IndexReaders open on the index. Their view of the index is whatever state the index was in when either their open() or reopen() method was called. Once you understand this about the Lucene IndexReader and IndexWriter classes, you’re ready to start building applications capable of handling a stream of search requests, interleaved with updates to the index, (or else you’re ready to use Solr, or something like it).

3 Responses to “Updating and Deleting Documents in Lucene 2.4: LingMed Case Study”

  1. BL Thursday Says:

    [...] * Updating and Deleting Documents in Lucene 2.4 [...]

  2. Lucene or a Database?   Yes! « LingPipe Blog Says:

    [...] to the index are visible to all search clients, and this was covered in an earlier blog post: “Updating and Deleting Documents in Lucene”. Each MEDLINE citation has a unique PubMed ID, and treating this as an index field allows for rapid [...]

  3. Rakesh Says:

    is the following if condition really needed ?

    “if (mSearcher.docFreq(idTerm) > 0) ”

    Ideally while doing an update, if a matching document is not found, won’t lucene just add the new document ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 810 other followers