## Archive for the ‘Mitzi’s Blog’ Category

### The Latin1 Transcoding Trick for Ant

April 21, 2013

A while back Bob blogged about The Latin1 Transcoding Trick for Java Servlets, etc.

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte.

Another situation where this trick comes in real handy is dealing with the way that Ant compiles its logfiles.

If, like me, you’re fond of debug-by-printf and you use Ant to compile and run your programs, then you might have run into the problem that has given rise to many StackOverflow queries, that is, when you use an Ant task to run the program and instrument your code with print statements to standard out, Ant replaces non-ASCII characters with a question mark. When the problem you’re trying to debug is making sure that non-ASCII characters are being processed correctly, this is both misleading and maddening. The standard advice on StackOverflow is to set the shell environment variable ANT_OPTS using the following incantation (for bash shell):

export ANT_OPTS="-Dfile.encoding=UTF-8"

This works as long as you’re working with UTF-8 encoded character data and your terminal’s encoding is set to UTF-8 as well. Here a solution that works no matter what character encoding is in play:

export ANT_OPTS="-Dfile.encoding=Latin1"

It’s the ol’ Latin1 transcoding trick!

Of course you already know about character encodings . Do you know about Ant’s IO System? Here’s what Ant contributor Conor MacNeill says:

The Ant IO system is designed to associate any output sent to System.out and System.err with the task that generated the output and to log the output accordingly.

Ant’s Main class installs its own output streams into System.out and System.err. These streams are instances of DemuxOutputStreams

Using the source code for Ant 1.9.0, in class org.apache.tools.ant.Main we see that System.In, System.Out, and System.Err are all reassigned to Ant’s DemuxInputStream and DemuxOutputStream, which extend InputStream and OutputStream, respectively:

System.setIn(new DemuxInputStream(project));
System.setOut(new PrintStream(new DemuxOutputStream(project, false)));
System.setErr(new PrintStream(new DemuxOutputStream(project, true)));

The call to the PrintStream constructor is the one-arg constructor PrintStream(OutputStream out). Because no file encoding is specified, the encoding used is the default charset for the JVM that’s running Ant. This is specified by the system property file.encoding. This property varies depending on your platform and locale. To check this, try this on your machine:

public class GetDefaultEncoding {
public static void main(String[] args) {
System.out.println(System.getProperty("file.encoding"));
}
}

On my Mac running OS-X the default is US-ASCII (because the default locale on this machine is en_US). On my Windows XP machine the default is Cp1252 (Windows Latin1 which differs from ISO_8859-1 just enough to be noticeable).

At the point where Ant’s DemuxInputStream reads in the bytes sent to System.out by a Java task, any character outside of the default character set is replaced by a question mark. When Latin1 is the default encoding, all bytes are valid Latin1 characters and their Unicode code point value is the same as the byte value so the bytes from the Java task pass through the Ant IO system unchanged.

As long as the next process in the chain (e.g. the terminal app) is configured to handle whatever encoding your text data is in, you’re good to go.

### Using Luke the Lucene Index Browser to develop Search Queries

July 24, 2012

Luke is a GUI tool written in Java that allows you to browse the contents of a Lucene index, examine individual documents, and run queries over the index. Whether you’re developing with PyLucene, Lucene.NET, or Lucene Core, Luke is your friend.

Downloads are available from http://code.google.com/p/luke/ You can download the source or just the standalone binary .jar file. Right now we’re using the just-released Luke 4.0.0-ALPHA standalone binary. To run Luke, just fire up the jar file from the command line:

> java -jar lukeall-4.0.0-ALPHA.jar

Once launched, Luke opens with a dialog menu to open an index. We’re using an index called fed-papers.idx, which is an index over the Federalist Papers, a set of 85 op-ed pieces written by James Madison, Alexander Hamilton, and/or John Jay. This is taken from the example in our recently updated Lucene 3 tutorial. To build the index yourself, download the data and accompanying source code and follow the instructions in the tutorial. Each of the 85 papers is treated as a document. The text of the document has been indexed using Lucene’s StandardAnalyzer and is stored in a field named text.

The following is a screenshot of the Luke GUI upon first opening the index. We’ve circled the main controls on the top left corner of the GUI. Luke opens with the Overview tab pane.

All 85 documents have a similar structure. They start with an identifying number, e.g. FEDERALIST No. 1. The opening salutation is: To the People of the State of New York:. Looking at the top terms in the index, we see that there are 85 documents which contain the words {federalist, people, state, york} as well as common function words. The StandardAnalyzer has stop-listed other function words, including {to, of, the, no}. Otherwise, these too would be among the top terms with a document frequency of 85.

### Exploring Document Indexing

The Document tab pane lets you examine individual documents in the index. In the screenshot below we have selected the 47th document in the index, FEDERALIST No. 48. Circled in red are the series of controls used to get to this point. First we chose the Document tab pane, and then used the “Browse by document number” control to choose document 47. The bottom half of the window shows the document fields.

Clicking on the “Reconstruct & Edit” control opens a new pop-up window which allows us to inspect the document contents on a field-by-field basis. Below we show side-by-side screenshots of the text field. On the left we see the raw text as stored; on the right we see the result of tokenization and indexing via the Lucene StandardAnalyzer. Lucene’s StandardAnalyzer includes a StandardTokenizer, StandardFilter, Tokenization happens first. This removes punctuation and assigns each token a position. Then the tokens are lower-cased and stop-listed. LowerCaseFilter and StopFilter.

We have circled the text of the opening salutation: To the People of the State of New York:. It consists of 9 words followed by 1 punctuation symbol. The tokenizer strips out the final colon. The stop-list filter removes the tokens at positions {1, 2, 4, 5, 7}. The text of tokens which have been indexed are displayed as is. Tokens which were stop-listed are missing from the index, so Luke displays the string null followed by the number of consecutive stop-listed tokens.

### Exploring Search

The Search tab pane packs a lot of controls. The annotated screenshot below shows the results of running a search over the index.

In the top right quadrant is an embedded tab pane (tabs circled in red). On the Analysis tab, we have selected the StandardAnalyzer from a pull-down list of Analyzers and selected the default search field from a pull-down list of all fields in the index. Given this, Luke constructs the QueryParser used to parse queries entered into the text box in the upper left quadrant. We entered the words: “Powers of the Judiciary” into this text box, circled in red. Directly below is it the parsed query, also circled in red. Clicking on the “Search” control runs this search over the index. The bottom pane displays the ranked results.

Luke passes the input string from the search text box to the QueryParser which parses it using the specified analyzer and creates a set of search terms, one term per token, over the specified default field. In this example, the StandardAnalyzer tokenizes, lowercases, and stop-lists these words, resulting in two search terms text:powers and text:judiciary. The result of this search pulls back all documents which contain the words powers and/or judiciary.

To drill down on Lucene search, we use the Explain Structure control, circled in red on the following screenshot.

When clicked, this pops up another pane which shows all details of the query constructed from the search expression. The structure of the query used in this example is:

lucene.BooleanQuery
clauses=2, maxClauses=1024
Clause 0: SHOULD
lucene.TermQuery
Term: field='text' text='powers'
Clause 1: SHOULD
lucene.TermQuery
Term: field='text' text='judiciary'

The search expression can be any legal search string according to the Lucene Query Parser Syntax. To search for the phrase powers of the judiciary we need to enclose it in double quotes. But this new search produces no results, which is clearly wrong, since this phrase occurs in the title of FEDERALIST No. 80 (see search results above).

In the query details we see that the input has been parsed into text:"powers judiciary". The Explain Structure feature returns the following:

lucene.PhraseQuery, slop=0
pos: [0,1]
Term 0: field='text' text='powers'
Term 1: field='text' text='judiciary'

The problem here is that the token positions specified don’t allow for the intervening stop-listed words. To correct this, we need to adjust Luke’s default settings on the QueryParser. The screenshot below shows both the controls changed and the query results.

First we go to the QueryParser tab in the set of search controls in the top right quadrant, then we click the checkbox labeled Enable positionIncrements. Now the parsed query looks like this: text:"powers ? ? judiciary", which translates to the following programmatic query:

lucene.PhraseQuery, slop=0
pos: [0,3]
Term 0: field='text' text='powers'
Term 1: field='text' text='judiciary'

Finally, we select the single search result, and click on the Explain control (circled in red). This pops up another window which explains the query score (outlined in red). Here is the text of the explanation:

0.0847  weight(text:"powers ? ? judiciary" in 79) [DefaultSimilarity], result of:
0.0847  fieldWeight in 79, product of:
1.0000  tf(freq=1.0), with freq of:
1.0000  phraseFreq=1.0
3.6129  idf(), sum of:
1.3483  idf(docFreq=59, maxDocs=85)
2.2646  idf(docFreq=23, maxDocs=85)
0.0234  fieldNorm(doc=79)

### Using the Lucene XML Query Parser

The Lucene API contains many kinds of queries beyond those generated by the QueryParser. You can use Luke to develop these queries as well, via the Lucene XML Query Parser. It is almost impossible to find any on-line documentation on this most excellent contribution to Lucene by Mark Harwood. Luckily, it is distributed with Lucene source tgz files. For Lucene 3.6 the path to this documentation is: lucene-3.6.0/contrib/xml-query-parser/docs/index.html.

Recast in as a Lucene XML Query, our original search term:powers term:judiciary becomes:

 <BooleanQuery fieldName="text">
<Clause occurs="should">
<TermQuery>powers</TermQuery>
</Clause>
<Clause occurs="should">
<TermQuery>judiciary</TermQuery>
</Clause>
</BooleanQuery>

The screenshot below shows the results of pasting the above XML into the search expression textbox with the QueryParser control Use XML Query Parser turned on. (Note: currently this works in Luke 3.5 but not in Luke 4.0-ALPHA. We hate the bleeding edge).

The XML rewrites to exactly the same query as our first query.

Big deal, you might be saying right about now. That’s a whole lotta chopping for not much kindling. Au contraire! One of the reasons for developing the XML Query Syntax is this:

To bridge the growing gap between Lucene query/filtering functionality and the set of functionality accessible through the standard Lucene QueryParser syntax.

For example, a while back, Mark Miller blogged about Lucene SpanQueries. This is similar to but not the same as PhraseQuery. We can use Luke to compare and contrast the difference between the two. Here’s our earlier phrase query Powers of the Judiciary recast as a span query.

  <SpanNear slop="3" inOrder="true" fieldName="text">
<SpanTerm>powers</SpanTerm>
<SpanTerm>judiciary</SpanTerm>
</SpanNear>

The screenshot below shows the results of running this query.

FEDERALIST No. 48 scores slightly better on this query than does FEDERALIST No. 80. Understanding why is left as an exercise to the reader.

### Lucene Tutorial updated for Lucene 3.6

July 5, 2012

The current Apache Lucene Java version is 3.6, released in April of 2012. We’ve updated the Lucene 3 tutorial and the accompanying source code to bring it in line with the current API so that it doesn’t use any deprecated methods and my, there are a lot of them. Bob blogged about this tutorial back in February 2011, shortly after Lucene Java rolled over to version 3.0.

Like other 3.x minor releases, Lucene 3.6 introduces performance enhancements, bug fixes, new analyzers, and changes that bring the Lucene API in line with Solr. In addition, Lucene 3.6 anticipates Lucene 4, billed as “the next major backwards-incompatible release.”

### Significant changes since version 3.0

• IndexReader delete methods are deprecated and will be removed entirely in Lucene 4. All deletes and updates are done via an IndexWriter.
• There is a single IndexWriter constructor that takes two arguments: the index directory and an IndexWriterConfig object. The latter was introduced in Lucene 3.1. It holds configuration information that was previously specified directly as additional arguments to the constructor.
• IndexWriter optimize methods are deprecated. The merge method(s) supply this functionality.

### Building the Source

The ant build file is in the file src/applucene/build.xml and should be run from that directory. The book’s distribution is organized this way so that each chapter’s demo code is roughly standalone, but they are able to share libs. There are some minor dependencies on LingPipe in the example (jar included), but those are just for I/O and could be easily removed or replicated. As an added bonus, the source code now includes the data used in the examples throughout the tutorial, the venerable Federalist Papers from Project Gutenberg.

### Expected Error Histograms from Illumina 36bp Reads

May 10, 2010

Mitzi and I have been working on reproducing the results reported in the following (hugely influential) recent paper.

The paper analyzes human liver and kidney tissue using both an Affymetrix micro-array and the Illumina short-read sequencer. More about Affy vs. Illumina later; for now, I just want to describe the quality of the Illumina reads.

### Error Profile of Illumina

What’s nice about modern bioinformatics is that many authors submit their data to public repositories. So Mitzi was able to download and crunch Marioni et al.’s Illumina data.

For each 36bp read, Illumina provides a probability estimate for each base. For instance, we might have the 17th position assign a probability of 0.97 to A, 0.02 to C, 0.001 to G, and 0.009 to T. Marioni et al., like most researchers, simply take the first-best calls (see below for some exceptions), thus assigning position 17 in the above example to A. But according to Illumina, there’s only a 97% chance that call is correct!

### Calculating Expected Errors per Read

So how many errors do we expect in a 36bp read? If $q_i$ is the probability that the most likely base at position $i$ is correct, the expected number of errors in the first $I$ reads is $\sum_{i = 1}^{I} (1 - q_i)$.

With a little Java, Python and R (more about the package we’re working on later), Mitzi plotted a histogram for all 66.4M reads from Marioni’s 3pM kidney data:

But it turns out that like most researchers, Marioni et al. truncated the reads because they’re noisier toward the tail (3′ end) of the reads. Specifically, they only used the first 32bp. Here’s the same plot with the last 4bps truncated:

As you can see, this dramatically reduces the expected number of errors.

### Calculating Expected Number of Correct Reads

Mitzi also plotted the probability that the first best calls up to length $I$ are all correct for a read, which is $\prod_{i=1}^{I} q_i$. The probabilities are so low it requires a log (base 10) scale when using all 36bps:

The startling conclusion is that there’s almost no chance at all that the first-best base calls leads to the correct sequence. This has unsettling ramifications for procedures like SNP calling.

### The Done Thing

What Marioni et al. did is typical. They used a heuristic aligner (Bowtie) which ignores the read quality scores and allows only a small number of edits. They then censor (aka discard) any read from their data set that doesn’t align “uniquely”. A unique alignment is taken to be one that has a unique maximum quality score with fewer than the maximum number of allowed edits. That is, if there’s an alignment with one edit, and seven with two edits, the alignment is considered unique; but if there are two alignments with one edit, the read’s discarded.

Why throw away non-unique reads? There’s good reason to believe that the unique alignments are not unique. Not eveyone discards non-unique reads; there are so-called “recovery” strategies, which I discussed in a previous post describing my joint mapping recovery and mRNA expression model.

Why throw away quality scores from the aligner? I only know of two aligners that use quality scores, Gnumap and BWA, but these systems are far less popular than simple edit counting systems like Bowtie.

Why only count edits? A proper channel model should be better. For instance, the SHRiMP system employs a more refined channel model (which my probabilistic alignment model refines with proper normalization and integrating out of the actual sequence of edits).

The answer’s simple really: it’s easier to code and easier to explain.

### Can We Do Better?

Like every newbie, I feel we can. Not surprisingly, I think we should use all the data we have, integrating a proper probabilistic channel model for alignment with a Bayesian mRNA expression model.

Mitzi’s busy doing much more liberal alignments with SHRiMP (which requires a Viterbi approximation to the channel model) and then we’ll run them through the scalable collapsed Gibbs sampler I wrote the for mRNA expression model. After that, we’ll work in the more refined channel edit model that can account for biases induced by the wet lab process and the sequencer, and the quality scores coming from the sequencer.

### Bioinformatics?

No, Alias-i has no plans to work on Bioinformatics; this is still just a hobby for me. Maybe I should choose a hobby that’s a little more dissimilar from my day job.

### Lucene 2.4 in 60 seconds

February 18, 2009

[Update: 17 Nov 2010. We now have an extensive (20 pages in textbook format) tutorial for Lucene 3.0 as an appendix for the LingPipe book. The PDF for the current draft and code samples in a tarball are available from the following site:

As a bonus, the tokenization chapter provides adapter implementations for converting LingPipe tokenizers to Lucene analyzers and vice-versa.]

This is a tutorial on getting started with the Lucene 2.4 Java API which avoids using deprecated classes and methods. In particular it shows how to process search results without using Hits objects, as this class is scheduled for removal in Lucene 3.0. The time estimate of 60 seconds to complete this tutorial is more wish than promise; my goal is to present the essential concepts in Lucene as concisely as possible.

In the best “Hello World” tradition I have written a class called HelloLucene which builds a Lucene index over a set of 3 texts: { “hello world”, “hello sailor”, “goodnight moon” }. HelloLucene has two methods (besides main): buildIndex and searchIndex, which are called in turn. buildIndex builds an index over these 3 texts; searchIndex takes the args vector and runs a search over the index for each string, printing out results to the terminal.

Here is buildIndex and its required import statements:

import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
...
public static void buildIndex() throws IOException {
IndexWriter indexWriter
= new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
new StandardAnalyzer(),
IndexWriter.MaxFieldLength.LIMITED);
String[] texts = new String[] {  "hello world",
"hello sailor",
"goodnight moon" };
for (String text : texts) {
Document doc = new Document();
Field.Store.YES,Field.Index.ANALYZED));
}
indexWriter.close();
}


A Lucene data store is called an Index. It is a collection of Document objects. A Document consists of one or more named Field objects. Documents are indexed on a per-field basis. An IndexWriter object is used to create this index. Let’s go over the call to its constructor method:

new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
new StandardAnalyzer(),
IndexWriter.MaxFieldLength.LIMITED);


The first argument is the index that the IndexWriter operates on. This example keeps the index on disk, so the FSDirectory class is used to get a handle to this index. The second argument is a Lucene Analyzer. This object controls how the input text is tokenized into terms used in search and indexing. HelloLucene uses a StandardAnalyzer, which is designed to index English language texts. It ignores punctuation, removes common function words such as "if", "and", and “"but", and converts words to lowercase. The third argument determines the amount of text that is indexed. The constant IndexWriter.MaxFieldLength.LIMITED defaults to 10,000 characters.

for (String text : texts) {
Document doc = new Document();
}
indexWriter.close();


The for loop maps texts into Document objects, which contain a single Field with name "text". The last 2 arguments to the Field constructor method specify that the contents of the field are stored in the index, and that they are analyzed by the IndexWriter's Analyzer. The IndexWriter.addDocument() method adds each document to the index. After all texts have been processed the IndexWriter is closed.

Both indexing and search are operations over Document objects. Searches over the index are specified on a per-field basis, (just like indexing). Lucene computes a similarity score between each search query and all documents in the index and the search results consist of the set of best-scoring documents for that query.

Here is searchIndex and its required import statements:

import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
...
public static void searchIndex(String[] queryStrings)
throws IOException, ParseException {
Searcher searcher
= new IndexSearcher(FSDirectory.getDirectory("helloLuceneIndex"));
QueryParser parser = new QueryParser("text",new StandardAnalyzer());
for (String queryString : queryStrings) {
System.out.println("nsearching for: " + queryString);
Query query = parser.parse(queryString);
TopDocs results = searcher.search(query,10);
System.out.println("total hits: " + results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
Document doc = searcher.doc(hit.doc);
System.out.printf("%5.3f %sn",
hit.score, doc.get("text"));
}
}
searcher.close();
}


Access to a Lucene index (or indexes) is provided by a Searcher object. searchIndex instantiates a IndexSearcher over the directory "helloLuceneIndex". Search happens via a call to the Searcher.search() method. Before calling the search() method the search string must be processed into a Lucene Query. To prepare the query we create a QueryParser and call its parse() method on the search string. The QueryParser is constructed using the same Analyzer as was used for indexing because the QueryParser processes the search string into a set of search terms which are matched against the terms in the index, therefore both the indexed text and the search text must be tokenized in the same way.

Here are the statements which process the results:

TopDocs results = searcher.search(query,10);
System.out.println("total hits: " + results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
Document doc = searcher.doc(hit.doc);
System.out.printf("%5.3f %sn",
hit.score, doc.get("text"));
}


The search() method returns a TopDocs object, which has 2 public fields: scoreDocs and totalHits. The latter is the number of documents that matched the query, and the former is the array of results in the form of ScoreDoc objects, where a ScoreDoc is itself a simple object consisting of two public fields: doc, the Document id (an int value); and score a float value calculated by Lucene (using a similarity metric best described as Unexplainable Greek kung fu). searchIndex reports the TopDoc.totalHits, then iterates over the TopDoc.scoreDocs array. The ScoreDoc.doc is used to retrieve the Document object from the index, and finally the method Document.get() retrieves the contents of the Field with name "text".

In order to compile this program, download the Lucene 2.4 distribution, which contains the jarfile lucene-core-2.4.0.jar. Either add this to your classpath, or simply put it in the same directory as HelloLucene, and compile with the following command:

javac -cp "lucene-core-2.4.0.jar" HelloLucene.java


Invoke the program with the following command:

java -cp ".;lucene-core-2.4.0.jar" HelloLucene "hello world" hello moon foo


Produces these results:

built index

searching for: hello world
total hits: 2
1.078 hello world
0.181 hello sailor

searching for: hello
total hits: 2
0.625 hello world
0.625 hello sailor

searching for: moon
total hits: 1
0.878 goodnight moon

searching for: foo
total hits: 0

all done


The first query "hello world" matches exactly against our first text, but it also partially matches the text "hello sailor". The second query "hello" matches both texts equally well. The third query "moon" matches against the only document which contains that word. The fourth query "foo" doesn’t match anything in the index.

Refactor HelloLucene.java into two programs: BuildHelloLucene.java and SearchHelloLucene.java, where BuildHelloLucene uses its command line arguments to build the index, instead of the strings supplied by Sring[] texts.

Runing BuildHelloLucene with inputs "hello world", "hello sailor", and "goodnight moon" and then running SearchHelloLucene with inputs "hello world", "hello", "moon" and "foo" should give the same results as above. Different input texts and searches will expose the behaviour of the StandardAnalyzer in particular, and Lucene in general.

### Lucene or a Database?   Yes!

November 22, 2008

This question comes up pretty often on the Lucene mailing lists. Assuming that’s an inclusive OR in the question, we say yes! to both. In many situations this is not just a clever answer it’s the correct one — while the two systems can both be used as a data store, there are some things that a database can do better than Lucene, and some things that Lucene can do that a database can’t do at all.

Lucene provides full-text search over a collection of data that returns results in order of relevancy. If search over (lots of) text data is a non-negotiable requirement, then a date with Lucene (or one of its cousins) is in your future, because this is something that most database systems do badly, if at all. Here’s a nice critique of this feature in MySQL (an otherwise fine RDBMS).

Lucene provides search and indexing over Document objects. A Document is a set of Fields. Indexing a Field processes its contents into one or more Term objects which contain the Field name and the Term string value. Lucene maintains an inverted index which maps terms into documents. Queries against the index are scored by TF-IDF, a measure of document relevance. A Lucene Analyzer defines how index terms are extracted from text. The developer can define their own Analyzer, or use one of the pre-existing analyzers in the Lucene API, which can do tokenization for many different languages, as well as stop-listing of common words, case normalization, stemming, and more. For a nice example of a custom analyzer, see Bob‘s chapter on “Orthographic variation with Lucene” in the Lucene in Action book.

Relational databases store data as rows and columns in tables. A table declaration specifies the columns, the type of data stored in each column, and constraints over columns. The DBMS enforces these constraints as rows are added to the tables. In Lucene there is no notion of document type. A document is a set of fields, and processing a document consists of processing those fields into fielded terms that are added to the index. Lucene was designed for rapid indexing of documents. Checking document completeness is the responsibility of the calling application, ditto enforcing cross-document constraints. In the latter case, trying to enforce a uniqueness constraint on a term in the index is likely to impair indexing speed.

Searches against the two are different both in the search operations and search results. Database queries are specified in SQL which allows for composition of data from different tables via JOIN, complex boolean selectional restrictions, and ordering of results. Search over a database returns a ResultSet object containing a table which is a new view on the data resulting from executing the search query, that is, it may contain data stored in several different tables underlyingly.

Lucene search is over the term index, and the results returned are a set of pairs of documents and scores. Lacking explicit document types, a query is over some number of fields. Lucene has its own query language, and the package org.apache.lucene.search provides classes to do this programmatically. Lucene uses the score to limit the number of documents returned. Overriding this behavoir requires getting all matching documents for a query, and this can be very expensive, and queries designed to find things like dates or numbers within some range can be inefficient.

So if you want to provide relevance-based text search over some heterogeneous data store that has both a transactional component, such as an accounting system, as well as a storing large amounts of text you’re going to have to use both Lucene and a database, and the questions to ask are:

• What is the mapping between the database tables and views and Lucene documents?
• What text, if any, needs to be stored in the index?
• How do I keep my Lucene index in sync with the database?

The Scaling Web blog has a nice case study. They have some good tips about how to conflate information from multiple table columns into good fielded documents, and how best to use IndexReader and IndexWriter objects in order to keep the index up-to-date with the database.

Of course there are situations where just Lucene or just a database is a sufficient data store. In the LingMed sandbox project we use Lucene to store a local version of the MEDLINE citation index. The challenge there is making sure that updates to the index are visible to all search clients, and this was covered in an earlier blog post: “Updating and Deleting Documents in Lucene”. Each MEDLINE citation has a unique PubMed ID, and treating this as an index field allows for rapid search by PubMed ID. The citation title and abstract are also indexed into their own fields, and we can use Lucene to find out how many different documents contain a word in the article title or abstract – that is, we can quickly and easily calculate the document frequency for an item.

LingMed uses a LingPipe Chunker and a LanguageModel to find all mentions of genes in MEDLINE citations and assign a score to each. A MySQL database is the appropriate data store because we need to keep track of a many-to-many relationship between genes and articles (each gene mention); as well as information about the article itself. We don’t need sophisticated text search over this data, nor do we want to rank our results by TF-IDF; we are using LingPipe to compute our own scores, and use SQL’s “ORDER BY” clause to retrieve the best-scoring items.

The correct answer to the question “Lucene or a Database?” always depends on the specifics of the situation. Furthermore, as more functionality is added to Lucene (and family), the question is worth revisiting, and is revisited fairly frequently on the Lucene mailing lists. These questions are usually accompanied by a specific use case, and the comments of the Lucene contributors provide good guidelines which should help you find the answer that is right for your application.

### Updating and Deleting Documents in Lucene 2.4: LingMed Case Study

November 5, 2008

Lucene Java 2.4 was recently released, and this is a good excuse to review the cumulative changes to the org.apache.lucene.index IndexReader and IndexWriter classes. Before Lucene Java 2.1 documents were added to an index using IndexWriter.addDocument(), deleted using IndexReader.delete(). Document updates were written as a delete followed by an add. In 2.1 the method IndexWriter.deleteDocument() was added to the API, and in version 2.2 IndexWriter.updateDocument() was added as well. Most of the demos and tutorials out there predate this release, causing no end of confusion and even heartbreak for the newbie developer. Doug Cutting summarizes this nicely, in his comments on this blog post:

The presence of IndexReader.delete() traces back to the origins of Lucene. IndexReader and IndexWriter were written, then deletion was added. IndexWriter should really have been called IndexAppender. The only place deletion was possible was IndexReader, since one must read an index to figure out which document one intends to delete, and IndexWriter only knows how to append to indexes. This has big performance implications: IndexReader is a heavy-weight view of an index, while IndexWriter is lightweight. … The issue has been discussed at length for years, but no acceptable solution had been developed.
Recently Ning Li contributed to Lucene a fix for this, a version of IndexWriter that can efficiently intermix deletes with additions.

In the LingMed sandbox project, we use Lucene to maintain local version of the MEDLINE citation index, where each MEDLINE citation is stored as a Lucene document. Every weekday the NLM releases a set of updates which contains new citations, revised citations, and lists of citations that should be deleted from the index. This is an XML file, and we use classes from the com.aliasi.medline package to process the updates file. As an article goes through the publishing pipeline (e.g. accepted, pre-print, corrections added), the corresponding citation in the MEDLINE index is updated accordingly. We only want to keep the latest version of a citation in our Lucene index. The MEDLINE updates file encodes all citations as MedlineCitation entities, therefore our handler must figure out whether the citation is new or revised. Since all Medline citations have a unique PMID (PubMed identifier) element, we search the index for that PubMed id – if it exists then we call IndexWriter.updateDocument, else we call IndexWriter.addDocument.

Search over the index requires us to open an IndexReader. This raises the question: if the IndexWriter makes changes to the index how can the IndexReader see them? The answer is found in the IndexWriter's javadoc:

an IndexReader or IndexSearcher will only see the index as of the “point in time” that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened.

Given this, processing a set of MEDLINE updates file(s) is pretty straightforward: before processing each file we open an IndexReader on the index, and after processing the entire file we call the IndexWriter.commit(), which flushes any pending changes out to the index. (The Lucene javadoc it says that after a commit “a reader will see changes” but only if the reader goes looking for them! See also this discussion on the Lucene mailing list).

We use a com.aliasi.medline.MedlineParser to parse the updates file. The parser takes a visitor in the form of a MedlineHandler, which processes the MEDLINE citations as they are extracted by the parser. The parser invokes the handler’s handle(MedlineCitation) method on each MedlineCitation element that it parses out of the updates file, and invokes the handler’s delete(String) method on each PubMed identifier in the DeleteCitation element.

Here is the calling code which processes the MEDLINE updates file(s), (calls to the log4j logger ommitted):

MedlineParser parser = new MedlineParser(true); // true = save raw XML
IndexWriter indexWriter
= new IndexWriter(FSDirectory.getDirectory(mIndex),
mCodec.getAnalyzer(),
new IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
for (File file: files) {
...
MedlineIndexer indexer = new MedlineIndexer(indexWriter,mCodec);
parser.setHandler(indexer);
parseFile(parser,file);
indexer.close();
...
}
indexWriter.optimize();
indexWriter.close();


The class MedlineIndexer implements MedlineHandler and does the work of updating the index:

static class MedlineIndexer implements MedlineHandler {
final IndexWriter mIndexWriter;
final MedlineCodec mMedlineCodec;
final IndexSearcher mSearcher;

public MedlineIndexer(IndexWriter indexWriter, MedlineCodec codec)
throws IOException {
mIndexWriter = indexWriter;
mMedlineCodec = codec;
}
...
public void close() throws IOException {
mSearcher.close();
mIndexWriter.commit();
}


Instantiating a MedlineIndexer automatically opens a fresh IndexReader and IndexSearcher on the index. The MedlineIndexer.close() method closes these objects and commits updates on the index.

The MedlineCodec maps the MedlineCitation.pmid() to its own field in the Lucene Document, so that we can uniquely identify documents by PubMed identifier. To lookup documents by PubMed id we create a org.apache.lucene.index.Term object:

Term idTerm = new Term(Fields.ID_FIELD,citation.pmid());


Here is the MedlineIndexer.handle() method:

public void handle(MedlineCitation citation) {
Document doc = mMedlineCodec.toDocument(citation);
try {
...
Term idTerm = new Term(Fields.ID_FIELD,citation.pmid());
if (mSearcher.docFreq(idTerm) > 0) {
mIndexWriter.updateDocument(idTerm,doc);
} else {
}
}
} catch (IOException e) {
mLogger.warn("handle citation: index access error, term: "+citation.pmid());
}
}


We use the IndexSearcher's doqFreq() method to check if a document with this PubMed id is in the index. If so, then the handler updates the document, else it adds it.

To delete a citation from the index we use the deleteDocuments method. Here is the MedlineIndexer.delete() method:

public void delete(String pmid) {
...
Term idTerm = new Term(Fields.ID_FIELD,pmid);
mLogger.debug("delete citation: "+pmid);
try {
mIndexWriter.deleteDocuments(idTerm);
} catch (IOException e) {
mLogger.warn("delete citation: index access error, term: "+pmid);
}
}


Indexing the MEDLINE daily updates file is a simple batch-oriented process. The IndexWriter holds a write-lock on the index (a file-system based lock, see the LockFactory javadoc), so only a single update process can ever be running. However there can be any number of IndexReaders open on the index. Their view of the index is whatever state the index was in when either their open() or reopen() method was called. Once you understand this about the Lucene IndexReader and IndexWriter classes, you’re ready to start building applications capable of handling a stream of search requests, interleaved with updates to the index, (or else you’re ready to use Solr, or something like it).