Author Archive

Document Classification with Lucene

April 10, 2014

As promised in my last post, this post shows you how to use Lucene’s ranked search results and document store to build a simple classifier. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene. The data and source code for this example are contained in the source bundle distributed with this book, which can be downloaded from

A classifier takes some input, which could be just about anything, and returns a classification of the input over a finite number of discrete categories. The goal of text classification is to assign some text to some category. Given a set of texts that contain discrete category labels, we can use that can assign these labels to new documents based on the similarity of the new documents to the labeled documents.

Example Dataset: The 20 Newsgroups Corpus

For this example, we use the data from the 20 Newsgroups corpus, a set of roughly 20,000 messages posted to 20 different newsgroups.  The classification task is to build a system that can assign a newsgroup name (the category label) to a newsgroup message (the text). Since all the messages in the corpus are already labeled, we can use one set of newsgroup posts to train our classifier and use the other set to test the classifier. To test the system, we compare the newsgroup name generated by the classifier against the name of the newsgroup that the message was posted to.

To use Lucene to classify a newsgroup post from the test set, we tokenize the message subject and body and create a Lucene query over these tokens. We do a search on the index over the training data and use the newsgroup name of the top-scoring document as the classifier response.  The names of the newsgroups in this corpus are: sci.crypt sci.electronics
talk.politics.misc talk.religion.misc talk.politics.guns alt.atheism
talk.politics.mideast soc.religion.christian

This dataset is maintained by Jason Rennie. There are no licensing terms listed and the data may be downloaded directly from:  The bundle 20news-bydate.tar.gz is divided into two sets of posts, a training set comprising roughly 60% of the posts to each group and a test set containing the remaining 40% of the posts. This version of the corpus doesn’t include cross-posts (messages posted to more than one newsgroup) and the newsgroup-identifying header lines have been removed.  20news-bydate.tar.gz unpacks into two top-level directories: 20news-bydate-train and 20news-bydate-test. Both of these contain 20 subdirectories of newsgroup data, where the directory name is the same as the newsgroup name, and each of these contains a set of newsgroup messages, roughly 400—600 posts for the training sets and roughly 250—400 posts for the test sets.

The Program

The program constructs the Lucene index from the training data and then uses this index to do a first-best classification of the test data. The classification results are presented as a confusion matrix that shows the full response profile of the classifiers. This is a 20-by-20 matrix where both the rows and columns are newsgroup names. Each row represents the performance of the classifier on posts from a given newsgroup, and each column represents the classifier response, so that reading across a row, we can see the number of posts given a particular category label. The classifier also reports the percentage of posts in the test set that were correctly labeled.

The source bundle contains the source code and the Lucene jar files as well as the file 20news-bydate.tar.gz.  The source code is in directory src/applucene which contains an Ant build.xml file. The program is in subdirectory src/applucene/src/com/colloquial/applucene. The Ant target classify runs the classification program over the 20 newsgroups data.

Building the Index

We’ve already discussed the method buildIndex in previous post on Lucene 4. This method builds a Lucene index over all the documents in the training directory. It takes two arguments: the directory for the Lucene index and a the directory containing the training data newsgroup posts.

void buildIndex(File indexDir, File trainDir) 
    throws IOException, FileNotFoundException {
    Directory fsDir =;
    IndexWriterConfig iwConf 
        = new IndexWriterConfig(VERSION,mAnalyzer);

    IndexWriter indexWriter
        = new IndexWriter(fsDir,iwConf);

    File[] groupsDir = trainDir.listFiles();
    for (File group : groupsDir) {
        String groupName = group.getName();
        File[] posts = group.listFiles();
        for (File postFile : posts) {
            String number = postFile.getName();
            NewsPost post = parse(postFile, groupName, number);
            Document d = new Document();
            d.add(new StringField("category",
            d.add(new TextField("text",
            d.add(new TextField("text",

The first four statements in buildIndex create a that stores the index as a file-system directory and an IndexWriter which does the work of analyzing documents and adding them to the index. The behavior of the IndexWriter is specified by an IndexWriterConfig object. Documents are indexed via the IndexWriter‘s default Analyzer. At the outset of the program we define a static final Lucene Version enum constant named VERSION. In order to ensure that the tokenization applied to the query matches the tokenization used to build the index, we use the member variable mAnalyzer to hold the reference to a LuceneAnalyzer. We supply these as arguments to the IndexWriterConfig constructor. We call the setOpenMode method with the enum constant IndexWriterConfig.OpenMode.CREATE, which causes the index writer to create a new index or overwrite an existing one.

For each newsgroup sub-directory, all posts in that directory are parsed into a NewsPost object, which is a simple domain model of a newsgroup post that holds the newsgroup name, subject, body, and filename (a string of numbers). The NewsPost object is processed into a Lucene Document that has two fields: a StringField named category for the newsgroup name and a TextField named text that holds the message subject and message body.

Using the Index for Classification

To use Lucene to classify a newsgroup post from the test set, we tokenize the message subject and body and create a Lucene query over these tokens. We do a search on the index over the training data and use the newsgroup name of the top-scoring document as the classifier response.

The method buildQuery creates a Lucene query from the subject and body text of the test message by tokenizing the message subject and body. The Lucene query ORs together all terms (up to a maximum of 1,024) into a BooleanQuery using the BooleanClause enum value SHOULD.

BooleanQuery buildQuery(String text) throws IOException {
    BooleanQuery termsQuery = new BooleanQuery();
    Reader textReader = new StringReader(text);
    TokenStream tokStream 
        = mAnalyzer.tokenStream("text",textReader);
    try {
        CharTermAttribute terms = 
        int ct = 0;
        while (tokStream.incrementToken() 
               && ct++ < 1024) {
                add(new TermQuery(new Term("text",
    } finally {
    return termsQuery;

As we process each newsgroup in the test set, we fill in the confusion matrix, one row at a time. To index the rows and columns of the matrix by newsgroup name, we define a String array of newsgroup names called NEWSGROUPS:

public static final String[] NEWSGROUPS
    = { "alt.atheism",

The method testIndex iterates over the sets of newsgroup posts in the test set and tabulates the results in a 20-by-20 confusion matrix.

void testIndex(File indexDir, File testDir)
    throws IOException, FileNotFoundException {
    Directory fsDir =;
    DirectoryReader reader =;
    IndexSearcher searcher = new IndexSearcher(reader);

    int[][] confusionMatrix
        = new int[NEWSGROUPS.length][NEWSGROUPS.length];
    File[] groupsDir = testDir.listFiles();
    for (File group : groupsDir) {
        String groupName = group.getName();
        int rowIdx = Arrays.binarySearch(NEWSGROUPS,groupName);

        File[] posts = group.listFiles();
        for (File postFile : posts) {
            String number = postFile.getName();
            NewsPost post = parse(postFile, groupName, number);
            BooleanQuery termsQuery 
                = buildQuery(post.subject() 
                             + " " + post.body());
            // only get first-best result
            TopDocs hits =,1);
            ScoreDoc[] scoreDocs = hits.scoreDocs;
            for (int n = 0; n < scoreDocs.length; n++) {
                ScoreDoc sd = scoreDocs[n];
                int docId = sd.doc;
                Document d = searcher.doc(docId);
                String category = d.get("category");
                // record result in confusion matrix
                int colIdx 
                    = Arrays.binarySearch(NEWSGROUPS,category);
        for (int i=0; i<NEWSGROUPS.length; i++) 
             System.out.printf("| %4d ", confusionMatrix[rowIdx][i]);

Evaluating the Classifier and Comparing Tokenization Strategies

The ClassifyNewsgroups program allows the user to specify the analyzer used. This allows us to see how different tokenization strategies affect classification. The three analyzers available are: the Lucene StandardAnalyzer, which chains together a StandardTokenizer that breaks the text into a stream of words, discarding whitespace and punctuation, followed by a StandardFilter, a LowerCaseFilter, and then a StopFilter, which uses a list of English stop words; an analyzer that chains together just a StandardTokenizer and a LowerCaseFilter; and an Analyzer that uses only a Lucene NGramTokenizer (from package oal.analysis.ngram) to index the data using 4-grams.

Here is the top-left quadrant of the confusion matrix, with the results for the first 10 newsgroups of running the classifier using the Lucene StandardAnalyzer:

ath grphx oswin ibm mac win-x sale auto cycle bb
ath 247 1 0 2 0 1 0 1 2 0
grphx 1 301 14 4 4 28 2 1 2 1
oswin 3 54 182 28 43 42 3 0 1 1
ibm 1 25 27 249 36 9 9 1 3 0
mac 0 31 10 25 257 6 7 3 3 0
win-x 0 43 7 1 7 313 1 2 4 3
sale 0 17 5 24 40 9 230 12 13 1
auto 1 7 5 0 1 0 6 324 9 0
cycle 1 0 1 0 0 3 0 13 341 3
bb 0 3 1 0 0 1 0 2 2 336

The cells on the diagonal show that in general, this classifier performs well on the data. Posts to alt.atheism are correctly classified in 247 out of 319 cases, at a rate of roughly 75 percent, whereas a classifier which always guessed at random, choosing uniformly among the newsgroups, would be correct only 5 percent of the time. In comparing the classification rates for the posts to the different computer newsgroups, we see a certain amount of confusion between the different kinds of computers and operating systems, as well as confusion with the listings in the newsgroup however, there is little confusion between the posts to the computer newsgroup and alt.atheism or with

To evaluate how well this program is able to identify the newsgroup to which a post belongs, we did several runs over the 20 Newsgroups dataset using different tokenization strategies and tabulated the results. The table below compares the correct classification rates for the three different analyzers for all newsgroups.

Newsgroup name Standard Analyzer Lowercase only N-Grams length 4
alt.atheism 0.77 0.80 0.74 0.77 0.77 0.62 0.46 0.52 0.60 0.63 0.66 0.59
comp.sys.mac.hardware 0.67 0.66 0.57 0.79 0.79 0.57 0.59 0.60 0.58 0.82 0.82 0.70 0.86 0.86 0.83 0.85 0.86 0.78 0.91 0.92 0.86
sci.crypt 0.91 0.89 0.85
sci.electronics 0.61 0.64 0.61 0.67 0.70 0.66 0.79 0.81 0.74
soc.religion.christian 0.64 0.65 0.78
talk.politics.guns 0.74 0.75 0.76
talk.politics.mideast 0.79 0.81 0.74
talk.politics.misc 0.56 0.55 0.62
talk.religion.misc 0.59 0.59 0.66

All classifiers work pretty well. No one classifier works best for all categories, reflecting the difference in the kinds of vocabulary and writing styles used across the different newsgroups.

N-Grams can be very effective for both regular search and classification, however the storage requirements for n-grams are considerably greater than for word-based tokenization. The size of index over the 20 Newsgroups training data is 7 MB when the Lucene StandardAnalyzer is used, 8 MB when the lowercase only analyzer is used, and 24 MB for the length-4 n-grams.

Lucene 4 Essentials for Text Search and Indexing

March 8, 2014

Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene.

Lucene Overview

Apache Lucene is a search library written in Java. It’s popular in both academic and commercial settings due to its performance, configurability, and generous licensing terms. The Lucene home page is

Lucene provides search over documents. A document is essentially a collection of fields. A field consists of a field name that is a string and one or more field values. Lucene does not in any way constrain document structures. Fields are constrained to store only one kind of data, either binary, numeric, or text data. There are two ways to store text data: string fields store the entire item as one string; text fields store the data as a series of tokens. Lucene provides many ways to break a piece of text into tokens as well as hooks that allow you to write custom tokenizers. Lucene has a highly expressive search API that takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score.

The Lucene API consists of a core library and many contributed libraries. The top-level package is org.apache.lucene, which is abbreviated as oal in this article. As of Lucene 4, the Lucene distribution contains approximately two dozen package-specific jars, e.g.: lucene-core-4.7.0.jar, lucene-analyzers-common-4.7.0.jar, lucene-misc-4.7.0.jar. This cuts down on the size of an application at a small cost to the complexity of the build file.

A Lucene Index is an Inverted Index

Lucene manages an index over a dynamic collection of documents and provides very rapid updates to the index as documents are added to and deleted from the collection. An index may store a heterogeneous set of documents, with any number of different fields that may vary by document in arbitrary ways. Lucene indexes terms, which means that Lucene search is search over terms. A term combines a field name with a token. The terms created from the non-text fields in the document are pairs consisting of the field name and field value. The terms created from text fields are pairs of field name and token.

The Lucene index provides a mapping from terms to documents.  This is called an inverted index because it reverses the usual mapping of a document to the terms it contains.  The inverted index provides the mechanism for scoring search results: if a number of search terms all map to the same document, then that document is likely to be relevant.

Here are three entries from an index over part of the The Federalist Papers, a collection of 85 political essays which contains roughly 190,000 word instances over a vocabulary of about 8,000 words.  A field called text holds the contents of each essay, which have been tokenized into words, all lowercase, no punctuation.  The inverted index entries for the terms consisting of field name text and tokens abilities, able, and abolish are:

term document
text:abilities 2, 17, 21, 22, 35, 64, 76
text:able 1, 3, 4, 7, 8, 9, 10, 11, 12, 15, 16, 21, 23, 24,
25, 27, 29, 30, 33, 34, 35, 36, 37, 38, 41, 43, 46, 49,
51, 52, 58, 62, 63, 64, 67, 68, 70, 71, 75, 78, 85
text:abolish 10, 14, 26, 31, 39, 49, 45, 70, 71, 78, 81

Note that the document numbers here are Lucene’s internal references to the document.  These ids are not stable; Lucene manages the document id as it manages the index and the internal numbering may change as documents are added to and deleted from the index.

Lucene Versions and Version Numbers

The current Apache Lucene Java release is version 4.7, where 4 is the major version number and 7 is the minor version number.  The Apache odometer rolled over to 4.6.0 in November, 2013 and just hit 4.7.0 on February 26, 2014.  At the time that I wrote the Lucene chapter of Text Processing in Java, the current version was 4.5.  Minor version updates maintain backwards compatibility for the given major version therefore, all the example programs in the book compile and run under version 4.7 as well.

The behavior of many Lucene components has changed over time.  In particular, the index file format is subject to change from release to release as different methods of indexing and compressing the data are implemented.   To address this, the Enum class oal.util.Version was introduced in Lucene 3.  A Version instance identifies the major and minor versions of Lucene. For example, LUCENE_45 identifies version 4.5.  The Lucene version is supplied to the constructor of the components in an application.  As of Lucene 4.7, older versions have been deprecated, so the compiler issues a warning when older versions are specified.  This means that the examples from the book, which specify version 4.5, generate a compiler warning when compiled under version 4.7.

There’s no requirement that all components in an application be of the same version however, for components used for both search and indexing, it is critical that the Lucene version is the same in the code that is called at indexing time and the code that is called at search time.  A Lucene index knows what version of Lucene was used to create it (by using the Lucene Version enum constant).  Lucene is backward compatible with respect to searching and maintaining old index versions because it includes classes that can read and write all versions of the index up through the current release.

Lucene Indexes Fields

Conceptually, Lucene provides indexing and search over documents, but implementation-wise, all indexing and search is carried out over fields.  A document is a collection of fields.  Each field has three parts: name, type, and value.  At search time, the supplied field name restricts the search to particular fields.

For example, a MEDLINE citation can be represented as a series of fields: one field for the name of the article, another field for name of the journal in which it was published, another field for the authors of the article, a pub-date field for the date of publication, a field for the text of the article’s abstract, and another field for the list of topic keywords drawn from Medical Subject Headings (MeSH).  Each of these fields is given a different name, and at search time, the client could specify that it was searching for authors or titles or both, potentially restricting to a date range and set of journals by constructing search terms for the appropriate fields and values.

The Lucene API for fields has changed across the major versions of Lucene as the functionality and organization of the underlying Lucene index have evolved.  Lucene 4 introduces a new interface oal.index.IndexableField, which is implemented by class oal.document.Field.  Lucene 4 also introduces datatype-specific subclasses of Field that encapsulate indexing and storage details for common use cases.  For example, to index integer values, use class oal.document.IntField, and to index simple unanalyzed strings (keywords), use oal.document.StringField.  These so-called sugar subclasses are all final subclasses.

The field type is an object that implements oal.index.IndexableFieldType.  Values may be text, binary, or numeric.  The value of a field can be indexed for search or stored for retrieval or both.  The value of an indexed field is processed into terms that are stored in the inverted index.  The raw value of a stored field is stored in the index in a non-inverted manner.  Storing the raw values allows you to retrieve them at search time but may consume substantial space.

Indexing and storage options are specified via setter methods on oal.document.FieldType.  These include the method setIndexed(boolean), which specifies whether or not to index a field, and the method setTokenized(boolean), which specifies whether or not the value should be tokenized.  The method setOmitNorms(boolean) controls how Lucene computes term frequency.  Lucene’s default behavior is to represent term frequency as a proportion by computing the ratio of the number of times a term occurs to the total number of terms in the document, instead of storing a simple term frequency count.  To do this calculation it stores a normalizing factor for each field that is indexed.  Calling method setOmitNorms with value true turns this off and the raw term frequency is used instead.

Some indexing choices are interdependent.  Lucene checks the values on a FieldType object at Field construction time and throws an IllegalArgumentException if the FieldType has inconsistent values.  For example, a field must be either indexed or stored or both, so indexed and/or stored must be true.  If indexed is false, then stored must be true and all other indexing options should be set to false.  The following code fragment defines a custom FieldType and then creates a Field of this type:

FieldType myFieldType = new FieldType();
Field myField = new Field("field name",
                          "field value",

The source code of the subclasses of Field provides good examples of how to define a FieldType.

Analyzing Text into Tokens

Search and indexing over text fields require processing text data into tokens.  The package oal.analysis contains the base classes for tokenizing and indexing text.  Processing may consist of a sequence  of transformations , e.g., whitespace tokenization, case normalization, stop-listing, and stemming.

The abstract class oal.analysis.TokenStream breaks the incoming text into a sequence of tokens that are retrieved using an iterator-like pattern.  TokenStream has two subclasses: oal.analysis.Tokenizer and oal.analysis.TokenFilter.   A Tokenizer takes a as input whereas a TokenFilter takes another oal.analysis.TokenStream as input.  This allows us to chain together tokenizers such that the initial tokenizer gets its input from a reader and the others operate on tokens from the preceding TokenStream in the chain.

An oal.analysis.Analyzer supplies the indexing and searching processes with TokenStreams on a per-field basis.  It maps field names to tokenizers and may also supply a default analyzer for unknown field names.  Lucene includes many analysis modules that provide concrete implementations of different kinds of analyzers.  As of Lucene 4, these modules are bundled into separate jarfiles.  There are several dozen language-specific analysis packages, from for Arabic to for Turkish.  The package oal.analysis.core provides several general-purpose analyzers, tokenizers, and tokenizer factory classes.

The abstract class oal.analysis.Analyzer contains methods used to extract terms from input text.  Concrete subclasses of Analyzer must override the method createComponents, which returns an object of the nested class TokenStreamComponents that defines the tokenization process and provides access to initial and file components of the processing pipeline.  The initial component is a Tokenizer that handles the input source.  The final component is an instance of TokenFilter and it is the TokenStream returned by the method Analyzer.tokenStream(String,Reader).  Here is an example of a custom Analyzer that tokenizes its inputs into individual words with all letters lowercase.

Analyzer analyzer = new Analyzer() { 
    @Override protected TokenStreamComponents 
    createComponents(String fieldName, Reader reader) {
        Tokenizer source = 
            new StandardTokenizer(VERSION,reader);
        TokenStream filter = 
            new LowerCaseFilter(VERSION,source);
        return new TokenStreamComponents(source, filter);

Note that the constructors for the oal.analysis.standard.StandardTokenizer and oal.analysis.core.LowerCaseFilter objects require a Version argument.  Further note that package oal.analysis.standard is distributed in the jarfile lucene-analyzers-common-4.x.y.jar, where x and y are the minor version and release number.

Indexing Documents

Document indexing consists of first constructing a document that contains the fields to be indexed or stored, then adding that document to the index.  The key classes involved in indexing are oal.index.IndexWriter, which is responsible for adding documents to an index, and, which is the storage abstraction used for the index itself.  Directories provide an interface that’s similar to an operating system’s file system.  A Directory contains any number of sub-indexes called segments.  Maintaining the index as a set of segments allows Lucene to rapidly update and delete documents from the index.

The following example shows how to create a Lucene index given a directory containing a set of data files.  The data in this example is taking from the 20 Newsgroups corpus a set of roughly 20,000 messages posted to 20 different newsgroups.  This code is excerpted from the example program in section 7.11 of Text Processing in Java that shows how to do document classification using Lucene.  We’ll get to document classification in a later post.  Right now we just want to go over how to build the index.

void buildIndex(File indexDir, File dataDir) 
    throws IOException, FileNotFoundException {

    Directory fsDir =;
    IndexWriterConfig iwConf 
        = new IndexWriterConfig(VERSION,mAnalyzer);
    IndexWriter indexWriter
        = new IndexWriter(fsDir,iwConf);

    File[] groupsDir = dataDir.listFiles();
    for (File group : groupsDir) {
        String groupName = group.getName();
        File[] posts = group.listFiles();
        for (File postFile : posts) {
            String number = postFile.getName();
            NewsPost post = parse(postFile, groupName, number);
            Document d = new Document();
            d.add(new StringField("category",
            d.add(new TextField("text",
            d.add(new TextField("text",


The method buildIndex walks over the training data directory and parses the newsgroup messages into a NewsPost object, which is a simple domain model of a newsgroup post, consisting of the newsgroup name, subject, body, and filename (a string of numbers).  We treat each post as a Lucene document consisting of two fields: a StringField named category for the newsgroup name and a TextField named text that holds the message subject and message body.  Method buildIndex takes two arguments: the directory for the Lucene index and a directory for one set of newsgroup posts where the directory name is the same as the Usenet newsgroup name, e.g.:

The first four statements in buildIndex create a that stores the index as a file-system directory and an IndexWriter which does the work of analyzing documents and adding them to the index.  The behavior of the IndexWriter is specified by an IndexWriterConfig object.  Documents are indexed via the IndexWriter‘s default Analyzer.  At the outset of the program we define a static final Lucene Version enum constant named VERSION. and a Lucene Analyzer named mAnalyzer.  We supply these as arguments to the IndexWriterConfig constructor.  We call the setOpenMode method with the enum constant IndexWriterConfig.OpenMode.CREATE, which causes the index writer to create a new index or overwrite an existing one.

A for loop iterates over all files in the data directory.  Each file is first parsed into a Java NewsPost and a corresponding Lucene document is created and fields are added to the document.  This document has two fields: category and text.  Newsgroup names are stored as simple unanalyzed strings.  The field named text stores both the message subject and message body.  A document may have multiple values for a given field.  Search over that field will be over all values for that field, however phrase search over that field will not match across different field values.  When theIndexWriter.addDocument(Document) method is called, the document is added to the index and the constituent fields are indexed and stored accordingly.

The final two statements invoke the IndexWriter‘s commit and close methods, respectively.  The commit method writes all pending changes to the directory and syncs all referenced index files so that the changes are visible to index readers.  Lucene now implements a two-phase commit so that if the commit succeeds, the changes to the index will survive a crash or power loss.  A commit may be expensive, and part of performance tuning is determining when to commit as well as how often and how aggressively to merge the segments of the index.

Document Search and Search Ranking

The Lucene search API takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score.  Lucene provides a highly configurable hybrid form of search that combines exact boolean searches with softer, more relevance-ranking-oriented vector-space search methods.  All searches are field specific, because Lucene indexes terms and a term is composed of a field name and a token.

Search Queries

Lucene specifies a language in which queries may be expressed.  For instance, computer NOT java produces a query that specifies the term computer must appear in the default field and the term java must not appear.  Queries may specify fields, as in text:java, which requires the term java to appear in the text field of a document.

The query syntax includes basic term and field specifications, modifiers for wildcard, fuzzy, proximity, or range searches, and boolean operators for requiring a term to be present, absent, or for combining queries with logical operators.  Finally, sub-queries may be boosted by providing numeric values to raise or lower their prominence relative to other parts of the query. The full syntax specification is in the package level javadoc for package oal.queryparser.classic.  A bonus feature of Text Processing in Java is a quick one-page reference guide to Lucene’s search query syntax (see Figure 7.2).

Queries may be constructed programmatically using the dozen or so built-in implementations of the abstract base class.  The most basic kind of query is a search for a single token on a single field, i.e., a single term.  This query is implemented in Lucene’s class.  A term query is constructed from a Term object, which is constructed from a field name and text for the term, both specified as strings.

Search Scoring

Lucene’s default search scoring algorithm weights results using TF—IDF, term frequency—inverse document frequency.  Term frequency means that high-frequency terms within a document have higher weight than do low-frequency terms.  Inverse document frequency means that terms that occur frequently across many documents in a collection of documents are less likely to be meaningful descriptors of any given document in a corpus and are therefore down-weighted.  This filters out common words.

As of Lucene 4, the API provides alternative scoring algorithms and a pluggable architecture that allows developers to build their own custom scoring models.

Search Example

The following program illustrates the basic sequence of search operations.  This program is a simplified version of the Lucene search example program included with the book.  When run from the command line, this program takes three arguments: the path to a Lucene index; a query string; the maximum number of results to return.  It runs the specified query against the specified index and prints out the rank, score, and internal document id for all documents that match the query.

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.util.Version;


public class LuceneSearch {
    public static void main(String[] args) 
        throws ParseException, CorruptIndexException,
               IOException {

        File indexDir = new File(args[0]);
        String query = args[1];
        int maxHits = Integer.parseInt(args[2]);

        Directory fsDir =;
        DirectoryReader reader =;
        IndexSearcher searcher = new IndexSearcher(reader);

        Analyzer stdAn 
            = new StandardAnalyzer(Version.LUCENE_45);
        QueryParser parser 
            = new QueryParser(Version.LUCENE_45,"text",stdAn);
        Query q= parser.parse(query);

        TopDocs hits =,maxHits);
        ScoreDoc[] scoreDocs = hits.scoreDocs;
        System.out.println("hits=" + scoreDocs.length);
        System.out.println("Hits (rank,score,docId)");
        for (int n = 0; n < scoreDocs.length; ++n) {
            ScoreDoc sd = scoreDocs[n];
            float score = sd.score;
            int docId = sd.doc;
            System.out.printf("%3d %4.2f %d\n",
                              n, score, docId);

The main method is declared to throw a Lucene corrupt index exception if the index isn’t well formed, a Lucene parse exception if the query isn’t well formed, and a general Java I/O exception if there is a problem reading from or writing to the index directory.  It starts off by reading in command-line arguments, then it creates a Lucene directory, index reader, index searcher, and query parser, and then it uses the query parser to parse the query.

Lucene uses instances of the aptly named IndexReader to read data from an index, in this example, we use an instance of class oal.index.DirectoryReader.  The class performs the actual search.  Every index searcher wraps an index reader to get a handle on the indexed data.  Once we have an index searcher, we can supply queries to it and enumerate results in order of their score.

The class oal.queryparser.classic.QueryParserBase provides methods to parse a query string into a object or throw a oal.queryparser.classic.ParseException if it is not.  All tokens are analyzed using the specified Analyzer.  This example is designed to search an index for Lucene version 4.5 that was created using an analyzer of class oal.analysis.standard.StandardAnalyzer which contains text a field named text.

It is important to use the same analyzer in the query parser as is used in the creation of the index.  If they don’t match, queries that should succeed will fail should the different analyzers produce differing tokenizations given the same input.  For instance, if we apply stemming to the contents of text field text during indexing and reduce the word codes to code, but we don’t apply stemming to the query, then a search for the word codes in field text will fail. The search term is text:codes but the index contains only the term text:code.  Note that package oal.queryparser.classic is distributed in the jarfile lucene-queryparser-4.x.y.jar.

The actual search is done via a call to the IndexSearcher‘s search method, which takes two arguments: the query and an upper bound on the number of hits to return.  It returns an instance of the Lucene class  The TopDocs result provides access to an array of search results.  Each result is an instance of the Lucene class, which encapsulates a document reference with a floating point score.  The array of search results is sorted in decreasing order of score, with higher scores representing better matches.  For each ScoreDoc object, we get the score from the public member variable score.  We get the document reference number (Lucene’s internal identifier for the doc) from the public member variable doc.  The document identifier is used to retrieve the document from the searcher.  Here, we’re just using the id as a diagnostic to illustrate the basics of search and so we just print the number, even though it is only an internal reference.  Finally, we close the IndexReader.

Discussion and Further Reading

This post tries to cover the essentials of Lucene 4 in a very short amount of space.  In order to do this, this post contains minimal amounts of examples, and links to the Lucene javadocs have been substituted for detailed explanation of how classes and methods behave.  Putting these code fragments together into a full application is left as an exercise to the reader.

Back when Lucene was at version 2.4 I wrote my first blog post titled “Lucene 2.4 in 60 Seconds.”  A more accurate title would have been “Lucene 2.4 in 60 Seconds or More.”  Likewise, a more accurate title for this post would be “The Essential Essentials of Text Search and Indexing with Lucene 4” but that’s just not very snappy.  For more on the essentials of search and indexing with Lucene, please check out Chapter Seven of Text Processing in Java, (and all the other chapters as well).

To cover all of Lucene would take an entire book, and there are many good ones out there. My favorite is Lucene in Action, Second Edition, which, alas, only covers version 3.

New Book: Text Processing in Java

January 31, 2014

I’m pleased to announce the publication of Text Processing in Java !
Text Processing in Java

This book teaches you how to master the subtle art of multilingual text processing and prevent text data corruption.  Data corruption is the all-too-common problem of words that are garbled into strings of question marks, black diamonds, or random glyphs.  In Japanese this is called mojibake (“character change”), written 文字化け, but on your browser it might look like this: �����  or this: 文字化ã‘. When this happens, pinpointing the source of error can be surprisingly difficult and time-consuming. The information and example programs in this book make it easy.

This book also provides an introduction to natural language processing using Lucene and Solr. It covers the tools and techniques necessary for managing large collections of text data, whether they come from news feeds, databases, or legacy documents. Each chapter contains executable programs that can also be used for text data forensics.

Topics covered:

  • Unicode code points
  • Character encodings from ASCII and Big5 to UTF-8 and UTF-32LE
  • Character normalization using International Components for Unicode (ICU)
  • Java I/O, including working directly with zip, gzip, and tar files
  • Regular expressions in Java
  • Transporting text data via HTTP
  • Parsing and generating XML, HTML, and JSON
  • Using Lucene 4 for natural language search and text classification
  • Search, spelling correction, and clustering with Solr 4

Other books on text processing presuppose much of the material covered in this book. They gloss over the details of transforming text from one format to another and assume perfect input data. The messy reality of raw text will have you reaching for this book again and again.

Buy Text Processing in Java on Amazon

The Latin1 Transcoding Trick for Ant

April 21, 2013

A while back Bob blogged about The Latin1 Transcoding Trick for Java Servlets, etc.

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte.

Another situation where this trick comes in real handy is dealing with the way that Ant compiles its logfiles.

If, like me, you’re fond of debug-by-printf and you use Ant to compile and run your programs, then you might have run into the problem that has given rise to many StackOverflow queries, that is, when you use an Ant task to run the program and instrument your code with print statements to standard out, Ant replaces non-ASCII characters with a question mark. When the problem you’re trying to debug is making sure that non-ASCII characters are being processed correctly, this is both misleading and maddening. The standard advice on StackOverflow is to set the shell environment variable ANT_OPTS using the following incantation (for bash shell):

export ANT_OPTS="-Dfile.encoding=UTF-8"

This works as long as you’re working with UTF-8 encoded character data and your terminal’s encoding is set to UTF-8 as well. Here a solution that works no matter what character encoding is in play:

export ANT_OPTS="-Dfile.encoding=Latin1"

It’s the ol’ Latin1 transcoding trick!

Of course you already know about character encodings . Do you know about Ant’s IO System? Here’s what Ant contributor Conor MacNeill says:

The Ant IO system is designed to associate any output sent to System.out and System.err with the task that generated the output and to log the output accordingly.

Ant’s Main class installs its own output streams into System.out and System.err. These streams are instances of DemuxOutputStreams

Using the source code for Ant 1.9.0, in class we see that System.In, System.Out, and System.Err are all reassigned to Ant’s DemuxInputStream and DemuxOutputStream, which extend InputStream and OutputStream, respectively:

System.setIn(new DemuxInputStream(project));
System.setOut(new PrintStream(new DemuxOutputStream(project, false)));
System.setErr(new PrintStream(new DemuxOutputStream(project, true)));

The call to the PrintStream constructor is the one-arg constructor PrintStream(OutputStream out). Because no file encoding is specified, the encoding used is the default charset for the JVM that’s running Ant. This is specified by the system property file.encoding. This property varies depending on your platform and locale. To check this, try this on your machine:

public class GetDefaultEncoding {
    public static void main(String[] args) {

On my Mac running OS-X the default is US-ASCII (because the default locale on this machine is en_US). On my Windows XP machine the default is Cp1252 (Windows Latin1 which differs from ISO_8859-1 just enough to be noticeable).

At the point where Ant’s DemuxInputStream reads in the bytes sent to System.out by a Java task, any character outside of the default character set is replaced by a question mark. When Latin1 is the default encoding, all bytes are valid Latin1 characters and their Unicode code point value is the same as the byte value so the bytes from the Java task pass through the Ant IO system unchanged.

As long as the next process in the chain (e.g. the terminal app) is configured to handle whatever encoding your text data is in, you’re good to go.

Mystery Novel with Natural Language Processing

October 24, 2012

For those of you who like mystery novels, Mitzi’s just written one. The added bonus for readers of this blog is that there’s natural language processing involved in the detective work (I don’t want to give too much away, so I can’t tell you how).

Mitzi Morris, Poetic Justice Cover

Poetic Justice is in the cozy mystery sub-genre, where the focus is on the amateur sleuths and their milieu, not on grisly multiple homicides.

The Back-Cover Blurb

Once you’ve made it in Manhattan, why would you be caught dead in Staten Island? That’s what Jay Alfred, editor-in-chief of Ars Longa Press, can’t understand. Jay and his partner Ken live on the best block in Chelsea. They’re an attractive pair of opposites. Jay would never stoop to snoop. Ken exercises his right to know every chance he gets.

When Sheba Miller, literary agent and downtown doyenne, is found dead in a bar on Staten Island, Ken can’t wait to investigate. He hustles Jay onto the Staten Island Ferry and into adventure. Then Sheba’s tell-all memoir surfaces. It’s a catalog of white nights with hot artists and liquid lunches with idiot publishers. Jay’s the idiot-in-chief, but he’s not alone. It’s a good thing that Sheba’s dead, because half of literary New York is ready to kill her.

That’s not the only book in town and Ken’s not the only amateur detective. Sheba’s old friends and lovers and the junior members of Ars Longa are all ready and willing to explore New York City and beyond in search of authors, books, killers, and a killer martini.

Look Inside!

If you follow the Amazon link below, the first six chapters are available free online through Amazon’s “Look Inside!” feature.

Early Reviews

For what it’s worth, at least twenty people have read it and said they enjoyed it. Some (including me) are already clamoring for the second book in the series. Other mystery writers told Mitzi this would happen; luckily for us fans, she already has two follow-ons in the pipeline.


  • Publisher:  Colloquial Media
  • Language:  English
  • Pages:  324
  • ISBN-10: 0-9882087-0-9 (Paperback)
  • ISBN-13: 978-0-9882087-0-4 (Kindle)


On Amazon, it’s eligible for the 4-for-3 deal (order 4 books, get the cheapest one free).

It’s also available from Amazon UK.

If you want a review copy, add a comment to this post or send me e-mail at

Using Luke the Lucene Index Browser to develop Search Queries

July 24, 2012

Luke is a GUI tool written in Java that allows you to browse the contents of a Lucene index, examine individual documents, and run queries over the index. Whether you’re developing with PyLucene, Lucene.NET, or Lucene Core, Luke is your friend.

Downloading, running Luke

Downloads are available from You can download the source or just the standalone binary .jar file. Right now we’re using the just-released Luke 4.0.0-ALPHA standalone binary. To run Luke, just fire up the jar file from the command line:

> java -jar lukeall-4.0.0-ALPHA.jar

Once launched, Luke opens with a dialog menu to open an index. We’re using an index called fed-papers.idx, which is an index over the Federalist Papers, a set of 85 op-ed pieces written by James Madison, Alexander Hamilton, and/or John Jay. This is taken from the example in our recently updated Lucene 3 tutorial. To build the index yourself, download the data and accompanying source code and follow the instructions in the tutorial. Each of the 85 papers is treated as a document. The text of the document has been indexed using Lucene’s StandardAnalyzer and is stored in a field named text.

The following is a screenshot of the Luke GUI upon first opening the index. We’ve circled the main controls on the top left corner of the GUI. Luke opens with the Overview tab pane.

All 85 documents have a similar structure. They start with an identifying number, e.g. FEDERALIST No. 1. The opening salutation is: To the People of the State of New York:. Looking at the top terms in the index, we see that there are 85 documents which contain the words {federalist, people, state, york} as well as common function words. The StandardAnalyzer has stop-listed other function words, including {to, of, the, no}. Otherwise, these too would be among the top terms with a document frequency of 85.

Exploring Document Indexing

The Document tab pane lets you examine individual documents in the index. In the screenshot below we have selected the 47th document in the index, FEDERALIST No. 48. Circled in red are the series of controls used to get to this point. First we chose the Document tab pane, and then used the “Browse by document number” control to choose document 47. The bottom half of the window shows the document fields.

Clicking on the “Reconstruct & Edit” control opens a new pop-up window which allows us to inspect the document contents on a field-by-field basis. Below we show side-by-side screenshots of the text field. On the left we see the raw text as stored; on the right we see the result of tokenization and indexing via the Lucene StandardAnalyzer. Lucene’s StandardAnalyzer includes a StandardTokenizer, StandardFilter, Tokenization happens first. This removes punctuation and assigns each token a position. Then the tokens are lower-cased and stop-listed. LowerCaseFilter and StopFilter.

We have circled the text of the opening salutation: To the People of the State of New York:. It consists of 9 words followed by 1 punctuation symbol. The tokenizer strips out the final colon. The stop-list filter removes the tokens at positions {1, 2, 4, 5, 7}. The text of tokens which have been indexed are displayed as is. Tokens which were stop-listed are missing from the index, so Luke displays the string null followed by the number of consecutive stop-listed tokens.

Exploring Search

The Search tab pane packs a lot of controls. The annotated screenshot below shows the results of running a search over the index.

In the top right quadrant is an embedded tab pane (tabs circled in red). On the Analysis tab, we have selected the StandardAnalyzer from a pull-down list of Analyzers and selected the default search field from a pull-down list of all fields in the index. Given this, Luke constructs the QueryParser used to parse queries entered into the text box in the upper left quadrant. We entered the words: “Powers of the Judiciary” into this text box, circled in red. Directly below is it the parsed query, also circled in red. Clicking on the “Search” control runs this search over the index. The bottom pane displays the ranked results.

Luke passes the input string from the search text box to the QueryParser which parses it using the specified analyzer and creates a set of search terms, one term per token, over the specified default field. In this example, the StandardAnalyzer tokenizes, lowercases, and stop-lists these words, resulting in two search terms text:powers and text:judiciary. The result of this search pulls back all documents which contain the words powers and/or judiciary.

To drill down on Lucene search, we use the Explain Structure control, circled in red on the following screenshot.

When clicked, this pops up another pane which shows all details of the query constructed from the search expression. The structure of the query used in this example is:

  clauses=2, maxClauses=1024
  Clause 0: SHOULD
      Term: field='text' text='powers'
  Clause 1: SHOULD
      Term: field='text' text='judiciary'

The search expression can be any legal search string according to the Lucene Query Parser Syntax. To search for the phrase powers of the judiciary we need to enclose it in double quotes. But this new search produces no results, which is clearly wrong, since this phrase occurs in the title of FEDERALIST No. 80 (see search results above).

In the query details we see that the input has been parsed into text:"powers judiciary". The Explain Structure feature returns the following:

lucene.PhraseQuery, slop=0
  pos: [0,1]
  Term 0: field='text' text='powers'
  Term 1: field='text' text='judiciary'

The problem here is that the token positions specified don’t allow for the intervening stop-listed words. To correct this, we need to adjust Luke’s default settings on the QueryParser. The screenshot below shows both the controls changed and the query results.

First we go to the QueryParser tab in the set of search controls in the top right quadrant, then we click the checkbox labeled Enable positionIncrements. Now the parsed query looks like this: text:"powers ? ? judiciary", which translates to the following programmatic query:

lucene.PhraseQuery, slop=0
  pos: [0,3]
  Term 0: field='text' text='powers'
  Term 1: field='text' text='judiciary'

Finally, we select the single search result, and click on the Explain control (circled in red). This pops up another window which explains the query score (outlined in red). Here is the text of the explanation:

0.0847  weight(text:"powers ? ? judiciary" in 79) [DefaultSimilarity], result of:
  0.0847  fieldWeight in 79, product of:
    1.0000  tf(freq=1.0), with freq of:
      1.0000  phraseFreq=1.0
    3.6129  idf(), sum of:
      1.3483  idf(docFreq=59, maxDocs=85)
      2.2646  idf(docFreq=23, maxDocs=85)
    0.0234  fieldNorm(doc=79)

Using the Lucene XML Query Parser

The Lucene API contains many kinds of queries beyond those generated by the QueryParser. You can use Luke to develop these queries as well, via the Lucene XML Query Parser. It is almost impossible to find any on-line documentation on this most excellent contribution to Lucene by Mark Harwood. Luckily, it is distributed with Lucene source tgz files. For Lucene 3.6 the path to this documentation is: lucene-3.6.0/contrib/xml-query-parser/docs/index.html.

Recast in as a Lucene XML Query, our original search term:powers term:judiciary becomes:

 <BooleanQuery fieldName="text">
         <Clause occurs="should">
         <Clause occurs="should">

The screenshot below shows the results of pasting the above XML into the search expression textbox with the QueryParser control Use XML Query Parser turned on. (Note: currently this works in Luke 3.5 but not in Luke 4.0-ALPHA. We hate the bleeding edge).

The XML rewrites to exactly the same query as our first query.

Big deal, you might be saying right about now. That’s a whole lotta chopping for not much kindling. Au contraire! One of the reasons for developing the XML Query Syntax is this:

To bridge the growing gap between Lucene query/filtering functionality and the set of functionality accessible through the standard Lucene QueryParser syntax.

For example, a while back, Mark Miller blogged about Lucene SpanQueries. This is similar to but not the same as PhraseQuery. We can use Luke to compare and contrast the difference between the two. Here’s our earlier phrase query Powers of the Judiciary recast as a span query.

  <SpanNear slop="3" inOrder="true" fieldName="text">		

The screenshot below shows the results of running this query.

FEDERALIST No. 48 scores slightly better on this query than does FEDERALIST No. 80. Understanding why is left as an exercise to the reader.

Lucene Tutorial updated for Lucene 3.6

July 5, 2012

[Update: 8 Mar 2014. I’ve just written a quick introduction to Lucene 4:

The contents of this introduction are excerpted from the Lucene chapter in my new book:

This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene’s search query syntax.]

The current Apache Lucene Java version is 3.6, released in April of 2012. We’ve updated the Lucene 3 tutorial and the accompanying source code to bring it in line with the current API so that it doesn’t use any deprecated methods and my, there are a lot of them. Bob blogged about this tutorial back in February 2011, shortly after Lucene Java rolled over to version 3.0.

Like other 3.x minor releases, Lucene 3.6 introduces performance enhancements, bug fixes, new analyzers, and changes that bring the Lucene API in line with Solr. In addition, Lucene 3.6 anticipates Lucene 4, billed as “the next major backwards-incompatible release.”

Significant changes since version 3.0

  • IndexReader delete methods are deprecated and will be removed entirely in Lucene 4. All deletes and updates are done via an IndexWriter.
  • There is a single IndexWriter constructor that takes two arguments: the index directory and an IndexWriterConfig object. The latter was introduced in Lucene 3.1. It holds configuration information that was previously specified directly as additional arguments to the constructor.
  • IndexWriter optimize methods are deprecated. The merge method(s) supply this functionality.

Building the Source

The ant build file is in the file src/applucene/build.xml and should be run from that directory. The book’s distribution is organized this way so that each chapter’s demo code is roughly standalone, but they are able to share libs. There are some minor dependencies on LingPipe in the example (jar included), but those are just for I/O and could be easily removed or replicated. As an added bonus, the source code now includes the data used in the examples throughout the tutorial, the venerable Federalist Papers from Project Gutenberg.

0/1 Loss Meaningless for Predicting Rare Events such as Exploding Manholes

June 14, 2012

[Update: 19 June 2012: Becky just wrote me to clarify which tools they were using for what (quoted with permission, of course — thanks, Becky):

… we aren’t using BART to rank structures, we use an independently learned ranked list to bin the structures before we apply BART. We use BART to do a treatment analysis where the y values represent whether there was an event, then we compute the role that the treatment variable plays in the prediction. Here’s a journal paper that describes our initial ranking method

and the pre-publication version

The algorithm for doing the ranking was modified a few years ago, and now Cynthia is taking a new approach that uses survival analysis.]

Rare Events

Let’s suppose you’re building a model to predict rare events, like manhole explosions in the Con-Ed system in New York (this is the real case at hand — see below for more info). For a different example, consider modeling the probability of a driver getting into a traffic accident in the next week. The problem with both of these situations is that even with all the predictors in hand (last maintenance, number of cables, voltages, etc. in the Con-Ed case; driving record, miles driven, etc. in the driving case), the estimated probability for any given manhole exploding (any person getting into an accident next week) is less than 50%.

The Problem with 0/1 Loss

A typical approach in machine learning in general, and particularly in NLP, is to use 0/1 loss. This forces the system to make a simple yea/nay (aka 0/1) prediction for every manhole about whether it will explode in the next year or not. Then we compare those predictions to reality, assigning a loss of 1 if you predict the wrong outcome and 0 if you predict correctly, then summing these losses over all manholes.

The way to minimize expected loss is to predict 1 if the probability estimate of failure is greater than 0.5 and 0 otherwise. If all of the probability estimates are below 0.5, all predictions are 0 (no explosion) for every manhole. Consequently, the loss is always the number of explosions. Unfortunately, this is the best you can do if your loss is 0/1 and you have to make 0/1 predictions.

So we’ve minimized 0/1 loss and in so doing created a useless 0/1 classifier.

A Hacked Threshold?

There’s something fishy about a classifier that returns all 0 predictions. Maybe we can adjust the threshold for predicting explosions below 0.5. Equivalently, for 0/1 classification purposes, we could rescale the probability estimates.

Sure, it gives us some predicted explosions, but the result is a non-optimal 0/1 classifier. The reason it’s non-optimal in 0/1-loss terms is that each prediction of an explosion is likely to be wrong, but in aggregate some of them will be right.

It’s not a 0/1 Classification Problem

The problem in 0/1 classification arises from converting estimates of explosion of less than 50% per manhole to 0/1 predictions minimizing expected loss.

Suppose our probability estimates are close, at least in the sense that for any given manhole there’s only a very small chance it’ll explode no matter what its features are.

Some manholes do explode and the all-0 predictions are wrong for every exploding manhole.

What Con-Ed really cares about is finding the most at-risk properties in its network and supplying them maintenance (as well as understanding what the risk factors are). This is a very different problem.

A Better Idea

Take the probabilities seriously. If your model predicts a 10% chance of explosion for each of 100 manholes, you expect to see 10 explosions. You just don’t know which of the 100 manholes they’ll be. You can measure these marginal predictions (number of predicted explosions) to gauge how accurate your model’s probability estimates are.

We’d really like a general evaluation that will measure how good our probability estimates are, not how good our 0/1 predictions are. Log loss does just that. Suppose you have N outcomes y_1,\ldots,y_N with corresoponding predictors (aka features), x_1,\ldots,x_N, and your model has parameter \theta. The log loss for parameter (point) estimate \hat{\theta} is

      {\mathcal L}(\hat{\theta}) - \sum_{n=1}^N \, \log \, p(y_n|\hat{\theta};x_n)

That is, it’s the negative log probability (the negative turns gain into loss) of the actual outcomes given your model; the summation is called the log likelihood when viewed as a function of \theta, so log loss is really just the negative log likelihood. This is what you want to optimize if you don’t know anything else. And it’s exactly what most probabilistic estimators optimize for classifiers (e.g., logistic regression, BART [see below]).

Decision Theory

The right thing to do for the Con-Ed case is to break out some decision theory. We can assign weights to various prediction/outcome pairs (true positive, false positive, true negative, false negative), and then try to optimize weights. If there’s a huge penalty for a false negative (saying there won’t be an explosion when there is), then you are best served by acting on low-probability information, such as servicing even low-probability manholes. For example, if there is a $100 cost for a manhole blowing up and it costs $1 to service a manhole so it doesn’t blow up, then even a 1% chance of blowing up is enough to send out the service team.

We haven’t changed the model’s probability estimates at all, just how we act on them.

In Bayesian decision theory, you choose actions to minimize expected loss conditioned on the data (i.e., optimize expected outcomes based on the posterior predictions of the model).

Ranking-Based Evaluations

Suppose we sort the list of manholes in decreasing order of estimated probability of explosion. We can line this up with the actual outcomes. Good system performance is reflected in having the actual explosions ranked high on the list.

Information retrieval supplies a number of metrics for this kind of ranking. The thing I like to see for this kind of application is a precision-recall curve. I’m not a big fan of single-number evaluations like mean average precision, though precision-at-N makes sense in some cases, such as if Con-Ed had a fixed maintenance budget and wanted to know how many potentially exploding manholes it could service.

There’s a long description of these kind evaluations in

Just remember there’s noise in this received curves and that picking an optimal point on them is unlikely to produce such good behavior on held-out data.

With good probability estimates for the events you will get good rankings (there’s a ton of theory around this I’ve never studied).

About the Exploding Manholes Project

I’ve been hanging out at Columbia’s Center for Computational Learning Systems (CCLS) talking to Becky Passonneau, Haimanti Dutta, Ashish Tomar, and crew about their Con-Ed project of predicting certain kinds of events like exploding manholes. They built a non-parametric regression model using Bayesian additive regression trees with a fair amount of data and many features as predictors.

I just wrote a blog post on Andrew Gelman’s blog that’s related to issues they were having with diagnosing convergence:

But the real problem is that all the predictions are below 0.5 for manholes exploding and the like. So simple 0/1 loss just fails. I thought the histograms of residuals looked fishy until it dawned on me that it actually makes sense for all the predictions to be below 0.5 in this situation.

Moral of the Story

0/1 loss is not your real friend. Decision theory is.

The Lottery Paradox

This whole discussion reminds me of the lottery “paradox”. Each ticket holder is very unlikely to win a lottery, but one of them will win. The “paradox” arises from the inconsistency of the conjunction of beliefs that each person will lose and the belief that someone will win.

Oh, no! Henry Kyburg died in 2007. He was a great guy and decades ahead of his time. He was one of my department’s faculty review board members when I was at CMU. I have a paper in a book he edited from the 80s when we were both working on default logics.

Lucene 2.4 in 60 seconds

February 18, 2009

[Update: 8 Mar 2014. I’ve just written a quick introduction to Lucene 4:

The contents of this introduction are excerpted from the Lucene chapter in my new book:

This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene’s search query syntax.]

This is a tutorial on getting started with the Lucene 2.4 Java API which avoids using deprecated classes and methods. In particular it shows how to process search results without using Hits objects, as this class is scheduled for removal in Lucene 3.0. The time estimate of 60 seconds to complete this tutorial is more wish than promise; my goal is to present the essential concepts in Lucene as concisely as possible.

In the best “Hello World” tradition I have written a class called HelloLucene which builds a Lucene index over a set of 3 texts: { “hello world”, “hello sailor”, “goodnight moon” }. HelloLucene has two methods (besides main): buildIndex and searchIndex, which are called in turn. buildIndex builds an index over these 3 texts; searchIndex takes the args vector and runs a search over the index for each string, printing out results to the terminal.

Here is buildIndex and its required import statements:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
public static void buildIndex() throws IOException {
    IndexWriter indexWriter
        = new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
                          new StandardAnalyzer(),
    String[] texts = new String[] {  "hello world",
                                     "hello sailor",
                                     "goodnight moon" };
    for (String text : texts) {
        Document doc = new Document();
        doc.add(new Field("text",text,

A Lucene data store is called an Index. It is a collection of Document objects. A Document consists of one or more named Field objects. Documents are indexed on a per-field basis. An IndexWriter object is used to create this index. Let’s go over the call to its constructor method:

new IndexWriter(FSDirectory.getDirectory("helloLuceneIndex"),
                new StandardAnalyzer(),

The first argument is the index that the IndexWriter operates on. This example keeps the index on disk, so the FSDirectory class is used to get a handle to this index. The second argument is a Lucene Analyzer. This object controls how the input text is tokenized into terms used in search and indexing. HelloLucene uses a StandardAnalyzer, which is designed to index English language texts. It ignores punctuation, removes common function words such as “if”, “and”, and “”but”, and converts words to lowercase. The third argument determines the amount of text that is indexed. The constant IndexWriter.MaxFieldLength.LIMITED defaults to 10,000 characters.

for (String text : texts) {
    Document doc = new Document();
    doc.add(new Field("text",text,Field.Store.YES,Field.Index.ANALYZED));

The for loop maps texts into Document objects, which contain a single Field with name “text”. The last 2 arguments to the Field constructor method specify that the contents of the field are stored in the index, and that they are analyzed by the IndexWriter‘s Analyzer. The IndexWriter.addDocument() method adds each document to the index. After all texts have been processed the IndexWriter is closed.

Both indexing and search are operations over Document objects. Searches over the index are specified on a per-field basis, (just like indexing). Lucene computes a similarity score between each search query and all documents in the index and the search results consist of the set of best-scoring documents for that query.

Here is searchIndex and its required import statements:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
public static void searchIndex(String[] queryStrings) 
    throws IOException, ParseException {
    Searcher searcher 
        = new IndexSearcher(FSDirectory.getDirectory("helloLuceneIndex"));
    QueryParser parser = new QueryParser("text",new StandardAnalyzer());
    for (String queryString : queryStrings) {
        System.out.println("nsearching for: " + queryString);
        Query query = parser.parse(queryString);
        TopDocs results =,10);
        System.out.println("total hits: " + results.totalHits);
        ScoreDoc[] hits = results.scoreDocs;
        for (ScoreDoc hit : hits) {
            Document doc = searcher.doc(hit.doc);
            System.out.printf("%5.3f %sn",
                              hit.score, doc.get("text"));

Access to a Lucene index (or indexes) is provided by a Searcher object. searchIndex instantiates a IndexSearcher over the directory “helloLuceneIndex”. Search happens via a call to the method. Before calling the search() method the search string must be processed into a Lucene Query. To prepare the query we create a QueryParser and call its parse() method on the search string. The QueryParser is constructed using the same Analyzer as was used for indexing because the QueryParser processes the search string into a set of search terms which are matched against the terms in the index, therefore both the indexed text and the search text must be tokenized in the same way.

Here are the statements which process the results:

TopDocs results =,10);
System.out.println("total hits: " + results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
    Document doc = searcher.doc(hit.doc);
    System.out.printf("%5.3f %sn",
                      hit.score, doc.get("text"));

The search() method returns a TopDocs object, which has 2 public fields: scoreDocs and totalHits. The latter is the number of documents that matched the query, and the former is the array of results in the form of ScoreDoc objects, where a ScoreDoc is itself a simple object consisting of two public fields: doc, the Document id (an int value); and score a float value calculated by Lucene (using a similarity metric best described as Unexplainable Greek kung fu). searchIndex reports the TopDoc.totalHits, then iterates over the TopDoc.scoreDocs array. The ScoreDoc.doc is used to retrieve the Document object from the index, and finally the method Document.get() retrieves the contents of the Field with name “text”.

In order to compile this program, download the Lucene 2.4 distribution, which contains the jarfile lucene-core-2.4.0.jar. Either add this to your classpath, or simply put it in the same directory as HelloLucene, and compile with the following command:

javac -cp "lucene-core-2.4.0.jar"

Invoke the program with the following command:

java -cp ".;lucene-core-2.4.0.jar" HelloLucene "hello world" hello moon foo

Produces these results:

built index

searching for: hello world
total hits: 2
1.078 hello world
0.181 hello sailor

searching for: hello
total hits: 2
0.625 hello world
0.625 hello sailor

searching for: moon
total hits: 1
0.878 goodnight moon

searching for: foo
total hits: 0

all done

The first query “hello world” matches exactly against our first text, but it also partially matches the text “hello sailor”. The second query “hello” matches both texts equally well. The third query “moon” matches against the only document which contains that word. The fourth query “foo” doesn’t match anything in the index.

Exercise for the reader

Refactor into two programs: and, where BuildHelloLucene uses its command line arguments to build the index, instead of the strings supplied by Sring[] texts.

Runing BuildHelloLucene with inputs “hello world”, “hello sailor”, and “goodnight moon” and then running SearchHelloLucene with inputs “hello world”, “hello”, “moon” and “foo” should give the same results as above. Different input texts and searches will expose the behaviour of the StandardAnalyzer in particular, and Lucene in general.

Lucene or a Database? Yes!

November 22, 2008

[Update: 10 Feb 2014. A newer discussion of databases and Lucene 4 is available in the chapter on Lucene in the book

This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene’s search query syntax.]

This question comes up pretty often on the Lucene mailing lists. Assuming that’s an inclusive OR in the question, we say yes! to both. In many situations this is not just a clever answer it’s the correct one — while the two systems can both be used as a data store, there are some things that a database can do better than Lucene, and some things that Lucene can do that a database can’t do at all.

Lucene provides full-text search over a collection of data that returns results in order of relevancy. If search over (lots of) text data is a non-negotiable requirement, then a date with Lucene (or one of its cousins) is in your future, because this is something that most database systems do badly, if at all. Here’s a nice critique of this feature in MySQL (an otherwise fine RDBMS).

Lucene provides search and indexing over Document objects. A Document is a set of Fields. Indexing a Field processes its contents into one or more Term objects which contain the Field name and the Term string value. Lucene maintains an inverted index which maps terms into documents. Queries against the index are scored by TF-IDF, a measure of document relevance. A Lucene Analyzer defines how index terms are extracted from text. The developer can define their own Analyzer, or use one of the pre-existing analyzers in the Lucene API, which can do tokenization for many different languages, as well as stop-listing of common words, case normalization, stemming, and more. For a nice example of a custom analyzer, see Bob‘s chapter on “Orthographic variation with Lucene” in the Lucene in Action book.

Relational databases store data as rows and columns in tables. A table declaration specifies the columns, the type of data stored in each column, and constraints over columns. The DBMS enforces these constraints as rows are added to the tables. In Lucene there is no notion of document type. A document is a set of fields, and processing a document consists of processing those fields into fielded terms that are added to the index. Lucene was designed for rapid indexing of documents. Checking document completeness is the responsibility of the calling application, ditto enforcing cross-document constraints. In the latter case, trying to enforce a uniqueness constraint on a term in the index is likely to impair indexing speed.

Searches against the two are different both in the search operations and search results. Database queries are specified in SQL which allows for composition of data from different tables via JOIN, complex boolean selectional restrictions, and ordering of results. Search over a database returns a ResultSet object containing a table which is a new view on the data resulting from executing the search query, that is, it may contain data stored in several different tables underlyingly.

Lucene search is over the term index, and the results returned are a set of pairs of documents and scores. Lacking explicit document types, a query is over some number of fields. Lucene has its own query language, and the package provides classes to do this programmatically. Lucene uses the score to limit the number of documents returned. Overriding this behavoir requires getting all matching documents for a query, and this can be very expensive, and queries designed to find things like dates or numbers within some range can be inefficient.

So if you want to provide relevance-based text search over some heterogeneous data store that has both a transactional component, such as an accounting system, as well as a storing large amounts of text you’re going to have to use both Lucene and a database, and the questions to ask are:

  • What is the mapping between the database tables and views and Lucene documents?
  • What text, if any, needs to be stored in the index?
  • How do I keep my Lucene index in sync with the database?

The Scaling Web blog has a nice case study. They have some good tips about how to conflate information from multiple table columns into good fielded documents, and how best to use IndexReader and IndexWriter objects in order to keep the index up-to-date with the database.

Of course there are situations where just Lucene or just a database is a sufficient data store. In the LingMed sandbox project we use Lucene to store a local version of the MEDLINE citation index. The challenge there is making sure that updates to the index are visible to all search clients, and this was covered in an earlier blog post: “Updating and Deleting Documents in Lucene”. Each MEDLINE citation has a unique PubMed ID, and treating this as an index field allows for rapid search by PubMed ID. The citation title and abstract are also indexed into their own fields, and we can use Lucene to find out how many different documents contain a word in the article title or abstract – that is, we can quickly and easily calculate the document frequency for an item.

LingMed uses a LingPipe Chunker and a LanguageModel to find all mentions of genes in MEDLINE citations and assign a score to each. A MySQL database is the appropriate data store because we need to keep track of a many-to-many relationship between genes and articles (each gene mention); as well as information about the article itself. We don’t need sophisticated text search over this data, nor do we want to rank our results by TF-IDF; we are using LingPipe to compute our own scores, and use SQL’s “ORDER BY” clause to retrieve the best-scoring items.

The correct answer to the question “Lucene or a Database?” always depends on the specifics of the situation. Furthermore, as more functionality is added to Lucene (and family), the question is worth revisiting, and is revisited fairly frequently on the Lucene mailing lists. These questions are usually accompanied by a specific use case, and the comments of the Lucene contributors provide good guidelines which should help you find the answer that is right for your application.