Archive for the ‘Lucene’ Category

Lucene’s Missing Term/Field Query Structure

January 21, 2009

In a follow-up to an earlier blog post, Lucene’s Missing TokenStream factory, I’d like to discuss a problem with Lucene’s query structure. Not Lucene’s query syntax, which has its own problems, but gets the term/field distinction (mostly) right. But be careful — the query syntax has a different notion of term than the object representations in Lucene’s search package.

The underlying problem is the same as with the missing token stream factories: a missing abstraction at the plain text level that winds up conflating notions of plain text, like ranges, and notions of fielded queries, like ranges of plain text in a specified field.

At the root of the query structure problem is the index.Term class, which is constructed with a field and a textual value for that field. The simplest possible query is a search.TermQuery, which consists of a single term. So far, no problem. But what about a search.RangeQuery? It’s constructed using a pair of terms. There’s a note in the constructor that both terms must be from the same field. What’s the problem? Range queries shouldn’t be built out of two field/text pairs, but rather out of one field and two text objects.

Phrase queries (search.PhraseQuery) have the same problem as range queries, in that they should only apply to terms with the same field. Although the add term checks that added terms have the same field and throws an exception if they don’t, there is no warning in the javadoc.

The solution to the problem is simple. Split the structure of queries into two types, one for the text part of a query and one for restricting it to fields. It’s easiest to see in BNF, though what I’m really talking about is object structure, not the query syntax:

TextQuery ::= String                               // term query
                  | (String TO String)              // range query
                  | "String String ... String"      // phrase query
                  | (TextQuery AND TextQuery)
                  | (TextQuery OR TextQuery)

FieldedQuery ::= [Field : TextQuery]
                     | (FieldedQuery AND FieldedQuery)
                     | (FieldedQuery OR FieldedQuery)

The logical operations all distribute through fields, so a query like [AUTHOR: (Smith OR Jones)] is equivalent to ([AUTHOR:Smith] OR [AUTHOR:Jones]).

An advantage the query object representation has over the query syntax is that there are no default fields. Lucene’s query syntax, on the other hand, allows a fielded query to consist of a term query. The use of an analyzer in constructing an actual query parser then fills in the missing field.

The class search.WildcardQuery is equally problematic in that it takes a term as an argument. It’s overloading the textual value of the term to represent not a term, but a string with structure including a special character for the multiple character (*) and single character (?) wildcards. But what if I want a question mark or asterisk in my term itself? The query syntax handles this problem, but not the object representation. The classes search.PrefixQuery and search.FuzzyQuery have the same problem with conflating terms and a syntax for search.

Phrase queries and boolean queries have the additional problem of being mutable, which makes their use in collections problematic (just as using mutable collections within other collections is problematic). For instance, if you construct a phrase query, its notion of equality and hash code change as more terms are added, because equality is defined to require having an equal list of terms.

Lucene or a Database? Yes!

November 22, 2008

[Update: 10 Feb 2014. A newer discussion of databases and Lucene 4 is available in the chapter on Lucene in the book

This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene’s search query syntax.]

This question comes up pretty often on the Lucene mailing lists. Assuming that’s an inclusive OR in the question, we say yes! to both. In many situations this is not just a clever answer it’s the correct one — while the two systems can both be used as a data store, there are some things that a database can do better than Lucene, and some things that Lucene can do that a database can’t do at all.

Lucene provides full-text search over a collection of data that returns results in order of relevancy. If search over (lots of) text data is a non-negotiable requirement, then a date with Lucene (or one of its cousins) is in your future, because this is something that most database systems do badly, if at all. Here’s a nice critique of this feature in MySQL (an otherwise fine RDBMS).

Lucene provides search and indexing over Document objects. A Document is a set of Fields. Indexing a Field processes its contents into one or more Term objects which contain the Field name and the Term string value. Lucene maintains an inverted index which maps terms into documents. Queries against the index are scored by TF-IDF, a measure of document relevance. A Lucene Analyzer defines how index terms are extracted from text. The developer can define their own Analyzer, or use one of the pre-existing analyzers in the Lucene API, which can do tokenization for many different languages, as well as stop-listing of common words, case normalization, stemming, and more. For a nice example of a custom analyzer, see Bob‘s chapter on “Orthographic variation with Lucene” in the Lucene in Action book.

Relational databases store data as rows and columns in tables. A table declaration specifies the columns, the type of data stored in each column, and constraints over columns. The DBMS enforces these constraints as rows are added to the tables. In Lucene there is no notion of document type. A document is a set of fields, and processing a document consists of processing those fields into fielded terms that are added to the index. Lucene was designed for rapid indexing of documents. Checking document completeness is the responsibility of the calling application, ditto enforcing cross-document constraints. In the latter case, trying to enforce a uniqueness constraint on a term in the index is likely to impair indexing speed.

Searches against the two are different both in the search operations and search results. Database queries are specified in SQL which allows for composition of data from different tables via JOIN, complex boolean selectional restrictions, and ordering of results. Search over a database returns a ResultSet object containing a table which is a new view on the data resulting from executing the search query, that is, it may contain data stored in several different tables underlyingly.

Lucene search is over the term index, and the results returned are a set of pairs of documents and scores. Lacking explicit document types, a query is over some number of fields. Lucene has its own query language, and the package provides classes to do this programmatically. Lucene uses the score to limit the number of documents returned. Overriding this behavoir requires getting all matching documents for a query, and this can be very expensive, and queries designed to find things like dates or numbers within some range can be inefficient.

So if you want to provide relevance-based text search over some heterogeneous data store that has both a transactional component, such as an accounting system, as well as a storing large amounts of text you’re going to have to use both Lucene and a database, and the questions to ask are:

  • What is the mapping between the database tables and views and Lucene documents?
  • What text, if any, needs to be stored in the index?
  • How do I keep my Lucene index in sync with the database?

The Scaling Web blog has a nice case study. They have some good tips about how to conflate information from multiple table columns into good fielded documents, and how best to use IndexReader and IndexWriter objects in order to keep the index up-to-date with the database.

Of course there are situations where just Lucene or just a database is a sufficient data store. In the LingMed sandbox project we use Lucene to store a local version of the MEDLINE citation index. The challenge there is making sure that updates to the index are visible to all search clients, and this was covered in an earlier blog post: “Updating and Deleting Documents in Lucene”. Each MEDLINE citation has a unique PubMed ID, and treating this as an index field allows for rapid search by PubMed ID. The citation title and abstract are also indexed into their own fields, and we can use Lucene to find out how many different documents contain a word in the article title or abstract – that is, we can quickly and easily calculate the document frequency for an item.

LingMed uses a LingPipe Chunker and a LanguageModel to find all mentions of genes in MEDLINE citations and assign a score to each. A MySQL database is the appropriate data store because we need to keep track of a many-to-many relationship between genes and articles (each gene mention); as well as information about the article itself. We don’t need sophisticated text search over this data, nor do we want to rank our results by TF-IDF; we are using LingPipe to compute our own scores, and use SQL’s “ORDER BY” clause to retrieve the best-scoring items.

The correct answer to the question “Lucene or a Database?” always depends on the specifics of the situation. Furthermore, as more functionality is added to Lucene (and family), the question is worth revisiting, and is revisited fairly frequently on the Lucene mailing lists. These questions are usually accompanied by a specific use case, and the comments of the Lucene contributors provide good guidelines which should help you find the answer that is right for your application.

Updating and Deleting Documents in Lucene 2.4: LingMed Case Study

November 5, 2008

[Update: 10 Feb 2014. Much has changed in Lucene since 2.4. An extensive tutorial for Lucene 4 is now available as a chapter in the book

This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene’s search query syntax.]

Lucene Java 2.4 was recently released, and this is a good excuse to review the cumulative changes to the org.apache.lucene.index IndexReader and IndexWriter classes. Before Lucene Java 2.1 documents were added to an index using IndexWriter.addDocument(), deleted using IndexReader.delete(). Document updates were written as a delete followed by an add. In 2.1 the method IndexWriter.deleteDocument() was added to the API, and in version 2.2 IndexWriter.updateDocument() was added as well. Most of the demos and tutorials out there predate this release, causing no end of confusion and even heartbreak for the newbie developer. Doug Cutting summarizes this nicely, in his comments on this blog post:

The presence of IndexReader.delete() traces back to the origins of Lucene. IndexReader and IndexWriter were written, then deletion was added. IndexWriter should really have been called IndexAppender. The only place deletion was possible was IndexReader, since one must read an index to figure out which document one intends to delete, and IndexWriter only knows how to append to indexes. This has big performance implications: IndexReader is a heavy-weight view of an index, while IndexWriter is lightweight. … The issue has been discussed at length for years, but no acceptable solution had been developed.
Recently Ning Li contributed to Lucene a fix for this, a version of IndexWriter that can efficiently intermix deletes with additions.

In the LingMed sandbox project, we use Lucene to maintain local version of the MEDLINE citation index, where each MEDLINE citation is stored as a Lucene document. Every weekday the NLM releases a set of updates which contains new citations, revised citations, and lists of citations that should be deleted from the index. This is an XML file, and we use classes from the com.aliasi.medline package to process the updates file. As an article goes through the publishing pipeline (e.g. accepted, pre-print, corrections added), the corresponding citation in the MEDLINE index is updated accordingly. We only want to keep the latest version of a citation in our Lucene index. The MEDLINE updates file encodes all citations as MedlineCitation entities, therefore our handler must figure out whether the citation is new or revised. Since all Medline citations have a unique PMID (PubMed identifier) element, we search the index for that PubMed id – if it exists then we call IndexWriter.updateDocument, else we call IndexWriter.addDocument.

Search over the index requires us to open an IndexReader. This raises the question: if the IndexWriter makes changes to the index how can the IndexReader see them? The answer is found in the IndexWriter's javadoc:

an IndexReader or IndexSearcher will only see the index as of the “point in time” that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened.

Given this, processing a set of MEDLINE updates file(s) is pretty straightforward: before processing each file we open an IndexReader on the index, and after processing the entire file we call the IndexWriter.commit(), which flushes any pending changes out to the index. (The Lucene javadoc it says that after a commit “a reader will see changes” but only if the reader goes looking for them! See also this discussion on the Lucene mailing list).

We use a com.aliasi.medline.MedlineParser to parse the updates file. The parser takes a visitor in the form of a MedlineHandler, which processes the MEDLINE citations as they are extracted by the parser. The parser invokes the handler’s handle(MedlineCitation) method on each MedlineCitation element that it parses out of the updates file, and invokes the handler’s delete(String) method on each PubMed identifier in the DeleteCitation element.

Here is the calling code which processes the MEDLINE updates file(s), (calls to the log4j logger ommitted):

MedlineParser parser = new MedlineParser(true); // true = save raw XML
IndexWriter indexWriter
      = new IndexWriter(FSDirectory.getDirectory(mIndex),
               new IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
for (File file: files) {
    MedlineIndexer indexer = new MedlineIndexer(indexWriter,mCodec);

The class MedlineIndexer implements MedlineHandler and does the work of updating the index:

static class MedlineIndexer implements MedlineHandler {
    final IndexWriter mIndexWriter;
    final MedlineCodec mMedlineCodec;
    final IndexReader mReader;
    final IndexSearcher mSearcher;

    public MedlineIndexer(IndexWriter indexWriter, MedlineCodec codec) 
        throws IOException {
        mIndexWriter = indexWriter;
        mMedlineCodec = codec;
        mReader =,true); // read-only
        mSearcher = new IndexSearcher(mReader);
    public void close() throws IOException { 

Instantiating a MedlineIndexer automatically opens a fresh IndexReader and IndexSearcher on the index. The MedlineIndexer.close() method closes these objects and commits updates on the index.

The MedlineCodec maps the MedlineCitation.pmid() to its own field in the Lucene Document, so that we can uniquely identify documents by PubMed identifier. To lookup documents by PubMed id we create a org.apache.lucene.index.Term object:

Term idTerm = new Term(Fields.ID_FIELD,citation.pmid());

Here is the MedlineIndexer.handle() method:

public void handle(MedlineCitation citation) {
    Document doc = mMedlineCodec.toDocument(citation);
    try {
            Term idTerm = new Term(Fields.ID_FIELD,citation.pmid());
            if (mSearcher.docFreq(idTerm) > 0) {
            } else {
    } catch (IOException e) {
        mLogger.warn("handle citation: index access error, term: "+citation.pmid());

We use the IndexSearcher's doqFreq() method to check if a document with this PubMed id is in the index. If so, then the handler updates the document, else it adds it.

To delete a citation from the index we use the deleteDocuments method. Here is the MedlineIndexer.delete() method:

public void delete(String pmid) {
    Term idTerm = new Term(Fields.ID_FIELD,pmid);
    mLogger.debug("delete citation: "+pmid);
    try {
    } catch (IOException e) {
        mLogger.warn("delete citation: index access error, term: "+pmid);

Indexing the MEDLINE daily updates file is a simple batch-oriented process. The IndexWriter holds a write-lock on the index (a file-system based lock, see the LockFactory javadoc), so only a single update process can ever be running. However there can be any number of IndexReaders open on the index. Their view of the index is whatever state the index was in when either their open() or reopen() method was called. Once you understand this about the Lucene IndexReader and IndexWriter classes, you’re ready to start building applications capable of handling a stream of search requests, interleaved with updates to the index, (or else you’re ready to use Solr, or something like it).

Chinese Information Retrieval with LingPipe and Lucene

October 17, 2008

The standard approach to information retrieval in English involves tokenizing text into a sequence of words, stemming the words to their morphological roots, then tossing out stop words that are too frequent or non-discriminative. This is how the standard settings work in Apache’s Java-based Lucene search engine.

Taking this approach for Chinese would require being able to find the words. But the state of the art in Chinese word segmentation is only 97-98% or so on a relatively well-edited and homogeneous test set making up the third SIGHan segmentation bakeoff. And keep in mind that this is the usual average performance figure, with rarer words being harder to find. Out-of-training-vocabulary performance is at best 85%. Unfortunately, it’s those rarer words in the long tail, likes names, that are often the key disrciminative component of a query.

It turns out that you can go a very long way indexing Chinese using character unigrams and bigrams. That is, instead of breaking text into tokens, pull out all the characters and index those as if they were words, then pull out all the two-character sequences and index those the same way. This is what we did for Technorati. The evaluation on real queries showed what you’d expect: very good recall, but often poor precision because of unintended cross-word and sub-word effects. Consider an English example of the word sequence “and add it to your”, which would look like “andaddittoyour” without spaces. The problem is that without spaces, it matches the words “dad”, “toy”, “ditto”, “you”, and “our”. (In swear word filtering, naive replacement in subwords leads to the “clbuttic” mistake).

By the way, subword problems come up in English at an alarming rate. How many of you know that “Spider-man” is two words while “Batman” is only one? In looking at customer query logs for TV data, we saw that users have no idea how many spaces to put in many examples. We just expect our search engines to have solved that one for us. They actually solve this problem roughly the same way as we suggest for Chinese. (This is also a problem in noun compounding languages like German and Danish, or generally highly compounding languages like Turkish.)

The cross-word effects would be completely mitigated if we could teach Chinese query writers to insert spaces between words. I’m told this is impossible given the conventions of the language, but my reply is “where there’s a will, there’s a way”, and I think the adoption of Pinyin provides some direct support for my conjecture. But let’s suppose we can’t get the Chinese searchers to insert spaces; it won’t take care of the subword problem anyway.

So how can we improve precision for unigram and bigram indexing? Add words when we can find them and let TF/IDF sort out the weights. This is the idea behind this paper, which evaluates a number of such combinations:

Jian-Yun Nie, Jiangfeng Gao, Jian Zhang, and Ming Zhou. 2001. On the Use of Words and N-grams for Chinese Information Retrieval. Proceedings of the 5th International Workshop Information Retrieval with Asian Languages.

The best indexing seems to arise from a mixture of character unigrams, character bigrams and words.

So how do we get the words? Two ways, for maximal precision and control. First, we use a dictionary. We can build an efficient tokenizer to find dictionary words using LingPipe’s Aho-Corasick implementation dict.ExactDictionaryChunker, the use of which is described in the javadoc and in the named entity tutorial. Just make sure you use the right tokenizer, which for Chinese would be the tokenizer.CharacterTokenizerFactory. A dedicated implementation would be faster than going through tokenizers, given that we’re treating each character as a token.

As to where to find a dictionary, that’s up to you. Training data is still available on the old SIGHan site or from LDC.

The second approach to word spotting can be approximate, and for that, we recommend our noisy channel Chinese word segmenter, as described in our tutorial on Chinese word segmentation.

The only fiddly bits of coding are wrapping up our dictionary chunker and chinese word segmenter as Lucene analyzers (see lucene.analysis.Analyzer. You can see how to adapt LingPipe tokenizer factories for use as Lucene analyzers in our new sandbox project LingMed; look for the top-level read-me.html file and the adapter implementation for LingPipe tokenizers in src/com/aliasi/lingmed/lucene/

We’ve sidestepped the fact that Chinese has two distinct ways of rendering characters, traditional Chinese characters (used mainly in Taiwan and Hong Kong and among ex-pats) and simplifed Chinese characters (used primarily on the mainland). The unicode standard, of course, contains both.

It’s a simple matter to load a single dictionary on both types of characters, and a single word segmenter operating over both types of characters. We’ve had good luck in training multilingual (English/Hindi) named entity extractors. It works because it relies on local context to predict words, and local context is almost always homogeneous in character set. That way, you can handle docs with mixed character sets and don’t have to do a character set ID pass up front (which is tricky, because the code points overlap in unicode and queries are often short).

We should point out that this method can also work for Japanese. Though you might want to up the n-gram count for the syllabic characters in Hirigana and Katakana, while keeping 1-grams and 2-grams for the imported Chinese ideograms making up Kanji.

Lucene’s Missing Token Stream Factory

July 6, 2008

While we’re on the subject of tokenization, every time I (Bob) use Lucene, I’m struck by its lack of a tokenizer factory abstraction.

Lucene’s abstract class TokenStream defines an iterator-style method next() that returns the next token or null if there are no more tokens.

Lucene uses the abstract class Token for tokens. A Lucene token contains a string representing its text, a start and end position, and a lexical type which I’ve never seen used. Because LingPipe has to handle many tokens quickly without taxing garbage collection, it doesn’t create objects for them beyond their string texts. But that’s the subject of another blog entry.

A Lucene Document is essentially a mapping from field names, represented as strings, to values, also represented as strings. Each field in a document may be treated differently with respect to tokenization. For instance, some might be dates, others ordinary text, and others keyword identifiers).

To handle this fielded document structure, the Lucene class analysis.Analyzer maps field names, represented as strings, and values, represented as instances of, to token streams. The choice of Reader for values is itself puzzling because it introduces I/O exceptions and the question of who’s responsible for closing the reader.

Lucene overloads the analyzer class itself to provide the functionality of LingPipe’s tokenizer factory. Lucene classes such as SimpleAnalyzer and CJKAnalyzer return the same token stream no matter which field is specified. In other words, the field name is ignored.

What would be useful would be a Lucene interface analysis.TokenStreamFactory with a simple method TokenStream tokenizer(CharSequence input) (note how we’ve replaced the analyzer’s reader input with a generic string). Then analyzers could be built by associating token stream factories with fields. This would be the natural place to implement Lucene’s simple analyzer, CJK analyzer, and so on. The current analyzer behavior is then easily derived with an analyzer which sets a default token stream factory for fields.

Book: Building Search Applications: Lucene, LingPipe and Gate

June 12, 2008

There’s a book about LingPipe!

The title is linked to the Amazon page; it’s also available as an inexpensive download from Lulu.

The Bottom Line

The subtitle “A practical guide to building search applications using open source software” pretty much sums it up (comment added June 22, 2008: please see Seth Grimes’s comment below about LingPipe’s royalty-free license not being compatible with other open-source licenses). It takes a reader that knows Java, but nothing at all about search or associated text processing algorithms, and provides a hands-on, step-by-step guide for building a state-of-the-art search engine.

I (Bob) gave Manu feedback on a draft, but there wasn’t much to correct on the LingPipe side, so I can vouch for the book’s technical accuracy. (Disclaimer: I didn’t actually try to run the code.)

Chapter by Chapter Overview

After (1) a brief discussion of application issues, the chapters include (2) tokenization in all three frameworks, (3) indexing with Lucene, (4) searching with Lucene, (5) sentence extraction, part-of-speech tagging, interesting/significant phrase extraction, and entity extraction with LingPipe and Gate (6) clustering with LingPipe, (7) topic and language classification with LingPipe, (8 ) enterprise and web search, page rank/authority calculation, and crawling with Nutch, (9) tracking news, sentiment analysis with LingPipe, detecting offensive content and plagiarism, and finally, (10) future directions including vertical search, tag-based search and question-answering.

For those wanting introductions to the LingPipe APIs mentioned above, Konchady’s book is a gentler starting point than our own tutorials.

That may sound like a whole lot of ground to cover in 400 pages, but Konchady pulls the reader along by illustrating everything with working code and not getting bogged down in theoretical boundary conditions. There are pointers to theory, and a bit of math where necessary, but the book never loses sight of its goal of providing a practical introduction. In that way, it’s like the Manning in Action series.

The book’s hot of the presses, so it’s up to date with Lucene 2.3 and LingPipe 3.3.

About the Author

Manu Konchady‘s an old hand at search and text processing. You may remember him from such books as Text Mining Application Programming and High Performance Parallel Algorithms for Scientific Computing with Application to a Coupled Ocean Model.

TF/IDF, Scaling and Symmetry

June 6, 2007

Search engines typically use TF/IDF scaling only on the query, because it’s too hard to compute dynamically on the documents, leading to an asymmetry in scoring.

I’m building some discriminitave classifiers for LingPipe. I just finished with voted kernel perceptrons, which are way cool, and though lighter than SVMs, still awfully heavy.

A simpler discriminative classifier is based on inverse-document frequency weighting, as is commonly used in search engines. In this situation, the weight applied to a search term is inversely related to the number of documents in which it appears. For instance, a term appearing in docFrequency documents, where there is a total number of numDocs documents, is typically weighted by a damped inverse ratio like:

  idf(docFrequency,numDocs) = log(numDocs/docFrequency)

For base 2 logs, and numDocs=1024, we have

    Doc Freq   IDF
    --------   ---
       1        10
       2         9
       4         8
     512         1
    1024         0

Basically, this approach upweights terms that don’t show up in many documents. If we are using TF/IDF ranking for classification, IDF weighting provides a parametric form of discriminitive feature weighting.

Term frequencies are typically dampened as well, often with a square root penalty:

    tf(term,doc) = sqrt(frequency(term,doc))

This has the effect of making score rise linearly with the number of overlapping counts, rather than quadratically, as it’d do in the un-dampened raw frequency counts.

Scoring is typically done by cosine. That is, a length-normalized dot product of query and document vectors.

      = cosine(docVec,queryVec)
      = docVec * queryVec / (length(docVec) * length(queryVec))

What I didn’t notice until I dug into the internals of the Apache Lucene search engine and really paid attention to the scoring section in Witten et al.’s Managing Gigabytes is that IDF normalization is only applied to the query vector. That is:

    docVec[term] = tf(term,doc)
    queryVec[term] = tf(term,query) * idf(term,numDocs)

The reason for this is that if doc frequencies and the number of docs continually changes, so would the length of a document vector if it were normalized by IDF. In this case, every document in the result set would have to be renormalized for every query, requiring multiplications and additions for each term in the document. Even if we could afford the arithmetic overhead, we find that reverse indexes don’t even support looking up all of the terms in a document easily, because they’re maps from terms to documents, not the other way around.

Instead, in dynamic search engines, IDF is not applied to doc vectors, allowing their lengths to be computed as soon as they are added to the index.

The upshot of only IDF-weighting queries is a surprising asymmetry. If we index a document by its text and then create a query out of the same text, the score will be less than 1 (recall that if all entries are positive counts, cosines range between 0 and 1 inclusive). To score maximally against a document, what is required is a query with weights that will invert the IDF weighting applied to queries so that the query vector actually matches the document vector. That is, the best scoring vector against a document is the vector created from the document term frequencies with inverse IDF weighting.

For the TF/IDF classifier I’m building into LingPipe, the IDFs are computed for document vectors at compile time, thus ensuring that the only query that will score 1.0 against a document is the query constructed out of the document itself.

Of course if you’re a major search engine, and thus matching by boolean match and sorting by some social network metric of influence with some fiddling, none of this really matters. I actually miss Excite, which did TF/IDF search. Perhaps that’s not surprising given that Doug Cutting, the founder and original developer of Lucene, was their lead search engine developer.

To Stem or Not to Stem?

March 21, 2007

I’ve been working for a customer on unsupervised and semi-supervised approaches to stemming for information retrieval. I’d never thought too hard about stemming per se, preferring to tackle the underlying problem in applications through character n-gram indexing (we solve the same problems facing tokenized classifiers using character language models). I have thought about computational morpho-phonology back in my logic programming/constraint grammar phase.
I found two very interesting evaluation papers in the same issue of JASIS, the first of which is:

Paice, C. D. 1996. Method for evaluation of stemming algorithms based on error counting. Journa l of the American Society for Information Science 47(8):632–649.
Wiley Pay Link (I couldn’t find a free copy online)

The Paice paper introduces a stemming-as-clustering evaluation, treating the clustering evaluation as one of equivalence recall and precision (just like LingPipe’s clustering precision/recall evaluation). Paice assumes, quite rightly for stemming, if I may say so, that the issue in question is whether two words word1 and word2 have the same stem, not the actual identity of the stem. This makes sense because reverse-indexed IR system don’t care about the form of a word, just its identity. Paice then draws a distinction between “light” and “heavy” stemming, but I found this unmotivated as a two-way classification.

Extending Paice’s example, here are the words I found in the English Gigaword corpus with etymological root author:

antiauthoritarian, antiauthoritarianism, antiauthority, author, authoratative, authoratatively, authordom, authored, authoress, authoresses, authorhood, authorial, authoring, authorisation, authorised, authorises, authoritarian, authoritarianism, authoritarians, authoritative, authoritatively, authoritativeness, authorities, authoritory, authority, authorization, authorizations, authorize, authorized, authorizer, authorizers, authorizes, authorizing, authorless, authorly, authors, authorship, authorships, coauthor, coauthored, coauthoring, coauthors, cyberauthor, deauthorized, multiauthored, nonauthor, nonauthoritarian, nonauthorized, preauthorization, preauthorizations, preauthorized, quasiauthoritarian, reauthorization, reauthorizations, reauthorize, reauthorized, reauthorizes, reauthorizing, semiauthoritarian, semiauthorized, superauthoritarian, unauthorised, unauthoritative, unauthorized

As you can see, it’s rather difficult to draw a line here. The shared root of author and authorize is lost to most native speakers, despite having a regular suffix -ize. In contrast, the relation between notary and notarize is regular, but the relation between note and notary feels more opaque. And we’re not even considering word sense ambiguity here, a closely related problem for search.

These uncertainties in derivational stem equivalence is why many people want to restrict stemming to inflectional morphology. Inflections are forms of words, like the verb forms and noun number (e.g. -s makes a noun plural), whereas derivational morphology often converts one syntactic type to another (e.g. -ly converts and adjective to an adverb). The problem with restricting to inflectional morphology is that it’s not the right cut at the problem. Sometimes derivational stemming helps and sometimes inflectional stemming hurts.

This brings us to the second interesting paper:

Hull, David A. 1996. Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science. 47(1): 70–84. [CiteSeer link to paper]

Hull goes through a couple hundred TREC queries with a fine-toothed comb, finding that it’s almost impossible to declare a single winner among stemmers. He compared the simple Porter and Lovins stemmers with more linguistically sophisticated algorithms developed at Xerox (now part of Inxight‘s offerings), and less sophisticated algorithms like reducing every word to it’s first 5 characters or just removing final s. There was simply not a clear winner. The beauty of the paper is in the careful case studies in which he contrasted stemmers. For example, query 21 contained the word superconductivity, but matches documents containing only the form superconductors , clearly requiring derivational morphology for high recall . A similar problem occurs for surrogate mother and surrogate motherhood in query 70. Another case was genetic engineering vs. genetically engineered. Another example where stemming helped was failure versus fail in the context of bank failures.

An example where stemming hurt was in client server matching serve and client, leading to numerous false positive matches; this is really a frequency argument, as servers in the computer sense do serve data. Another example was reducing fishing to fish, causing recall problems even though it’s a simple inflectional change. Other problems arose in converting privatization to private, which hurt precision.

What’s a poor computational linguist to do? One thing I’d recommend is to index both the word and its stem. If you’re using a TF/IDF-based document ranker, such as the one underlying Apache Lucene, you can index both the raw word and any number of stemmed forms in the same position and hope IDF sorts them out in the long run. That seems to work for Chinese, where folks tend to index both words and unigrams and/or bigrams because n-grams provide high recall and words provide added precision.

If anyone knows of other evaluation’s like Hull’s that involve n-gram based stemmers, I’d love to hear about them.

What is NLP-based Search?

February 12, 2007

Matthew Hurst (Latest Post) and Fernando Pereira (Latest Post) have been having an interesting discussion of the hype surrounding Powerset. The reason I call it hype is because of the New York Times‘s breathless coverage (not worth reading) and the lack of details in Powerset’s own description:

Powerset is a Silicon Valley company building a transformative consumer search engine based on natural language processing.

Our unique innovations in search are rooted in breakthrough technologies that take advantage of the structure and nuances of natural language. Using these advanced techniques, Powerset is building a large-scale search engine that breaks the confines of keyword search.

By making search more natural and intuitive, Powerset is fundamentally changing how we search the web, and delivering higher quality results.

Powerset’s search engine is currently under development. Please check back in the near future for more information about our technology and for signing up for our alpha.

I call what’s going on “hype” because I can’t tell what’s being proposed.The last thing I remember being hyped to this extent without being described was the Segway scooter.

So my question is: What does this “transformative consumer search engine” look like from a user’s perspective?

I’ll leave you with a pointer to my two comments on Matthew Hurst’s latest blog entry, NLP and Search: Free Your Mind. Matthew was suggesting that we can leverage redundancy of information and present answers, not documents (or passages). I’ll repeat my entries here:

I couldn’t agree more about doing confidence-based collective data extraction. Those who’ve participated in the DARPA bakeoffs over time have been doing just what Matthew suggests.

That’s why we built LingPipe to handle confidence-based extraction for tagging, chunking and classification. But we prefer to use an expectation-based form of inference rather than a winner-take all, which only makes sense for bakeoff entries. Most social networking and clustering software works just fine with expectations.

In real life, at least for intelligence analysts and biologists, recall is the game. The reason is the long tail. Most relations are not mentioned dozens of times in various forms. In other words, you may have to deal with the equivalent of “youtube data mining”, because language is vague.

But it’s not just recall of relations. Intelligence analysts need to link back to sources to generate their reports. Biologists won’t trust current relation extraction programs to guide their research because even reasoning cumulatively, they’re not precise enough. Too much depends on context outside of the phrase, sentence or even paragraph/document in which a fact is stated.

Could anyone help someone like me with limited imagination? I’m trying to envision what the next generation of NLP enabled search is going to look like from a user’s perspective.

Let’s say my wife and I are trying to settle a bet that arose over dinner, such as whether the lead singer of Jethro Tull was Scottish or English. I actually went to Google and used their newish pattern based search:

Google QUERY: ian anderson was born in *

Here’s the first few “answers”:

1. Paiseley, Scotland
2. 1947
3. Scotland in 1947
4. Fife in 1947
5. Philadelphia
6. Croydon before England won the world cup
7. Williston, ND
8. Nottingham on June 29, 1948
9. 1981
10. digg

Now which of those is the “right” answer? Well, first it depends on which Ian Anderson we’re talking about. There are 10 with their own Wikipedia pages.

The voting thing doesn’t help much here unless you happen to know that Dunfermline is in West Fife, which is in Scotland, which is in the United Kingdom. I’m confused about answer 1, because it’s “Paiseley”, which is spelled wrong. Wikipedia claims the answer is Dunfermline. Clearly the source is important in providing answers.