Lucene’s Missing Term/Field Query Structure

by

In a follow-up to an earlier blog post, Lucene’s Missing TokenStream factory, I’d like to discuss a problem with Lucene’s query structure. Not Lucene’s query syntax, which has its own problems, but gets the term/field distinction (mostly) right. But be careful — the query syntax has a different notion of term than the object representations in Lucene’s search package.

The underlying problem is the same as with the missing token stream factories: a missing abstraction at the plain text level that winds up conflating notions of plain text, like ranges, and notions of fielded queries, like ranges of plain text in a specified field.

At the root of the query structure problem is the index.Term class, which is constructed with a field and a textual value for that field. The simplest possible query is a search.TermQuery, which consists of a single term. So far, no problem. But what about a search.RangeQuery? It’s constructed using a pair of terms. There’s a note in the constructor that both terms must be from the same field. What’s the problem? Range queries shouldn’t be built out of two field/text pairs, but rather out of one field and two text objects.

Phrase queries (search.PhraseQuery) have the same problem as range queries, in that they should only apply to terms with the same field. Although the add term checks that added terms have the same field and throws an exception if they don’t, there is no warning in the javadoc.

The solution to the problem is simple. Split the structure of queries into two types, one for the text part of a query and one for restricting it to fields. It’s easiest to see in BNF, though what I’m really talking about is object structure, not the query syntax:

TextQuery ::= String                               // term query
                  | (String TO String)              // range query
                  | "String String ... String"      // phrase query
                  | (TextQuery AND TextQuery)
                  | (TextQuery OR TextQuery)

FieldedQuery ::= [Field : TextQuery]
                     | (FieldedQuery AND FieldedQuery)
                     | (FieldedQuery OR FieldedQuery)

The logical operations all distribute through fields, so a query like [AUTHOR: (Smith OR Jones)] is equivalent to ([AUTHOR:Smith] OR [AUTHOR:Jones]).

An advantage the query object representation has over the query syntax is that there are no default fields. Lucene’s query syntax, on the other hand, allows a fielded query to consist of a term query. The use of an analyzer in constructing an actual query parser then fills in the missing field.

The class search.WildcardQuery is equally problematic in that it takes a term as an argument. It’s overloading the textual value of the term to represent not a term, but a string with structure including a special character for the multiple character (*) and single character (?) wildcards. But what if I want a question mark or asterisk in my term itself? The query syntax handles this problem, but not the object representation. The classes search.PrefixQuery and search.FuzzyQuery have the same problem with conflating terms and a syntax for search.

Phrase queries and boolean queries have the additional problem of being mutable, which makes their use in collections problematic (just as using mutable collections within other collections is problematic). For instance, if you construct a phrase query, its notion of equality and hash code change as more terms are added, because equality is defined to require having an equal list of terms.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s