[Update: 10 Feb 2014. A newer discussion of databases and Lucene 4 is available in the chapter on Lucene in the book
This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene’s search query syntax.]
This question comes up pretty often on the Lucene mailing lists. Assuming that’s an inclusive OR in the question, we say yes! to both. In many situations this is not just a clever answer it’s the correct one — while the two systems can both be used as a data store, there are some things that a database can do better than Lucene, and some things that Lucene can do that a database can’t do at all.
Lucene provides full-text search over a collection of data that returns results in order of relevancy. If search over (lots of) text data is a non-negotiable requirement, then a date with Lucene (or one of its cousins) is in your future, because this is something that most database systems do badly, if at all. Here’s a nice critique of this feature in MySQL (an otherwise fine RDBMS).
Lucene provides search and indexing over
Document objects. A
Document is a set of
Fields. Indexing a
Field processes its contents into one or more
Term objects which contain the
Field name and the
Term string value. Lucene maintains an inverted index which maps terms into documents. Queries against the index are scored by TF-IDF, a measure of document relevance. A Lucene
Analyzer defines how index terms are extracted from text. The developer can define their own Analyzer, or use one of the pre-existing analyzers in the Lucene API, which can do tokenization for many different languages, as well as stop-listing of common words, case normalization, stemming, and more. For a nice example of a custom analyzer, see Bob‘s chapter on “Orthographic variation with Lucene” in the Lucene in Action book.
Relational databases store data as rows and columns in tables. A table declaration specifies the columns, the type of data stored in each column, and constraints over columns. The DBMS enforces these constraints as rows are added to the tables. In Lucene there is no notion of document type. A document is a set of fields, and processing a document consists of processing those fields into fielded terms that are added to the index. Lucene was designed for rapid indexing of documents. Checking document completeness is the responsibility of the calling application, ditto enforcing cross-document constraints. In the latter case, trying to enforce a uniqueness constraint on a term in the index is likely to impair indexing speed.
Searches against the two are different both in the search operations and search results. Database queries are specified in SQL which allows for composition of data from different tables via JOIN, complex boolean selectional restrictions, and ordering of results. Search over a database returns a ResultSet object containing a table which is a new view on the data resulting from executing the search query, that is, it may contain data stored in several different tables underlyingly.
Lucene search is over the term index, and the results returned are a set of pairs of documents and scores. Lacking explicit document types, a query is over some number of fields. Lucene has its own query language, and the package org.apache.lucene.search provides classes to do this programmatically. Lucene uses the score to limit the number of documents returned. Overriding this behavoir requires getting all matching documents for a query, and this can be very expensive, and queries designed to find things like dates or numbers within some range can be inefficient.
So if you want to provide relevance-based text search over some heterogeneous data store that has both a transactional component, such as an accounting system, as well as a storing large amounts of text you’re going to have to use both Lucene and a database, and the questions to ask are:
- What is the mapping between the database tables and views and Lucene documents?
- What text, if any, needs to be stored in the index?
- How do I keep my Lucene index in sync with the database?
The Scaling Web blog has a nice case study. They have some good tips about how to conflate information from multiple table columns into good fielded documents, and how best to use
IndexWriter objects in order to keep the index up-to-date with the database.
Of course there are situations where just Lucene or just a database is a sufficient data store. In the LingMed sandbox project we use Lucene to store a local version of the MEDLINE citation index. The challenge there is making sure that updates to the index are visible to all search clients, and this was covered in an earlier blog post: “Updating and Deleting Documents in Lucene”. Each MEDLINE citation has a unique PubMed ID, and treating this as an index field allows for rapid search by PubMed ID. The citation title and abstract are also indexed into their own fields, and we can use Lucene to find out how many different documents contain a word in the article title or abstract – that is, we can quickly and easily calculate the document frequency for an item.
LingMed uses a LingPipe
Chunker and a
LanguageModel to find all mentions of genes in MEDLINE citations and assign a score to each. A MySQL database is the appropriate data store because we need to keep track of a many-to-many relationship between genes and articles (each gene mention); as well as information about the article itself. We don’t need sophisticated text search over this data, nor do we want to rank our results by TF-IDF; we are using LingPipe to compute our own scores, and use SQL’s “ORDER BY” clause to retrieve the best-scoring items.
The correct answer to the question “Lucene or a Database?” always depends on the specifics of the situation. Furthermore, as more functionality is added to Lucene (and family), the question is worth revisiting, and is revisited fairly frequently on the Lucene mailing lists. These questions are usually accompanied by a specific use case, and the comments of the Lucene contributors provide good guidelines which should help you find the answer that is right for your application.