I’m very happy to see that the 2nd Edition is out:
- McCandless, Michael, Erik Hatcher, and Otis Gospodnetić. 2010. Lucene in Action. Second Edition. Manning Press.
Manning’s offering 40% off until September 30, 2010. Follow the link to the book and use code
lingpipeluc40 when you check out.
You Need This Book
If you’re interested in version 3 or later of the Apache Lucene search library, you need this book. As far as I know, this book is the only way to come up to speed on the intricacies of Lucene.
What if I have the First Edition?
Keep it for your projects that require 2.x versions of Lucene. So much has changed in version 3 that is not backward compatible with previous versions that the first edition of this book will be of almost no use. Every example has changed. It’s pretty much a whole new book (other than the first page where it describes Lucene as simple — that used to be true, but is not any longer).
I’ve been coding in Lucene since version 1.2 (2002), and the javadoc’s been getting better over time. I still needed the code samples from this book to get a basic index up and running and to figure out how to enumerate the terms given an analyzer. (In the first edition of the book, I contributed a case study on n-gram-based analysis, so it’s not like I was unfamiliar with the whole process!)
Is it a Good Book?
I’m going to violate the “show, don’t tell” rule of writing and answer my own question with an unequivocal “yes”. Now let me show you why.
The authors clearly know their Lucene. Not surprising, given that they’re three of the core developers for Lucene. (Erik Hatcher was also involved in the Ant build system, and has another Manning … in Action book for that.) Following the Manning Press in Action methodology, they lay everything out with very clear examples that you can run. As such, the book’s better than good — it’s really great set of code examples presented in a coherent style and order, with easy-to-follow explanations.
The book covers a wide range of introductory and advanced topics in a way that’s about as easy to follow as possible. It starts from simple indexing and query construction and then moves into more advanced topics like spell checking and term vector more-like-this implementations. It does all of this with examples, cleverly sidestepping most of the (relatively simple) mathematics underlying search and scoring, while still conveying the basic principles.
They cover extras like the Tika document framework, where they provide a very clear intro to the SAX XML parsers and a nice overview the Jakarta Commons Digester (I dislike the config-and-reflection-based Digester so much that I built LingPipe’s delegating and delegate handlers as an in-the-code alternative).
If you follow the examples in this book and try them out you’ll be in a good position to use Lucene in a serious application. Even better, you’ll be prepared to join the project’s amazingly helpful and responsive mailing list (for years now, the first response to most newbie questions has been to read Lucene in Action!).
They cover many more advanced topics than the last book, like the range of file-level locking options and their use in a more transactional setting than basic Lucene provides itself.
Needs a Revision
Reading the book, I can’t shake the guilt arising from not sending them feedback on drafts. I had the drafts lying around in PDF form for ages and kept promising to send Otis feedback. As is, this book needs another editorial pass. (That’s not to say it’s not fantastic as is — all the glitches in the book I saw were relatively minor and didn’t get in the way of the code examples actually illustrating what’s going on with Lucene.)
The initial two lines of a paragraph are repeated on pages xxx and xxxi. There are also typos like inserted question marks, as in “book?two hundred years ago” on p. xxxii.
There are mistakes in code fragments, like a missing comma in the parse() example on p. 239.
There are also bugs in code. I only studied a few programs in the term vector section (which is great) and in the Tika section. Listing 5.18 has a bug in that it calls
searcher.search() with a maximum number of hits of 10 (below pushpin 6), whereas the method is specified to take a configurable max (above pushpin 3).
There are dangling references like “uncomment the output lines toward the end of the
docsLike method” (on p. 195), where there are no output lines toward the end of the said method (on p. 193).
I’d be in favor of getting rid of the content-free text like “It’s time to get our feet wet!” and so on, while you’re at it.
There’s a huge range of contributed packages surrounding Lucene these days, coupled with an ever-expanding range of applications to which Lucene is being put. As a result, the authors face a daunting curatorial challenge.
From my perspective, the biggest gap is around Unicode and how it behaves with respect to normalization and search. It’s more than just stripping accents from Western characters.
Don’t get me wrong. I love JUnit. We use it for unit testing in LingPipe. And I recommend reading our tests to help understand our classes. I just don’t think it’s a good way to present examples in an intro book. On top of that, they’re using the old style of JUnit rather than the newer style with annotations for tests and startup/teardown methods.
What about the Code?
Authors of textbooks trying to introduce an API are presented with the difficult problem of balancing simplicity with solid coding practice. I have to fight my own tendency to err on the latter side. The authors favor simplicity in most cases, as they explain up front. The authors realize that users are likely to start from cut-and-pasted versions of their code, and try to warn people about the shortcuts they took for the sake of clarity. (We do the same thing in our tutorials.)
Many (but not all) methods that throw exceptions are just declared to throw
Exception. I think this does readers a disservice in not knowing what’s actually throwing what kind of exception. Others might argue it makes the code cleaner and less cluttered. The authors explain that’s their motivation and point out proper exception etiquette. We do the same thing in some of our tutorial code.
The use of generics is spotty, with allocations like
new TreeMap() that should be
new TreeMap<String,Integer> and subsequent use of non-generic
Iterator instances. As a result, they have to cast their gets and can’t use auto-boxing. This just clutters up the code using the maps and makes their intended contents opaque.
I’d recommend autoboxing, or its explicit result,
Integer.valueOf(int), over the older style
new Integer(int) (the latter always creates a new instance, whereas
valueOf() being a factory can share instances).
I’d like to see more foreach and fewer explicit iterators.
They don’t like ternary operators or min/max operators, resulting in awkward fragments like:
int size = max; if (max > hits.scoreDocs.length) size = hits.scoreDocs.length;
rather than the simpler
int size = Math.min(max,hits.scoreDocs.length);
I don’t know if this is the different authors’ example coding style (they all know how to write real production code, like Lucene itself!).
There is a more serious issue for floating point precision they try to deal with in listing 5.22 in the last if/else block. They try to avoid having cosines take on values outside of (-1,1), which would be illegal inputs to
acos(). Their approach is insufficient. The only way to guarantee that your natural cosine computations fall in this range is to do:
double boundedX = Math.max(-1.0, Math.min(1.0, x));
That’s another problem we ran into for our k-means clustering implementation in LingPipe. Once the vectors get big, it’s easy to get results greater than 1 for close matches.
Also, if you’re worried about efficiency, replace
Math.sqrt(x) * Math.sqrt(y) with
Math.sqrt(x * y).
I’m not a big fan of Manning’s approach to code formatting and presentation. Whole programs are printed out, often across multiple pages, and a numbered push-pin like graphic in very heavy bold is used to pick them out and cross-reference them in the text. I prefer the interleaved style I’ve been using, and you see in something like Bloch’s Effective Java, but I can see that it’s a matter of taste.
The code formatting also bugged me. Rather than breaking before assignment equal signs, they break after. They use two characters of indent, which is hard to scan. And the code’s too light for the rest of the text (the “color” doesn’t balance in the typographical sense).
The real typographical sin this book commits is breaking class and method names and file paths and URLs across lines. A class name like
IndexWriter is apt to show up with
Index- at the end of one line and
Writer at the start of the next. The reason this is bad is that hyphens are legal characters in paths and may be confused with underscores in class names (hyphens aren’t legal in Java identifiers).
How Much Math?
For the most part, the authors stick to concrete examples and shy away from abstract mathematics (and related algorithm analyses). If you want to understand the math behind comparison or the probability models behind compressed indexes, I’d still recommend Witten, Moffat and Bell’s Managing Gigabytes (Morgan Kaufmann text, so it’s prohibitively expensive). Mannnig, Raghavan and Schütze’s more recent Introduction to Information Retrieval (Cambridge Uni. Press let them leave an online version up for free!) has most of the math and many more recent topics related to language processing such as edit distance algorithms and their relation to spell checking, probabilistically motivated retrieval, and a range of classifiers like K-nearest neighbors and support vector machines.
In the cosine example in the “Leveraging Term Vectors” (section 5.10), they define cosine in a formula without ever defining the length or dot product functions. If someone needs cosine defined for them, they probably need the definitions for basic vector operations. In the example code, they assume that the word vector has only a single instance of each word, which is certain to be confusing for novice geometers. I would’ve just stuck with largest cosine for the sort, rather than smallest angle and skipped the call to
Math.acos(), the arc (or inverse) cosine operation (which they later describe as prohibitive in terms of computation costs, though it’ll only make a difference percentage-wise if the term vectors are short).
For spell checking, it’d have been nice to get up to something like the noisy channel model. Peter Norvig provides an elegant introduction in his chapter in the Beautiful Data book from O’Reilly. Lucene’s spell checking is still pretty simple and word based (rather than frequency based), though the authors provide some suggestions on how to work with what’s there.
What about Solr?
Now that the Solr search client-server project is merging with Lucene, it’ll be interesting to see if the third edition includes information about Solr. This book only mentions it in passing in a page-long section 10.8.
Other Books on Lucene
I haven’t seen Smiley and Pugh’s 2009 Solr 1.4 Enterprise Search Server, but it has positive feedback on Amazon.
Manu Konchady’s book Building Search Applications with Lucene, LingPipe, and Gate is more introductory (more like the O’Reilly NLTK book), and needs an update for Lucene 3 and LingPipe 4.
About those Cover Illustrations
On page xxxii, they tell the story of an editor buying the book from which the cover art is drawn. They say the cover page is missing so they couldn’t track it down, but they mention the date and artists. I just did a search for the terms they mention,
[ottoman empire clothing 1802 william miller] and found two copies of the book for sale at a Donald Heald Books (you may need to search again). Here’s the scoop
[DALVIMART, Octavien (illustrator) - William ALEXANDER (1767-1816)].
The Costume of Turkey, illustrated by a series of engravings; with descriptions in English and French
London: printed for William Miller by T. Bensley, 1802 [but text watermarked 1796 and 1811; plates 1811 and 1817). Folio (14 1/8 x 10 1/2 inches). Parallel text in French and English. Letterpress title with integral hand-coloured stipple-engraved vignette, 60 hand-coloured stipple-engraved plates by J. Dadley or William Poole after Octavien Dalvimart. Contemporary green straight-grained morocco, covers bordered in gilt and blind, skilfully rebacked to style with the spine in six compartments with double raised bands, the bands highlighted with gilt tooling, lettered in gilt in the second compartment, the others with repeat decoration in gilt, oatmeal endpapers, gilt edges. Provenance: William Rees Mogg (Cholwell House, Somerset, England, armorial bookplate).
A good copy of this classic work on the costume of the Ottoman Empire.
Starting with the "Kislar Aga or first black unuch [sic.]” in the “Grand Signior’s Seraglio” the subjects covered are quite wide-ranging but centered on the inhabitants of Constantinople and those who were visiting the capital city: a “Sultana”, “Officers of the Grand Signior”, “Turkish woman of Constantinople”, “Turkish woman in provincial dress”. Dalvimart does make occasional forays out into the provincial areas of the empire: included are images of an “Albanian”, an “Egyptian Arab”, a “Bedouin Arab”, a “Dervise [sic.] of Syria”, an “Armenian”, a “Bosniac” as well as a number of fine plates of the female costume of the Greek Islands (which are much admired in the text). “The Drawings, from which … [the] plates have been engraved, were made on the spot … [in about 1798] by Monsieur Dalvimart, and may be depended upon for their correctness. They have been accurately attended to in the progress of the engraving; and each impression has been carefully coloured according to the original drawing, that the fidelity of them might not be impaired” (Preface). Abbey points out that the informative text is attributed to Willliam Alexander in the British Library catalogue.
Abbey Travel II,370; Atabey 313; cf. Colas I, 782; cf. Lipperheide I, Lb37; cf. Vinet 2337.