McCandless, Hatcher & Gospodnetić (2010) Lucene in Action: 2nd Ed. (Review)

by

I’m very happy to see that the 2nd Edition is out:

  • McCandless, Michael, Erik Hatcher, and Otis Gospodnetić. 2010. Lucene in Action. Second Edition. Manning Press.

Manning’s offering 40% off until September 30, 2010. Follow the link to the book and use code lingpipeluc40 when you check out.

You Need This Book

If you’re interested in version 3 or later of the Apache Lucene search library, you need this book. As far as I know, this book is the only way to come up to speed on the intricacies of Lucene.

What if I have the First Edition?

Keep it for your projects that require 2.x versions of Lucene. So much has changed in version 3 that is not backward compatible with previous versions that the first edition of this book will be of almost no use. Every example has changed. It’s pretty much a whole new book (other than the first page where it describes Lucene as simple — that used to be true, but is not any longer).

I’ve been coding in Lucene since version 1.2 (2002), and the javadoc’s been getting better over time. I still needed the code samples from this book to get a basic index up and running and to figure out how to enumerate the terms given an analyzer. (In the first edition of the book, I contributed a case study on n-gram-based analysis, so it’s not like I was unfamiliar with the whole process!)

Is it a Good Book?

I’m going to violate the “show, don’t tell” rule of writing and answer my own question with an unequivocal “yes”. Now let me show you why.

The authors clearly know their Lucene. Not surprising, given that they’re three of the core developers for Lucene. (Erik Hatcher was also involved in the Ant build system, and has another Manning … in Action book for that.) Following the Manning Press in Action methodology, they lay everything out with very clear examples that you can run. As such, the book’s better than good — it’s really great set of code examples presented in a coherent style and order, with easy-to-follow explanations.

The book covers a wide range of introductory and advanced topics in a way that’s about as easy to follow as possible. It starts from simple indexing and query construction and then moves into more advanced topics like spell checking and term vector more-like-this implementations. It does all of this with examples, cleverly sidestepping most of the (relatively simple) mathematics underlying search and scoring, while still conveying the basic principles.

They cover extras like the Tika document framework, where they provide a very clear intro to the SAX XML parsers and a nice overview the Jakarta Commons Digester (I dislike the config-and-reflection-based Digester so much that I built LingPipe’s delegating and delegate handlers as an in-the-code alternative).

If you follow the examples in this book and try them out you’ll be in a good position to use Lucene in a serious application. Even better, you’ll be prepared to join the project’s amazingly helpful and responsive mailing list (for years now, the first response to most newbie questions has been to read Lucene in Action!).

They cover many more advanced topics than the last book, like the range of file-level locking options and their use in a more transactional setting than basic Lucene provides itself.

Needs a Revision

Reading the book, I can’t shake the guilt arising from not sending them feedback on drafts. I had the drafts lying around in PDF form for ages and kept promising to send Otis feedback. As is, this book needs another editorial pass. (That’s not to say it’s not fantastic as is — all the glitches in the book I saw were relatively minor and didn’t get in the way of the code examples actually illustrating what’s going on with Lucene.)

The initial two lines of a paragraph are repeated on pages xxx and xxxi. There are also typos like inserted question marks, as in “book?two hundred years ago” on p. xxxii.

There are mistakes in code fragments, like a missing comma in the parse() example on p. 239.

There are also bugs in code. I only studied a few programs in the term vector section (which is great) and in the Tika section. Listing 5.18 has a bug in that it calls searcher.search() with a maximum number of hits of 10 (below pushpin 6), whereas the method is specified to take a configurable max (above pushpin 3).

There are dangling references like “uncomment the output lines toward the end of the docsLike method” (on p. 195), where there are no output lines toward the end of the said method (on p. 193).

I’d be in favor of getting rid of the content-free text like “It’s time to get our feet wet!” and so on, while you’re at it.

What’s Missing?

There’s a huge range of contributed packages surrounding Lucene these days, coupled with an ever-expanding range of applications to which Lucene is being put. As a result, the authors face a daunting curatorial challenge.

From my perspective, the biggest gap is around Unicode and how it behaves with respect to normalization and search. It’s more than just stripping accents from Western characters.

JUnit Examples

Don’t get me wrong. I love JUnit. We use it for unit testing in LingPipe. And I recommend reading our tests to help understand our classes. I just don’t think it’s a good way to present examples in an intro book. On top of that, they’re using the old style of JUnit rather than the newer style with annotations for tests and startup/teardown methods.

What about the Code?

Authors of textbooks trying to introduce an API are presented with the difficult problem of balancing simplicity with solid coding practice. I have to fight my own tendency to err on the latter side. The authors favor simplicity in most cases, as they explain up front. The authors realize that users are likely to start from cut-and-pasted versions of their code, and try to warn people about the shortcuts they took for the sake of clarity. (We do the same thing in our tutorials.)

Many (but not all) methods that throw exceptions are just declared to throw Exception. I think this does readers a disservice in not knowing what’s actually throwing what kind of exception. Others might argue it makes the code cleaner and less cluttered. The authors explain that’s their motivation and point out proper exception etiquette. We do the same thing in some of our tutorial code.

The use of generics is spotty, with allocations like new TreeMap() that should be new TreeMap<String,Integer> and subsequent use of non-generic Iterator instances. As a result, they have to cast their gets and can’t use auto-boxing. This just clutters up the code using the maps and makes their intended contents opaque.

I’d recommend autoboxing, or its explicit result, Integer.valueOf(int), over the older style new Integer(int) (the latter always creates a new instance, whereas valueOf() being a factory can share instances).

I’d like to see more foreach and fewer explicit iterators.

They don’t like ternary operators or min/max operators, resulting in awkward fragments like:

int size = max;
if (max > hits.scoreDocs.length) size = hits.scoreDocs.length;

rather than the simpler

int size = Math.min(max,hits.scoreDocs.length);

I don’t know if this is the different authors’ example coding style (they all know how to write real production code, like Lucene itself!).

There is a more serious issue for floating point precision they try to deal with in listing 5.22 in the last if/else block. They try to avoid having cosines take on values outside of (-1,1), which would be illegal inputs to acos(). Their approach is insufficient. The only way to guarantee that your natural cosine computations fall in this range is to do:

double boundedX = Math.max(-1.0, Math.min(1.0, x));

That’s another problem we ran into for our k-means clustering implementation in LingPipe. Once the vectors get big, it’s easy to get results greater than 1 for close matches.

Also, if you’re worried about efficiency, replace Math.sqrt(x) * Math.sqrt(y) with Math.sqrt(x * y).

Layout, Etc.

I’m not a big fan of Manning’s approach to code formatting and presentation. Whole programs are printed out, often across multiple pages, and a numbered push-pin like graphic in very heavy bold is used to pick them out and cross-reference them in the text. I prefer the interleaved style I’ve been using, and you see in something like Bloch’s Effective Java, but I can see that it’s a matter of taste.

The code formatting also bugged me. Rather than breaking before assignment equal signs, they break after. They use two characters of indent, which is hard to scan. And the code’s too light for the rest of the text (the “color” doesn’t balance in the typographical sense).

The real typographical sin this book commits is breaking class and method names and file paths and URLs across lines. A class name like IndexWriter is apt to show up with Index- at the end of one line and Writer at the start of the next. The reason this is bad is that hyphens are legal characters in paths and may be confused with underscores in class names (hyphens aren’t legal in Java identifiers).

How Much Math?

For the most part, the authors stick to concrete examples and shy away from abstract mathematics (and related algorithm analyses). If you want to understand the math behind comparison or the probability models behind compressed indexes, I’d still recommend Witten, Moffat and Bell’s Managing Gigabytes (Morgan Kaufmann text, so it’s prohibitively expensive). Mannnig, Raghavan and Schütze’s more recent Introduction to Information Retrieval (Cambridge Uni. Press let them leave an online version up for free!) has most of the math and many more recent topics related to language processing such as edit distance algorithms and their relation to spell checking, probabilistically motivated retrieval, and a range of classifiers like K-nearest neighbors and support vector machines.

In the cosine example in the “Leveraging Term Vectors” (section 5.10), they define cosine in a formula without ever defining the length or dot product functions. If someone needs cosine defined for them, they probably need the definitions for basic vector operations. In the example code, they assume that the word vector has only a single instance of each word, which is certain to be confusing for novice geometers. I would’ve just stuck with largest cosine for the sort, rather than smallest angle and skipped the call to Math.acos(), the arc (or inverse) cosine operation (which they later describe as prohibitive in terms of computation costs, though it’ll only make a difference percentage-wise if the term vectors are short).

For spell checking, it’d have been nice to get up to something like the noisy channel model. Peter Norvig provides an elegant introduction in his chapter in the Beautiful Data book from O’Reilly. Lucene’s spell checking is still pretty simple and word based (rather than frequency based), though the authors provide some suggestions on how to work with what’s there.

What about Solr?

Now that the Solr search client-server project is merging with Lucene, it’ll be interesting to see if the third edition includes information about Solr. This book only mentions it in passing in a page-long section 10.8.

Other Books on Lucene

I haven’t seen Smiley and Pugh’s 2009 Solr 1.4 Enterprise Search Server, but it has positive feedback on Amazon.

Manu Konchady’s book Building Search Applications with Lucene, LingPipe, and Gate is more introductory (more like the O’Reilly NLTK book), and needs an update for Lucene 3 and LingPipe 4.

About those Cover Illustrations

On page xxxii, they tell the story of an editor buying the book from which the cover art is drawn. They say the cover page is missing so they couldn’t track it down, but they mention the date and artists. I just did a search for the terms they mention, [ottoman empire clothing 1802 william miller] and found two copies of the book for sale at a Donald Heald Books (you may need to search again). Here’s the scoop

[DALVIMART, Octavien (illustrator) – William ALEXANDER (1767-1816)].

The Costume of Turkey, illustrated by a series of engravings; with descriptions in English and French

London: printed for William Miller by T. Bensley, 1802 [but text watermarked 1796 and 1811; plates 1811 and 1817). Folio (14 1/8 x 10 1/2 inches). Parallel text in French and English. Letterpress title with integral hand-coloured stipple-engraved vignette, 60 hand-coloured stipple-engraved plates by J. Dadley or William Poole after Octavien Dalvimart. Contemporary green straight-grained morocco, covers bordered in gilt and blind, skilfully rebacked to style with the spine in six compartments with double raised bands, the bands highlighted with gilt tooling, lettered in gilt in the second compartment, the others with repeat decoration in gilt, oatmeal endpapers, gilt edges. Provenance: William Rees Mogg (Cholwell House, Somerset, England, armorial bookplate).

A good copy of this classic work on the costume of the Ottoman Empire.

Starting with the “Kislar Aga or first black unuch [sic.]” in the “Grand Signior’s Seraglio” the subjects covered are quite wide-ranging but centered on the inhabitants of Constantinople and those who were visiting the capital city: a “Sultana”, “Officers of the Grand Signior”, “Turkish woman of Constantinople”, “Turkish woman in provincial dress”. Dalvimart does make occasional forays out into the provincial areas of the empire: included are images of an “Albanian”, an “Egyptian Arab”, a “Bedouin Arab”, a “Dervise [sic.] of Syria”, an “Armenian”, a “Bosniac” as well as a number of fine plates of the female costume of the Greek Islands (which are much admired in the text). “The Drawings, from which … [the] plates have been engraved, were made on the spot … [in about 1798] by Monsieur Dalvimart, and may be depended upon for their correctness. They have been accurately attended to in the progress of the engraving; and each impression has been carefully coloured according to the original drawing, that the fidelity of them might not be impaired” (Preface). Abbey points out that the informative text is attributed to Willliam Alexander in the British Library catalogue.

Abbey Travel II,370; Atabey 313; cf. Colas I, 782; cf. Lipperheide I, Lb37; cf. Vinet 2337.

9 Responses to “McCandless, Hatcher & Gospodnetić (2010) Lucene in Action: 2nd Ed. (Review)”

  1. Andrew Clegg Says:

    Haven’t read the Lucene one but, but a sidenote on Solr 1.4 Enterprise Search Server:

    I wasn’t massively impressed with it, to be honest. Not much there that’s not in LucidWorks’ free PDF manual, and it could have done with more real-world examples of tuning Solr for different kinds of application.

  2. Lucid Imagination » Get in on the Lucene action: “You need this book”. Says:

    […] Lucene in Action, Second Edition continues to harvest Kudos from people who ought to know. Bob Carpenter of Lingpipe found a lot to like: […]

  3. Robert Muir Says:

    Bob, I’m not sure the authors left anything out with regard to Unicode normalization etc support, because really Lucene 3.0 lacks this functionality.

    I’d like to think the situation is improving for the next release though (any improvements to docs are appreciated): http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/icu/src/java/overview.html?revision=941706&view=co

    • lingpipe Says:

      That’s great news. I love ICU. And it looks like you’re really going to town in integrating it.

      In the end, I found the general transliteration toolkit, with its specification language allowing filtering and chaining, to be the most useful. With that, you don’t really need the built-in normalizers.

      I provide an example of using ICU transliteration-based normalization for tokenization in the book draft for LingPipe I posted. And show how to adapt LingPipe tokenizer factories to Lucene analyzers and vice-versa.

      As to the Lucene book, it covers lots of stuff that’s not part of Lucene itself, so that’s not necessarily an issue. It’s just more an issue of priorities. I’m sure everyone they asked has a pet issue they wished got more coverage. And there’s an advantage to a shorter book over a huge one.

      We have the same problem for LingPipe — there’s really not anything Unicode-specific built in other than being based on Java char values. We don’t want a dependency on ICU, but it’s just so darn useful!

      • rcmuir Says:

        (Sorry its a bit offtopic), but I must disagree that Transliterator makes the built in normalization support obselete.

        At least for search, in most cases you want filtering/folding to give a ‘stable’ result: fold(fold(x)) = fold(x). For non-trivial foldings, this can be tricky to achieve with Transliterator, and might require many redundant passes to ::NFC()/::NFD() etc, because your rules might denormalize the intermediate result before the next chain.

        Additionally with many rules and chaining, Transliterator can be very slow.

        This is why, for performance and simplicity, defining your folding as a custom normalization can be substantially better if you can get away with it.

        We support Transliterator filtering too in Lucene, but the “fuzzy search term folding” (diacritics removal etc) is instead implemented as a custom normalization form.

        This way all foldings are instead actually integrated into the normalization process, computed recursively when compiling the data file (not at runtime: its a single processing step), supporting .quickCheck (realizing that in general, lots of data will already be “folded”), and works on CharSequence and Appendable.

        You can read more about the new normalization API here: http://userguide.icu-project.org/transforms/normalization
        http://site.icu-project.org/design/normalization/custom

      • lingpipe Says:

        Thanks for the detailed report.

        I hadn’t gotten as far as ICU’s custom normalization, which looks as nicely designed as the rest of ICU. I can see where caching would be helpful if normalization’s slow. I hadn’t timed the transliterators versus normalizers.

        I completely agree that stability (i.e. idempotency) of the fold transform is critical. In the exact same way as matching analyzers in index building and querying.

        I had thought the basic decompositions (and recompositions) were stable in just this way.

        If the composition with some other operation is a problem (say stoplisting and stemming working differently in different orders in terms of stability), then it seems like using the normalizer will just push that problem down into its implementation.

        There’s really no way to automatically check something is stable. . If you do a really antagonistic transform, say converting “0” to “1” and “1” to “0”, you’ll never stabilize by repeated application. You could even let the string being transformed correspond to a Turing machine tape and define the transform to be one step of evaluating the Turing machine — they you get the halting problem trying to determine if the transform will stabilize!

  4. rcmuir Says:

    Bob, you are right the basic compositions and decompositions are stable. its just that you end out with at the least, ::NFD() at the beginning of every rule set and ::NFC() at the end, or similar.

    Separately, I think the custom normalization does a pretty good job at detecting problems. For example I created a rule file with your example “0” to “1” and “1” to “0” and it refused to compile:

    gennorm2 error: U+0030 maps to itself directly or indirectly

    • lingpipe Says:

      That’s very cool. I always have to remind myself that just because the general problem’s undecidable, you can often go a long way toward addressing real problems people have.

      The error message is helpful as is, but would be even more helpful if it indicated the true problem of flip-flopping rather than mapping to itself.

      I thought the problem would be running NFD (decomposing), then something like an Arabic to Roman transliteration, which may reintroduce accented characters and require NFD to be run again (this is a silly example; you’d transliterate first then decompose).

      • rcmuir Says:

        well, not too silly of an example. you are gonna want to decompose for your transliteration before too, unless you want to duplicate rules (e.g. 0623 versus 0627+0654).

        but my only point was, if what you are doing is more like a normalization process, the normalizer API has many advantages (performance just being one of them) and is probably a much better choice.

        however, if you are actually doing transliteration, the transliterator API is best.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s