Book: Building Search Applications: Lucene, LingPipe and Gate

by

There’s a book about LingPipe!

The title is linked to the Amazon page; it’s also available as an inexpensive download from Lulu.

The Bottom Line

The subtitle “A practical guide to building search applications using open source software” pretty much sums it up (comment added June 22, 2008: please see Seth Grimes’s comment below about LingPipe’s royalty-free license not being compatible with other open-source licenses). It takes a reader that knows Java, but nothing at all about search or associated text processing algorithms, and provides a hands-on, step-by-step guide for building a state-of-the-art search engine.

I (Bob) gave Manu feedback on a draft, but there wasn’t much to correct on the LingPipe side, so I can vouch for the book’s technical accuracy. (Disclaimer: I didn’t actually try to run the code.)

Chapter by Chapter Overview

After (1) a brief discussion of application issues, the chapters include (2) tokenization in all three frameworks, (3) indexing with Lucene, (4) searching with Lucene, (5) sentence extraction, part-of-speech tagging, interesting/significant phrase extraction, and entity extraction with LingPipe and Gate (6) clustering with LingPipe, (7) topic and language classification with LingPipe, (8 ) enterprise and web search, page rank/authority calculation, and crawling with Nutch, (9) tracking news, sentiment analysis with LingPipe, detecting offensive content and plagiarism, and finally, (10) future directions including vertical search, tag-based search and question-answering.

For those wanting introductions to the LingPipe APIs mentioned above, Konchady’s book is a gentler starting point than our own tutorials.

That may sound like a whole lot of ground to cover in 400 pages, but Konchady pulls the reader along by illustrating everything with working code and not getting bogged down in theoretical boundary conditions. There are pointers to theory, and a bit of math where necessary, but the book never loses sight of its goal of providing a practical introduction. In that way, it’s like the Manning in Action series.

The book’s hot of the presses, so it’s up to date with Lucene 2.3 and LingPipe 3.3.

About the Author

Manu Konchady‘s an old hand at search and text processing. You may remember him from such books as Text Mining Application Programming and High Performance Parallel Algorithms for Scientific Computing with Application to a Coupled Ocean Model.

3 Responses to “Book: Building Search Applications: Lucene, LingPipe and Gate”

  1. Seth Grimes Says:

    Breck, Bob, do you consider Lingpipe to be open source? You provide all the Java source code (correct?), but the convention is that “open source” means more than that. It seems to me that the license (http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt) does not meet the OSI’s criteria as an open-source license, and it certainly doesn’t fit GNU criteria. Review the OSI criteria at http://www.opensource.org/docs/osd

    “1. Free Redistribution

    “The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale.”

    Lingpipe doesn’t meet this criterion. Note your license clause 2: “You may copy or modify the Software or use any output of the Software (i) for internal non-production trial, testing and evaluation of the Software, or (ii) in connection with any product or service you provide to third parties for free.”

    OSI criteria include —

    “9. License Must Not Restrict Other Software

    “The license must not place restrictions on other software that is distributed along with the licensed software. For example, the license must not insist that all other programs distributed on the same medium must be open-source software.”

    Paragraph 4 of the Lingpipe is contrary to this —

    “4. Whether you distribute the Software or not, if you distribute any computer program that is not the Software, but that (a) is distributed in connection with the Software or contains any part of the Software, (b) causes the Software to be copied or modified (i.e., ran, used, or executed), such as through an API call, or (c) uses any output of the Software, then you must distribute that other computer program under a license defined as a Free Software License by the Free Software Foundation or an Approved Open Source License by the Open Source Initiative.”

    If someone were distributing other software with a license that included this latter language, that person wouldn’t be allowed to distribute royalty-free Lingpipe along with it!

    How about simply adopted a recognized OS license for Lingpipe?

    Seth

  2. Bob Carpenter Says:

    Seth:

    (I just fact-checked myself and had to edit this entry again. I (Bob) don’t think much about the licensing issues.)

    You’re right. Our royalty-free license isn’t accepted by any of the open-source license clearing houses.

    There’s a discussion of “open source” on our FAQ. It’s definitely been a point of confusion, and perhaps, we should just remove any reference to the term “open source”.

    We don’t have any immediate plans to release LingPipe under a recognized open-source license. We realize that our license is very restrictive compared even to GNU. With an “approved” open source license, many more people would use LingPipe, and perhaps even more importantly, contribute to it.

    Like MySQL, we offer commercial licenses under various terms for customers. We also offer specialized royalty-free licenses for academic research projects that do not own their own data. Maybe Breck will blog about our business model at some point.

    We didn’t write the subtitle of the book to which this blog post referred. I edited the main blog entry with a pointer to your comment where I quote the subtitle so that it’s crystal clear we don’t claim to fit any “official” definitions of “open source”.

  3. Breck Says:

    Seth,
    I went and removed the “open source” phrase from our FAQ. Apologies for the confusion–we really try to be clear about it when we communicate about LingPipe licensing but I can believe that I have slipped up.
    As for going with a Open Source license I would do it if I could come up with a viable business model that didn’t depend on license fees for those not willing to comply with the Royalty Free license.

    Breck

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 822 other followers