The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.
I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).
O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!
A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)
Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.
If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a “pass”.] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.