Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.
The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.
I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).
O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!
A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)
Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.
Manning also passed on TPiJ, so Colloquial Media will branch out from genre fiction into tech books. More news when we’re closer to finishing.
If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a “pass”.] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.
February 22, 2013 at 5:58 am |
OpenNLP moved to Apache, here is the new link: http://opennlp.apache.org/
February 25, 2013 at 1:06 pm |
Cool.
I really like the way they laid out their doc. And their choice of tools — there’s a strong overlap with LingPipe.
I wonder what Apache requires before they’re out of the “incubator”.
April 8, 2013 at 2:41 pm
Hello Bob,
we left the Incubator already in February 2012 and became a Top Level Project (TLP) at Apache.
The Incubation Policy document has a section about minimum graduation requirements, have a look here:
http://incubator.apache.org/incubation/Incubation_Policy.html#Minimum+Graduation+Requirements
Jörn
April 8, 2013 at 6:33 pm
Congrats and thanks for the correction. I don’t know why I thought you were still in the incubator.
February 22, 2013 at 2:25 pm |
[…] Anyone Want to Write an O’Reilly Book on NLP with Java? by Bob Carpenter. […]