We’ve started work on a proper LingPipe book. You can download the current partial draft of the book and sample code from:
In case you were wondering why the blog’s been quieter these days, this is it!
Our goal is to produce something with a little more breadth and depth and much more narrative structure than the current LingPipe tutorials. Something that a relative Java and natural language processing novice could work through from beginning to end, coming out with a fairly comprehensive knowledge of LingPipe and a good overview of some aspects of natural language processing.
We’re not trying to write an introduction to Java, but there’s a lot more detail on Java in the book than in LingPipe’s javadoc or the current tutorials.
Progress so Far
So far, there’s a getting started chapter with sections on all the tools we use, a hello world example and an introduction to Ant. The second chapter is all about character encodings and characters and strings in Java, as well as an introduction to the International Components for Unicode (ICU) library for normalization, encoding detection, and transliteration. The third chapter covers regular expressions, focusing on their interaction with Unicode. The fourth chapter is on input and output in Java, including files, byte and character streams, the various interfaces and buffering, compression, standard input and output, reading from URLs and resources on the classpath, and serialization, including the serialization proxy. The fifth chapter is on tokenization, including an overview of all of LingPipe’s built-in tokenizers and how to build your own.
The first appendiex is an introduction to Java, including the primitives, objects and arrays. The second appendix contains suggested further reading in areas related to natural language processing.
I hope to keep churning out chapters at around one per week. As I complete chapters, I’ll release new versions.
Comments Most Appreciated
C’mon — you know you want to be listed in the front of the book for sending us feedback. Any comments, suggestions, etc., should go to
The book’s not been copy-edited yet, even by me, so I don’t need to know about misspellings, sentences that run off into space, or that kind of thing.
I would love to get feedback about the general level of description, the tone, or get suggestions for demos or how to improve the existing code or descriptions.
We’ll be publishing it ourselves, probably through CreateSpace. That’ll let us sell through Amazon.
If it turns out to be 800 pages long, as we expect, we should be able to sell it for around US$ 20 (in the US anyway).
We plan to continue distributing PDF versions for free.
It’s about Time
I’m psyched, because it’s been over ten years since my last book.
I’m also working on a book on Bayesian categorical modeling — the posts here on non-NLP-related Bayesian stats, and posts like working through the collapsed samplers for LDA and naive Bayes, are warmups for that book. Don’t hold your breath; I’m trying to finish the LingPipe book first.