LingPipe Book Draft Available


We’ve started work on a proper LingPipe book. You can download the current partial draft of the book and sample code from:

In case you were wondering why the blog’s been quieter these days, this is it!

Our Goals

Our goal is to produce something with a little more breadth and depth and much more narrative structure than the current LingPipe tutorials. Something that a relative Java and natural language processing novice could work through from beginning to end, coming out with a fairly comprehensive knowledge of LingPipe and a good overview of some aspects of natural language processing.

We’re not trying to write an introduction to Java, but there’s a lot more detail on Java in the book than in LingPipe’s javadoc or the current tutorials.

Progress so Far

So far, there’s a getting started chapter with sections on all the tools we use, a hello world example and an introduction to Ant. The second chapter is all about character encodings and characters and strings in Java, as well as an introduction to the International Components for Unicode (ICU) library for normalization, encoding detection, and transliteration. The third chapter covers regular expressions, focusing on their interaction with Unicode. The fourth chapter is on input and output in Java, including files, byte and character streams, the various interfaces and buffering, compression, standard input and output, reading from URLs and resources on the classpath, and serialization, including the serialization proxy. The fifth chapter is on tokenization, including an overview of all of LingPipe’s built-in tokenizers and how to build your own.

The first appendiex is an introduction to Java, including the primitives, objects and arrays. The second appendix contains suggested further reading in areas related to natural language processing.

I hope to keep churning out chapters at around one per week. As I complete chapters, I’ll release new versions.

Comments Most Appreciated

C’mon — you know you want to be listed in the front of the book for sending us feedback. Any comments, suggestions, etc., should go to

The book’s not been copy-edited yet, even by me, so I don’t need to know about misspellings, sentences that run off into space, or that kind of thing.

I would love to get feedback about the general level of description, the tone, or get suggestions for demos or how to improve the existing code or descriptions.

Eventual Publication

We’ll be publishing it ourselves, probably through CreateSpace. That’ll let us sell through Amazon.

If it turns out to be 800 pages long, as we expect, we should be able to sell it for around US$ 20 (in the US anyway).

We plan to continue distributing PDF versions for free.

It’s about Time

I’m psyched, because it’s been over ten years since my last book.

I’m also working on a book on Bayesian categorical modeling — the posts here on non-NLP-related Bayesian stats, and posts like working through the collapsed samplers for LDA and naive Bayes, are warmups for that book. Don’t hold your breath; I’m trying to finish the LingPipe book first.

5 Responses to “LingPipe Book Draft Available”

  1. Alex Ott Says:

    From first point of view, there is too much information about base Java stuff – readers/streams/archives/etc. I think, that it’s better to point onto several very good Java books (Thinking in Java, Core Java, ….) and concentrate on main topics – this will make book smaller in size, but it will provide only relevant information

    • lingpipe Says:

      I’m afraid you may be right. In my last book, I wound up chucking a 200 page introduction to propositional and predicate logic, figuring there were better intros out there.

      Here, I’m trying to focus on just the aspects of Java related to scalable statistical and heuristic text processing. And do so in such a way that experienced users can skip the parts they already know. That way, I can provide references in more interesting chapters.

      We find that our users often get stuck on the fine points of serialization, regexes, strings, and character encodings which aren’t treated particularly deeply in the introductory books (though I don’t know either Thinking in Java or Core Java).

      I was aiming for something like Effective Java for text, but I’m afraid what I already have has too much introductory material.

      • Alex Ott Says:

        I read Core Java not so much time ago – just to refresh my Java knowledge, and I found that it provides pretty good coverage of corresponding topics

  2. Jessica Says:

    I’m looking forward to seeing more NLP/LingPipe-specific chapters in the near future. I agree with Alex that it seems like quite a lot of introductory material, but you are the one most aware of your audience’s needs.

  3. Emre Sevinç Says:

    I’ll be definitely looking forward to buy the finished version. Now I’m downloading the online draft version.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: