LingPipe Book, Draft 2

by

I’m not quite releasing drafts as frequently as I’d like, though I do have the procedure automated now.

Where to Get the Latest Draft

You can download the book and the code tarball from:

What’s in It (So Far)

Draft 2 is up to 350 or so pages, and most of what I’ve added since the first draft is LingPipe-related. There’s no more general background in the works, but at this pace, it’ll be another year and another 1000 pages before it’s done. We’ll almost certainly have to break it into two volumes if we want to print it.

The current draft has chapters on getting started with Java and LingPipe, including an overview of tools we use. The second chapter’s on character encodings and how to use them in Java. The third chapter covers regexes, including all the quantifiers, and again focusing on how to get the most out of Unicode. The fourth chapter covers I/O, including files, readers, writers and streams, compressed archives like GZIP, ZIP and Tar, resources on the classpath, URIs, URLs, standard I/O, object I/O and serialization, and LingPipe’s I/O utilities.

The fifth chapter gets more into LingPipe proper, covering the handler, parser, and corpora abstractions in package com.aliasi.corpus, as well as support for cross-validation.

The sixth chapter is on classifier evaluations, including K-ary classifiers, reductions to binary classifiers, all the usual statistics, and how to use them in LingPipe. There’s also an extensive section on scored/ranked evaluations.

I’ll probably rearrange and move tokenization before classifier evals, but it’s currently after. I cover just about all aspects of tokenization, including stemming/lemmatization, soundex, character normalization with ICU, and so on. There’s complete coverage of LingPipe’s tokenizers and factories, and complete tokenization abstraction. I also detail interoperability with Lucene’s Analyzer class, with examples in Arabic.

Chapter 9, which will also move earlier, is on symbol tables.

Chapter 11 is a fairly complete overview of latent Dirichlet allocation (LDA) and LingPipe’s implementations.

There’s currently almost 100 pages of appendices, including basic math, stats, information theory, an overview of corpora, and an overview of the relevant data types in Java.

Appendix E is about a 20-page intro to Lucene 3.0, going over all you need to know to get search up and running.

What’s Next

The next thing I’ll address will be chapter 7, on naive Bayes classifiers. Then I’ll turn to logistic regression classifiers, which will require an auxiliary chapter on feature extraction and another on vectors. I may also write chapters on KNN, perceptrons, and our language-model-based classifiers, though he latter depend on a chapter on character language models.

After that, I’ll probably turn to tagging and chunking, though we’ll have to see. That’ll require sentence detection, as well as some more stats and interfaces.

Comments Welcome

So far, no one’s sent any comments on the first draft. I’d love to hear what you think, be it in the form of comments, corrections, suggestions, or whatever.

11 Responses to “LingPipe Book, Draft 2”

  1. Will Thompson Says:

    I’m looking forward to getting the next draft! I’m teaching myself text processing algorithms, and your book is perfect for what I need. I’m more than happy to send you feedback as I go through it.

  2. Rich W Says:

    I do keep meaning to read this… and I’d be happy to provide feedback. Is there a section you think my (or our) help would be most useful to you?

  3. lingpipe Says:

    I can say what I don’t need: help on grammar, spelling, etc. I’ll catch all that myself and with proofreaders.

    What’s most helpful is pointing out things that are unclear, especially if you can give me a hint as to how I might be able to make them more clear.

    One thing I’m worried about now is writing at the right level for the audience. Some of the tutorials were too hard mathematically, and others too trivial.

    What I’d like to know about the book is: Is there too much or too little math?

    Are there too many trivial code samples or not enough? I’m trying to simplify heavily from the tutorials, while at the same time increasing API coverage. Do we need easier “getting started” example code templates?

    Is the way I’m interspersing code and text confusing or easier to understand than putting it in a big printout and then referring back to it?

    Am I trying to be too exhaustive in my coverage of LingPipe? Of Java?

    Will the text work as a reference book for LingPipe?

    As usual, I’m writing what I would’ve liked to have read myself before I knew all this stuff. But I’m probably the wrong audience, because I love clean mathematical descriptions and tend to program from empty editor buffers rather than by cutting-and-pasting code.

  4. Mark J Says:

    Hi Bob,

    Even in its current state, this looks like a very useful book!

    I have a general question not specifically related to the book, but I suspect it’s something you must have come across too: are there good tools for scheduling complex computational experiments (that involve a series of interdependent steps) on a cluster? What I’m basically after is a build system for complex experiments on clusters.

    For example, training and evaluating Eugene’s and my reranking parser involves splitting the training data into folds, training and running the first stage generative parser on each fold, then collecting all the folds together to produce the input for the next stage, etc., etc.

    I’ve been using “make” to run these kinds of experiments. It works fairly well on a multi-core box (e.g., “make -j 8” will figure out how to use all your cores at once), but I don’t see how to generalise this to a cluster environment.

    What I’d like is something that could schedule jobs the way “make” does, but can run them on a cluster. And it should play nice with PBS or whatever cluster scheduling software is being used.

    And finally (while I’m dreaming) it would be great if it were a bit more expressive. “make” is basically a propositional system (i.e., because jobs are associated with file names, they are atomic). It would nice if it were first-order, i.e., if it were possible to pass arguments to jobs that correspond to parameters in the job. I get around this restriction in my Makefiles by encoding parameters in the filename; e.g., grammar_t1e-7.txt is a grammar where the program that constructed it was called with the flag “-t 1e-7”. But this is a kludge, and the Makefiles become basically incomprehensible jumbles of make macros.

    Any suggestions?

    Best,

    Mark

    • lingpipe Says:

      I take it you’ve already considered something like Hadoop running Map-Reduce? The nice part about that is you can specify the reduce operation to put everything back together.

      The cluster schedulers Mitzi’s using to run DNA sequence alignment data are very simple “distribute a batch job, make sure it completes” kind of things. From what I’ve heard at the dinner table, the main hassle seems to be software installation on the compute nodes and the slow file system shared by the whole thing.

    • Brendan O'Connor Says:

      Hi, I’ve had this need too. I think “make” might be the best open-source solution that has full-out dependency graphs and the like. I think so because I’ve seen a few different efforts to make things like this…

      Here’s one. http://www.cs.cmu.edu/~jhclark/loonybin/files/doc/README-txt.html

  5. Heather Says:

    I read the first draft in its entirety and plan on using at least ch. 1,3,5 in my class this Spring. I found those 3 chapters to be great technical companions to Jurafsky and Martin’s first 3 chapters. I love the examples so far and appreciate how much simpler they are than the tutorials – which are great but are definitely advanced for beginners.

    After reading draft 1 I wanted to see info specifically on Ngrams, HMMs, classification and regression, clustering, and using Lingpipe with Lucene. And maybe a chapter on developing a custom classifier with Lingpipe, for example if I wanted to use a neural network model other than perceptron (I cover many in my class).
    It sounds like a number of these topics have been covered in the new draft, looking forward to reading it!

  6. jwp Says:

    Mark: I don’t know of a tool that works across all cluster environments, but if you’re using SGE or LSF, you might try their make-based schedulers:

    http://wikis.sun.com/display/GridEngine/Transparent+Remote+Execution#TransparentRemoteExecution-ParallelMakefileProcessingWithqmake

    http://www.ualberta.ca/CNS/RESEARCH/LSF/doc/users/10-lsmak.htm

  7. Heather Says:

    Love the Lucene Appendix!

  8. Mark J Says:

    Thanks everyone for the suggestions.

    Bob, I haven’t used Hadoop or MapReduce, but I think they’d be mainly useful for parallelising a single step in the kind of processes I’m talking about. I’m thinking now of problems that involve a dependency graph of heterogeneous steps. Make solves this problem fine on a single machine.

    I’ll look more closely at Brendan and jwp’s suggestions. LoonyBin looks especially interesting — it seems designed to solve the kinds of problems I’m interested in — but Brendan’s message didn’t sound like the ringing endorsement I was hoping for.

    Tell Mitzi hi! Yes, clusters are a pain. Running on a cluster means you’re at least one more degree removed from the actual computation. But in my experience most of the problems come from my own stupid mistakes. For example, a couple of days ago I was wondering why all my MCMC sampling jobs were running really slowly. It turns out I’d added code to log certain types of events back to the master node, and one of the samplers produced these events really often (as in, millions of times every second). This clogged the network connecting the slaves to the master and as a result everything ran slow (we have a diskless cluster — with memory so cheap it seemed like a good idea at the time).

    Mark

  9. Nitish Sinha Says:

    I have started reading your book. I like your appendix B on information theory. I am considering using a few chapters for my class on “Text processing in Finance”. Will keep you posted as I read more chapters.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s