I’m not quite releasing drafts as frequently as I’d like, though I do have the procedure automated now.
Where to Get the Latest Draft
You can download the book and the code tarball from:
What’s in It (So Far)
Draft 2 is up to 350 or so pages, and most of what I’ve added since the first draft is LingPipe-related. There’s no more general background in the works, but at this pace, it’ll be another year and another 1000 pages before it’s done. We’ll almost certainly have to break it into two volumes if we want to print it.
The current draft has chapters on getting started with Java and LingPipe, including an overview of tools we use. The second chapter’s on character encodings and how to use them in Java. The third chapter covers regexes, including all the quantifiers, and again focusing on how to get the most out of Unicode. The fourth chapter covers I/O, including files, readers, writers and streams, compressed archives like GZIP, ZIP and Tar, resources on the classpath, URIs, URLs, standard I/O, object I/O and serialization, and LingPipe’s I/O utilities.
The fifth chapter gets more into LingPipe proper, covering the handler, parser, and corpora abstractions in package
com.aliasi.corpus, as well as support for cross-validation.
The sixth chapter is on classifier evaluations, including K-ary classifiers, reductions to binary classifiers, all the usual statistics, and how to use them in LingPipe. There’s also an extensive section on scored/ranked evaluations.
I’ll probably rearrange and move tokenization before classifier evals, but it’s currently after. I cover just about all aspects of tokenization, including stemming/lemmatization, soundex, character normalization with ICU, and so on. There’s complete coverage of LingPipe’s tokenizers and factories, and complete tokenization abstraction. I also detail interoperability with Lucene’s
Analyzer class, with examples in Arabic.
Chapter 9, which will also move earlier, is on symbol tables.
Chapter 11 is a fairly complete overview of latent Dirichlet allocation (LDA) and LingPipe’s implementations.
There’s currently almost 100 pages of appendices, including basic math, stats, information theory, an overview of corpora, and an overview of the relevant data types in Java.
Appendix E is about a 20-page intro to Lucene 3.0, going over all you need to know to get search up and running.
The next thing I’ll address will be chapter 7, on naive Bayes classifiers. Then I’ll turn to logistic regression classifiers, which will require an auxiliary chapter on feature extraction and another on vectors. I may also write chapters on KNN, perceptrons, and our language-model-based classifiers, though he latter depend on a chapter on character language models.
After that, I’ll probably turn to tagging and chunking, though we’ll have to see. That’ll require sentence detection, as well as some more stats and interfaces.
So far, no one’s sent any comments on the first draft. I’d love to hear what you think, be it in the form of comments, corrections, suggestions, or whatever.