* Your Mileage May Vary

August 25, 2010 by lingpipe

Alex Smola just introduced a blog, Adventures in Data, that’s focusing on many of the same issues as this one. His first few posts have been on coding tweaks and theorems for gradient descent for linear classifiers, a problem Yahoo!’s clearly put a lot of effort into.

Lazy Updates for SGD Regularization, Again

In the wide world of independently discovering algorithmic tricks, Smola has a post on Lazy Updates for Generic Regularization in SGD. It presents the same technique as I summarized in my pedantically named white paper Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression. It turns out to be a commonly used, but not often discussed technique.

Blocking’s Faster

I wrote a long comment on that post, explaining that I originally implemented it that way in LingPipe, but eventually found it faster in real large-scale applications to block the prior updates.

Your Mileage May Vary

To finally get to the point of the post (blame my love of rambling New Yorker articles), your mileage may vary.

The reason blocking works for us is that we have feature sets like character n-grams where each text being classified produces on the order of 1/1000 of all features. So if you update by block every 1000 epochs, it’s a win. Not even counting removing the memory and time overhead of keeping the record of sparseness and increased memory locality.

Horses for Courses

The only possible conclusion is that it’s horses for courses. Each application needs its own implementation for optimal results.

As another example, I can estimate my annotation models using Gibbs sampling in BUGS in 20 or 30 minutes. I can estimate them in my custom Java program in a second or two. I only wrote the program because I needed to scale to a named-entity corpus and BUGS fell over. As yet another example, I’m finding coding up genomic data very similar to text data (edit distance, Bayesian models) and also very different (4-character languages, super-long strings).

Barry Richards, the department head of Cog Sci when I was a grad student, gave a career retrospective talk about all the work he did in optimizing planning systems. Turns out each application required a whole new kind of software. As a former logician, he found the lack of generality very dispiriting.

Not being able to resist one further idiom, this sounds like a probletunity to me.

Threading Blog Discussions

This post is a nice illustration of Fernando Pereira’s comment on Hal Daumé III’s blog about the problematic nature of threading discussions across blogs.

That Zeitgeist Again

I need to generalize my earlier post about the scientific zeitgeist to include coding tricks. The zeitgeist post didn’t even discuss my rediscoveries in the context of SGD!

McCandless, Hatcher & Gospodnetić (2010) Lucene in Action: 2nd Ed. (Review)

August 16, 2010 by lingpipe

I’m very happy to see that the 2nd Edition is out:

  • McCandless, Michael, Erik Hatcher, and Otis Gospodnetić. 2010. Lucene in Action. Second Edition. Manning Press.

Manning’s offering 40% off until September 30, 2010. Follow the link to the book and use code lingpipeluc40 when you check out.

You Need This Book

If you’re interested in version 3 or later of the Apache Lucene search library, you need this book. As far as I know, this book is the only way to come up to speed on the intricacies of Lucene.

What if I have the First Edition?

Keep it for your projects that require 2.x versions of Lucene. So much has changed in version 3 that is not backward compatible with previous versions that the first edition of this book will be of almost no use. Every example has changed. It’s pretty much a whole new book (other than the first page where it describes Lucene as simple — that used to be true, but is not any longer).

I’ve been coding in Lucene since version 1.2 (2002), and the javadoc’s been getting better over time. I still needed the code samples from this book to get a basic index up and running and to figure out how to enumerate the terms given an analyzer. (In the first edition of the book, I contributed a case study on n-gram-based analysis, so it’s not like I was unfamiliar with the whole process!)

Is it a Good Book?

I’m going to violate the “show, don’t tell” rule of writing and answer my own question with an unequivocal “yes”. Now let me show you why.

The authors clearly know their Lucene. Not surprising, given that they’re three of the core developers for Lucene. (Erik Hatcher was also involved in the Ant build system, and has another Manning … in Action book for that.) Following the Manning Press in Action methodology, they lay everything out with very clear examples that you can run. As such, the book’s better than good — it’s really great set of code examples presented in a coherent style and order, with easy-to-follow explanations.

The book covers a wide range of introductory and advanced topics in a way that’s about as easy to follow as possible. It starts from simple indexing and query construction and then moves into more advanced topics like spell checking and term vector more-like-this implementations. It does all of this with examples, cleverly sidestepping most of the (relatively simple) mathematics underlying search and scoring, while still conveying the basic principles.

They cover extras like the Tika document framework, where they provide a very clear intro to the SAX XML parsers and a nice overview the Jakarta Commons Digester (I dislike the config-and-reflection-based Digester so much that I built LingPipe’s delegating and delegate handlers as an in-the-code alternative).

If you follow the examples in this book and try them out you’ll be in a good position to use Lucene in a serious application. Even better, you’ll be prepared to join the project’s amazingly helpful and responsive mailing list (for years now, the first response to most newbie questions has been to read Lucene in Action!).

They cover many more advanced topics than the last book, like the range of file-level locking options and their use in a more transactional setting than basic Lucene provides itself.

Needs a Revision

Reading the book, I can’t shake the guilt arising from not sending them feedback on drafts. I had the drafts lying around in PDF form for ages and kept promising to send Otis feedback. As is, this book needs another editorial pass. (That’s not to say it’s not fantastic as is — all the glitches in the book I saw were relatively minor and didn’t get in the way of the code examples actually illustrating what’s going on with Lucene.)

The initial two lines of a paragraph are repeated on pages xxx and xxxi. There are also typos like inserted question marks, as in “book?two hundred years ago” on p. xxxii.

There are mistakes in code fragments, like a missing comma in the parse() example on p. 239.

There are also bugs in code. I only studied a few programs in the term vector section (which is great) and in the Tika section. Listing 5.18 has a bug in that it calls searcher.search() with a maximum number of hits of 10 (below pushpin 6), whereas the method is specified to take a configurable max (above pushpin 3).

There are dangling references like “uncomment the output lines toward the end of the docsLike method” (on p. 195), where there are no output lines toward the end of the said method (on p. 193).

I’d be in favor of getting rid of the content-free text like “It’s time to get our feet wet!” and so on, while you’re at it.

What’s Missing?

There’s a huge range of contributed packages surrounding Lucene these days, coupled with an ever-expanding range of applications to which Lucene is being put. As a result, the authors face a daunting curatorial challenge.

From my perspective, the biggest gap is around Unicode and how it behaves with respect to normalization and search. It’s more than just stripping accents from Western characters.

JUnit Examples

Don’t get me wrong. I love JUnit. We use it for unit testing in LingPipe. And I recommend reading our tests to help understand our classes. I just don’t think it’s a good way to present examples in an intro book. On top of that, they’re using the old style of JUnit rather than the newer style with annotations for tests and startup/teardown methods.

What about the Code?

Authors of textbooks trying to introduce an API are presented with the difficult problem of balancing simplicity with solid coding practice. I have to fight my own tendency to err on the latter side. The authors favor simplicity in most cases, as they explain up front. The authors realize that users are likely to start from cut-and-pasted versions of their code, and try to warn people about the shortcuts they took for the sake of clarity. (We do the same thing in our tutorials.)

Many (but not all) methods that throw exceptions are just declared to throw Exception. I think this does readers a disservice in not knowing what’s actually throwing what kind of exception. Others might argue it makes the code cleaner and less cluttered. The authors explain that’s their motivation and point out proper exception etiquette. We do the same thing in some of our tutorial code.

The use of generics is spotty, with allocations like new TreeMap() that should be new TreeMap<String,Integer> and subsequent use of non-generic Iterator instances. As a result, they have to cast their gets and can’t use auto-boxing. This just clutters up the code using the maps and makes their intended contents opaque.

I’d recommend autoboxing, or its explicit result, Integer.valueOf(int), over the older style new Integer(int) (the latter always creates a new instance, whereas valueOf() being a factory can share instances).

I’d like to see more foreach and fewer explicit iterators.

They don’t like ternary operators or min/max operators, resulting in awkward fragments like:

int size = max;
if (max > hits.scoreDocs.length) size = hits.scoreDocs.length;

rather than the simpler

int size = Math.min(max,hits.scoreDocs.length);

I don’t know if this is the different authors’ example coding style (they all know how to write real production code, like Lucene itself!).

There is a more serious issue for floating point precision they try to deal with in listing 5.22 in the last if/else block. They try to avoid having cosines take on values outside of (-1,1), which would be illegal inputs to acos(). Their approach is insufficient. The only way to guarantee that your natural cosine computations fall in this range is to do:

double boundedX = Math.max(-1.0, Math.min(1.0, x));

That’s another problem we ran into for our k-means clustering implementation in LingPipe. Once the vectors get big, it’s easy to get results greater than 1 for close matches.

Also, if you’re worried about efficiency, replace Math.sqrt(x) * Math.sqrt(y) with Math.sqrt(x * y).

Layout, Etc.

I’m not a big fan of Manning’s approach to code formatting and presentation. Whole programs are printed out, often across multiple pages, and a numbered push-pin like graphic in very heavy bold is used to pick them out and cross-reference them in the text. I prefer the interleaved style I’ve been using, and you see in something like Bloch’s Effective Java, but I can see that it’s a matter of taste.

The code formatting also bugged me. Rather than breaking before assignment equal signs, they break after. They use two characters of indent, which is hard to scan. And the code’s too light for the rest of the text (the “color” doesn’t balance in the typographical sense).

The real typographical sin this book commits is breaking class and method names and file paths and URLs across lines. A class name like IndexWriter is apt to show up with Index- at the end of one line and Writer at the start of the next. The reason this is bad is that hyphens are legal characters in paths and may be confused with underscores in class names (hyphens aren’t legal in Java identifiers).

How Much Math?

For the most part, the authors stick to concrete examples and shy away from abstract mathematics (and related algorithm analyses). If you want to understand the math behind comparison or the probability models behind compressed indexes, I’d still recommend Witten, Moffat and Bell’s Managing Gigabytes (Morgan Kaufmann text, so it’s prohibitively expensive). Mannnig, Raghavan and Schütze’s more recent Introduction to Information Retrieval (Cambridge Uni. Press let them leave an online version up for free!) has most of the math and many more recent topics related to language processing such as edit distance algorithms and their relation to spell checking, probabilistically motivated retrieval, and a range of classifiers like K-nearest neighbors and support vector machines.

In the cosine example in the “Leveraging Term Vectors” (section 5.10), they define cosine in a formula without ever defining the length or dot product functions. If someone needs cosine defined for them, they probably need the definitions for basic vector operations. In the example code, they assume that the word vector has only a single instance of each word, which is certain to be confusing for novice geometers. I would’ve just stuck with largest cosine for the sort, rather than smallest angle and skipped the call to Math.acos(), the arc (or inverse) cosine operation (which they later describe as prohibitive in terms of computation costs, though it’ll only make a difference percentage-wise if the term vectors are short).

For spell checking, it’d have been nice to get up to something like the noisy channel model. Peter Norvig provides an elegant introduction in his chapter in the Beautiful Data book from O’Reilly. Lucene’s spell checking is still pretty simple and word based (rather than frequency based), though the authors provide some suggestions on how to work with what’s there.

What about Solr?

Now that the Solr search client-server project is merging with Lucene, it’ll be interesting to see if the third edition includes information about Solr. This book only mentions it in passing in a page-long section 10.8.

Other Books on Lucene

I haven’t seen Smiley and Pugh’s 2009 Solr 1.4 Enterprise Search Server, but it has positive feedback on Amazon.

Manu Konchady’s book Building Search Applications with Lucene, LingPipe, and Gate is more introductory (more like the O’Reilly NLTK book), and needs an update for Lucene 3 and LingPipe 4.

About those Cover Illustrations

On page xxxii, they tell the story of an editor buying the book from which the cover art is drawn. They say the cover page is missing so they couldn’t track it down, but they mention the date and artists. I just did a search for the terms they mention, [ottoman empire clothing 1802 william miller] and found two copies of the book for sale at a Donald Heald Books (you may need to search again). Here’s the scoop

[DALVIMART, Octavien (illustrator) - William ALEXANDER (1767-1816)].

The Costume of Turkey, illustrated by a series of engravings; with descriptions in English and French

London: printed for William Miller by T. Bensley, 1802 [but text watermarked 1796 and 1811; plates 1811 and 1817). Folio (14 1/8 x 10 1/2 inches). Parallel text in French and English. Letterpress title with integral hand-coloured stipple-engraved vignette, 60 hand-coloured stipple-engraved plates by J. Dadley or William Poole after Octavien Dalvimart. Contemporary green straight-grained morocco, covers bordered in gilt and blind, skilfully rebacked to style with the spine in six compartments with double raised bands, the bands highlighted with gilt tooling, lettered in gilt in the second compartment, the others with repeat decoration in gilt, oatmeal endpapers, gilt edges. Provenance: William Rees Mogg (Cholwell House, Somerset, England, armorial bookplate).

A good copy of this classic work on the costume of the Ottoman Empire.

Starting with the "Kislar Aga or first black unuch [sic.]” in the “Grand Signior’s Seraglio” the subjects covered are quite wide-ranging but centered on the inhabitants of Constantinople and those who were visiting the capital city: a “Sultana”, “Officers of the Grand Signior”, “Turkish woman of Constantinople”, “Turkish woman in provincial dress”. Dalvimart does make occasional forays out into the provincial areas of the empire: included are images of an “Albanian”, an “Egyptian Arab”, a “Bedouin Arab”, a “Dervise [sic.] of Syria”, an “Armenian”, a “Bosniac” as well as a number of fine plates of the female costume of the Greek Islands (which are much admired in the text). “The Drawings, from which … [the] plates have been engraved, were made on the spot … [in about 1798] by Monsieur Dalvimart, and may be depended upon for their correctness. They have been accurately attended to in the progress of the engraving; and each impression has been carefully coloured according to the original drawing, that the fidelity of them might not be impaired” (Preface). Abbey points out that the informative text is attributed to Willliam Alexander in the British Library catalogue.

Abbey Travel II,370; Atabey 313; cf. Colas I, 782; cf. Lipperheide I, Lb37; cf. Vinet 2337.

Bing Translate Has a Great UI, and Some Nice NLP, too

August 10, 2010 by lingpipe

I’m really digging the user interfaces Bing has put up. Google’s now copying their image display and some of their results refinement. I’ve been working on tokenization in Arabic for the LingPipe book and was using the text from an Arabic Wikipedia page as an example.

Here are links to Bing’s and Google’s offerings:

Yahoo!’s still using Babel Fish in last year’s UI; it doesn’t do Arabic.

Language Detection

First, it uses language detection to figure out what language you’re translating from. Obvious, but oh so much nicer than fiddling with a drop-down menu.

Side-by-Side Results

Second, it pops up the results side-by-side.

Sentence-Level Alignments

Even cooler, if you mouse over a region, it does sentence detection and shows you the corresponding region in the translation. Awesome.

Back at Google

I went back and looked at Google translate and see that they’ve added an auto-detect language feature since my last visit. Google only displays the translated page, but as you mouse over it, there’s a pop-up showing the original text and asking for a correction.

I don’t know who did what first, but these are both way better interfaces than I remember from the last time I tried Google translation. And the results are pretty good NLP-wise, too.

Sierra & Bates. Head First Java 2nd Ed. (Review)

August 9, 2010 by lingpipe

After the comments on the last post about relying more heavily on Java textbooks, I ordered the most popular ones from Amazon. I’ll start the reviews with the number one most popular book on Java (ranked by Amazon sales; it also has 269 reviews; I’m happy to see Effective Java is number 4):

  • Sierra, Kathy and Bert Bates. 2005. Head First Java. 2nd Edition. O’Reilly.

Mashup: Elementary School Workbook + Kitschy Humor

Image from Title Page.

I grew up with texts like The Little LISPer, but this book’s even stranger. It strikes me as a mashup of an elementary school math problem workbook and one of those cutesy humor books you see near the checkout counters of bookstores around the holidays.

Here’s an example of one of their quizzes, where you fill in missing pieces of a Java program; just like school writing exercises:

The Design and Layout

I think they were striving for a sense of the enthusiastic amateur. The “fun” fonts and askew alignments make it look non-linear. Code comments are added in a handwritten post-it note style.

You really need to head over to Amazon and “look inside”.

Who’s it For?

It’s aimed at Java novices. Other than that, I’m not sure. Maybe for potential Java developers who like crossword puzzles or true/false quizzes with titles like “Be the Compiler”. I’m more of a Java Puzzlers guy myself (thank my other childhood friend, Martin Gardner).

Unless you laughed at the photo included above, don’t buy it for the humor.

How Much Does it Cover?

It covers a startling amount of Java, including advanced topics like threads, sockets, RMI, Swing, audio, …

Of course, there can’t be much depth given this range. But you do get a wide range of hello-world-type programs.

Image from Page 101.

Is it Sound Java?

The authors clearly know what they’re doing. The advice and coverage are too basic if you want to program Java seriously, but it’ll get you started on the right track.

There’s even good high-level advice like telling you to write unit tests (though not using anything as complex as JUnit).

It’s an Industry

Like other popular series (i.e., …for Dummies, … in Action, … in a Nutshell), this one’s turned into an industry, with dozens of titles covering everything from HTML to statistics.

As such, they’ve created all the trappings of such an enterprise, such as “Head First learning principles”:

  • make it visual — put the words within or near graphics
  • use a conversational and personalized style
  • get the learner to think more deeply
  • get — and keep — the reader’s attention
  • touch their emotions

And they break out all sorts of “metacognitive” tips explaining why they wrote such a whacky book and how you should go about learning with it (like read it before bed and mix right-brain/left-brain activities for learning).

Many reviewers like the approach well enough to blurb about it.

LingPipe Book Draft Available

August 5, 2010 by lingpipe

We’ve started work on a proper LingPipe book. You can download the current partial draft of the book and sample code from:

In case you were wondering why the blog’s been quieter these days, this is it!

Our Goals

Our goal is to produce something with a little more breadth and depth and much more narrative structure than the current LingPipe tutorials. Something that a relative Java and natural language processing novice could work through from beginning to end, coming out with a fairly comprehensive knowledge of LingPipe and a good overview of some aspects of natural language processing.

We’re not trying to write an introduction to Java, but there’s a lot more detail on Java in the book than in LingPipe’s javadoc or the current tutorials.

Progress so Far

So far, there’s a getting started chapter with sections on all the tools we use, a hello world example and an introduction to Ant. The second chapter is all about character encodings and characters and strings in Java, as well as an introduction to the International Components for Unicode (ICU) library for normalization, encoding detection, and transliteration. The third chapter covers regular expressions, focusing on their interaction with Unicode. The fourth chapter is on input and output in Java, including files, byte and character streams, the various interfaces and buffering, compression, standard input and output, reading from URLs and resources on the classpath, and serialization, including the serialization proxy. The fifth chapter is on tokenization, including an overview of all of LingPipe’s built-in tokenizers and how to build your own.

The first appendiex is an introduction to Java, including the primitives, objects and arrays. The second appendix contains suggested further reading in areas related to natural language processing.

I hope to keep churning out chapters at around one per week. As I complete chapters, I’ll release new versions.

Comments Most Appreciated

C’mon — you know you want to be listed in the front of the book for sending us feedback. Any comments, suggestions, etc., should go to carp@alias-i.com.

The book’s not been copy-edited yet, even by me, so I don’t need to know about misspellings, sentences that run off into space, or that kind of thing.

I would love to get feedback about the general level of description, the tone, or get suggestions for demos or how to improve the existing code or descriptions.

Eventual Publication

We’ll be publishing it ourselves, probably through CreateSpace. That’ll let us sell through Amazon.

If it turns out to be 800 pages long, as we expect, we should be able to sell it for around US$ 20 (in the US anyway).

We plan to continue distributing PDF versions for free.

It’s about Time

I’m psyched, because it’s been over ten years since my last book.

I’m also working on a book on Bayesian categorical modeling — the posts here on non-NLP-related Bayesian stats, and posts like working through the collapsed samplers for LDA and naive Bayes, are warmups for that book. Don’t hold your breath; I’m trying to finish the LingPipe book first.

Big Bit-Packed Array Abstraction (for Java, C, etc.)

July 28, 2010 by lingpipe

One limitation of Java is that arrays can only be up to Integer.MAX_VALUE entries long. That means you can only get 2 billion entries into an array. That’s only 2GB if the arrays contain bytes, and we have way more memory than that these days. That’s not enough.

A second limitation, shared by C, is that each entry is a round number of bits, usually a power of two bytes (8, 16, 32, or 64 bits). Often 32 is too little and 64 too much.

I’ve been spending some time lately working on bioinformatics applications. The size and shape of the data’s a bit unusual for me, and very awkwardly sized for Java or even C-based arrays.

Packing the Human Genome

For instance, if you want to represent the bases in both strands of the human genome, it’s a sequence 6 billion bases long with each base being 1 of 4 values (A, C, G, or T). OK, no big, just pack four values into each byte and you can squeeze under the limit with a 1.5 billion long array of bytes. It’s even pretty easy to write the bit fiddling because everything aligns on byte boundaries.

Indexing the Human Genome

Now let’s suppose we want to index the genome so we can find occurrences of sequences of arbitrary length in time proportional to their length. Easy. Just implement a suffix array over the positions. (A suffix array is a packed form of a suffix tree, which is a trie structure that represents every suffix in a string. Technically, it’s an array consisting of all the positions, sorted based on the suffix defined from the position to the end of the string.)

Because there are 6G positions, the positions themselves won’t fit into an int, even using an unsigned representation. There are only 32 bits in an int, and thus a max of about 4 billion values. But a long is overkill. We need 33 bits, not 64. Using 8 bytes per entry, the suffix array would fill 48GB. We really only need a little over half that amount of space if we use 33 bits per pointer. But now things aren’t so easy. First, the bit fiddling’s more of a pain because 33-bit values can’t be both packed and byte aligned. Second, you can’t create an array that’ll hold 24GB of values, even if it’s a maximally sized array of longs (that’ll get you to 16GB).

The Abstraction

What I’ve defined in the package I’m working on with Mitzi is a BigArray interface. The interface is super simple.

interface BigArray {
    long length();
    long get(long n);
    void set(long n, long val);
}

That’s the basics. We might want to make the set operations optional so we could have immutable big arrays. As is, there’s no way to define immutable arrays in Java or C (though Java provides immutable wrappers for lists in the collections framework and C probably has something similar in its standard template lib).

I followed the idiom of Java’s Number interface and define extra getters for bytes, shorts and integers for convenience, but don’t show them here.

On top of a big array, we could implement a variety of the new I/O buffer classes, using something like the random access file repositioning pattern, but it’d be clunky. It’s really a shame the buffers are also sized to integers.

Implementations

What’s groovy about interfaces is that we can implement them any way we want. The simplest implementation would just be backed by an array of long values. An implementation specifically for two-bit values might be used for the genome, with an efficient byte-aligned implementation.

What I did was create an implementation that allows you to specify the array length and bits per value. I then pack the bits into an array of longs and fiddle them out for getting and setting. This is fine for the genome sequence, because it’ll fit, although it’s not as efficient as the byte-aligned implementation, so I’ll probably add extra special-case implementations.

But it’s still not enough for the suffix array. I can’t fit all the bits into an array of longs. Although I haven’t implemented it yet, the obvious thing to do here is to build up the big array out of an array of arrays. I’d only need two arrays of longs for the human genome suffix array. This is pretty easy to implement hierarchically with an array of smaller big array implementations. First figure out which subarray, then do the same thing as before.

Yes, I Know about Burrows-Wheeler Indexing

If you’re just interested in the suffix array packing problem, the Burrows-Wheeler compression algorithm may be combined with indexing to get the entire genome suffix array into 4GB of memory in a way that’s still usable at run time. This was first explored by Ferragini and Manzini in their 2000 paper, Opportunistic data structures with applications. Very very cool stuff. And it works even better in the genomics setting than for text documents because of the small alphabet size (just four bases for DNA/RNA or 20 amino acids for proteins). Several projects have taken this approach (e.g. BWA and Bowtie).

LaTeX Problem’s Not Our Fault

July 14, 2010 by lingpipe

Thanks to those who’ve sent me e-mail saying that the LaTeX is broken on the site. It’s not our fault. Really. It’s a problem with WordPress’s server.

Are lots of people having problems seeing LaTeX on our site?

What’s Going On

WordPress uses a web service to decode the LaTeX expressions to PNGs. When I write something like

$latex f(x)$

in a blog post, it gets rendered as f(x).

What’s really going on is that WordPress is preprocessing my doc and replacing the LaTeX escapes with links to a web service that returns PNG images. For instance, the example above translates to the following link:

http://l.wordpress.com/latex.php?latex=f%28x%29&bg=ffffff&fg=000000&s=0

If you click on the link, you should see just the formula.

(Note: Given that they run the preprocessor, WordPress might change the path and the link might go stale. Or their server might be down. If that happens, right click on the image in your browser and select “View Image Info” or whatever the variant for doing that on your browser is.)

What Happens When it Fails

When WordPress’s web server goes down, it fills in something like “latex path not specified”. Not sure whether that’s postprocessing or just the result from the server not being configured properly.

Brendan O’Connor was kind enough to send me a screen shot:

screen shot of effect of WordPress LaTeX server outage

The Latin1 Transcoding Trick (for Java Servlets, etc.)

July 14, 2010 by lingpipe

This post uses Java servlets as an example, but applies more broadly to a situation where a program automatically converts streams of bytes to streams of Unicode characters.

The Trick

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte. These char values may be translated back into bytes with a simple cast:

Converter.setEncoding("ISO-8859-1");
char[] cs = Converter.getChars();
byte[] bs = new byte[cs.length];
for (int i = 0; i < cs.length; ++i)
    bs[i] = (byte) cs[i];

Now you have the original bytes back again in the array bs. In Java, char values act as unsigned 16-bit values, whereas byte values are signed 8-bit values. The casting preserves values through the magic of integer arithmetic overflow and twos-complement notation. (I attach a program that’ll verify this works at the end.)

You can now use your own character encoding detection or pull out a general solution like the International Components for Unicode (which I highly recommend — it tracks the Unicode standard very closely, performing character encoding detection, fully general and configurable Unicode normalization, and even transliteration).

Use in Servlets for Forms

I learned this trick from Jason Hunter’s excellent book, Java Servlet Programming, (2nd Edition, O’Reilly). Hunter uses the trick for decoding form data. The problem is that there’s no way in HTML to declare what character encoding is used on a form. What Hunter does is add a hidden field for the value of the char encoding followed by the Latin1 transcoding trick to recover the actual string.

Here’s an illustrative code snippet, copied more or less directly from Hunter’s book:

public void doGet(HttpServletRequest req, ...) {
   ...
    String encoding = req.getParameter("charset");
    String text = req.getParameter("text");
    text = new String(text.getBytes("ISO-8859-1"), encoding);
    ...

Of course, this assumes that the getParameter() will use Latin1 to do the decoding so that the getBytes("ISO-8859-1") returns the original bytes. According to Hunter, this is typically what happens because browsers insist on submitting forms using ISO-8859-1, no matter what the user has chosen as an encoding.

We borrowed this solution for our demos (all the servlet code is in the distribution under $LINGPIPE/demos/generic), though we let the user choose the encoding (which is itself problematic because of the way browsers work, but that’s another story).

Testing Transcoding

public class Test {
    public static void main(String[] args) throws Exception {
        byte[] bs = new byte[256];
        for (int i = -128; i < 128; ++i)
            bs[i+128] = (byte) i;
        String s = new String(bs,"ISO-8859-1");
        char[] cs = s.toCharArray();
        byte[] bs2 = s.getBytes("ISO-8859-1");
        for (int i = 0; i < 256; ++i)
            System.out.printf("%d %d %d\n",
                             (int)bs[i],(int)cs[i],(int)bs2[i]);
    }
}

which prints out

c:\carp\temp>javac Test.java

c:\carp\temp>java Test
-128 128 -128
-127 129 -127
...
-2 254 -2
-1 255 -1
0 0 0
1 1 1
...
126 126 126
127 127 127

Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes

July 13, 2010 by lingpipe

I’ve uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).

Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, Finding scientific topics. They show the final form, but don’t derive the integral or provide a citation.

I suppose these 25-step integrals are supposed to be child’s play. Maybe they are if you’re a physicist or theoretical statistician. But that was a whole lot of choppin’ with the algebra and the calculus for a simple country computational linguist like me.

On to Bayesian Naive Bayes

My whole motivation for doing the derivation was that someone told me that it wasn’t possible to integrate out the multinomials in naive Bayes (actually, they told me you’d be left with residual \Gamma functions). It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup.

It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit. But in the end, the result’s interesting. I didn’t wind up with what I expected. Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc. Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item.

This is so cool. I actually learned something from a derivation.

In my prior post, Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM). Don’t get me started on the misappropriation of the term “Bayesian” by any system that applies Bayes’s rule, but do check out another of my previous posts, What is Bayesian Statistical Inference? if you’re curious.

Thanks to Wikipedia

I couldn’t have done the derivation for LDA (or NB) without the help of

It pointed me (implicitly) at a handful of invaluable tricks, such as

  • multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
  • unfolding products based on position, unfolding a \Gamma() function for the position at hand, then refolding the rest back up so it could be dropped out, and
  • reparameterizing products for total counts based on sufficient statistics.

Does anyone know of another source that works through equations more gently? I went through the same exegesis for SGD estimation for multinomial logistic regression with priors a while back.

But Wikipedia’s Derivation is Wrong!

At least I’m pretty sure it is as of 5 PM EST, 13 July 2010.

Wikipedia’s calculation problem starts in the move from the fifth equation before the end to the fourth. At this stage, we’ve already eliminated all the integrals, but still have a mess of \Gamma functions left. The only hint at what’s going on is in the text above which says it drops terms that don’t depend on k (the currently considered topic assignment for the n-th word of the m-th document). The Wikipedia’s calculation then proceeds to drop the term \prod_{i \neq k} \Gamma(n^{i,-(m,n)}_{m,(\cdot)} + \alpha_i) without any justification. It clearly depends on k.

The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of \Gamma function applications are dropped, such as \Gamma(n^{k,-(m,n)}_{m,(\cdot)} + \alpha_k), which even more clearly depend on k.

It took me a few days to see what was going on, and I figured out how to eliminate the variables properly. I also explain each and every step for those of you like me who don’t eat, sleep and breathe differential equations. I also use the more conventional stats numbering system where the loop variable m ranges from 1 to M so you don’t need to keep (as large) a symbol table in your head.

I haven’t changed the Wikipedia page for two reasons: (1) I’d like some confirmation that I haven’t gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I’d introduce typos or spend all day entering it.

LaTeX Rocks

I also don’t think I could’ve done this derivation without LaTeX. The equations are just too daunting for pencil and paper. The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX. Any suggestions?

Unicode Shell for Java: Windows PowerShell ISE

July 6, 2010 by lingpipe

I finally figured out how to have my Java programs write Unicode to a Windows shell and show up looking the right way. Here’s the final result (click to enlarge):

DOS vs Powershell for Unicode

And here’s how I did it.

1. Fire Up PowerShell ISE

My new Windows 7 Professional notebook comes with an application called PowerShell ISE (make sure to run the ISE version; the unmarked one is more like DOS and has the same problems noted below). The “ISE” is for “integrated scripting environment”.

It defaults to Consolas font, which looks a lot like Lucida console.

2. Set Character Encoding to UTF-8

This’ll set the DOS character encoding to be UTF-8.

> chcp 65001
Active code page: 65001

3. Set Java System.out to UTF-8

Before writing to System.out, Java’s constant for the standard output, you’ll need to set it up to use UTF-8. It’s a one-liner.

System.setOut(new PrintStream(System.out,true,"UTF-8"));

The true value enables auto-flushing, which is a good idea for standard output.

Demo Program

Here’s a simple test program (the backslash escapes are Java literals for Unicode code points).

import java.io.PrintStream;
import java.io.UnsupportedEncodingException;

public class Test {

    public static void main(String[] args)
        throws UnsupportedEncodingException {

        PrintStream utf8Out
            = new PrintStream(System.out,false,"UTF-8");
        System.setOut(utf8Out);

        System.out.println("English: Hello");
        System.out.println("French: D\u00E9j\u00E0 vu");
        System.out.println("Tamil: \u0B92");
        System.out.println("Han: \u4E52");
    }

}

You can see the output in action in the screen dumps at the top of this post.

Problems with DOS

The DOS shell will let you set the character encoding and change the font to Lucida console. But it oddly duplicates characters at the end of lines if the lines contain non-ASCII code points. You can see this on the example. And it can’t handle the Tamil or Han characters.