Hal, of the NLPers blog, was just in town for a talk at Columbia on hierarchical Bayesian models (which was awesome — more on the zeitgeist of hierarchical Bayesian models in NLP in future posts). So I thought I’d catch up on his recent work, which includes this paper for the upcoming NAACL meeting in Colorado this summer:
- Goyal, Amit, Hal Daumé III, and Suresh Venkatasubramanian. 2009. Streaming for Large Scale NLP: Language Modeling.. NAACL.
This paper imports a nice, simple algorithm from
- Manku, Gurmeet Singh. and Rajeev Motwani. 2002. Aproximate frequency counts over data streams. VLDB.
and evaluates it on some NLP data. Goyal et al.’s a prime instance of a least publishable unit (LPU). I’m quite surprised it was accepted; I would’ve expected to see the contribution of this paper as an aside in a machine translation paper. But then I wouldn’t have found it. I really like small publishable units, as long as they’re not redundant (in a field; I also don’t mind importing results from other fields and do it all the time myself). Another advantage is that you can go into depth on one thing, which is really critical if you’re trying to implement (or decide to implement) something yourself.
The motivation is streaming data that requires (approximate) word n-gram counts because of its size. Having said that, they evaluated only on relatively small data sets (i.e. word 5-grams over fractions of English Gigaword under 1G words total). But there are lots of ways to deal with data this size. We managed to build character 8-grams in memory on even more data without pruning. You could also perform multiple passes, each storing a shard of the data, as is common with search engines and then merge it — that’s what LingPipe does to scale beyond what’ll fit in memory, following the Lucene search index model. You can speed this up with pre-passes for short n-grams to prune out any extensions of low-count low-order n-grams.
[Revised:] Goyal et al. cite Alex Franz and Thorsten Brants’s Google n-gram effort, and it’d be interesting to see how close you could get to that effort on a single machine. Franz and Brants ran over 1000 times more data (over 1T words) and collected many more n-grams (min count of 40). You simply couldn’t have produced the Google scale corpus using the Goyal et al. methods as they described their implementation on a regular desktop workstation (Goyal et al. restricted themselves to 8GB of RAM), because it produced hundreds of millions of n-grams. Goyal et al. took a threshold of 100 occurrences in 1G instances, which is a whole lot smaller (i.e. requiring 2000 times the relative frequency to keep a word). Presumably they could’ve collected that corpus (at frequency cutoff 80,000), but how long would it’ve taken to run? Is it even the right question to ask of an experimental paper?
In a nutshell, the Manku and Motwani algorithm, as presented by Goyal et al., keeps counts, error estimates, and periodically prunes the counts. It assumes you know the number N of tokens in the corpus ahead of time and have set thresholds ε and s such that ε << s. It then processes data in blocks of size 1/ε, pruning after each block.
INPUTS: words[1..N], ε in (0,1), s in (0,1), s >> &epsilon counter, error = new string to counter maps; for (t in 1:εN) for (w in words[(t-1)/ε:t/&epsilon]) if (w not stored) error(w) = t-1; ++count(w); for (w in words[1..N]) if (count(w) + error(w) <= t) remove w return (w,count(w)) if count(w) > (s-ε)N
The error(w) bounds how many counts could’ve been thrown away for a word before the current epoch t, and the count(w) is the count that hasn’t been thrown away. Pruning just removes those words that’ve appeared fewer than the number of epochs times.
It’s pretty easy to see why the following count results hold:
- the algorithm returns all words with actual counts > sN,
- the algorithm returns no word with actual counts < (s – ε)N, and
- returned counts are less than or equal to actual counts by an amount at most εN.
The harder bit’s the (presumably amortized; I didn’t read the original paper) analysis required to establish its space bound:
- The algorithm requires at most O(1/ε log (εN)) space.
Goyal et al. actually retain all counts (that is, they effectively set s=ε). This admits large relative frequency errors, because a word with actual count T = εN could have a returned count of 1 and another with actual count T could have a returned count of T. In NLP, the appearance of a term is often much more important than the counts, so I appreciate their motivation. When s >> ε as the algorithm developers intended, the relative frequency errors are bounded much more tightly because the min count sN is much greater than the max error εN.
You could actually run blocks of 1/ε tokens and find your N at the end if you didn’t know N ahead of time, as would be the case in a truly streaming application.
The empirical evals show exactly what you’d expect — the counts derived using this approach have slightly inferior performance compared to storing full counts and pruning.
They did crude implementations with hash tables for the n-grams rather than tries or PAT-tries, which is suboptimal in space and in time. It’s what Joel Spolsky called Shlemiel the painter’s algorithm, the usual source of which is the use of += on strings in Java. In technical terms, it’s a quadratic algorithm that keeps repeating work (1+2+3+…+N = O(N2) steps for problem of size N) where there’s a simple linear algorithm. LingPipe updates n-gram counts using the standard linear algorithms which are well known from the suffix tree literature. But I think the point of Goyal et al. was to evaluate the technique, not build a super-scalable implementation.
It’s not quite possible to use our sequence counters to implement this algorithm with LingPipe. We don’t store the error bounds. If you just compute each block separately and prune to minimum count 2 in each block, you can easily establish the same bounds on accuracy (though clearly not on size, though the fact that things are going to disk should make this easier to handle). You could store the error bounds separately and prune by hand, but you’d need to compute the error bounds yourself by finding all the new n-grams in a block, so it’s really not practical as LingPipe’s written now.