I’ve been thinking about linear classifiers, and while it’s well known how to convert naive Bayes (multinomial) classifiers into linear models (just treat log p(cat) as intercept, and log p(word|cat) as dimensions), it’s less well known how to do this with higher-order n-grams. One use is an n-gram language-model based search engine implemented on top of a vector-based retrieval engine (I’m not sure Lucene’s scoring’s flexible enough for this).
Well, it turns out to be easy if you use interpolated smoothing like Witten-Bell. In the bigram case, the smoothed estimate is:
p(a|b) = lam(b) * p'(a|b) + (1-lam(b)) p(a) p(a) = lam() * p'(a) + (1 - lam()) unif
where lam(b)
in [0,1]
is the interpolation coefficient for context b
and lam()
is the interpolation coefficient for the empty n-gram. p'(a|b)
and p'(a)
are maximum likelihood estimates, and unif
is the uniform probability of a token.
For Witten-Bell smoothing, interpolation is defined as:
lam(b) = count(b,_) / (count(b,_) + 1)
where count(b,_)
is the number of times the token b
is seen before another token in training, making lam(b)
increase with the number of training instances.
Let count[X]
be the count for n-gram X
(where the length may be 0, 1 or 2). For instance, the vector “John likes John” has counts for the following dimensions:
[] : 3 [John] : 2 [likes] : 1 [John,likes] : 1 [likes,John] : 1
Then define coefficient vector beta
, with dimensions for all of the n-grams seen in training (after pruning):
beta[] = log lam() + log unif beta[a] = log p(a) - beta[] beta[a,b] = log p(b|a) - beta[a]
If we look at the contribution from terms in our example, and sum, we see how it works, assuming all bigrams have been seen in training other than [likes,John]
.
[] : 3 3 * beta[] = 3 * (log lam() + log unif) [John] : 2 2 * beta[John] = 2 * (log p(john) - beta[]) [likes] : 1 1 * beta[likes] = 1 * (log p(likes) - beta[]) [John,likes] : 1 1 * beta[John,likes] = 1 * (log p(likes|John) - beta[John]) [likes,John] : 1 1 * beta[likes,John] = 0
Doing the sums leaves us with the log estimate:
log p([John,likes,John]) = log p(John) + log p(likes|John) + log lam(likes) + log p(John)
It’s no problem extending this to higher order n-grams. Many other kinds of interpolated smoothing will work, too.
Leave a Reply