I’ve been thinking about linear classifiers, and while it’s well known how to convert naive Bayes (multinomial) classifiers into linear models (just treat log p(cat) as intercept, and log p(word|cat) as dimensions), it’s less well known how to do this with higher-order n-grams. One use is an n-gram language-model based search engine implemented on top of a vector-based retrieval engine (I’m not sure Lucene’s scoring’s flexible enough for this).

Well, it turns out to be easy if you use interpolated smoothing like Witten-Bell. In the bigram case, the smoothed estimate is:

p(a|b) = lam(b) * p'(a|b) + (1-lam(b)) p(a) p(a) = lam() * p'(a) + (1 - lam()) unif

where `lam(b)`

in `[0,1]`

is the interpolation coefficient for context `b`

and `lam()`

is the interpolation coefficient for the empty n-gram. `p'(a|b)`

and `p'(a)`

are maximum likelihood estimates, and `unif`

is the uniform probability of a token.

For Witten-Bell smoothing, interpolation is defined as:

lam(b) = count(b,_) / (count(b,_) + 1)

where `count(b,_)`

is the number of times the token `b`

is seen before another token in training, making `lam(b)`

increase with the number of training instances.

Let `count[X]`

be the count for n-gram `X`

(where the length may be 0, 1 or 2). For instance, the vector “John likes John” has counts for the following dimensions:

[] : 3 [John] : 2 [likes] : 1 [John,likes] : 1 [likes,John] : 1

Then define coefficient vector `beta`

, with dimensions for all of the n-grams seen in training (after pruning):

beta[] = log lam() + log unif beta[a] = log p(a) - beta[] beta[a,b] = log p(b|a) - beta[a]

If we look at the contribution from terms in our example, and sum, we see how it works, assuming all bigrams have been seen in training other than `[likes,John]`

.

[] : 3 3 * beta[] = 3 * (log lam() + log unif) [John] : 2 2 * beta[John] = 2 * (log p(john) - beta[]) [likes] : 1 1 * beta[likes] = 1 * (log p(likes) - beta[]) [John,likes] : 1 1 * beta[John,likes] = 1 * (log p(likes|John) - beta[John]) [likes,John] : 1 1 * beta[likes,John] = 0

Doing the sums leaves us with the log estimate:

log p([John,likes,John]) = log p(John) + log p(likes|John) + log lam(likes) + log p(John)

It’s no problem extending this to higher order n-grams. Many other kinds of interpolated smoothing will work, too.

## Leave a Reply