## Linear Models for N-Gram Probabilities

I’ve been thinking about linear classifiers, and while it’s well known how to convert naive Bayes (multinomial) classifiers into linear models (just treat log p(cat) as intercept, and log p(word|cat) as dimensions), it’s less well known how to do this with higher-order n-grams. One use is an n-gram language-model based search engine implemented on top of a vector-based retrieval engine (I’m not sure Lucene’s scoring’s flexible enough for this).

Well, it turns out to be easy if you use interpolated smoothing like Witten-Bell. In the bigram case, the smoothed estimate is:

```  p(a|b) = lam(b) * p'(a|b) + (1-lam(b)) p(a)
p(a) = lam() * p'(a) + (1 - lam()) unif
```

where `lam(b)` in `[0,1]` is the interpolation coefficient for context `b` and `lam()` is the interpolation coefficient for the empty n-gram. `p'(a|b)` and `p'(a)` are maximum likelihood estimates, and `unif` is the uniform probability of a token.

For Witten-Bell smoothing, interpolation is defined as:

```  lam(b) = count(b,_) / (count(b,_) + 1)
```

where `count(b,_)` is the number of times the token `b` is seen before another token in training, making `lam(b)` increase with the number of training instances.

Let `count[X]` be the count for n-gram `X` (where the length may be 0, 1 or 2). For instance, the vector “John likes John” has counts for the following dimensions:

```  [] : 3
[John] : 2
[likes] : 1
[John,likes] : 1
[likes,John] : 1
```

Then define coefficient vector `beta`, with dimensions for all of the n-grams seen in training (after pruning):

```  beta[] = log lam() + log unif
beta[a] = log p(a) - beta[]
beta[a,b] = log p(b|a) - beta[a]
```

If we look at the contribution from terms in our example, and sum, we see how it works, assuming all bigrams have been seen in training other than `[likes,John]`.

```  [] : 3             3 * beta[]           =  3 * (log lam() + log unif)
[John] : 2         2 * beta[John]       =  2 * (log p(john) - beta[])
[likes] : 1        1 * beta[likes]      =  1 * (log p(likes) - beta[])
[John,likes] : 1   1 * beta[John,likes] =  1 * (log p(likes|John) - beta[John])
[likes,John] : 1   1 * beta[likes,John] =  0
```

Doing the sums leaves us with the log estimate:

```  log p([John,likes,John])
= log p(John) + log p(likes|John) + log lam(likes) + log p(John)
```

It’s no problem extending this to higher order n-grams. Many other kinds of interpolated smoothing will work, too.