Linear Models for N-Gram Probabilities


I’ve been thinking about linear classifiers, and while it’s well known how to convert naive Bayes (multinomial) classifiers into linear models (just treat log p(cat) as intercept, and log p(word|cat) as dimensions), it’s less well known how to do this with higher-order n-grams. One use is an n-gram language-model based search engine implemented on top of a vector-based retrieval engine (I’m not sure Lucene’s scoring’s flexible enough for this).

Well, it turns out to be easy if you use interpolated smoothing like Witten-Bell. In the bigram case, the smoothed estimate is:

  p(a|b) = lam(b) * p'(a|b) + (1-lam(b)) p(a)
    p(a) = lam() * p'(a) + (1 - lam()) unif

where lam(b) in [0,1] is the interpolation coefficient for context b and lam() is the interpolation coefficient for the empty n-gram. p'(a|b) and p'(a) are maximum likelihood estimates, and unif is the uniform probability of a token.

For Witten-Bell smoothing, interpolation is defined as:

  lam(b) = count(b,_) / (count(b,_) + 1)

where count(b,_) is the number of times the token b is seen before another token in training, making lam(b) increase with the number of training instances.

Let count[X] be the count for n-gram X (where the length may be 0, 1 or 2). For instance, the vector “John likes John” has counts for the following dimensions:

  [] : 3
  [John] : 2
  [likes] : 1
  [John,likes] : 1
  [likes,John] : 1

Then define coefficient vector beta, with dimensions for all of the n-grams seen in training (after pruning):

  beta[] = log lam() + log unif
  beta[a] = log p(a) - beta[]
  beta[a,b] = log p(b|a) - beta[a]

If we look at the contribution from terms in our example, and sum, we see how it works, assuming all bigrams have been seen in training other than [likes,John].

  [] : 3             3 * beta[]           =  3 * (log lam() + log unif)
  [John] : 2         2 * beta[John]       =  2 * (log p(john) - beta[])
  [likes] : 1        1 * beta[likes]      =  1 * (log p(likes) - beta[])
  [John,likes] : 1   1 * beta[John,likes] =  1 * (log p(likes|John) - beta[John])
  [likes,John] : 1   1 * beta[likes,John] =  0

Doing the sums leaves us with the log estimate:

  log p([John,likes,John]) 
    = log p(John) + log p(likes|John) + log lam(likes) + log p(John)

It’s no problem extending this to higher order n-grams. Many other kinds of interpolated smoothing will work, too.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: