Pearson’s chi-squared independence test may be used to compute collocations in a text corpus; that is, sequences of words that occur together more than they might by chance. This typically finds names, idioms, set-phrases and the like in text. There’s a whole chapter on collocations in Manning and Schütze’s Stat NLP book, and there’s a Lingpipe Interesting Phrases Tutorial.
In the case of bigrams, we have two words A and B, and want to know if they occur as a bigram (A,B) more often than chance. The simple binary chi-squared independence test is based on a confusion matrix which computes bigram counts for (+A,+B), (+A,-B), c(-A,-B), and (-A,-B), where (+A,+B) is the count of the bigram (A,B), (+A,-B) is the count of all bigrams (A,D) where D != B, and (-A,-B) is the count of all bigrams (C,D) where C != A and D != B. The formula for the test statistic itself is defined in the javadoc for method
The LingPipe class
lm.TokenizedLM provides a convenient wrapper with which to count sequences of token n-grams. The tokenized LM class has a
train(CharSequence) method which allows texts to be given. This method stores the n-gram counts after tokenizing. So let’s see what happens if we give it two sentences as inputs:
- John ran
- John ran home
This produces the following counts:
This causes a subtle problem for chi-squared statistics: the count for a final word may be higher than its count as the first element of a bigram. The same problem happens for initial words and their counts as the second element of a bigram. So what are our exact bigram counts for (ran,home)? They’re (+ran,+home)=1, (+ran,-home)=0, (-ran,+home)=0, and (-ran,-home)=2. (Leave aside for now the fact that this isn’t high enough counts for the significance test to be accurate; it’s just an example of a point.)
The trie data structure used to represent counts (
lm.TrieIntSeqCounter) can easily compute the number of words following a given word with a local lookup of a word’s daughters in the trie. It can’t easily compute the number of words preceding a given word; that would require iterating over all characters. So would computing all bigrams.
So the chi-squared implementation takes a shortcut. It approximates count(+ran,-home)=0 by count(+ran)-count(+ran,+home) = 2-1. We’re in effect overcounting the negative cases. Thus our chi-squared statistic is wrong.
But what if we implicitly assume there are boundary tokens, so we model the following sequences:
- BOS John ran EOS
- BOS John ran home EOS
giving us the following additional unigram and bigram counts:
Our approximate counts now become exact:
- count(+ran,-home) = count(+ran)-count(+ran,+home) = 1
- count(-ran,+home) = count(home)-count(+ran,+home) = 0
- count(-ran,-home) = totalCount-count(+ran,+home)-count(-ran,+home)-count(+ran,-home) = 9-0-1-1 = 7
Et voilà. An ex post facto rationalization. What LingPipe is counting in chi-squared collocation tests includes the implicit boundaries.