## Collocations, Chi-Squared Independence, and N-gram Count Boundary Conditions

Pearson’s chi-squared independence test may be used to compute collocations in a text corpus; that is, sequences of words that occur together more than they might by chance. This typically finds names, idioms, set-phrases and the like in text. There’s a whole chapter on collocations in Manning and Schütze’s Stat NLP book, and there’s a Lingpipe Interesting Phrases Tutorial.

In the case of bigrams, we have two words A and B, and want to know if they occur as a bigram (A,B) more often than chance. The simple binary chi-squared independence test is based on a confusion matrix which computes bigram counts for (+A,+B), (+A,-B), c(-A,-B), and (-A,-B), where (+A,+B) is the count of the bigram (A,B), (+A,-B) is the count of all bigrams (A,D) where D != B, and (-A,-B) is the count of all bigrams (C,D) where C != A and D != B. The formula for the test statistic itself is defined in the javadoc for method `Statistics.chiSquaredIndependence(double,double,double,double`).

The LingPipe class `lm.TokenizedLM` provides a convenient wrapper with which to count sequences of token n-grams. The tokenized LM class has a `train(CharSequence)` method which allows texts to be given. This method stores the n-gram counts after tokenizing. So let’s see what happens if we give it two sentences as inputs:

• John ran
• John ran home

This produces the following counts:

n-gram Count
John 2
ran 2
home 1
(John,ran) 2
(ran,home) 1
(John,ran,home) 1

This causes a subtle problem for chi-squared statistics: the count for a final word may be higher than its count as the first element of a bigram. The same problem happens for initial words and their counts as the second element of a bigram. So what are our exact bigram counts for (ran,home)? They’re (+ran,+home)=1, (+ran,-home)=0, (-ran,+home)=0, and (-ran,-home)=2. (Leave aside for now the fact that this isn’t high enough counts for the significance test to be accurate; it’s just an example of a point.)

The trie data structure used to represent counts (`lm.TrieIntSeqCounter`) can easily compute the number of words following a given word with a local lookup of a word’s daughters in the trie. It can’t easily compute the number of words preceding a given word; that would require iterating over all characters. So would computing all bigrams.

So the chi-squared implementation takes a shortcut. It approximates count(+ran,-home)=0 by count(+ran)-count(+ran,+home) = 2-1. We’re in effect overcounting the negative cases. Thus our chi-squared statistic is wrong.

But what if we implicitly assume there are boundary tokens, so we model the following sequences:

• BOS John ran EOS
• BOS John ran home EOS

giving us the following additional unigram and bigram counts:

n-gram Count
BOS 2
EOS 2
(BOS,John) 2
(ran,EOS) 1
(home,EOS) 1

Our approximate counts now become exact:

• count(+ran,+home)=1
• count(+ran,-home) = count(+ran)-count(+ran,+home) = 1
• count(-ran,+home) = count(home)-count(+ran,+home) = 0
• count(-ran,-home) = totalCount-count(+ran,+home)-count(-ran,+home)-count(+ran,-home) = 9-0-1-1 = 7

Et voilà. An ex post facto rationalization. What LingPipe is counting in chi-squared collocation tests includes the implicit boundaries.

### One Response to “Collocations, Chi-Squared Independence, and N-gram Count Boundary Conditions”

1. Text Classification for Sentiment Analysis – Stopwords and Collocations «streamhacker.com Says:

[…] mentioned at the end of the article on precision and recall, it's possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like "not great", […]