Pearson’s chi-squared independence test may be used to compute collocations in a text corpus; that is, sequences of words that occur together more than they might by chance. This typically finds names, idioms, set-phrases and the like in text. There’s a whole chapter on collocations in Manning and Schütze’s Stat NLP book, and there’s a Lingpipe Interesting Phrases Tutorial.

In the case of bigrams, we have two words A and B, and want to know if they occur as a bigram (A,B) more often than chance. The simple binary chi-squared independence test is based on a confusion matrix which computes bigram counts for (+A,+B), (+A,-B), c(-A,-B), and (-A,-B), where (+A,+B) is the count of the bigram (A,B), (+A,-B) is the count of all bigrams (A,D) where D != B, and (-A,-B) is the count of all bigrams (C,D) where C != A and D != B. The formula for the test statistic itself is defined in the javadoc for method `Statistics.chiSquaredIndependence(double,double,double,double`

).

The LingPipe class `lm.TokenizedLM`

provides a convenient wrapper with which to count sequences of token n-grams. The tokenized LM class has a `train(CharSequence)`

method which allows texts to be given. This method stores the n-gram counts after tokenizing. So let’s see what happens if we give it two sentences as inputs:

- John ran
- John ran home

This produces the following counts:

n-gram | Count |
---|---|

John | 2 |

ran | 2 |

home | 1 |

(John,ran) | 2 |

(ran,home) | 1 |

(John,ran,home) | 1 |

This causes a subtle problem for chi-squared statistics: the count for a final word may be higher than its count as the first element of a bigram. The same problem happens for initial words and their counts as the second element of a bigram. So what are our exact bigram counts for (ran,home)? They’re (+ran,+home)=1, (+ran,-home)=0, (-ran,+home)=0, and (-ran,-home)=2. (Leave aside for now the fact that this isn’t high enough counts for the significance test to be accurate; it’s just an example of a point.)

The trie data structure used to represent counts (`lm.TrieIntSeqCounter`

) can easily compute the number of words following a given word with a local lookup of a word’s daughters in the trie. It can’t easily compute the number of words preceding a given word; that would require iterating over all characters. So would computing all bigrams.

So the chi-squared implementation takes a shortcut. It approximates count(+ran,-home)=0 by count(+ran)-count(+ran,+home) = 2-1. We’re in effect overcounting the negative cases. Thus our chi-squared statistic is wrong.

But what if we implicitly assume there are boundary tokens, so we model the following sequences:

- BOS John ran EOS
- BOS John ran home EOS

giving us the following additional unigram and bigram counts:

n-gram | Count |
---|---|

BOS | 2 |

EOS | 2 |

(BOS,John) | 2 |

(ran,EOS) | 1 |

(home,EOS) | 1 |

Our approximate counts now become exact:

- count(+ran,+home)=1
- count(+ran,-home) = count(+ran)-count(+ran,+home) = 1
- count(-ran,+home) = count(home)-count(+ran,+home) = 0
- count(-ran,-home) = totalCount-count(+ran,+home)-count(-ran,+home)-count(+ran,-home) = 9-0-1-1 = 7

*Et voilà*. An *ex post facto* rationalization. What LingPipe is counting in chi-squared collocation tests includes the implicit boundaries.

May 24, 2010 at 8:40 am |

[…] mentioned at the end of the article on precision and recall, it's possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like "not great", […]