You can think of binomial hypothesis testing as a single point chi-squared test. Wikipedia’s Chi-Square Test entry is nice in that it shows you how each table entry in the chi-squared test is a binomial hypothesis test, and how those binomials are estimated by normals for high count data. With multiple cells in the chi-square tests, the final estimate is a sum of normal approximations, which is distributed approximate chi-squared. So really chi-square tests are just series of binomial tests combined by summation.

In our application, it’s a binomial generated using independence assumptions versus a binomial model using our higher-order n-gram models. If you read the collocation chapter of Chris Manning and Hinrich Schuetze’s book on statistical NLP, they go over both chi-square and t-tests for collocation data.

LingPipe uses the chi-squared test for independence in generating collocations through the collocation method for token language models. You can also do this with a binomial by building a unigram background model. Our interesting terms method uses the binomial hypothesis test in measuring two distributions against each other. If you read the code or doc carefully enough, you’ll see I don’t really take into account the variance of both estimates, which I should be doing.

]]>