add:

sxy += (x – mx) * (y – next_my);

remove:

sxy -= (x – mx) * (y – old_my); ]]>

You just need to compute the cross product sum

add:

sxy += (x – mM) * (y – nextM);

remove:

sxy -= (x – mM) * (y – mMOld);

And then sxy / (n-1) is the covariance and so on…

]]>`mN == 1`

is a cost you only pay once per accumulation.
]]>The usual thing to do is to treat the word to count maps as vectors with the words as the dimensions and then use standard vector cosine to compare them. This is all implemented in LingPipe, though has nothing to do with this post. Often, there’s a TF/IDF rescaling of the counts. Check out LingPipe’s class `spell.TfIdfDistance`

for details and an implementation.

The idea is that I should be able to compare these two arrays and identify if there are enough keywords in common to deem these two articles related.

Any suggestions on stats models or formulas to use? Point me in the right direction?

Thanks!

]]>