Disregard the above, I have

mN = mN + 1

in there twice. Silly me

Example in python:

>>> mN = 0

>>> mM = 0

>>> mS = 0

>>>

>>> def handle(x):

… global mN

… global mM

… global mS

… mN = mN + 1

… nextM = mM + (x – mM) / mN

… mS = mS + (x – mM) * (x – nextM)

… mM = nextM

… mN = mN + 1

… print(mM)

…

>>> handle(1)

1.0

>>> handle(1)

1.0

>>> handle(2)

1.2

>>> handle(3)

1.457142857142857

>>>

>>> realMean = (1 + 1 + 2 + 3)/4

>>> print(realMean)

1.75 ]]>

Although it’s C++, there’s a nice implementation of all of this in the Boost accumulators (though I don’t know if they give you the MLE divide-by-N estimate or the unbiased divide-by-(N-1) estimate. You can also accumulate Y^2 and use the definition of variance. It avoids all those potentially imprecise subtractions.

]]>Sorry types in my previous post:

add:

sxy += (x – mx) * (y – next_my);

remove:

sxy -= (x – mx) * (y – old_my);

You just need to compute the cross product sum

add:

sxy += (x – mM) * (y – nextM);

remove:

sxy -= (x – mM) * (y – mMOld);

And then sxy / (n-1) is the covariance and so on…

]]>That’s just an efficiency thing, but I’m skeptical whether it would be faster or not. Branch mispredictions are costly, whereas not having the `mN == 1`

is a cost you only pay once per accumulation.

This is not the place to ask a general question! We have a mailing list and e-mail (see the LingPipe home page).

The usual thing to do is to treat the word to count maps as vectors with the words as the dimensions and then use standard vector cosine to compare them. This is all implemented in LingPipe, though has nothing to do with this post. Often, there’s a TF/IDF rescaling of the counts. Check out LingPipe’s class `spell.TfIdfDistance`

for details and an implementation.