To take a Gibbs sample, the previous sampled value for a variable is deleted from the sufficient statistics of the estimators that depend on it, the variable is resampled, and the the variable is added back into the estimators. What if our variable’s normal? Well, it turns out we can generalize Welford’s algorithm, which I describe in a comment to:
- LingPipe Blog: Computing Sample Mean and Variance Online in One Pass
Here’s the basic algorithm, cribbed and simplified from John Cook’s site (see reference below), along with an extra method (unHandle(double)
) added in for deletes:
private long mN = 0L; private double mM = 0.0; private double mS = 0.0; public void handle(double x) { ++mN; double nextM = mM + (x – mM) / mN; mS += (x – mM) * (x – nextM); mM = nextM; } public void unHandle(double x) { if (mN == 0) { throw new IllegalStateException(); } else if (mN == 1) { mN = 0; mM = 0.0; mS = 0.0; } else { double mMOld = (mN * mM - x)/(mN-1); mS -= (x - mM) * (x - mMOld); mM = mMOld; --mN; } } public double mean() { return mM; } public double varianceUnbiased() { return mN > 1 ? mS/(mN-1) : 0.0; }
Works like a charm. I’m sure someone’s done this already, but I didn’t find any references in a quick check.
The same technique can obviously be applied to covariance and correlation calculations.
This’ll be in the next release of LingPipe.
References
- Welford, B. P. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.
- Cook, John. Accurately computing running variance. Web page.
August 9, 2010 at 5:30 am |
I’m a student at University of Manitiba and had a question I thought you might be able to answer. I’m trying to identify and algorith that would be best suited to help me identify relationship between two arrays (each array generated froma separate archival story/article). Each array contains 1 column of keyword entries paired to a second column of frequencies that the keyword occurs within the original text.
The idea is that I should be able to compare these two arrays and identify if there are enough keywords in common to deem these two articles related.
Any suggestions on stats models or formulas to use? Point me in the right direction?
Thanks!
August 9, 2010 at 12:32 pm |
This is not the place to ask a general question! We have a mailing list and e-mail (see the LingPipe home page).
The usual thing to do is to treat the word to count maps as vectors with the words as the dimensions and then use standard vector cosine to compare them. This is all implemented in LingPipe, though has nothing to do with this post. Often, there’s a TF/IDF rescaling of the counts. Check out LingPipe’s class
spell.TfIdfDistance
for details and an implementation.November 18, 2014 at 4:59 pm |
Maybe I’m missing sth, but according to http://www.johndcook.com/blog/standard_deviation/ you should have a case then mean == 1 in your handle() method. The part of the code I’m referring to is marked with this comment at that link // See Knuth TAOCP vol 2, 3rd edition, page 232
December 10, 2014 at 4:01 pm |
That’s just an efficiency thing, but I’m skeptical whether it would be faster or not. Branch mispredictions are costly, whereas not having the
mN == 1
is a cost you only pay once per accumulation.August 6, 2015 at 6:43 pm |
I wish I had found this post before spending time coming up with my own implementation. I ran some simulations with mine and compared it to yours. The difference is negligible for anything that I care about, but if you’re curious, check out: http://blog.manycupsofcoffee.com/2015/08/streaming-standard-deviation-and.html
August 20, 2015 at 11:48 am |
You can easily extend this algorithm to compute correlation, covariance and all of the linear regression coefficients.
You just need to compute the cross product sum
add:
sxy += (x – mM) * (y – nextM);
remove:
sxy -= (x – mM) * (y – mMOld);
And then sxy / (n-1) is the covariance and so on…
August 20, 2015 at 11:50 am |
Sorry types in my previous post:
add:
sxy += (x – mx) * (y – next_my);
remove:
sxy -= (x – mx) * (y – old_my);
August 20, 2015 at 1:21 pm
Although it’s C++, there’s a nice implementation of all of this in the Boost accumulators (though I don’t know if they give you the MLE divide-by-N estimate or the unbiased divide-by-(N-1) estimate. You can also accumulate Y^2 and use the definition of variance. It avoids all those potentially imprecise subtractions.
April 27, 2017 at 3:48 am |
Thanks! I tried doing this myself but ran into trouble. This works great!
March 24, 2020 at 1:16 pm |
This code is incorrect and will not yield the mean of the input values.
Example in python:
>>> mN = 0
>>> mM = 0
>>> mS = 0
>>>
>>> def handle(x):
… global mN
… global mM
… global mS
… mN = mN + 1
… nextM = mM + (x – mM) / mN
… mS = mS + (x – mM) * (x – nextM)
… mM = nextM
… mN = mN + 1
… print(mM)
…
>>> handle(1)
1.0
>>> handle(1)
1.0
>>> handle(2)
1.2
>>> handle(3)
1.457142857142857
>>>
>>> realMean = (1 + 1 + 2 + 3)/4
>>> print(realMean)
1.75
March 24, 2020 at 1:29 pm |
Disregard the above, I have
mN = mN + 1
in there twice. Silly me