To take a Gibbs sample, the previous sampled value for a variable is deleted from the sufficient statistics of the estimators that depend on it, the variable is resampled, and the the variable is added back into the estimators. What if our variable’s normal? Well, it turns out we can generalize Welford’s algorithm, which I describe in a comment to:
- LingPipe Blog: Computing Sample Mean and Variance Online in One Pass
Here’s the basic algorithm, cribbed and simplified from John Cook’s site (see reference below), along with an extra method (unHandle(double)) added in for deletes:
private long mN = 0L;
private double mM = 0.0;
private double mS = 0.0;
public void handle(double x) {
++mN;
double nextM = mM + (x – mM) / mN;
mS += (x – mM) * (x – nextM);
mM = nextM;
}
public void unHandle(double x) {
if (mN == 0) {
throw new IllegalStateException();
} else if (mN == 1) {
mN = 0; mM = 0.0; mS = 0.0;
} else {
double mMOld = (mN * mM - x)/(mN-1);
mS -= (x - mM) * (x - mMOld);
mM = mMOld;
--mN;
}
}
public double mean() {
return mM;
}
public double varianceUnbiased() {
return mN > 1 ? mS/(mN-1) : 0.0;
}
Works like a charm. I’m sure someone’s done this already, but I didn’t find any references in a quick check.
The same technique can obviously be applied to covariance and correlation calculations.
This’ll be in the next release of LingPipe.
References
- Welford, B. P. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.
- Cook, John. Accurately computing running variance. Web page.
August 9, 2010 at 5:30 am |
I’m a student at University of Manitiba and had a question I thought you might be able to answer. I’m trying to identify and algorith that would be best suited to help me identify relationship between two arrays (each array generated froma separate archival story/article). Each array contains 1 column of keyword entries paired to a second column of frequencies that the keyword occurs within the original text.
The idea is that I should be able to compare these two arrays and identify if there are enough keywords in common to deem these two articles related.
Any suggestions on stats models or formulas to use? Point me in the right direction?
Thanks!
August 9, 2010 at 12:32 pm |
This is not the place to ask a general question! We have a mailing list and e-mail (see the LingPipe home page).
The usual thing to do is to treat the word to count maps as vectors with the words as the dimensions and then use standard vector cosine to compare them. This is all implemented in LingPipe, though has nothing to do with this post. Often, there’s a TF/IDF rescaling of the counts. Check out LingPipe’s class
spell.TfIdfDistancefor details and an implementation.