Deleting Values in Welford’s Algorithm for Online Mean and Variance

by

To take a Gibbs sample, the previous sampled value for a variable is deleted from the sufficient statistics of the estimators that depend on it, the variable is resampled, and the the variable is added back into the estimators. What if our variable’s normal? Well, it turns out we can generalize Welford’s algorithm, which I describe in a comment to:

Here’s the basic algorithm, cribbed and simplified from John Cook’s site (see reference below), along with an extra method (unHandle(double)) added in for deletes:

private long mN = 0L;
private double mM = 0.0;
private double mS = 0.0;
public void handle(double x) {
     ++mN;
    double nextM = mM + (x – mM) / mN;
    mS += (x – mM) * (x – nextM);
    mM = nextM;
}
public void unHandle(double x) {
    if (mN == 0) {
        throw new IllegalStateException();
    } else if (mN == 1) { 
        mN = 0; mM = 0.0; mS = 0.0;
    } else {
        double mMOld = (mN * mM - x)/(mN-1);
        mS -= (x -  mM) * (x - mMOld);
        mM = mMOld;
        --mN;
    }
}
public double mean() {
    return mM;
}
public double varianceUnbiased() {
    return mN > 1 ? mS/(mN-1) : 0.0;
}

Works like a charm. I’m sure someone’s done this already, but I didn’t find any references in a quick check.

The same technique can obviously be applied to covariance and correlation calculations.

This’ll be in the next release of LingPipe.

References

  • Welford, B. P. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.
  • Cook, John. Accurately computing running variance. Web page.

2 Responses to “Deleting Values in Welford’s Algorithm for Online Mean and Variance”

  1. Joseph Says:

    I’m a student at University of Manitiba and had a question I thought you might be able to answer. I’m trying to identify and algorith that would be best suited to help me identify relationship between two arrays (each array generated froma separate archival story/article). Each array contains 1 column of keyword entries paired to a second column of frequencies that the keyword occurs within the original text.

    The idea is that I should be able to compare these two arrays and identify if there are enough keywords in common to deem these two articles related.

    Any suggestions on stats models or formulas to use? Point me in the right direction?

    Thanks!

    • lingpipe Says:

      This is not the place to ask a general question! We have a mailing list and e-mail (see the LingPipe home page).

      The usual thing to do is to treat the word to count maps as vectors with the words as the dimensions and then use standard vector cosine to compare them. This is all implemented in LingPipe, though has nothing to do with this post. Often, there’s a TF/IDF rescaling of the counts. Check out LingPipe’s class spell.TfIdfDistance for details and an implementation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 797 other followers