Deleting Values in Welford’s Algorithm for Online Mean and Variance


To take a Gibbs sample, the previous sampled value for a variable is deleted from the sufficient statistics of the estimators that depend on it, the variable is resampled, and the the variable is added back into the estimators. What if our variable’s normal? Well, it turns out we can generalize Welford’s algorithm, which I describe in a comment to:

Here’s the basic algorithm, cribbed and simplified from John Cook’s site (see reference below), along with an extra method (unHandle(double)) added in for deletes:

private long mN = 0L;
private double mM = 0.0;
private double mS = 0.0;
public void handle(double x) {
    double nextM = mM + (x – mM) / mN;
    mS += (x – mM) * (x – nextM);
    mM = nextM;
public void unHandle(double x) {
    if (mN == 0) {
        throw new IllegalStateException();
    } else if (mN == 1) { 
        mN = 0; mM = 0.0; mS = 0.0;
    } else {
        double mMOld = (mN * mM - x)/(mN-1);
        mS -= (x -  mM) * (x - mMOld);
        mM = mMOld;
public double mean() {
    return mM;
public double varianceUnbiased() {
    return mN > 1 ? mS/(mN-1) : 0.0;

Works like a charm. I’m sure someone’s done this already, but I didn’t find any references in a quick check.

The same technique can obviously be applied to covariance and correlation calculations.

This’ll be in the next release of LingPipe.


  • Welford, B. P. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.
  • Cook, John. Accurately computing running variance. Web page.

8 Responses to “Deleting Values in Welford’s Algorithm for Online Mean and Variance”

  1. Joseph Says:

    I’m a student at University of Manitiba and had a question I thought you might be able to answer. I’m trying to identify and algorith that would be best suited to help me identify relationship between two arrays (each array generated froma separate archival story/article). Each array contains 1 column of keyword entries paired to a second column of frequencies that the keyword occurs within the original text.

    The idea is that I should be able to compare these two arrays and identify if there are enough keywords in common to deem these two articles related.

    Any suggestions on stats models or formulas to use? Point me in the right direction?


    • lingpipe Says:

      This is not the place to ask a general question! We have a mailing list and e-mail (see the LingPipe home page).

      The usual thing to do is to treat the word to count maps as vectors with the words as the dimensions and then use standard vector cosine to compare them. This is all implemented in LingPipe, though has nothing to do with this post. Often, there’s a TF/IDF rescaling of the counts. Check out LingPipe’s class spell.TfIdfDistance for details and an implementation.

  2. amocanu2013 Says:

    Maybe I’m missing sth, but according to you should have a case then mean == 1 in your handle() method. The part of the code I’m referring to is marked with this comment at that link // See Knuth TAOCP vol 2, 3rd edition, page 232

    • Bob Carpenter Says:

      That’s just an efficiency thing, but I’m skeptical whether it would be faster or not. Branch mispredictions are costly, whereas not having the mN == 1 is a cost you only pay once per accumulation.

  3. Steve Ash (@steve_ash) Says:

    I wish I had found this post before spending time coming up with my own implementation. I ran some simulations with mine and compared it to yours. The difference is negligible for anything that I care about, but if you’re curious, check out:

  4. Matthew Fioravante Says:

    You can easily extend this algorithm to compute correlation, covariance and all of the linear regression coefficients.

    You just need to compute the cross product sum
    sxy += (x – mM) * (y – nextM);
    sxy -= (x – mM) * (y – mMOld);

    And then sxy / (n-1) is the covariance and so on…

    • Matthew Fioravante Says:

      Sorry types in my previous post:
      sxy += (x – mx) * (y – next_my);
      sxy -= (x – mx) * (y – old_my);

      • Bob Carpenter Says:

        Although it’s C++, there’s a nice implementation of all of this in the Boost accumulators (though I don’t know if they give you the MLE divide-by-N estimate or the unbiased divide-by-(N-1) estimate. You can also accumulate Y^2 and use the definition of variance. It avoids all those potentially imprecise subtractions.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 847 other followers