Update, 30 April 2009:
OK, the good method mentioned in the links below, and in the final comment of mine below, is now implemented in LingPipe. Here’s the Javadoc and code:
- Javadoc:
stats.OnlineNormalEstimator
- Code:
stats/OnlineNormalEstimator.java
Update, 6 April 2009:
Just check out:
- Wikipedia: Algorithms for Calculating Variance
In particular, Welford’s algorithm, which is both online and fairly numerically stable. There’s a nice evaluation and description in:
- John D. Cook: Accurately Computing Running Variance
It shows the problem with computing variances for a set of large values that are close together.
A two pass algorithm that first estimates the mean is even more robust; that combines Welford’s algorithm with Andrew’s suggestion in the comments about subtracting out the mean.
Original Post:
Algebra week continues here at Alias-i with this piece of advice on how to compute sample means and variances in one pass through the data. Here’s the standard textbook definition of the sample mean of an array x of length n:
Here’s the formula for sample variance:
From the way they’re written, it looks like you need to make two passes through the array x, first to compute the mean and then again to compute the variance. But it turns out with a little algebra, it’s easy to see we only need a single pass. The trick is to keep a running total of the sums of the x values:
and the sum of the squares of the x values:
Of course, you need to keep track of n, the number of examples seen.
It’s easy to compute the mean, you just divide by the length as in the first equation above. With the mean in hand, the variance is easy to compute, too. We just expand out the square:
and then rearrange:
to derive:
and the final formulas for mean (repeated from above) and variance:
The sample mean and sample variance are maximum likelihood estimates of the true mean and variance. But the sample variance is biased. The unbiased variance estimate is larger by a factor of n/(n-1):
In conclusion, the sum of values, sum of squares, and number of samples are collectively sufficient statistics for estimating a normal distribution. The other nice thing about this is that you can also do it online by updating the sum of values and sum of squares as each value arrives. The Wikipedia has a nice discussion of unbiased estimators covering sample mean and variance.
March 19, 2009 at 6:58 pm |
This is not completely safe, as it can suffer numerical problems. In fact it is a recurrent example in textbooks for catastropic cancellation.
March 19, 2009 at 9:25 pm |
I’ve used this one-pass formulation without consequence, but santi’s right that it’s a danger.
http://en.wikipedia.org/wiki/Variance#Computational_formula_for_variance
March 20, 2009 at 11:57 am |
Ouch. That’s particularly worrisome because near-zero variance estimates are problematic for downstream inference in general.
The place I want(ed?) to use this is in scaling feature values to z-scores to improve the numerical stability of stochastic gradient descent:
z(x) = (x – mean)/sqrt(variance))
As variance approaches zero, the z-scores diverge. We’re just tossing out zero-variance dimensions (no way to estimate coefficients in that case anyway).
That involves walking over the corpus of vectors once collecting the sum of x and x**2, or twice, collecting first the sum of x, then the sum of (x – mean)**2.
Don’t we get the same problem with the terms in the usual definition that uses (x-mean)**2? In fact, the only way we’d seem to get catastrophic cancellation (that is, the difference between two close numbers losing all or most of its arithmetic precision) is when these terms were all near zero.
Is there a better way to compute this? How do packages like R compute this kind of thing (probably lots of ways, but I just meant the built-in)?
More generally, is there some kind of textbook presentation of this kind of thing I can read?
March 28, 2009 at 5:21 am |
Hi, Bob. This isn’t something I’ve ever looked at in detail, but my quick thought is that instead of keeping running totals of x and x^2, you can keep totals of z and z^2, where z is a linear transformation of x: z = (x-A)/B, and A and B are estimates of the mean and sd of x (as computed, for example, based on the first 100 numbers in your series, maybe with some small value such as .01 added to B to deal with potential zero variance estimates when your first data points are identical).
I actually seem to recall that there are simple recursive formulas you can use for the mean and variance, but the above approach should work ok, I think.
April 13, 2009 at 3:49 am |
I did a C++ implementation based on someone else’s C# implementation for running variance, based on those recursive formulas. I didn’t look at the near-zero issues but for what I was doing at the time it was fast & accurate.
there’s also a link in the post to john cook’s blog which talks about the recursive formulas some more.
http://anyall.org/blog/2008/11/calculating-running-variance-in-python-and-c/
April 13, 2009 at 12:16 pm |
Thanks for the link — I’d found it tracing the Wikipedia article linked by Dan.
I just implemented a class OnlineNormalEstimator for the next version of LingPipe. Here’s the Java, which uses fewer member variables than John D. Cook’s implementation:
private long mN = 0L;
private double mM = 0.0;
private double mS = 0.0;
public void handle(double x) {
++mN;
double nextM = mM + (x – mM) / mN;
mS += (x – mM) * (x – nextM);
mM = nextM;
}
public double mean() {
return mM;
}
public double varianceUnbiased() {
return mN > 1 ? mS/(mN-1) : 0.0;
}
November 17, 2012 at 9:57 pm |
[…] [2] https://lingpipe-blog.com/2009/03/19/computing-sample-mean-variance-online-one-pass/ […]