Update, 30 April 2009:
OK, the good method mentioned in the links below, and in the final comment of mine below, is now implemented in LingPipe. Here’s the Javadoc and code:
Update, 6 April 2009:
Just check out:
- Wikipedia: Algorithms for Calculating Variance
In particular, Welford’s algorithm, which is both online and fairly numerically stable. There’s a nice evaluation and description in:
- John D. Cook: Accurately Computing Running Variance
It shows the problem with computing variances for a set of large values that are close together.
A two pass algorithm that first estimates the mean is even more robust; that combines Welford’s algorithm with Andrew’s suggestion in the comments about subtracting out the mean.
Algebra week continues here at Alias-i with this piece of advice on how to compute sample means and variances in one pass through the data. Here’s the standard textbook definition of the sample mean of an array x of length n:
Here’s the formula for sample variance:
From the way they’re written, it looks like you need to make two passes through the array x, first to compute the mean and then again to compute the variance. But it turns out with a little algebra, it’s easy to see we only need a single pass. The trick is to keep a running total of the sums of the x values:
and the sum of the squares of the x values:
Of course, you need to keep track of n, the number of examples seen.
It’s easy to compute the mean, you just divide by the length as in the first equation above. With the mean in hand, the variance is easy to compute, too. We just expand out the square:
and then rearrange:
and the final formulas for mean (repeated from above) and variance:
The sample mean and sample variance are maximum likelihood estimates of the true mean and variance. But the sample variance is biased. The unbiased variance estimate is larger by a factor of n/(n-1):
In conclusion, the sum of values, sum of squares, and number of samples are collectively sufficient statistics for estimating a normal distribution. The other nice thing about this is that you can also do it online by updating the sum of values and sum of squares as each value arrives. The Wikipedia has a nice discussion of unbiased estimators covering sample mean and variance.