Calculate variance from a stream of sample values

Question

I'd like to calculate a standard deviation for a very large (but known) number of sample values, with the highest accuracy possible. The number of samples is larger than can be efficiently stored in memory.

The basic variance formula is:

$\sigma^2 = \frac{1}{N}\sum (x - \mu)^2$

... but this formulation depends on knowing the value of $\mu$ already.

$\mu$ can be calculated cumulatively -- that is, you can calculate the mean without storing every sample value. You just have to store their sum.

But to calculate the variance, is it necessary to store every sample value? Given a stream of samples, can I accumulate a calculation of the variance, without a need for memory of each sample? Put another way, is there a formulation of the variance which doesn't depend on foreknowledge of the exact value of $\mu$ before the whole sample set has been seen?

This very same issue was discussed on dsp.SE, and the answers were very similar, with numerically unstable methods proposed in one answer, and more stable methods described by others. — Dilip Sarwate, Mar 04 '12 at 17:04
https://math.stackexchange.com/questions/198336/how-to-calculate-standard-deviation-with-streaming-inputs — sanmai, Apr 12 '23 at 04:51

score 60 · Answer 1 · edited Mar 04 '12 at 20:03

60

I'm a little late to the party, but it appears that this method is pretty unstable, but that there is a method that allows for streaming computation of the variance without sacrificing numerical stability.

Cook describes a method from Knuth, the punchline of which is to initialize $m_1 = x_1$, and $v_1 = 0$, where $m_k$ is the mean of the first $k$ values. From there,

$$ \begin{align*} m_k & = m_{k-1} + \frac{x_k - m_{k-1}}k \\ v_k & = v_{k-1} + (x_k - m_{k-1})(x_k - m_k) \end{align*} $$

The mean at this point is simply extracted as $m_k$, and the variance is $\sigma^2 = \frac{v_k}{k-1}$. It's easy to verify that it works for the mean, but I'm still working on grokking the variance.

edited Mar 04 '12 at 20:03

Ilmari Karonen

26,117

answered Mar 04 '12 at 15:59

Dan Lecocq

751

4

+1, excellent! I didn't feel a simple upvote would be sufficient to express my appreciation of this answer, so there's an extra 50 rep bounty coming your way in 24 hours. – Ilmari Karonen Mar 04 '12 at 20:12
1

Isn't the final variance calculation should be Vk / k (not k-1)? – Eyal Roth Oct 10 '13 at 15:49
@errr: it has to do with using biased vs unbiased estimator – alexey Mar 04 '16 at 02:03
1

$v_k = \sum_i^k (x_i-m_k)^2$, so $v_k = v_{k-1} + x_k^2 - \frac{(k-1)m_k^2 - km_{k-1}^2}{k(k-1)}$ – GuLearn May 05 '17 at 02:02
To test whether it's stable or not refer to the Examples in https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance. The goes through finding the variance for (4, 7, 13, 16) then for (1e8 + 4, 1e8 + 7, 1e8 + 13, 1e8 + 16) and then for (1e9 + 4, 1e9 + 7, 1e9 + 13, 1e9 + 16). In theory they should all return the same variance but because of the imprecisions in the double arithmetic the result deviate from the expected variance. – thomas.han Jun 30 '18 at 16:58
This gave me the exact same result (to displayed accuracy, e.g. ~7 decimals) when tested against numpy.mean and numpy.var. Very nice! – SuaveSouris Nov 27 '19 at 21:11
As far as I know, the algorithm is from Welford (1962). – Acorn Jul 08 '20 at 06:38

Dinesh · Accepted Answer · 2011-02-05T22:31:42.793

21

You can keep two running counters - one for $\sum_i x_i$ and another for $\sum_i x_i^2$. Since variance can be written as $$ \sigma^2 = \frac{1}{N} \left[ \sum_i x_i^2 - \frac{(\sum_i x_i)^2}{N} \right] $$

you can compute the variance of the data that you have seen thus far with just these two counters. Note that the $N$ here is not the total length of all your samples but only the number of samples you have observed in the past.

edited Feb 05 '11 at 22:31

answered Feb 05 '11 at 21:35

Dinesh

3,049

1

Wait -- shouldn't the N be inside the the expected values, such that the right side is divided by $N^2$? – user6677 Feb 05 '11 at 22:05
e.g. $$ \sigma^2 = \frac{(\sum_i x_i^2)}{N} - (\frac{\sum_i x_i}{N})^2 $$ – user6677 Feb 05 '11 at 22:27
@user6677: You are indeed right. Thanks for the correction. – Dinesh Feb 05 '11 at 22:30
10

Note the sum of squares especially is at risk of overflow. – dfrankow Dec 29 '14 at 18:06
Instead of having counters for sum and sum of squares it would make more sense to have counters for average and average square. – Karolis Juodelė Feb 06 '15 at 05:50
6

This is numerically unstable. If your mean is large compared to your stdev, your estimate will be way off. Knuth's algo is better (see other answer). – Leopd Feb 18 '15 at 01:40
Shouldn't it be $1/(N-1)$ outside of the parenthesis? – R B Jan 18 '17 at 08:23

Calculate variance from a stream of sample values

2 Answers2

Linked