Given a set of independent samples $x_{i,j}$ where $i$ ranges from 1 to $m$ and $j$ ranges from 1 to $n$, it is easy to estimate the variance of the underlying distribution using a formula like \begin{equation} \text{Var}(x) = \sum_{i,j} x_{i,j}^2 - (\sum_{i,j} x_{i,j})^2. \end{equation} Unfortunately, I need to estimate this variance without directly measuring the $x_{i,j}$. Instead, I only have access to the set of "sub-population means" \begin{equation} y_i = \frac1n\sum_{j=1}^nx_{i,j}. \end{equation} It's easy to see that $\text{Var}(y)$ is an underestimate of $\text{Var}(x)$ using arguments based on the law of large numbers or jensen's inquality. So how can I get an unbiased estimate the variance of the $x$s given only the $y$s?
Asked
Active
Viewed 61 times
1 Answers
1
We can easily compute the variance of $y$ as $$\operatorname{Var}(y_i) = \operatorname{Var}(\frac{1}{n}\sum_{j=1}^n x_{ij}) = \frac{1}{n^2} \sum_{j=1}^n\operatorname{Var}(x_{ij})= \frac{1}{n}\operatorname{Var}(x_{ij}).$$ Thus if $\hat{y}$ is any unbiased estimator of $\operatorname{Var}(y)$, then $n\cdot \hat{y}$ is an unbiased estimator of $\operatorname{Var}(x)$.
-
Your final formula is intuitively basically what I think the answer should be, but I don't follow where your middle equality comes from. For the particular case of $n=1$ and $m$ large, I don't think it's correct. In this case, we have $\text{Var}(y_i) = 0$; because there's only one value, there's no variance. But then your estimator for $\text{Var}(x_{i,j})$ is also always 0, which is clearly incorrect. – Mike Izbicki Aug 26 '21 at 05:49
-
From basic probability theory we have for random variables $X$ and $Y$, that $Var(a\cdot X) = a^2\cdot Var(X)$ (for any constant $a$) and that $Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)$, where $Cov(X,Y)=0$ when $X$ and $Y$ are independent. This should explain the middle equality. – Leander Tilsted Kristensen Aug 26 '21 at 10:57
-
A single observation does not necessarily have $0$ variance. For instance a single observation from a $N(0,4)$ distribution would have variance $4$, however, there is (usually) no way to estimate the variance from a single observation. My computation holds regardless of what $m$ and $n$ is, but it requires that an unbiased estimator of $Var(y)$ exists. The usual estimator would be $$\hat{y} = \frac{1}{m-1} \sum_{i=1}^m (y_i - \frac{1}{m} \sum_{i=1}^m y_i)^2,$$ which of course fails to exist when $m=1$. – Leander Tilsted Kristensen Aug 26 '21 at 11:02