Working with normal distributions, how large can noise be before data becomes inaccurate?

Question

I'm measuring a characteristic of a device that has a normal distribution ($0$ mean and std dev of $\sigma_M$).

There is, however, noise in the measurement process, which also has a normal distribution ($0$ mean and std dev of $\sigma_N$). I can measure this noise independently.

I can estimate the device's true characteristic (without noise) as $\sigma_D = \sqrt{\sigma_M^2 - \sigma_N^2}$. To be compliant with a specific spec, $\sigma_D$ must be less than $L$.

If the noise is small compared to the measured value, I have high confidence in my data. But my confidence drops as the noise approaches the measured value. In the extreme case, if $\sigma_M = \sigma_N$, my estimation returns $\sigma_D = 0$, indicating that I've reached the noise floor of my equipment (I think that's the correct interpretation, but let me know if not).

My question is, how close can $\sigma_N$ be to $\sigma_M$ such that $\sigma_D$ is still "accurate"?

I don't want to report a value of $\sigma_D$ that contains too much error. Rather, I'd like to report some lower bound for $\sigma_D$ once $\sigma_N$ becomes too close to $\sigma_M$. Any light you can shed to help me define that lower bound would be much appreciated.

UPDATE 1

To clarify, I measure $(1)$ the device with noise and, separately at a later time, $(2)$ the noise (without the device).
The distribution of noise measured directly (without the device) can be assumed to also exist when the device is measured with noise, and that it is the ONLY noise present when the device is measured with noise.

UPDATE 2

Is the following statistically meaningful as a condition where $\sigma_D$ is inaccurate: $$ \sigma_M - \sigma_N < \dfrac{1.96}{\sqrt{n}}(\sigma_M + \sigma_N) \;, $$ where $n$ is the number of samples used to compute the sigmas?

Are you saying you can measure what the contribution of the noise was in a given measurement of the device, or that you can only separately measure the device with noise and the noise? The former problem is very easy, the latter problem seems subtle. — Ian, Jul 01 '17 at 15:40
It's an issue of when the measurements happen. You can measure the noise, but presumably you can't measure the actual contribution of the noise at the same time as you measure the device. Instead you can measure the distribution of the noise but not how it actually contributed to your device experiments. Is that correct? Again I ask because the other interpretation is a trivial problem (since you can simply subtract off the noise measurements). — Ian, Jul 01 '17 at 15:54
In symbols, I'm asking whether you have ${ X_i }{i=1}^n$ and ${ X_i + Y_i }{i=1}^n$ vs. ${ X_i }{i=1}^n$ and ${ X_i + Y_i }{i=n+1}^{2n}$, where $X$ is the noise and $Y$ is the signal. I think you have the second one. — Ian, Jul 01 '17 at 15:59
What you can really find is actually a probability and a confidence interval. The notation in Update 2 should be changed, as you do not know the true values of the variances, but only their estimates. Please, denote estimated values with a hat, e.g. $\hat{\sigma}_D$. The bound you are looking for is the Cramer-Rao Lower Bound. If you knew the true value of $\sigma_D$, the problem would be much, much easier. But you don't and you have to take that into account. — PseudoRandom, Jul 02 '17 at 09:02
Imagine you know the true value of $\sigma_M$. Your original problem says to establish if $X < L$, where $X \sim \mathcal{N}(0, \sigma^2_M)$. But since $X$ is random variable, we can only calculate the probability $\mathbb{P}(X < L)$. Now, if $Z$ is standard normal, $X = \sigma_M Z$, so $\mathbb{P}(X < L) = \Phi(\frac{L}{\sigma_M})$ and the problem is over. In that case, you can say ''with a probability of N%, the device is compliant''. But true $\sigma_M$ is unknown, so everything gets complicated. Are you with me up to this point? — PseudoRandom, Jul 02 '17 at 18:32
I edited the original posting to clarify that the original problem is to determine with some confidence that the true value of $\sigma_D$ is < $L$. Other than that, I think I'm with you. — user46688, Jul 02 '17 at 23:48
Ok, it's a bit different now. If $\sigma_D$ was known, there would be no problem at all. For unknown $\sigma_D$, we actually want to evaluate $\mathbb{P}(\hat{\sigma}_D < L)$, where $\hat{\sigma}_D$ is an estimate of $\sigma_D$. So we only need to know the CDF of $\hat{\sigma}_D$. Now, to do that, I need you to tell me which formula you are using to estimate all the $\sigma$s, because there are several options available (some formulas are biased, others are unbiased). — PseudoRandom, Jul 03 '17 at 07:05
Thanks @PseudoRandom, I measure n samples, compute their mean, subtract this mean from each sample, then compute $\sigma_M$ and $\sigma_N$ using $\sigma=\sqrt{ \frac{1}{n} \sum_1^nx_i^2}$. — user46688, Jul 03 '17 at 14:33
So the formula actually is: $\hat{\sigma} = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2}$, where $\bar{x}$ is the sample mean. — PseudoRandom, Jul 05 '17 at 07:51
But you can compute only $\sigma_N$ that way, since you have noise-only data. Let $z_i = x_i + w_i$ denote signal+noise data ($w_i$ is noise, $x_i$ is device, $z_i$ measures). If you try to use that formula, you get an estimate of $\sigma_M + \sigma_N$, since $\rm \sigma^2_z = Var[z_i] = Var[x_i] + Var[w_i] = \sigma^2_M + \sigma^2_N$. So, how do you compute $\hat{\sigma}_M$ ? Do you subtract $\sigma^2_N$ ? I suspect that what you are calling $\sigma_D$ is just $\hat{\sigma}_M$ and that you did a notational abuse with $\sigma_M$ meaning $\hat{\sigma}_z$ — PseudoRandom, Jul 05 '17 at 15:05
This would make sense if I read D = Device, M = Measures, N = Noise. Which would imply that the device is $X \sim \mathcal{N}(0, \sigma^2_D)$, while noise is $W \sim \mathcal{N}(0, \sigma^2_N)$ and measures, $z_i = x_i + w_i$, are actually $Z \sim \mathcal{N}(0, \sigma^2_M)$. This would also make sense since the spec limit regards the device, that's why you have $\sigma_D < L$. — PseudoRandom, Jul 05 '17 at 15:11
Thanks @PseudoRandom. I compute (perhaps incorrectly?) $\hat{\sigma_M}$ using the above std dev formula using $z_i$, because I measure $z_i$ with a population of $n$. So I just insert all those $z_i$ values to compute its std dev. The measurable quantities are $z_i$ and $w_i$, and I need to extract from them $\hat{\sigma_D}$. Could you clarify where I've gone wrong? Is it the way I've defined the problem, or how I attempted to solve it? Thanks — user46688, Jul 05 '17 at 16:46
It is just a notational problem in the question, don't worry too much about it. Check my full answer for the detail regarding the sample mean, but observe that for $n \rightarrow +\infty$, there is no difference since $\bar{x} \rightarrow \mu$, where $\mu$ is the true value of the mean, which is zero in our case. — PseudoRandom, Jul 05 '17 at 18:41

PseudoRandom · Accepted Answer · 2017-07-05T20:10:23.260

The question can be reduced to evaluation of the probability $\mathbb{P}(\hat{\sigma}_D < L)$, where $L>0$ is known. It is sufficient to calculate the CDF of $\hat{\sigma}_D$. Unfortunately, it appears no closed-form exists.

Part 1. We call secondary data the noise-only measures. They are a sequence $\{ y_1, \dots, y_n \}$ such that $y_i = n_i$, for $i=1,\dots,n$, where $n_i \sim \mathcal{N}(0, \sigma^2_N)$ are i.i.d. gaussian random variables (RVs), aka, noise. From standard estimation theory, (see note) $$ \hat{\sigma}^2_N = \frac{1}{n} \sum_{i=1}^n y_i^2 $$ is the MLE (Maximum Likelihood Estimator) of $\sigma^2_N$ and it follows a $\chi^2$ distribution with $n$ degrees of freedom. More precisely: $$ n \frac{\hat{\sigma}^2_N}{\sigma^2_N} \sim \chi^2_{n} $$ Let $z_i$ denote primary data (signal+noise). We have that $z_i = x_i + w_i$, for $i=1,\dots,n$, where: $x_i \sim \mathcal{N}(0, \sigma^2_D)$ represents the device, $w_i \sim \mathcal{N}(0, \sigma^2_N)$ is noise, independent of $x_i$ and $n_i$. Again, $$ \hat{\sigma}^2_M = \frac{1}{n} \sum_{i=1}^n z_i^2 $$ from which $$ n \frac{\hat{\sigma}^2_M}{\sigma^2_M} \sim \chi^2_n $$ Note: There is no need to subtract the sample mean $\bar{y}$ since we know that $\mathbb{E}[y_i] = 0$. If you actually use $$ \tilde{\sigma}^2_N = \frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2 $$ this is not ML anymore. It still is $\chi^2$ distribution, but with $(n-1)$ degrees of freedom. More precisely, $(n-1)s^2 / \sigma^2 \sim \chi^2_{n-1}$, where $s^2$ is the unbiased sample variance, defined as: $s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$. The precise mathematical statement follows from Cochran's theorem.

Part 2. We know that $\rm Var[z_i] = Var[x_i] + Var[w_i]$, so we can compute $$ \hat{\sigma}^2_D = \hat{\sigma}^2_M - \hat{\sigma}^2_N $$ Essentially, we now need to compute the CDF of the difference between two independent $\chi^2$ RVs, which is not trivial. This is complicated by the fact that some coefficients are needed to make things right. We need to use the following result.

Lemma. Let $X,Y$ be two independent $\chi^2_n$. The PDF of $Z=X-Y$ is given by $$ f_Z(z) = \frac{1}{\sqrt{\pi} 2^{n/2}} \frac{1}{\Gamma \Big( \frac{n}{2} \Big)} |z|^{(n-1)/2} K_{\frac{n-1}{2}}\Big( |z| \Big) $$ where $K(\cdot)$ is the modified Bessel function of the second kind and $\Gamma(\cdot)$ is the Gamma function. Proof. See here.

Denoting the PDF of $\hat{\sigma}^2_D$ with $f_Z(z)$, the CDF is given by $$ \mathbb{P}(\hat{\sigma}^2_D \leq t) = F_Z(t) = \int_{-\infty}^t f_Z(z) dz $$ Since $\hat{\sigma}_D = \sqrt{\hat{\sigma}^2_D}$, your solution is $$ \mathbb{P}(\sqrt{\hat{\sigma}^2_D} < L) = \mathbb{P}(\hat{\sigma}^2_D < L^2) = F_Z(L^2) = \int_{-\infty}^{L^2} f_Z(z) dz $$ which is the probability that the device is compliant.

ADDENDUM. To answer the accuracy question, define the Signal-to-Noise Ratio (SNR) as follows $$ SNR = \frac{\sigma^2_D}{\sigma^2_N} $$ which you can compute using estimated values (use big values of $n$, since, ideally, you would like to have $n \rightarrow +\infty$). SNR is a useful measure. First, $SNR \geq 0$ always. Second, in the limit $\sigma^2_N \rightarrow +\infty$ (infinitely powerful noise), we have $SNR=0$, while $\sigma^2_D \rightarrow +\infty$ (infinitely powerful signal) implies $SNR=+\infty$. In other words, the bigger the SNR, the better.

SNR is a quantitative metric tied to the accuracy of your measurements. Sometimes, you will see a threshold-based approach to define "accuracy": if $SNR \geq \gamma$, where $\gamma>0$ is arbitrarily decided (e.g. $\gamma = 10^3$), then you label the results as ``accurate'', inaccurate otherwise. But this approach is flawed, since accuracy is treated as a binary value, which is too simplistic.

A better approach is to compute $$ \eta = 1 - \frac{1}{SNR +1} $$ Why and how does this work? For $SNR=0$ (infinitely powerful noise or zero signal), $\eta=0$. For $SNR=+\infty$ (zero noise or infinitely powerful signal), $\eta=1$. So, clearly, $\eta \in [0,1]$, with extreme values taken only under limiting conditions. If you now use $a_{[\%]} = 100\eta$, you can interpret $a_{[\%]}$ directly as accuracy itself expressed in percentage. So, for example, $\eta=0.9$ implies 90% accurate measures, while $\eta=0.1$ implies rather inaccurate measures. This gives us a quantitative measure of the accuracy of our measures, which is also simple to calculate and intuitively appealing.

Thanks @PseudoRandom. What is $f_z(z)$ (besides the PDF of $\hat{\sigma_d}^2$)? Does it relate to $\hat{\sigma_M}$ or $\hat{\sigma_N}$? Can I, for example, create a model for $f_z(z)$ empirically from the population of $z_i$ measurements, then integrate it to L to solve for the probability that $\sqrt{\hat{\sigma_d}^2}<L$? Not sure how to proceed. — user46688, Jul 05 '17 at 18:56
I edited my answer, adding an answer to the accuracy question you asked, which you may find particularly useful since it is easy to calculate. Regarding $f_Z(z)$, it is not easy to find its expression, since we don't have a "simple" case of $Z=X-Y$, where $X,Y$ are $\chi^2_n$ independent, but $Z = c_1X - c_2 Y$, with $c_1 = n / \sigma^2_M$ and $c_2 = n/ \sigma^2_N $. For $c_1 = c_2 =1$, $f_Z(z)$ reduces to the PDF in the given link, which is still quite complicated, as it involves modified Bessel functions of the second kind. — PseudoRandom, Jul 05 '17 at 19:27
After you find $f_Z(z)$, you just integrate it from $-\infty$ to $L^2$. That's it. Most probably, you will need to approximate it numerically. It will give you the desired $\mathbb{P}(\hat{\sigma}_D < L)$. The parameters $\sigma^2_M$ and $\sigma^2_N$ will be inside $f_Z(z)$. Since you don't actually know the true values, just use estimates. At the end of the day, you will have a big integral, from $-\infty$ to $L^2$, of a function which depends on $n, \hat{\sigma}^2_M, \hat{\sigma}^2_N$ in complicated ways. At this point, I think SNR is better. — PseudoRandom, Jul 05 '17 at 19:41
In fact, I strongly suggest you to: (1) use SNR to evaluate accuracy (really, SNR is a very good idea!); (2) if you really want that probability, look for a paper along the lines of "Linear Combination of two independent chi-squared random variables". Or just use the $f_Z(z)$ given in the link as an approximation, it will still work. — PseudoRandom, Jul 05 '17 at 19:49
Thanks @PseudoRandom, really appreciate all your work here. The SNR is a nice way to formulate some measure of accuracy. — user46688, Jul 05 '17 at 21:10

Working with normal distributions, how large can noise be before data becomes inaccurate?

1 Answers1