Formula for estimating a continuous random variable in sampling, while most classic theory are discrete

Question

Very often we would like to estimate a continuous variable Y (e.g. mean weight, mean length) in sampling design. However, most of the literature in sampling theory seem to treat the sampling variable as discrete. The most common estimation of mean y from a sample size of n is:

$E(Y)=\sum_{i=1}^{n} y_{i} p(y_{i})$

Given a bit more complicated stratified sampling design with a strata. The strata is usually a discrete random variable (e.g. site, district). We often also estimate the mean Y at a given stratum. This would involves the conditional probability of a continuous variable y, condition on a discrete variable.

can some one give me more insight about:

1) On what rational is the Y variable being treated as discrete in sampling theory? is it because the selection probabilities of samples are discrete (and this is what we have) Or it is just for less computation complexity?

2) Is it possible to write the estimation formula with Y as a continuous random variable?

3) If 2) is true, on what situation would the discrete formula approach the continuous formula?

My key puzzle is how to link classical estimation in sampling theory with a probability theory based formula, if I want to highlight that my sampling variable is a continuous variable.

thanks!

Are you asking about conditional expectations of continuous variables, conditioned on discrete variables? Or just about the estimation formula for the mean in the continuous case, and why it seems discrete, when does it converge, etc...? — user3658307, Jun 02 '17 at 02:30
@user3658307. I am wondering why the formula with y as a continuous variable is not seen in text book, because the continuous formula is the correct formula. I guess the discrete formula only converges to the continuous one under a certain condition, and I want to know how. Thanks — tiantianchen, Jun 02 '17 at 07:34

score 2 · Accepted Answer · answered Jun 02 '17 at 17:27

I think there are three distinct concepts of mean you are talking about.

(1) For a continuous random variable $Y$ supported on $\Omega$ with density $f_Y(y)$, the mean is written: $$ \mathbb{E}[Y] = \int_\Omega y f_Y(y)dy $$

(2) For a discrete random variable $X$ with pmf $p(x)$, taking values $\{x_i\}$ the mean is given by $$ \mathbb{E}[X] = \sum_i x_i\, p(x_i) $$

(3) Given a set of samples $S=\{s_i\}$, realized from a random variable $Z$, which can be from either a continuous or a discrete random variable, the sample mean is given by $$ \hat{\mu}(S) = \frac{1}{|S|}\sum_{i} s_i $$

Note that (3) is used regardless of whether $S$ comes from a discrete or continuous RV. In both cases, $\hat{\mu}$ will converge to $\mathbb{E}[Z]$, regardless of whether $Z$ is discrete or continuous. The important fact is that $\mathbb{E}[Z]$ is the true, theoretical mean of the random variable, whereas $\hat{\mu}$ is an approximation of it.

How do we know that $\hat{\mu}$ converges to $\mathbb{E}[Z]$? Using the Law of large numbers, which says that (under mild conditions): $$ P\left( \lim_{|S|\rightarrow\infty} \hat{\mu}(S) = \mathbb{E}[Z] \right)=1 $$ In words, as the number of samples increases, the probability that the sample mean equals the true mean converges to 1. It doesn't matter whether $Z$ is continuous or discrete. This means that (3) converges to (1) and (2) as $|S|$ increases. See also here and here.

The other question embedded in your question seems to be how (2) can converge to (1). A great discussion of that is here.

I'll show a more heuristic (less rigorous) discussion, just to show that we can consider a discrete random variable $V$, taking values in $S_v=\{v_i\}$, to be a continuous random variable. Let $U$ be a continuous random variable with density: $$ f_U(u) = \sum_i P(V=u) \delta_m(x-x_i) $$ where $\delta_m(a)=\mathbb{I}[0\in\{a\}]$ is the Dirac measure. Notice that $f_U(u)=0$ if $u\notin S_v$ and $f_U(u)=P(V=u)$ otherwise. Further, $$ \int_{-\infty}^\infty f_U(u)du = \int_{-\infty}^\infty \sum_i P(V=u)\delta(u-v_i)du = \sum_i P(V=v_i) = 1 $$ as we would expect. Then, for the mean, we get: \begin{align} \mathbb{E}[U] &= \int_{-\infty}^\infty uf_U(u) du \\ &= \int_{-\infty}^\infty u \sum_i P(V=u)\delta(u-v_i)du \\ &= \sum_i \int_{-\infty}^\infty u P(V=u)\delta(u-v_i)du \\ &= \sum_i v_i P(V=v_i) \\ &= \mathbb{E}[V] \end{align} Notice that we have used the definition of $\mathbb{E}[U]$ to be a continuous RV and the definition of $\mathbb{E}[V]$ as a discrete RV.

So if you simply redefine any discrete RV as a continuous RV as above, the definitions of means coincide.

Thanks very much @user3658307. Your answer disentangles my puzzle, I think I confused the pdf/pmf in (1)- (2) with the probability of selecting sample s_i in the population (rather than S) in sampling theory. I will digest your answer. :-) — tiantianchen, Jun 02 '17 at 20:30
@tiantianchen Glad to help :-) Yes, probability (and especially statistics) has the unfortunate tendency to conflate many distinct concepts with very similar (or even ambiguous) notation, and I think we as humans (or at least I) tend to confuse certain concepts, especially when it comes to continuous RVs (where the density is no longer a probability) and how sample statistics relate to their theoretical quantities. :P — user3658307, Jun 02 '17 at 21:44

Formula for estimating a continuous random variable in sampling, while most classic theory are discrete

1 Answers1