2

To begin with, the standard iterated law of probability is as follows.

$$ \mathbb E X = \mathbb E [\mathbb E(X|Y)]. (1) $$

I am perfectly happy with $(1)$ and there is also some quite good discussion on the intuition here. However, the extension of this property is more troublesome to me. It states that

$$ \mathbb E (X|Y) = \mathbb E [\mathbb E(X|Y, Z)|Y]. (2) $$

I found a proof here which basically just restated the definition. I still do not get it mathematically or intuitively.

  • Could anyone provide a/an concrete/intuitive example to explain how $(2)$ works, please? How is $(2)$ an extension of $(1)$, please? Is there a more straightforward way to prove it, please? Thank you!

In addition, by letting $X$ equal to $X|Y$ and $Y$ equal to $Z$ in $(1)$, one has

$$ \mathbb E(X|Y) = \mathbb E\{\mathbb E[(X|Y)|Z]\}. (3) $$

  • What does $(3)$ mean exactly, please? Is $(3)$ equal to $(2)$, please? Thank you!
LaTeXFan
  • 3,548
  • 1
    This has been asked on the site before. The answer is that the object X|Y mentioned in "by letting X equal to X|Y in (1)" does not exist. Hence (1) and (2) use standard notations while (3) is undefined (but (2) should be corrected to $\mathbb E[X|Y] = \mathbb E [\mathbb E(X|Y, Z)|Y]$). – Did Aug 18 '14 at 08:37
  • @Did Could you explain why $X|Y$ does not exist in more detail, please? Thank you! – LaTeXFan Aug 18 '14 at 08:46
  • How would you define it? Where did you find it mentioned (preferably with a definition)? – Did Aug 18 '14 at 08:52
  • @Did Well, in Bayesian statistics we frequently use something like $(X|\mu) \sim N(\mu, 1)$ and $\mu \sim N(0, \infty)$. Both $X$ and $\mu$ are random variables here. – LaTeXFan Aug 18 '14 at 08:57
  • 1
    The first statement refers to the fact that the conditional distribution of X conditionally on μ=m is N(m,1), for every real number m, the second to the fact that the distribution of μ is N(0,1) (rather than N(0,∞), I guess). Indeed both X and μ are random variables here but nowhere one refers to some (nonexistent) random variable X|μ. – Did Aug 18 '14 at 09:02
  • @Did I suppose so. By the way, $N(0, \infty)$ is something called non-informative prior in Bayesian context. – LaTeXFan Aug 18 '14 at 09:15
  • Then you have still another beast to tame, which is that μ itself cannot be defined as a random variable since there is no such thing as a uniform distribution on the whole real line... :-) (Note that I am well aware that these problems can be, and are routinely, circumvented in Bayesian statistics.) – Did Aug 18 '14 at 09:17

1 Answers1

3

You should view $E[X|Y]$ as $E[X|\sigma_{Y}]$ where $\sigma_{Y}$ is the $\sigma$-algebra generated by the random variable $Y$. It stands to reason that $\sigma_{Y}$ is a sub-$\sigma$-algebra of the $\sigma$-algebra generated by $Y$ and $Z$. Loosely speaking, the information that you get from knowing $Y$ alone should be a "subset" of the information that you get from knowing $Y$ and $Z$ together.

The iterated conditioning law says that if $F$ is a sub-$\sigma$-algebra of $G$, then $$ E[E[X|G]|F] = E[X|F] $$

You can take it from here on I think.

Calculon
  • 5,725
  • "It stands to reason that the $\sigma$-algebra generated by $Y$ and $Z$ is a sub-$\sigma$-algebra of $\sigma_{Y}$." Do you mean the intersection of $\sigma(Y)$ and $\sigma(Z)$ is a sub-$\sigma$-algebra of $\sigma (Y)$. – LaTeXFan Aug 18 '14 at 08:51
  • @20824 I swapped the order in my statement by mistake. It should be the other way around. Sorry for the confusion. – Calculon Aug 18 '14 at 08:57
  • Thanks for clarification. Is the following conjecture right, please? The larger $\sigma$-algebra we condition on, the less specific information we get. And the smaller $\sigma$-algebra we condition on, the more specific information we get. Hence, if we condition on a larger set first and then condition on its subset, then it is the same as conditioning on the subset in the first place since the subset provides actually more relevant information. – LaTeXFan Aug 18 '14 at 09:10
  • 1
    If by larger you mean "coarser", i.e. containing less fine sets, then yes. – Calculon Aug 18 '14 at 09:17
  • 1
    ...Which would be the opposite of "larger" in the usual sense of set-inclusion. – Did Aug 18 '14 at 09:20
  • @L'universo No, that is not what I meant. Now I am more confused. "$\sigma_Y$ is a sub-$\sigma$-algebra of the $\sigma$-algebra generated by $Y$ and $Z$." Then in the context of $(2)$, I suppose $G=\sigma_{Y, Z}$ and $F=\sigma_Y$. It is clear that $\sigma_{Y, Z}$ is finer right? Then $(2)$ basically says that the expectation conditioning on $\sigma_{Y, Z}$ which is finer equal to the expectation conditioning on $\sigma_Y$ which is coarser set. Isn't this a paradox, please? – LaTeXFan Aug 18 '14 at 09:23
  • @Did Then which of the two is coarser: $\sigma_Y$ and $\sigma_{Y, Z}$? – LaTeXFan Aug 18 '14 at 09:34
  • @20824 you got it right but I don't see the paradox here. Could you explain why you think this is a paradox? – Calculon Aug 18 '14 at 09:34
  • @L'universo On the right hand side of $(2)$, we take expectation first conditioning on $\sigma_{Y, Z}$ which is finer. Then take expectation conditioning on $\sigma_Y$ which is coarser. This is like moving backwards. That is, from having more information to having less information. Therefore, the overall effect is having less information, i.e., equal to the left hand side. Now too see the paradox, imagine I am making decisions based on some information $\sigma_Y$. But you give me extra information $\sigma_{Y,Z}$. The problem is how do I forget them after knowing them? – LaTeXFan Aug 18 '14 at 09:46
  • @L'universo In addition, I am wondering whether $(2)$ is still true when I replace all $\mathbb E$ by $\mathbb P$, please? Or is there any equivalent result for probability, please? I suppose so. Thank you! – LaTeXFan Aug 18 '14 at 09:53
  • "Then which of the two is coarser" For every random variables $Y$ and $Z$, $\sigma(Y)\subseteq\sigma(Y,Z)$. – Did Aug 18 '14 at 10:01
  • @Did Thanks. Just check. – LaTeXFan Aug 18 '14 at 10:07
  • @20824 If you replace $E$ by $P$ in (2) you get a statement like $P(P(..))$, which makes no sense to me. – Calculon Aug 18 '14 at 10:50
  • @20824 it is good to have an intuitive understanding of these concepts but you shouldn't carry that intuition too far and let the mathematics take over at a certain point I think. There is no "forgetting" information in iterated conditioning. It is about what source of information you base your expectation on. – Calculon Aug 18 '14 at 10:57
  • @L'universo Thanks. I think I finally had some idea about this property. – LaTeXFan Aug 18 '14 at 12:08
  • 1
    A version of (2) for events reads $P(A|Y) = E [P(A|Y, Z)|Y]$, for every event $A$ and every random variables $(Y,Z)$. – Did Aug 18 '14 at 13:40
  • @Did Nice result. Would its proof follow from using an indicator function $$I_A = \left{ \begin{array}{lr} 1 & : \omega \in A\ 0 & : \omega \notin A \end{array} \right.$$? – Calculon Aug 18 '14 at 13:50
  • ?? Yes this is the special case $X=\mathbf 1_A$. – Did Aug 18 '14 at 15:51
  • @Did Thanks. I just wanted to make sure (I am new to measure theoretic probability). – Calculon Aug 18 '14 at 15:53