65

It would be extremely helpful if anyone gives me the formal definition of conditional probability and expectation in the following setting, given probability space $ (\Omega, \mathscr{A}, \mu ) $ with $\mu(\Omega) = 1 $, and a random variable $ X : \Omega \rightarrow \mathbb{R}^n $, where for any borel set $ A \in \mathscr{B}(\mathbb{R}^n) $ we define $$ \mathbb{P}(X \in A) = (X_*\mu)(A) = \mu(X^{-1}(A))= \mu(\{\omega\in \Omega\ \ |\ \ X(\omega) \in A\})\ \ \text{and}\ \ \mathbb{E}(X) = \int_\Omega Xd\mu $$ Regardless of $X, Y$ being discrete or continuous (with density $f_X, f_Y $ and joint density $f_{X,Y} $ w.r.t some measure $\nu$ on $\mathbb{R}^n $), I am asking for the definition of $ \mathbb{P}(Y\in B\ |\ X \in A) $ and $ \mathbb{E}(Y|X) $ for all Borel sets $ A, B \in \mathscr{B}(\mathbb{R}^n) $, keeping in mind that $ \mathbb{P}(X \in A) $ may well be zero.

In our probability class some thing of the following sort was mentioned, where $\delta_x$ is the Dirac distribution at $ x $, then we have $$ \mathbb{E}(Y|X = x) = \frac{\mathbb{E}(\delta_x(X)Y)}{\mathbb{P}(X=x)}$$ out of which I can't make any sense. Any appropiate reference for these is also very much welcome.

Thank you.

smiley06
  • 4,157
  • You can find a formal definition for conditional expectation (with some nice discussion regarding what it represents) here and (with its main properties) here. Maybe you can review these and answer your own question? – jkn Sep 17 '13 at 20:24
  • 4
    $E(Y|X=x)$ that you mention denotes the conditional expectation of the random variable $Y$ with respect to the sigma algebra $\mathcal{G}$ generated by the sets of the form ${\omega\in\Omega:X(\omega)=x}$. It turns out that for sigma algebras generated by disjoint sets $A_i$, $E(X|\mathcal{G})$ evaluated on $A_i$ is equal to $$\frac{E(1_{A_i}Y)}{P(A_i)},$$ where $1_A$ denotes the indicator function of $A$. See page 220 of this. – jkn Sep 17 '13 at 20:39
  • 1
    I can't remember of a definition for the probability of an event ${Y\in B}$ conditioned on a second event ${X\in A}$ apart from the standard undergrad definition if $P({X\in A})\neq 0$ of $$P({Y\in B}|{X\in A}):=\frac{P({Y\in B}\cap {X\in A})}{P({X\in A})}.$$ I forgot to add the sets $A_i$ above must each have positive probability for the equality to hold. – jkn Sep 17 '13 at 20:47
  • 1
    Sorry I should have said page 191 - confusion with editions ... – jkn Sep 17 '13 at 20:54

1 Answers1

90

Let throughout this post $(\Omega,\mathcal{F},P)$ be a probability space, and let us first define the conditional expectation ${\rm E}[X\mid\mathcal{G}]$ for integrable random variables $X:\Omega\to\mathbb{R}$, i.e. $X\in L^1(P)$, and sub-sigma-algebras $\mathcal{G}\subseteq\mathcal{F}$.

Definition: The conditional expectation ${\rm E}[X\mid\mathcal{G}]$ of $X$ given $\mathcal{G}$ is the random variable $Z$ having the following properties:

(i) $Z$ is integrable, i.e. $Z\in L^1(P)$.

(ii) $Z$ is ($\mathcal{G},\mathcal{B}(\mathbb{R}))$-measurable.

(iii) For any $A\in\mathcal{G}$ we have $$ \int_A Z\,\mathrm dP=\int_A X\,\mathrm dP. $$

Note: It makes sense to talk about the conditional expectation since if $U$ is another random variable satisfying (i)-(iii) then $U=Z$ $P$-a.s.

Definition: If $X\in L^1(P)$ and $Y:\Omega\to\mathbb{R}$ is any random variable, then the conditional expectation of $X$ given $Y$ is defined as $$ {\rm E}[X\mid Y]:={\rm E}[X\mid\sigma(Y)], $$ where $\sigma(Y)=\{Y^{-1}(B)\mid B\in\mathcal{B}(\mathbb{R})\}$ is the sigma-algebra generated by $Y$.

I'm not aware of any other definition of $P(Y\in B\mid X\in A)$ than the obvious, i.e. $$ P(Y\in B\mid X\in A)=\frac{P(Y\in B,X\in A)}{P(X\in A)} $$ provided that $P(X\in A)>0$. The only exception being when $A$ contains a single point, i.e. $A=\{x\}$ for some $x\in\mathbb{R}$. In this case, the object $P(Y\in B\mid X=x)$ is defined in terms of a regular conditional distribution.

Let us first define regular conditional probabilities. Let $X:\Omega\to\mathbb{R}$ be a random variable.

Definition: A regular conditional probability for $P$ given $X$ is a function $$ \mathcal{F}\times \mathbb{R} \ni(A,x)\mapsto P^X(A\mid x) $$ satisfying the following three conditions:

(i) The mapping $A\mapsto P^X(A\mid x)$ is a probability measure on $(\Omega,\mathcal{F})$ for all $x\in \mathbb{R}$.

(ii) The mapping $x\mapsto P^X(A\mid x)$ is $(\mathcal{B}(\mathbb{R}),\mathcal{B}(\mathbb{R}))$-measurable for all $A\in\mathcal{F}$.

(iii) The defining equation holds: For any $A\in\mathcal{F}$ and $B\in\mathcal{B}(\mathbb{R})$ we have $$ \int_B P^X(A\mid x)\,P_X(\mathrm dx)=P(A\cap\{X\in B\}). $$

Note: A mapping satisfying (i) and (ii) is often called a Markov kernel. Furthermore, since $(\mathbb{R},\mathcal{B}(\mathbb{R}))$ is a nice space, the regular conditional probability is unique in the sense that if $\tilde{P}^X(\cdot\mid\cdot)$ is another regular conditional probability of $P$ given $X$, then we have that $P^X(\cdot\mid x)=\tilde{P}^X(\cdot\mid x)$ for $P_X$-a.a. $x$. Here $P_X=P\circ X^{-1}$ is the distribution of $X$.

Connection: Let $P^X(\cdot\mid\cdot)$ be a regular conditional probability of $P$ given $X$. Then for any $A\in\mathcal{F}$ we have $$ {\rm E}[1_A\mid X]=\varphi(X), $$ where $\varphi(x)=P^X(A\mid x)$. In short we write ${\rm E}[1_A\mid X]=P^X(A\mid X)$.

Now let us introduce another random variable $Y:\Omega\to\mathbb{R}$, and $P^X(\cdot\mid \cdot)$ still denotes a regular conditional probability of $P$ given $X$.

Definition: For $B\in\mathcal{B}(\mathbb{R})$ and $x\in\mathbb{R}$ we define the regular conditional distribution of $Y$ given $X$ by $$ P_{Y\mid X}(B\mid x):=P^X(Y\in B\mid x). $$

Instead of $P_{Y\mid X}(B\mid x)$ one often writes $P(Y\in B\mid X=x)$.

An easy consequence of this definition is that $(B,x)\mapsto P_{Y\mid X}(B\mid x)$ is a Markov kernel and for any $A,B\in\mathcal{B}(\mathbb{R})$ we have $$ \int_A P_{Y\mid X}(B\mid x)\,P_X(\mathrm dx)=P(\{X\in A\}\cap\{Y\in B\}). \tag{1} $$

In fact, $P_{Y\mid X}(\cdot \mid \cdot)$ is a regular conditional distribution of $Y$ given $X$ if and only if $P_{Y\mid X}(\cdot\mid\cdot)$ is a Markov kernel and satisfies $(1)$. Again $(1)$ is often referred to as the defining equation.

Definition: Let $P^X(\cdot\mid\cdot)$ be a regular conditional probability of $P$ given $X$. Furthermore, let $U:\Omega\to\mathbb{R}$ be another random variable that is assumed bounded (to ensure the following expectations exist). Then we define the (regular) conditional mean of $U$ given $X=x$ by $$ {\rm E}[U\mid X=x]:=\int_\Omega U(\omega)\, P^X(\mathrm d\omega\mid x). $$

Let us denote $\psi(x)={\rm E}[U\mid X=x]$. Then we have the following:

Connection: The mapping $\mathbb{R}\ni x\mapsto \psi(x)$ is $(\mathcal{B}(\mathbb{R}),\mathcal{B}(\mathbb{R}))$-measurable, and $$ {\rm E}[U\mid X]=\psi(X). $$

The following is an extremely useful rule when calculating with conditional distributions:

Rule: Let $X$ and $Y$ be as above, and let $\xi:\mathbb{R}^2\to\mathbb{R}$ be $(\mathcal{B}(\mathbb{R}^2),\mathcal{B}(\mathbb{R}))$-measurable. Then $$ P(\xi(X,Y)\in D\mid X=x)=P(\xi(x,Y)\in D\mid X=x),\quad D\in\mathcal{B}(\mathbb{R}), $$ holds for $P_X$-a.a. $x$. This is saying that "conditional on $X=x$ we may replace $X$ by $x$".

The following example shows how this rule can be useful: Let $X$ and $Y$ be independent $\mathcal{N}(0,1)$ random variables, and let $U=X+Y$. Then we claim that $U\mid X=x\sim \mathcal{N}(x,1)$ for $P_X$-a.a. $x$. To see this, note that by the rule above, the distribution of $U\mid X=x$ and $Y+x\mid X=x$ is the same. But since $Y$ is independent of $X$ we have that $Y+x\mid X=x$ is distributed as $Y+x$. We can write it as follows: $$ U\mid X=x\sim Y+x\mid X=x\sim Y+x\sim\mathcal{N}(x,1). $$

Stefan Hansen
  • 25,582
  • 7
  • 59
  • 91
  • Thank you for the answer, my incentive of asking for $ P(X \in B | X \in A) $ was to understand conditioning on measure 0 sets, for e.g. iids $X_1,X_2\sim exp(\lambda)$ to find $ P(X_1+X_2 \in A|X_1 = X_2) $ gives two different answers, by conditioning on $ {X_1-X_2= 0} $ and ${X_1/X_2 = 1} $ respectively.... – smiley06 Sep 19 '13 at 08:26
  • Could you update your question with the derivation of these two answers? – Stefan Hansen Sep 19 '13 at 08:46
  • Routine calculations involving change of variables in two cases above, give you $ P(X_1 +X_2 \in A| X_1 = X_2) $ as $ \int_A\lambda^2 ye^{-\lambda y}dy $ for $ P(X_1 + X_2 \in A | X_1-X_2 = 0) $and $ \int_A\lambda ye^{-\lambda y}dy $ for $ P(X_1 + X_2 \in A | X_1/X_2 = 1)$ respectively. – smiley06 Sep 23 '13 at 13:58
  • Could you please show these routine calculations? – Stefan Hansen Sep 23 '13 at 14:10
  • 5
    amazing! Most complete explanation I have found on understanding conditional probabilities. Thanks. – tlamadon Apr 07 '14 at 16:46
  • @StefanHansen Do you know or have a reference to a proof for the replacement rule? I couldn't prove it myself. – simonzack May 06 '15 at 04:49
  • @simonzack: Take a look at this. – Stefan Hansen May 06 '15 at 09:31
  • @smiley06 I've asked the same question myself. Basically, if $P(A)=0$, $P(B|A)$ can be meaningfully defined as a standalone quantity only if $B\cap A=\emptyset$ or $B\cap A^c=\emptyset$ (with values of $P(B|A)$ respectively $0$ and $1$) - otherwise different transformations/limiting procedures would yield different values. Similarly, $E(Y|A)$ makes sense only if $Y$ is constant on $A$ - otherwise you can obtain any $y\in[\inf_A Y,\sup_A Y]$ as a value. And to finish, $Y|A$ can be any distribution on range of $Y$ restricted to $A$. I just posted a similar question: http://tiny.cc/935x6x – A.S. Dec 01 '15 at 09:31
  • @A.S. exactly so......that's why I wanted to know if there is a unique definition in such a situation. – smiley06 Dec 02 '15 at 10:11
  • @smiley There isn't. "The concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible." IIRC, there are developments of probability that somehow allow conditioning on sets of measure $0$. – A.S. Dec 02 '15 at 10:33
  • @StefanHansen: A question concerning the definition of r.c.p.: All the other people do it like this: $\nu: \Omega \times \mathcal{B}(\mathbb{R}) \to \mathbb{R}$ and demand that $\nu(\cdot, B) = E[1_{X \in B}|\mathcal{F}]$ i.e. we should suspect that $\nu(\omega, B) = P^X(X^{-1}(B), X(\omega))$ but this does not work: when integrating over an arbitrary set $A$ then $1_A$ is not factorizable... Could you elaborate? Are these two different definitions of r.c.p.? – Fabian Werner Dec 17 '15 at 15:20
  • @FabianWerner: I don't follow your construction. First, why do you condition on $\mathcal{F}$? Since $X$ is a random variable, $\mathrm{E}[\mathbf{1}{X\in B}\mid \mathcal{F}]=\mathbf{1}{X\in B}$, no? Second, how do you make sense of $P^X(X^{-1}(B),X(\omega))$ since $P^X$ is a measure on $\mathcal{B}(\mathbb{R})$ and does not take two arguments? – Stefan Hansen Dec 18 '15 at 06:13
  • @StefanHansen: sorry, $\mathcal{F}$ was an unclever special case. Generally, for example, in http://www.math.duke.edu/~rtd/PTE/PTE4_1.pdf 5.1.3 on p. 197 they define thr r.c.p. w.r.t. any sub-sigma-algebra $\mathcal{G}$ as a map $\mu : \Omega \times \mathcal{B}(\mathbb{R}) \to [0,1]$, i.e. going from the target sigma algebra times the original space to the reals while in your case, $P^X$ goes from the original sigma algebra times the target space... see the twist [original <-> target]? Does your definition coincide somehow with 5.1.3 in the reference? – Fabian Werner Dec 18 '15 at 07:07
  • @FabianWerner: Sorry, I misread your $P^X$ as $P_X$, so nevermind that previous comment of mine. This is a good question though. So, in this post we do not define r.c.p's with respect to a general sub-sigma algebra $\mathcal{G}$ but only with respect to $\mathcal{G}$'s of the form $\mathcal{G}=\sigma(X)$. The r.c.p. can then be defined as $P^X(A\mid x):=\varphi_A(x)$, where $\varphi_A$ is the function satisfying $\mathrm{E}[\mathbf{1}_A\mid X]=\varphi_A(X)$ (Connection 1). I believe the relationship is $\mu(\omega,A)=\varphi(X(\omega))=P^X(A\mid X(\omega))$. – Stefan Hansen Dec 18 '15 at 07:40
  • So I actually believe that Durrett's definition is more general but is equivalent to this definition when $\mathcal{G}=\sigma(X)$ (Note that I use $X$ to condition on, whereas he uses it differently). – Stefan Hansen Dec 18 '15 at 07:41
  • @StefanHansen Careful: There is an additional twist: $\mu$ does not accept sets from $\mathcal{F}$ but rather from the target sigma algebra $\mathcal{B}(\mathbb{R})$. Thats why I inserted that $X^{-1}(...)$ in the twist. Hmm... But if we can easily construct their $\mu$ from your $P^X$ then I have the feeling that your defn. should be more general (although as said, I can only show that $\mu = P^X(X^{-1}, X)$ if $\mathcal{G} = \sigma(X)$... weird... – Fabian Werner Dec 18 '15 at 07:51
  • The thing is that the case $\mathcal{G} = \sigma(X)$ is boring as one uses this often with $\mathcal{G} = \sigma(Y)$ from another random variable $Y$. Anyhow, thanks for the answer. – Fabian Werner Dec 18 '15 at 07:52
  • @FabianWerner: But r.c.p.'s in Durrett is defined by letting $(S,\mathcal{S})=(\Omega,\mathcal{F})$ and $X:\Omega\to\Omega$ be the identity in which case $\mu$ does accept sets from $\mathcal{F}$. You seem to have misunderstood me. For r.c.p.'s there is only one random variable (the one we condition on). Just replace $\sigma(X)$ with $\sigma(Y)$ in my previous comment (I just used $X$ to condition on since this is what I did in my initial post, although Durrett uses $X$ differently). – Stefan Hansen Dec 18 '15 at 07:58
  • 1
    Yes, sorry, whenever I wrote r.c.p. I actually meant the more general r.c.d. (without fixing G). Now it becomes at least clear to me that $\mu$ (if defined on $\sigma(X)$) and $P^X$ if only evaluated on sets $A \in \sigma(X)$ coincide in the sense that $\mu = P^X(X^{-1}, X)$ and $P^X$ on $(A, x)$ is $A = X^{-1}(B) = X^{-1}(B')$ then $X \in B == X \in B'$ and thus $E[1_{X \in B}|\sigma(x)] = E[1_{X \in B'}|\sigma(x)]$ and factorizing this into $f_A \circ X$ yields $P^X = f_A(x)$. I.e. they coincide 'on' $\sigma(X)$. – Fabian Werner Dec 18 '15 at 11:33
  • 1
    @FabianWerner: So whenever $\mathcal{G}$ is of the form $\sigma(Y)$ for any random variable $Y$ and if $\varphi$ is the function satisfying $\varphi_A(Y)=\mathrm{E}[\mathbf{1}{X\in A}\mid Y]$, then $\mu(\omega,A)=\varphi_A(Y(\omega))$ whereas $P{X\mid Y}(A\mid y)=\varphi_A(y)$. So the two ways of definining r.c.d's are essentially the same, however Durrett's definition allows for condtioning on an arbitrary sigma-algebra, whereas this definition only allows conditioning on a sigma-algebra generated by another random variable. – Stefan Hansen Dec 22 '15 at 07:19
  • 1
    @StefanHansen Thanks for such a great answer. May I ask you from which source or textbook you got these definitions? – le4m Jan 31 '17 at 10:54
  • 6
    @julypraise: These are from lecture notes used at my local university at the time I studied. I think they are pretty much consistent with the notation used in the series Probability With a View Towards Statistics by Hoffmann-Jørgensen. – Stefan Hansen Jan 31 '17 at 10:57
  • 1
    @Stefan Hansen Hi, this is a very good answer. Could you give me a reference for the things you wrote here, so that I can learn more about it. Thks. – Hana May 29 '20 at 22:02
  • 1
    @Hana: See my comment just above yours. – Stefan Hansen Jun 01 '20 at 17:45