How the Bayes rule for density functions is formulated in probability theory?

Question

Given a probability space $\left( \Omega\mathcal{,F,}\mathbb{P} \right)$, and two $\mathcal{F}$-measurable real-valued random variables $X,Y$, then the joint random variable $\left( X,Y \right)$ can be defined on a product space $\left( \Omega^{2},\sigma\left( \mathcal{F}^{2} \right),\mathbb{P \times P} \right)$ where $\mathbb{P \times P}$ is the product measure of $\mathbb{P}$. Let $f\left( x,y \right),f_{X}\left( x,y \right),f_{Y}\left( y \right)$ be the density functions (Randon-Nikodym derivatives) of $\left( X,Y \right),X,Y$ respectively, and let $f_{X|Y}\left( x,y \right)$ be the density function of $X$ conditioned on $Y$.

Anyone can help with a construction, or proof or related materials about the Bayes rule $f_{X|Y}\left( x|y \right) = \frac{f\left( x,y \right)}{f_{Y}\left( y \right)}$? We may also instead consider the other version $f_{X|Y}\left( x|y \right) = \frac{f_{Y|X}\left( y|x \right)f_{X}\left( x \right)}{f_{Y}\left( y \right)}$ which does involve the joint random variable. I do not understand how the this Bayes rule is formulated in measure theory. This is a widely used formula, while I cannot find any construction or proof from my probability books.

I can find related definition for "conditional density" in the following way. There could be other definitions.

We denote the integration w.r.t. the measure $\mathbb{P \circ}X^{- 1}$ of a RV as $\int_{B}^{}{dX} := \int_{B}^{}{d\left( \mathbb{P \circ}X^{- 1} \right)}$ for simplicity. Define the conditional probability measures $\mathbb{P}_{y},y \in Y\left( \Omega \right)$ as a family of probability measures on $\left( \Omega\mathcal{,F} \right)$ s.t. two axioms hold: 1) $\mathbb{P}_{y}\left( A \right)$ is $\left( \mathbb{R,}\mathcal{B}\left( \mathbb{R} \right) \right)$-measurable for any $A \in \mathcal{F}$ (given a fixed $A \in \mathcal{F}$,$\ \mathbb{P}_{y}\left( A \right)$ is a $\mathbb{R \rightarrow}\left\lbrack 0,1 \right\rbrack$ function w.r.t. index $y$); and 2) the general version of law of total probability

$$\int_{B}^{}{\mathbb{P}_{y}\left( A \right)dY}\mathbb{= P}\left( A\bigcap Y^{- 1}\left( B \right) \right),\forall A \in \mathcal{F}, B \in \mathcal{B}\left ( \mathbb R \right)$$

We then denote $\mathbb{P}\left( A|Y = y \right) = \mathbb{P}_{y}\left( A \right),\forall A \in \mathcal{F}$ as the conditional probability measure given event $Y = y$. Then for any RV $X$, the conditional probability density function $f_{X|Y}\left( x|y \right)$ is the Radon-Nikodym derivative of distribution $\mathbb{P}_{y} \circ X^{- 1}$

I list all relations I can conceive, based on above definition,

$$\int_{B}^{}{\mathbb{P}_{y}\left( A \right)dY}\mathbb{= P}\left( A\bigcap Y^{- 1}\left( B \right) \right),\forall A\mathcal{\in F,}B \in \mathcal{B}\left( \mathbb{R} \right)$$

$$\int_{B}^{}{\mathbb{P}_{x}\left( A \right)dY}\mathbb{= P}\left( A\bigcap X^{- 1}\left( B \right) \right),\forall A\mathcal{\in F,}X \in \mathcal{B}\left( \mathbb{R} \right)$$

$$\int_{B}^{}{f_{X|Y}\left( x|y \right)} = \mathbb{P}_{y}\left\{ X^{- 1}\left( B \right) \right\},\forall B \in \mathcal{B}\left( \mathbb{R} \right)$$

$$\int_{B}^{}{f_{Y|X}\left( y|x \right)} = \mathbb{P}_{x}\left\{ X^{- 1}\left( B \right) \right\},\forall B \in \mathcal{B}\left( \mathbb{R} \right)$$

$$\int_{B}^{}{f_{Y}\left( y \right)} = \mathbb{P}\left\{ Y^{- 1}\left( B \right) \right\},\forall B \in \mathcal{B}\left( \mathbb{R} \right)$$

$$\int_{B}^{}{f_{X}\left( x \right)} = \mathbb{P}\left\{ X^{- 1}\left( B \right) \right\},\forall B \in \mathcal{B}\left( \mathbb{R} \right)$$

Out of curiosity, where is this definition of conditional probability density function quoted from? — littleO, Jul 23 '18 at 02:46
@littleO It is from lecture notes. I am not aware of its original source. This answer seems to use the same definition. https://math.stackexchange.com/questions/496608/formal-definition-of-conditional-probability — Tony, Jul 23 '18 at 04:29
I am very glad to know if you have any other definition for the conditional density function? Maybe some others are more friendly with the Bayes rule. — Tony, Jul 23 '18 at 04:30
See this, strictly related: "A Measure Theoretic formulation of Bayes' Theorem" https://math.stackexchange.com/q/3503320/532409 and also https://math.stackexchange.com/q/496608/532409 — Quillo, Mar 12 '24 at 11:35

Bob · Accepted Answer · 2018-07-23T05:29:52.557

Here an outline about how you can get to the result you're looking for.

Given a sub-$\sigma$-algebra $\mathcal{G}$ of $\mathcal{F}$ and given $F\in\mathcal{F}$, define $$\mathbb{P}(F|\mathcal{G}) : \left(\Omega,\mathcal{G}\right) \rightarrow \left(\mathbb{R},\mathcal{B}_{\mathbb{R}}\right)$$ as the (essentially) unique $\left(\Omega,\mathcal{G}\right) - \left(\mathbb{R},\mathcal{B}_{\mathbb{R}}\right)$ -measurable function such that $$\forall G\in \mathcal{G}, \mathbb{P}(F\cap G)=\int_G \mathbb{P}(F|\mathcal{G})\operatorname{d}\mathbb{P},$$ via Radon-Nikodym theorem.
If $\mathcal{G}$ is a sub-$\sigma$-algebra of $\mathcal{F}$ and $X : \left(\Omega,\mathcal{F}\right) \rightarrow \left(\mathbb{R},\mathcal{B}_{\mathbb{R}}\right)$ and $A\in\mathcal{B}_{\mathbb{R}}$, define $$\mathbb{P}_{X|\mathcal{G}}(A):=\mathbb{P}(\{X\in A\} | \mathcal{G}).$$
If $Y : \left(\Omega,\mathcal{F}\right) \rightarrow \left(\mathbb{R},\mathcal{B}_{\mathbb{R}}\right)$ and $F\in\mathcal{F}$, the map $\mathbb{P}(F|\sigma(Y))$ is $\left(\Omega,\sigma(Y)\right)-\left(\mathbb{R},\mathcal{B}_{\mathbb{R}}\right)$ - measurable, and so there exists $\varphi :(\mathbb{R},\mathcal{B}_\mathbb{R})\rightarrow (\mathbb{R},\mathcal{B}_\mathbb{R})$ such that $\varphi \circ Y = \mathbb{P}(F|\sigma(Y))$. Notice that if $\psi$ is another map that does the same work, then $$\varphi=\psi$$ $\mathbb{P}_Y$-a.e.. So, define $\mathbb{P}(F|Y):=\varphi$.
If $X,Y : \left(\Omega,\mathcal{F}\right) \rightarrow \left(\mathbb{R},\mathcal{B}_{\mathbb{R}}\right)$ and $A\in\mathcal{B}_{\mathbb{R}}$, define $\mathbb{P}_{X|Y}(A):=\mathbb{P}(\{X\in A\}|Y)$. Then we have $$\mathbb{P}_{X|Y}(A)\circ Y=\mathbb{P}_{X|\sigma(Y)}(A).$$ If $y\in\mathbb{R}$, let's denote $\mathbb{P}_{X|Y}(A)(y)$ with the less clumsy notation $\mathbb{P}_{X|Y=y}(A)$.
If $X,Y : \left(\Omega,\mathcal{F}\right) \rightarrow \left(\mathbb{R},\mathcal{B}_{\mathbb{R}}\right)$ and $A,B\in\mathcal{B}_{\mathbb{R}}$, then: $$\mathbb{P}(\{X\in A\}\cap \{Y\in B\})= \int_{Y^{-1}(B)}\mathbb{P}\left(\{X\in A\}|\sigma(Y)\right) \operatorname{d}\mathbb{P} = \int_{Y^{-1}(B)}\mathbb{P}_{X|{\sigma(Y)}}(A) \operatorname{d}\mathbb{P} \\ = \int_{Y^{-1}(B)}\mathbb{P}_{X|Y}(A)\circ Y \operatorname{d}\mathbb{P} = \int_B\mathbb{P}_{X|Y}(A) \operatorname{d}\mathbb{P}_Y = \int_B\mathbb{P}_{X|Y=y}(A) \operatorname{d}\mathbb{P}_Y(y).$$
If $X,Y : \left(\Omega,\mathcal{F}\right) \rightarrow \left(\mathbb{R},\mathcal{B}_{\mathbb{R}}\right)$ and $\mathbb{P}_{(X,Y)}$ has a density w.r.t. the Lebesgue measure on $\mathbb{R}^2$, say $f_{(X,Y)}$, then also $Y$ has one with respect to Lebesgue measure on $\mathbb{R}$, say $f_Y$ and: $$\forall A\in\mathcal{B}_\mathbb{R}, \text{for} \ \mathbb{P}_Y-a.e. \ y\in\mathbb{R}, \mathbb{P}_{X|Y=y}(A)=\int_A \frac{f_{(X,Y)}(x,y)}{f_{Y}(y)}\operatorname {d}x.$$ In order to prove that, fix $A\in\mathcal{B}_{\mathbb{R}}$, and notice that $$\forall B\in\mathcal{B}_\mathbb{R}, \int_B \left( \int_A \frac{f_{(X,Y)}(x,y)}{f_{Y}(y)}\operatorname {d}x \right)\operatorname{d}\mathbb{P}_Y(y) = \int_B \left( \int_A \frac{f_{(X,Y)}(x,y)}{f_{Y}(y)}\operatorname {d}x \right) f_{Y}(y) \operatorname{d}y \\ = \int_B \left( \int_A f_{(X,Y)}(x,y)\operatorname {d}x \right) \operatorname{d}y = \int_{A\times B} f_{(X,Y)}(x,y)\operatorname{d}x\operatorname{d}y = \mathbb{P}_{(X,Y)}(A\times B) = \mathbb{P}(\{X\in A\}\cap \{Y\in B\})= \int_B\mathbb{P}_{X|Y=y}(A) \operatorname{d}\mathbb{P}_Y(y)$$ and so $$\forall B\in\mathcal{B}_\mathbb{R}, \int_B \left( \int_A \frac{f_{(X,Y)}(x,y)}{f_{Y}(y)}\operatorname {d}x-\mathbb{P}_{X|Y=y}(A) \right)\operatorname{d}\mathbb{P}_Y(y)=0,$$ and then $$\int_A \frac{f_{(X,Y)}(x,y)}{f_{Y}(y)}\operatorname {d}x-\mathbb{P}_{X|Y=y}(A)=0$$ for $\mathbb{P}_Y$-a.e. $y\in\mathbb{R}$.

Thanks a lot! Could I ask why the existence and a.s. uniqueness of the third step holds? — Tony, Jul 25 '18 at 04:26
About the existence, you can find a proof in the book Probability with martingales by David Williams (the only lemma in the appendix to chapter 3). It is a beautiful result, because basically states that it is true what intuitively you are expecting to happens: if a random variable is known if you know all the information in $Y$, then this random variable is a function of $Y$. It is a factorization theorem, however, I think about such a result as "Radon-Nikodym theorem for information valued measures". — Bob, Jul 25 '18 at 04:56
About the uniqueness, suppose $\varphi, \psi$ satisfy that relation. Then $\varphi\circ Y$ and $\psi\circ Y$ are both a version of $\mathbb{P}(F|\sigma(Y))$, so they differ at most on a set of $\mathbb{P}$ measure zero. Suppose to get a contradiction that $\varphi$ and $\psi$ differ on a set of non-null $\mathbb{P}_Y$ measure, say $A$. Then $0\neq\mathbb{P}_Y(A)= \mathbb{P}(Y^{-1}(A))$ and $\forall\omega\in Y^{-1}(A), \varphi\circ Y (\omega) = \varphi( Y (\omega))\neq\psi( Y (\omega)) = \psi\circ Y (\omega)$, and so they differ on a set of positive $\mathbb{P}$ measure, absurd. — Bob, Jul 25 '18 at 04:57

BruceET · Answer 2 · 2018-07-23T02:44:31.663

Here is a very elementary practical problems about election polling that illustrates how to get a posterior probability interval (credible interval) from a prior and data.

Prior. Expert's prior on proportion $\theta$ in favor of Candidate A is $\mathsf{Beta}(\alpha_0=330, \beta_0=270),$ which has $P(0.51 < \theta < 0.59) \approx 0.95.$ Mean, median, mode 0.55. (Expert thinks candidate will win, but not by much.)

Data. $x = 620$ of $n = 1000$ prospective voters polled favor the candidate.

Bayes' Theorem gives Posterior. $$ p(\theta | x) \propto p(\theta) \times p(x|\theta) \propto \theta^{\alpha_0-1(1-\theta)^{\beta_0 - 1}} \times \theta^x(1-\theta)^{n-x} \\ = \theta^{\alpha_0 + x -1}(1-\theta)^{\beta_0 + n - x -1} = \theta^{\alpha_n - 1}(1-\theta)^{\beta_n -1}.$$ Notice that constants of integration are omitted, hence the use of $\propto$ ('proportional to') instead of $=.$

Because prior and likelihood are 'conjugate' (mathematically compatible) we can notice that the posterior has the kernel of $\mathsf{Beta}(\alpha_n=\alpha_0 + x, \beta_n = \beta_0 + n - x),$ so we can identify the exact posterior distribution $\mathsf{Beta}{\alpha_n = 950, \beta_n = 650}$ without having to evaluate the integral in the denominator of the right-hand side of Bayes' Theorem.

Posterior probability interval. One way to get a 95% credible interval is to take quantiles 0.025 and 0.975 of $\mathsf{Beta}(950, 650)$ to obtain $(0.570, 0.618),$ using R statistical software.

qbeta(c(.025, .975), 950, 650)
[1] 0.5695848 0.6176932

How the Bayes rule for density functions is formulated in probability theory?

2 Answers2

Linked