Bayes classifier: handling conditional expectation / probability

Question

I am learning about the Bayes optimal classifier, and there is a step in a proof I struggle with. One can find this proof also on the Wikipedia page: https://en.wikipedia.org/wiki/Bayes_classifier#Proof_of_Optimality

The question arises already in part a). Let me give some definitions first:

Let $(X,Y)$ be random variables with values in $(\mathbb{R}^d, \{0,1\}). $

For $x$ in the support of $X$ let $\eta(x) = \mathbb{P}(Y=1|X=x)$, and let $h$ be a classifier, meaning that $h(X) \in \{0, 1\}$. We further define the risk of a classifier as $R(h) := \mathbb{P}(h(X) \neq Y)$. Let me now state the proof (same as on Wikipedia) and the part where I am struggling. For any classifier $h$ we have:

$$R(h) = P(h(X) \neq Y) = \mathbb{E}_{XY}[\Bbb{1}_{h(X)\neq Y}] = \Bbb{E}_X\Bbb{E}_{Y|X}[\Bbb{1}_{h(X)\neq Y}|X=x],$$

where the last equality is the law of iterated expectations. Continuing, we get

$$\Bbb{E}_X\Bbb{E}_{Y|X}[\Bbb{1}_{h(X)\neq Y}|X=x] \\= \Bbb{E}_X[\Bbb{P}(Y\neq h(X)|X=x)] \\= \Bbb{E}_X[\Bbb{1}_{h(X) = 0}\cdot\Bbb{P}(Y=1|X=x) + \Bbb{1}_{h(X) = 1}\cdot\Bbb{P}(Y=0|X=x)] \\= \Bbb{E}_X[\Bbb{1}_{h(X) = 0}\cdot \eta(x) + \Bbb{1}_{h(X) =1}\cdot(1-\eta(x))] $$

Now the last line is pretty much what is written in the proof on Wikipedia (and also the proof I have seen in class), except that the argument of the function $\eta$ is not $x$ but $X$, in words it is not a point $x$ in the support of $X$ but a random variable. Now I wonder how this exchange is justified. Since I have not taken a class which deals with conditional expectations yet, there might be a somewhat straight-forward justification for this which I'm not aware of. It's also possible that I have made a mistake in the above computations.

I found a very thorough explanation of things that look similar in a thread here (provided by @Stefan Hansen):

https://math.stackexchange.com/a/498338/874549,

but to me this is very advanced, so it's hard to say if that is actually what I'm looking for.

If anyone sees a mistake, or has a somewhat elementary explanation this would be very appreciated!

Maximilian Janisch · Accepted Answer · 2021-12-25T17:00:58.400

As demonstrated by the Borel–Kolmogorov paradox, it is impossible that the term "$\mathsf P(Y=1\mid X=x)$" is defined using the event $\{X=x\}$ if $\mathsf P(\{X=x\})=0$. Instead, the term "$\mathsf P(Y=1\mid X=x)$" is very intimately related to the random variable $X$. Here is the usual definition:

For any Lebesgue-integrable (real) random variable $Y$ and any (real) random variable $X$ on the same probability space, one can define the conditional expectation $\mathsf E(Y\mid X)$ as the, informally speaking, "best approximation to $Y$ if $X$ is known". See for instance [1; Definition 8.11] for a formal definition.

Now, it follows from the definition that $\mathsf E(Y\mid X)$ is $\sigma(X)$-measurable, so by [1; Korollar 1.97], there exists a measurable function $\eta: \mathbb R\to\mathbb R$ such that $$\mathsf E(Y\mid X) = \eta\circ X$$ $\mathsf P$-almost everywhere. The function $\eta$ is uniquely determined $X_\#\mathsf P$-almost everywhere. (TODO: Prove this.)

(Here, $X_\#\mathsf P$ denotes the pushforward of $\mathsf P$ under $X$, i.e. $X_\#\mathsf P(A)\overset{\text{Def.}}=\mathsf P(X^{-1}(A))$ for all measurable $A\subset\mathbb R$.)

For example, suppose that we have a random variable $Y$ and a random variable $X$ such that $\mathsf E(Y\mid X)=X^2$. Then we have $\eta(x)=x^2$ $X_\#\mathsf P$-almost everywhere.

Therefore, one can now define, with an abuse of notation, (and using Iverson brackets, i.e. if $A$ is an event then $[A]$ shall denote the random variable that is the indicator function of $A$, it is often also denoted by $\mathbf 1_A$) $$\mathsf P(Y=1\mid X=x)=\eta(x)$$ where $\eta$ is a function satisfying $$\mathsf E([Y=1]\mid X) = \eta\circ X$$ $\mathsf P$-almost everywhere.

So, the Wikipedia article (with very confusing notation in my opinion) just says this: $$R(h)=\mathsf P(h(X)\neq Y)=\mathsf E([h(X)\neq Y]).$$ Since $h(X), Y$ only take the values $0$ and $1$, $$\mathsf E([h(X)\neq Y]) = \mathsf E([h(X)=0][Y=1])+\mathsf E([h(X)=1][Y=0]).$$

By the tower property for the conditional expectation [1; Satz 8.14 (iv)], we have $$\mathsf E([h(X)=0][Y=1]) = \mathsf E(\mathsf E([h(X)=0][Y=1]\mid X)).$$ By [1; Satz 8.14 (iii)], since $h(X)$ is $\sigma(X)$-measurable (assuming $h$ is measurable), we have $$\mathsf E([h(X)=0][Y=1]\mid X) = [h(X)=0]\mathsf E([Y=1]\mid X).$$

But we chose the notation $E([Y=1]\mid X)=\eta(X)$, so we get $$\mathsf E([h(X)=0][Y=1]) = \mathsf E([h(X)=0] \eta(X)).$$ Analogously (exercise), we have $$\mathsf E([h(X)=1][Y=0]) = \mathsf E([h(X)=1] (1-\eta(X))),$$ and this is enough to conclude the proof for what you wanted to show.

Literature

[1] Achim Klenke: Wahrscheinlichkeitstheorie. 3. Auflage (2012/2013). Springer Spektrum.

Thank you Maximilian! Back when I posted the question, I also played around with Marc Romani's suggestion of writing the integrals, or rather sums explicitly (assuming that I work on a discrete space where every $x$ has positive measure, to avoid the paradox you mention). That's what helped me see what's going on back then. Nonetheless, your answer provides an excellent explanation of the more general case. +1 — noam.szyfer, Nov 05 '21 at 17:36

score 1 · Answer 2 · answered Jun 04 '21 at 16:46

I think the confusion arises from the notation itself. Note that \begin{align} &\,\mathbb{E}_X[\mathbb{P}(Y \neq h(X)|X=x)]\\ =&\,\mathbb{E}_X[\mathbb{P}(Y=0,h(X)=1 | X=x) + \mathbb{P}(Y=1,h(X)=0 | X=x)]\\ =&\,\mathbb{E}_X[\mathbb{P}(Y=0|X=x)\mathbb{P}(h(X)=1|X=x) + \mathbb{P}(Y=1|X=x)\mathbb{P}(h(X)=0|X=x)]\\ =&\,\mathbb{E}_X[\mathbb{P}(Y=0|X=x)\boldsymbol{1}_{h(x)=1} + \mathbb{P}(Y=1|X=x)\boldsymbol{1}_{h(x)=0}]\\ =&\,\mathbb{E}_X[(1-\eta(x))\boldsymbol{1}_{h(x)=1} + \eta(x)\boldsymbol{1}_{h(x)=0}] \end{align} Inside the expectation you have the function $g(x) = (1-\eta(x))\boldsymbol{1}_{h(x)=1} + \eta(x)\boldsymbol{1}_{h(x)=0}$, defined on $\mathbb{R}^d$. But of course, you want to take the expectation with respect to the random variable $g(X)$. It's just how the notation plays out, once you condition on a particular value $x$ for $X$. If you wrote the integrals explicitly instead of the expectation operators it would become clear.

Thanks for your answer! Can you give me an explicit example of how to write one of these expectations as an integral? I know the definition, and tried, for instance, to write $\Bbb{E}X[\Bbb{P}(Y=0|X=x)\Bbb{1}{h(x)=1} + \Bbb{P}(Y=1|X=x)\Bbb{1}_{h(x) = 0}]$ as an integral but unfortunately this does not clear things up for me... — noam.szyfer, Jun 04 '21 at 19:31

Bayes classifier: handling conditional expectation / probability

2 Answers2

Literature