Proving that the Bayes optimal predictor is in fact optimal

Question

This is exercise 3.8(a) from Understanding Machine Learning: from Theory to Algorithms by Shalev-Shwartz and Ben-David. I am trying to figure the exercise out for a course, but it is not homework. (I have a decent mathematical background but less so in stochastics.)

Consider a set $\mathcal X$, and joint random variables $(x,y)$ distributed according to probability distribution $\mathcal D$ on $\mathcal X \times \{0,1\}$. We think informally of the probability of $(x, 0)$ as that of finding an $x$ without a certain property, and of the probability of $(x, 1)$ as that of finding an $x$ with a certain property. Note that I'm not requiring that the distribution is discrete.

We define $f: \mathcal X \to \{0,1\}$, the "Bayes optimal predictor", by $$ f(x) = 0 \iff \mathbb P(y = 0\mid x ) \geq \frac12. $$ (I'm not 100% sure this definition even makes sense in the general case. Do we need a well-behaved distribution to speak about $\mathbb P(y = 0 \mid x)$?)

Now I want to prove that $f$ is "the best at predicting whether or not an $x$ has the property". That is, for any $h : \mathcal X \to [0, 1]$ (note the codomain) we have that $$ \mathbb P(f(x) \neq y) \leq \mathbb E[|h(x) - y|]. $$ Intuitively, this is totally obvious to me: $f$ always makes the "best guess", so even in the optimal scenario, $h$ should incur more loss. I've even proved it in the case that $h$ maps into $\{0,1\}$. However, I can't seem to be able to get the general case. I've tried to condition and split the expectation and probability into the right cases, but I keep getting stuck, partially on notation and partially on insight. I hope someone can point me in the right direction.

This is overkill, but that's the $n=2$ case of the fact that medians minimize average distance. — Dap, Oct 11 '17 at 04:55

Dap · Answer 1 · 2017-10-10T19:23:01.817

First the inequality: temporarily writing $p_x$ for $\mathbb P(y=0\mid x)$, note that $$\mathbb E[|h(x)-y| \mid x] = h(x)p_x+(1-h(x))(1-p_x)$$ is a convex combination of $p_x$ and $1-p_x$, so must be at least $\min(p_x,1-p_x)=\mathbb P(f(x)\neq y\mid x).$

As for conditional probabilities: there is an abstract definition of "a conditional expectation," and these can often be shown to exist as a Radon–Nikodym derivative. In your case, there is a conditional expectation $\mathbb P(y=0|x)$ of $\mathbb P(y=0),$ and it is a measurable function $\mathcal X\to [0,1]$. It can be constructed as the Radon–Nikodym derivative of the measure on $\mathcal X$ coming from the event $y=0$ with respect to the total probability measure.

Thank you for your answer; it certainly sounds good, but I'll have to dig up my measure theory book to see what this Radon-Nikodym stuff is about again. Once I feel confident I understand your answer I'll upvote/accept. — Mees de Vries, Oct 11 '17 at 09:58

Proving that the Bayes optimal predictor is in fact optimal

1 Answers1