This is exercise 3.8(a) from Understanding Machine Learning: from Theory to Algorithms by Shalev-Shwartz and Ben-David. I am trying to figure the exercise out for a course, but it is not homework. (I have a decent mathematical background but less so in stochastics.)
Consider a set $\mathcal X$, and joint random variables $(x,y)$ distributed according to probability distribution $\mathcal D$ on $\mathcal X \times \{0,1\}$. We think informally of the probability of $(x, 0)$ as that of finding an $x$ without a certain property, and of the probability of $(x, 1)$ as that of finding an $x$ with a certain property. Note that I'm not requiring that the distribution is discrete.
We define $f: \mathcal X \to \{0,1\}$, the "Bayes optimal predictor", by $$ f(x) = 0 \iff \mathbb P(y = 0\mid x ) \geq \frac12. $$ (I'm not 100% sure this definition even makes sense in the general case. Do we need a well-behaved distribution to speak about $\mathbb P(y = 0 \mid x)$?)
Now I want to prove that $f$ is "the best at predicting whether or not an $x$ has the property". That is, for any $h : \mathcal X \to [0, 1]$ (note the codomain) we have that $$ \mathbb P(f(x) \neq y) \leq \mathbb E[|h(x) - y|]. $$ Intuitively, this is totally obvious to me: $f$ always makes the "best guess", so even in the optimal scenario, $h$ should incur more loss. I've even proved it in the case that $h$ maps into $\{0,1\}$. However, I can't seem to be able to get the general case. I've tried to condition and split the expectation and probability into the right cases, but I keep getting stuck, partially on notation and partially on insight. I hope someone can point me in the right direction.