0

I'm a bit confused about the universality of this statement:

Suppose we have real-valued random variables $Y,X$, and differentiable function $f(X)$ (perhaps some model). Do not assume that $f(X)$ is convex.

$$ \mathbb{E}[Y \mid X] = \text{argmin}_f \mathbb{E}[(Y - f(X))^2] $$

Is this always true? And if so, why? Most of these proofs rely on reducing the statement above to (e.g., here):

$$ \text{argmin}_f \mathbb{E}[(\mathbb{E}[Y] - f(X))^2]$$

Then they take the derivative to compute the minimum to show the result, but this would require $f(x)$ to be convex, so would the above statement always hold?

  • At which point do they take a derivative in the proof that you linked? They add/subtract $E[Y|X]$ and show the cross term goes to zero – WeakLearner Sep 27 '22 at 18:33
  • They have a note about convexity in the first proof and use it to show that we choose $a = \mathbb{E}[X]$ – user2793618 Sep 27 '22 at 18:39
  • 1
    convexity of the function $x \mapsto x^2$, not convexity of the function $f$ as you claim in your question – WeakLearner Sep 27 '22 at 18:42
  • Ah ok, this makes sense then – user2793618 Sep 27 '22 at 18:42
  • That's the first proof. In the second they use that ${\cal G}\ni g\mapsto \mathbb E[(\mathbb E[Y|X]-g(X))^2]$ is a convex functional that is minimized (obviously) at $g^*(X)=\mathbb E[Y|X]$. A nice reference BTW. – Kurt G. Sep 27 '22 at 18:45
  • Hmm, actually now I that I look at it again, I don't seem to understand why that function would necessarily be convex. as it is a function of $g(X)$. Not too sure why he mentioned convexity here – user2793618 Sep 27 '22 at 19:04
  • That was in fact a good observation. – Kurt G. Sep 28 '22 at 04:25
  • It doesn't have to be necessarily convex right? – user2793618 Sep 28 '22 at 04:47
  • Nope (as the accepted answer also says). Once we know that we only have to minimize $\mathbb E[(\mathbb E[Y|X]-g(X))^2]$ it becomes completely trivial. I find it nonetheless interesting in itself that the functional is convex in $g$. – Kurt G. Sep 28 '22 at 09:06

1 Answers1

2

There's no need to take derivatives or invoke convexity. Having established that $$ E\left[ (Y-f(X))^2\right]=E\left[(Y-E(Y\mid X))\right]^2+E\left[(E(Y\mid X)-f(X))^2\right]=a+b,\tag{$\ast$} $$ we observe the LHS of $(\ast)$ is minimized when $b$ is minimized. Since $b$ is the expectation of a non-negative random variable, it is clear that $b\ge0$. But taking $\hat f(X):=E(Y\mid X)$ leads to $b=0$, hence $\hat f(X)$ is a choice for $f(X)$ that minimizes the LHS.

Now if also $h(X)$ minimizes the LHS of $(\ast)$, then we must have $$E\left[(E(Y\mid X)-h(X))^2\right]=0$$ as well. But $(E(Y\mid X)-h(X))^2$ is a non-negative random variable. Therefore it must equal zero almost surely, which implies $h(X)=E(Y\mid X)$ almost surely.

grand_chat
  • 38,951