Prove conditional expectation as minimization of squared error

Question

I'm a bit confused about the universality of this statement:

Suppose we have real-valued random variables $Y,X$, and differentiable function $f(X)$ (perhaps some model). Do not assume that $f(X)$ is convex.

$$ \mathbb{E}[Y \mid X] = \text{argmin}_f \mathbb{E}[(Y - f(X))^2] $$

Is this always true? And if so, why? Most of these proofs rely on reducing the statement above to (e.g., here):

$$ \text{argmin}_f \mathbb{E}[(\mathbb{E}[Y] - f(X))^2]$$

Then they take the derivative to compute the minimum to show the result, but this would require $f(x)$ to be convex, so would the above statement always hold?

At which point do they take a derivative in the proof that you linked? They add/subtract $E[Y|X]$ and show the cross term goes to zero — WeakLearner, Sep 27 '22 at 18:33
They have a note about convexity in the first proof and use it to show that we choose $a = \mathbb{E}[X]$ — user2793618, Sep 27 '22 at 18:39
convexity of the function $x \mapsto x^2$, not convexity of the function $f$ as you claim in your question — WeakLearner, Sep 27 '22 at 18:42
That's the first proof. In the second they use that ${\cal G}\ni g\mapsto \mathbb E[(\mathbb E[Y|X]-g(X))^2]$ is a convex functional that is minimized (obviously) at $g^*(X)=\mathbb E[Y|X]$. A nice reference BTW. — Kurt G., Sep 27 '22 at 18:45
Hmm, actually now I that I look at it again, I don't seem to understand why that function would necessarily be convex. as it is a function of $g(X)$. Not too sure why he mentioned convexity here — user2793618, Sep 27 '22 at 19:04
Nope (as the accepted answer also says). Once we know that we only have to minimize $\mathbb E[(\mathbb E[Y|X]-g(X))^2]$ it becomes completely trivial. I find it nonetheless interesting in itself that the functional is convex in $g$. — Kurt G., Sep 28 '22 at 09:06

score 2 · Accepted Answer · answered Sep 27 '22 at 20:10

There's no need to take derivatives or invoke convexity. Having established that $$ E\left[ (Y-f(X))^2\right]=E\left[(Y-E(Y\mid X))\right]^2+E\left[(E(Y\mid X)-f(X))^2\right]=a+b,\tag{$\ast$} $$ we observe the LHS of $(\ast)$ is minimized when $b$ is minimized. Since $b$ is the expectation of a non-negative random variable, it is clear that $b\ge0$. But taking $\hat f(X):=E(Y\mid X)$ leads to $b=0$, hence $\hat f(X)$ is a choice for $f(X)$ that minimizes the LHS.

Now if also $h(X)$ minimizes the LHS of $(\ast)$, then we must have $$E\left[(E(Y\mid X)-h(X))^2\right]=0$$ as well. But $(E(Y\mid X)-h(X))^2$ is a non-negative random variable. Therefore it must equal zero almost surely, which implies $h(X)=E(Y\mid X)$ almost surely.

Thanks this is super well written and informative – user2793618 Sep 27 '22 at 20:58 — user2793618, Sep 27 '22 at 20:58

Prove conditional expectation as minimization of squared error

1 Answers1