How do you prove the mean is the best estimator?

Question

sorry if this wastes your time but I am not a mathematician. I do like numbers though.

After reading a problem where you were shown a set of X numbers, and asked how would you predict the next one, I learnt about the average.

By plotting I realize that the average is a number that will always be between the max and the min but also that is the number whose differences to all the sample will by smallest possible (at least in the case of picking up a single number).

I tried to write some math representing this but I have no real idea how to prove it.

I imagine that this could be proved by contradiction: any other number gives rise to a larger difference. But I couldnt get anywhere.

Would you give me some hints, or pointers as to what kind of ideas do you need to prove it?

Hi: $\bar{x}$ has this nice property but only when the loss function is the distance squared. When the loss function is such, then $\bar{x}$ minimizes the sum of all of the (distance squared) values in the sample. but if you used say, sum of absolute values of distance, as the loss function, then the estimator with this property would be the median. So, the estimator depends on the loss function used. — mark leeds, Aug 07 '22 at 16:59
Oh, and to prove it ( since that what you asked ) take $\sum_{i=1}^{n} (x_{i} - c)^2$ and solve for the $c$ that minimizes the sum by taking the first derivative and setting it to zero. you should get $\bar{x}$. — mark leeds, Aug 07 '22 at 17:00

Steven · Accepted Answer · 2022-08-08T13:06:59.460

Suppose we have some numbers $x_1, x_2, \ldots, x_n$. By shuffling them around we can assume that they are ordered such that $x_1 \leq x_2 \leq \ldots \leq x_n$. The mean of these numbers is given by $$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i. $$ It is easy to prove that $\bar{x}$ lies between the smallest number $x_1$ and the largest $x_n$ since $$ x_1 = \frac{1}{n} \sum_{i=1}^n x_1 \leq \frac{1}{n}\sum_{i=1}^n x_i \leq \frac{1}{n}\sum_{i=1}^n x_n = x_n. $$ As you mention, $\bar{x}$ is a number with small distance to each individual point. The precise way in which this is true, is that $\bar{x}$ minimizes the expression $$ \sum_{i=1}^n (x_i - \bar{x})^2. $$ One way to prove this, is to note that this expression is a quadratic in the variable $\bar{x}$, so it's unique minimum will be at the point where its derivative is zero. Setting the derivative to zero gives $$ -2\sum_{i=1}^n x_i+2n\bar{x}=0, $$ so that we indeed find that $$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i. $$

Edit: In response to your comment, indeed it turns out that $\bar{x}$ minimizes the squared differences $\sum_{i=1}^n (x_i - \bar{x})^2$ instead of $\sum_{i=1}^n |x_i - \bar{x}|$, which you might have initially expected. It turns out that the minimizer of $\sum_{i=1}^n |x_i - \bar{x}|$ is the median, which you have probably heard of. After some searching, I found this answer which gives a precise argument that the median minimizes this expression. It turns out, however, that the median is often non-unique. For example, in the case of two numbers $x_1 < x_2$, any $\bar{x} \in [x_1,x_2]$ minimizes $\sum_{i=1}^2 |x_i - \bar{x}|$. The fact that the squared differences always have a unique, easily expressible minimizer is convenient, but which measure is appropriate depends on the application.

It might also be helpful to know that in the context of machine learning, the sum of the squared differences is called the $\ell^2$-loss and the sum of the absolute value differences the $\ell^1$-loss.

Why did you pick that up to minimize? The distance to a point is $d = x-x'$ isnt it? why not minimizing that sum? Oh, this can not be minimized I think..? But why the squared, just simplicity? — Mah Neh, Aug 08 '22 at 12:21
By the way, thanks for the answer, it is clean, creative and understandable. — Mah Neh, Aug 08 '22 at 12:29
I wonder if there is any other way, more visual / geometrical / common sensical rather than purely analytical. — Mah Neh, Aug 08 '22 at 12:46
@MahNeh Thank you! Expanded my answer a little bit. It might also be helpful to think of examples, that is, sequences of numbers $x_1, \ldots, x_n$ where the mean doesn't minimize the absolute value differences and where the median doesn't minimize the squared differences. — Steven, Aug 08 '22 at 13:09
Dang this is good. I love maths sometimes. Still don't fully understand but I'll keep coming back to the answer. Hope you don't mind that I asked a follow up question to check whether there is any geometrical way, something that does not include "minimizing" which I find difficult to imagine. — Mah Neh, Aug 08 '22 at 13:12
@MahNeh to get some geometrical intuition, I would just try to think of some examples. For example, why the median for two points $x_1<x_2$ is not unique, but why the mean is (it has to do with that squaring the differences 'punishes' large differences), or what happens when we have a couple of points fairly close and then one far outlier. — Steven, Aug 08 '22 at 13:17

How do you prove the mean is the best estimator?

1 Answers1

Linked