Given a collection of points $P \subset \mathbb R^3$, a crude characterization of the "shape" of $P$ is sometimes given by the principal components. We construct a covariance matrix, e.g., if $P$ is discrete, $C = \displaystyle\sum_{p\in P} (p - \mu)(p - \mu)^\intercal$, where $\mu = \displaystyle\frac1{|P|}\sum_{p\in P}\, p$ is the centre of mass. This defines an ellipsoid whose semi-axis are defined by the unit eigenvectors of $C$, scaled by the associated eigenvalues.
My question is concerned with the following statement:
The ellipsoid described by the principal components is the best fit ellipsoid for $P$.
Unfortunately, I don't know of any author or resource that I can accuse of explicitly making such a claim$^*$. Anyway, my question is:
Is there a natural geometric definition of "best fit ellipsoid" for which the above statement becomes true?
For example, some kind of least squares or other variational characterization of this same ellipsoid is what I am looking for. I would also accept an answer that convinces me that this is the wrong way to be looking at the principal components, but that will be a tough sell.
If we do a coordinate translation, so that $v_p = (p - \mu)$, and let $\hat{v}_p = \frac{v_p}{\left\|v_p\right\|}$, and look at $C$ as a linear transformation which is the sum of the rank one operators in this coordinate system, $C = \displaystyle\sum_{p \in P} \left\|v_p\right\|^2 \hat{v}_p\hat{v}_p^\intercal$, then the ellipsoid in question is the image of the unit ball. From this characterization I gain some intuition as to why this particular ellipsoid is a good one. I am looking for a better understanding, preferably from a geometric perspective.
* Wikipedia comes close to such a claim in the description of the moments: "The 'second moment', ... in higher dimensions measures the shape of a cloud of points as it could be fit by an ellipsoid."
Edit: Although I feel that the observation that the ellipsoid reflects the variance of the Gaussian distribution that has maximal likelihood to produce $P$ (I haven't rolled up my sleeves and checked), this is not the kind of answer I am looking for. Perhaps I should remove all tags that refer to probability or regression.
I will make the question very specific then. From stuff I've seen elsewhere on the web, I get the feeling that this ellipsoid is different from the one that minimizes the sum of squared distances to the points, but I don't know for sure.
How about this then: the radial distance from a point $p$ to the ellipsoid is the distance as measured along the line that contains $p$ and $\mu$ (the latter being the origin in our new coordinate system). So let this be my question:
Does the ellipsoid defined above minimize the sum of the squared radial distances?