0

When studying "Moments" in Probability and Statistics, we are told the following. Given a random variable $x$ with a probability distribution function $f(x)$:

  • The theoretical $k$-th moment is given by : $\int_{x} x^k \cdot f(x) \mathrm{d}x$

Now, suppose we have some measurements $x_1, x_2... x_n$. For these observations (regardless of our assumption on their probability distribution):

  • The sample $k$-th moment is given by: $\frac{1}{n} \sum_{i=1}^n x_i^k$

For me, the definition of the theoretical moment seems reasonable because it depends on the probability distribution. For example, the k-th moment of a Normal Distribution (i.e. Gaussian Distribution) would be completely different than the k-th moment of an Exponential Distribution.

However, I find the idea of the sample moment to be more confusing. This time around, regardless of the probability distribution being assumed, the k-th order Moment for all probability distributions is identical.

This brings me to my question: Why don't Sample Moments depend on the underlying probability distribution but Theoretical Sample do depend on the underlying probability distribution? Is it possible to prove that regardless of the underlying probability distribution, the corresponding sample moment will always be the same?

Thanks!

References:

stats_noob
  • 3,112
  • 4
  • 10
  • 36
  • 4
    The samples generate an "empirical" distribution. With respect this distribution, the sample means, variance, etc can be treats in the way theoretical quantities are. – Mittens Oct 23 '22 at 03:45
  • @ Oliver Diaz: Thank you for your reply! I would be interested in hearing more if you had time! – stats_noob Oct 23 '22 at 03:50
  • 1
    Do a search about empirical distributions. Also, under some regularity assumptions, the law of large numbers states that the empearical moments converge to the theoretical ones. – Mittens Oct 23 '22 at 03:52
  • 4
    In this settings, the "measurements" used to compute sample moment actually are random variables. So sample moments are also random variables (as they are function of random variables). These sample moments are point estimators of the theoretical moments counterpart. – BGM Oct 23 '22 at 05:12

2 Answers2

1

I hope, the following may answer your question well.

Theoretical moments stem directly from probability distribution functions, such that if one knows the specific analytic distribution at hand, one can calculate its corresponding moments with accuracy.

In contrast, while still relying on stochasticity and the existence of some probability distributions, by sampling, one can calculate such estimations without knowing anything about what specific stochastic distribution properties and restrictions.

To show the simple relation, in mathematical terms, and the transition from sample moments to theoretical moments, we must apply the Law of Large Numbers. It is, indeed, this principle that bridges the gap you are referring to, between sample moments, based on finite samples, and theoretical moments, based on an entire probability distribution.

Let $X$ a random variable on a probability distribution function $f(x)$. The theoretical first moment, called expectation value, of $X$ is defined as:

$$E[X] = \int {x \cdot f(x)}\; dx \qquad \text{; over the entire domain of $x$}$$

Trivially, this moment depends on the actual probability distribution of the random variable $X$ over its entire domain.

Now, let $N$ a sample of size $n$ from distribution $f(x)$, then the the sample's first moment, the mean of the sample, is given by:

$$\frac{1}{n} \cdot \sum_i x_i \quad \text{for $i=1$ to $n$} \qquad \text{; over $n \in N$}$$

This means that we sum up all the sample values over $N$ and divide by the number of samples.

According to the Law of Large Numbers, when $n$ (the number of samples) approaches infinity, $N$ large enough, the sample mean will converge in probability to the expected value, hence:

$$\lim_{n\rightarrow∞} \frac{1}{n} \cdot \sum_i x_i = E[X] \qquad \text{with probability $1$ (!)}$$

The same procedure applies straightforward to higher order moments (variance etc.); however, the proof gets more complicated due to additional higher-order terms such as $(x_i - \text{mean})^k$ with $k > 1$. Then, for the second moment (variance), for instance, the Law of Large Numbers ensures that the sample variance (which involves the second power of the differences from the mean) converges to the theoretical variance, again provided we have a large enough number of samples.

al-Hwarizmi
  • 4,290
  • 2
  • 19
  • 36
  • 2
    Proving that $\frac{1}{n} \sum_i X_i^k$ converges (with proba 1) to $\mathbb E[X^k]$ is is in fact extremely easy for $k>1$ once you have it for $k=1$: instead of $X_i$, consider the random variables $Y_i$, and voilà, you get the convergence (with proba 1) of $\frac 1 n \sum_i Y_i = \frac 1 n \sum_i X_i^k$ to $\mathbb E[Y] = \mathbb E[X^k]$. – Thomas Lehéricy May 23 '23 at 19:27
1

In summary, I think the answer is, we simply do not need the underlying distribution to make use of the sample moments, as conventionally defined. One might argue that the sample average is too naive--it is assigning an equal weight of $1/n$ to each observation! Yes, it does seem a little naive, and yet the LLN guarantees, it'll converge under a pretty wide range of conditions and settings. So, one might ask next, okay, but can we speed up this convergence and get a better estimate using knowledge of $f$? Perhaps. Indeed, we could try to estimate the integral $$\int xf(x)dx$$ using trapezoid rule. But this requires an equally spaced grid, which our observations don't necessarily come in, so what if we try a weighted average like: $$\sum_{i=1}^n x_i f(x_i)?$$ Well this may or may not improve the quality, but we are making a big ask--we usually are not afforded the privilege of knowing $f$ exactly. So, then we are left with dropping it anyway, and asking about weighted averages like $$Y = \sum_{i=1}^n w_i X_i,$$ where $w_1+\dotsc +w_n=1$ and $w_i\geq0$. This is certainly more like the theoretical definition, but once again we have to start asking questions like what choice of weights to use and why? And this probably can get some good improvements, especially for specific industries. But ultimately, the standard sample moments are simpler and require less subjective input.


Here are some perspectives to consider that might help you get used to the fact that sample moments are defined in this way. These, of course, should not be taken too literally, but rather are intended to be broad, and perhaps a little stereotypical, even.

In probability theory, a typical problem is to be given a probability distribution and ask about the chance of some event, or ask about some "statistic" (like the mean, median, mode, maximum, etc). So, if we don't have a distribution, there is not much we can say, since we cannot compute probabilities, expectations, etc--that is if we also do not have data...

...but in statistics, we are given observations $x_1,\dotsc, x_n$ and want to ask things like what is the most likely distribution that would fit this data? If we have a guess about a family $\{f(x;\theta):\theta\in \Theta\}$ of distributions, then we can fit $f$ using methods like MLE to find the "best" estimate for $\theta$. But if we lack such a guess?

Well, luckily we can get a lot of information about a sample without knowing anything about its distribution $f(x;\theta)$. For example, suppose you run a trading desk or a betting team. You want to evaluate the performance of one of your traders, so you look at their winnings and losses over a large time period. You compute the mean and see one trader has a very large positive mean and another trader has a very large negative mean. Since, by the SLLN the RV $$\bar{X}_n := \frac{1}{n}\sum_{i=1}^n X_i \to \mathbb{E}(X), a.s.$$ we can be confident that these sample estimates are close to the true first moment, provided they exist, and provided we have a large enough data-set. Note the, perhaps pedantic but important, distinction between the RVs $X_1,\dotsc, X_n$ and $\bar{X}_n$ and the observed variates $x_1,\dotsc, x_n$ and $\bar{x}_n$ (these are just plain real numbers).

All of this can be done without knowledge of what $f$ is, and already you have gained some information about your team. If we needed $f$ to compute sample moments, very little could be done when we are starved for information. It is a great benefit to descriptive statistics to be able to compute sample estimates independent of the underlying distribution of the given observations.


Is it possible to prove that regardless of the underlying probability distribution, the corresponding sample moment will always be the same?

We define sample $k$-th moment as $$\bar{X^k}_n : = \frac{1}{n}\sum_{i=1}^n X_i^k.$$ This expression, as a formula, does not depend on the density of $X_i$. There is no need to prove this because it is simply a definition. However, the distribution of $\bar{X^k}_n$ does depend upon the distribution of $X_i$. Indeed, in the simplest case for $k=1$, and $n=2$, we know that the distribution of $\bar{X}_2$ will be related to the convolution of the densities of $X_1$ and $X_2$.

So, 1) the formula is always the same, by defintion, but 2) the distribution a sample moment has itself as an RV in its own right will depend on the underlying distribution of the data we are computing the moment of (and this will matter for making confidence intervals, error estimates, lower bounds etc). Finally, it is important to realize that distinct samples with different distributions may have the same sample moments, so these quantities do not entirely determine the distribution of an RV. There are many visual examples in machine learning books nowadays, so here is a more toy probability-theoretical example: let $X\sim \mathcal{U}(0, 1)$, this has a mean of $1/2$ and let $Y\sim \mathcal{N}(1/2, 1)$, which has a mean of $1/2$, too. But $X$ is almost surely positive while $Y$ can be negative, so they cannot represent the same data. Note, that the higher moments differ. If all of the theoretical moments match, then there are some theorems (e.g. here and the MGF correspondence) to show that $X=Y$ in distribution i.e. $F_X(a)=F_Y(a)$, for all $a$.

Nap D. Lover
  • 1,207