It is frequently stated (in textbooks, on Wikipedia) that the "Law of large numbers" in mathematical probability theory is a statement about relative frequencies of occurrence of an event in a finite number of trials or that it "relates the axiomatic concept of probability to the statistical concept of frequency". Isn't this is a methodological mistake of ascribing an interpretation to a mathematical term, perhaps relying too much on the colorful language, that does not at all follow from how this term is mathematically defined? Recall the typical derivation of the WLLN:
Let $X_1, X_2, ..., X_n$ be a sequence of n independent and identically distributed random variables with the same finite mean $\mu$, and with variance $\sigma^2$ and let:
$\overline{X}=\tfrac1n(X_1+\cdots+X_n)$
We have:
$E[\overline{X}] = \frac{E[X_1+...+X_n]}{n} = \frac{E[X_1]+...+E[X_n]}{n} = \frac{n\mu}{n} = \mu$ $Var[\overline{X}] = \frac{Var[X_1+...+X_n]}{n^2} = \frac{Var[X_1]+...+Var[X_n]}{n^2} = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$
And from Chebyshev's inequality:
$P(|\overline{X}-\mu|>\epsilon) \le \frac{\sigma^2}{n\epsilon^2}$
And so X is said to converge in probability to $\mu$.
Now consider what is strictly speaking the meaning of this expression in the axiomatic framework it is derived in:
$P(|\overline{X}-\mu|>\epsilon) \le \frac{\sigma^2}{n\epsilon^2}$
$P()$, everywhere it occurs in the derivation, is known only to be a number satisfying Kolmogorov's axioms, so a number between 0 and 1, and so forth, but none of the axioms introduce any theoretical equivalent of the intuitive notion of frequency. If additional assumptions about $P()$ are not made, the sentence can obviously not be interpreted at all, but what is also important the theoretical mean $\mu$ is not necessarily the mean value in an infinite number of trials, $\overline{X}$ is not necessarily the mean value from n trials, and so forth. Consider an experiment of tossing a fair coin repeatedly - quite obviously, nothing in Kolmogorov's axioms enforces using 1/2 for the probability of heads, you could just as well use $1/\sqrt{\pi}$, yet the derivation continues to "work", except the meaning of the various variables is not in agreement with their intuitive interpretations. The $P()$ might still mean something, it might be a quantification of an absurd belief of mine, the mathematical derivation continues be true regardless, in the sense that as long as the initial $P()'s$ satisfy axioms, theorems about other $P()'s$ follow, and with Kolmogorov's axioms providing only weak constraints on and not a definition of $P()$, it's basically only symbol manipulation.
This "relative frequency" interpretation frequently given seems to rest on an additional assumption, and this assumption seems to be a form of the law of large numbers itself. Consider this fragment from Kolmogorov's Grundbegriffe on applying the results of probability theory to the real world:
We apply the theory of probability to the actual world of experiment in the following manner:
...
4) Under certain conditions, which we shall not discuss here, we may assume that the event A which may or may not occur under conditions S, is assigned a real number P(A) which has the following characteristics:
a) One can be practically certain that if the complex of conditions S is repeated a large number of times, n, then if m be the number of occurrences of event A, the ratio m/n will differ very slightly from P(A).
Which seems equivalent to introducing the weak law of large numbers in a particular, slightly different form, as an additional axiom.
Meanwhile, many reputable sources contain statements that seem completely in opposition to the above reasoning, for example Wikipedia:
It follows from the law of large numbers that the empirical probability of success in a series of Bernoulli trials will converge to the theoretical probability. For a Bernoulli random variable, the expected value is the theoretical probability of success, and the average of n such variables (assuming they are independent and identically distributed (i.i.d.)) is precisely the relative frequency.
This seem to be mistaken already in claiming that from a mathematical theorem anything can follow about empirical probability (the page on which defines it as the relative frequency in actual experiment), but there are many more subtle claims that technically also seem erroneous from the above considerations:
The LLN is important because it "guarantees" stable long-term results for the averages of random events.
Note that the Wikipedia article about LLN claims to be about the mathematical theorem, not about the empirical observation, which was also historically sometimes been called the LLN. It seems to me that LLN does nothing to "guarantee stable long-term results", for as stated above those stable long-term results have to be assumed in the first place for the terms occuring in the derivation to have the intuitive meaning we typically ascribe to them, not to mention something has to be done to at all interpret $P()$ in the first place. Another instance from Wikipedia:
According to the law of large numbers, if a large number of six-sided die are rolled, the average of their values (sometimes called the sample mean) is likely to be close to 3.5, with the precision increasing as more dice are rolled.
Does this really follow from the mathematical theorem? In my opinion, the interpretation of the theorem that is used here, rests on assuming this fact. There is a particularly vivid example in the "Treatise on probability" by Keynes of what happens when one follows the WLLN with even a slight deviation from this initial assumptions of p's being the relative frequencies in the limit of an infinite number of trials:
The following example from Czuber will be sufficient for the purpose of illustration. Czuber’s argument is as follows: In the period 1866–1877 there were registered in Austria
m = 4,311,076 male births
n = 4,052,193 female births
s = 8,363,269
for the succeeding period, 1877–1899, we are given only
m' = 6,533,961 male births;
what conclusion can we draw as to the number n of female births? We can conclude, according to Czuber, that the most probable value
n' = nm'/m = 6,141,587
and that there is a probability P = .9999779 that n will lie between the limits 6,118,361 and 6,164,813. It seems in plain opposition to good sense that on such evidence we should be able with practical certainty P = .9999779 = 1 − 1/45250 to estimate the number of female births within such narrow limits. And we see that the conditions laid down in § 11 have been flagrantly neglected. The number of cases, over which the prediction based on Bernoulli’s Theorem is to extend, actually exceeds the number of cases upon which the à priori probability has been based. It may be added that for the period, 1877–1894, the actual value of n did lie between the estimated limits, but that for the period, 1895–1905, it lay outside limits to which the same method had attributed practical certainty.
Am I mistaken in my reasoning above, or are all those really mistakes in the Wikipedia? I have seen similar statements all over the place in textbooks, and I am honestly wondering what I am missing.
That's all for now was thinking maybe your question is better at the stats SE http://stats.stackexchange.com
– Willemien May 03 '14 at 00:17