Do the Kolmogorov's axioms permit speaking of frequencies of occurence in any meaningful sense?

Question

It is frequently stated (in textbooks, on Wikipedia) that the "Law of large numbers" in mathematical probability theory is a statement about relative frequencies of occurrence of an event in a finite number of trials or that it "relates the axiomatic concept of probability to the statistical concept of frequency". Isn't this is a methodological mistake of ascribing an interpretation to a mathematical term, perhaps relying too much on the colorful language, that does not at all follow from how this term is mathematically defined? Recall the typical derivation of the WLLN:

Let $X_1, X_2, ..., X_n$ be a sequence of n independent and identically distributed random variables with the same finite mean $\mu$, and with variance $\sigma^2$ and let:

$\overline{X}=\tfrac1n(X_1+\cdots+X_n)$

We have:

$E[\overline{X}] = \frac{E[X_1+...+X_n]}{n} = \frac{E[X_1]+...+E[X_n]}{n} = \frac{n\mu}{n} = \mu$ $Var[\overline{X}] = \frac{Var[X_1+...+X_n]}{n^2} = \frac{Var[X_1]+...+Var[X_n]}{n^2} = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$

And from Chebyshev's inequality:

$P(|\overline{X}-\mu|>\epsilon) \le \frac{\sigma^2}{n\epsilon^2}$

And so X is said to converge in probability to $\mu$.

Now consider what is strictly speaking the meaning of this expression in the axiomatic framework it is derived in:

$P(|\overline{X}-\mu|>\epsilon) \le \frac{\sigma^2}{n\epsilon^2}$

$P()$, everywhere it occurs in the derivation, is known only to be a number satisfying Kolmogorov's axioms, so a number between 0 and 1, and so forth, but none of the axioms introduce any theoretical equivalent of the intuitive notion of frequency. If additional assumptions about $P()$ are not made, the sentence can obviously not be interpreted at all, but what is also important the theoretical mean $\mu$ is not necessarily the mean value in an infinite number of trials, $\overline{X}$ is not necessarily the mean value from n trials, and so forth. Consider an experiment of tossing a fair coin repeatedly - quite obviously, nothing in Kolmogorov's axioms enforces using 1/2 for the probability of heads, you could just as well use $1/\sqrt{\pi}$, yet the derivation continues to "work", except the meaning of the various variables is not in agreement with their intuitive interpretations. The $P()$ might still mean something, it might be a quantification of an absurd belief of mine, the mathematical derivation continues be true regardless, in the sense that as long as the initial $P()'s$ satisfy axioms, theorems about other $P()'s$ follow, and with Kolmogorov's axioms providing only weak constraints on and not a definition of $P()$, it's basically only symbol manipulation.

This "relative frequency" interpretation frequently given seems to rest on an additional assumption, and this assumption seems to be a form of the law of large numbers itself. Consider this fragment from Kolmogorov's Grundbegriffe on applying the results of probability theory to the real world:

We apply the theory of probability to the actual world of experiment in the following manner:

...

4) Under certain conditions, which we shall not discuss here, we may assume that the event A which may or may not occur under conditions S, is assigned a real number P(A) which has the following characteristics:

a) One can be practically certain that if the complex of conditions S is repeated a large number of times, n, then if m be the number of occurrences of event A, the ratio m/n will differ very slightly from P(A).

Which seems equivalent to introducing the weak law of large numbers in a particular, slightly different form, as an additional axiom.

Meanwhile, many reputable sources contain statements that seem completely in opposition to the above reasoning, for example Wikipedia:

It follows from the law of large numbers that the empirical probability of success in a series of Bernoulli trials will converge to the theoretical probability. For a Bernoulli random variable, the expected value is the theoretical probability of success, and the average of n such variables (assuming they are independent and identically distributed (i.i.d.)) is precisely the relative frequency.

This seem to be mistaken already in claiming that from a mathematical theorem anything can follow about empirical probability (the page on which defines it as the relative frequency in actual experiment), but there are many more subtle claims that technically also seem erroneous from the above considerations:

The LLN is important because it "guarantees" stable long-term results for the averages of random events.

Note that the Wikipedia article about LLN claims to be about the mathematical theorem, not about the empirical observation, which was also historically sometimes been called the LLN. It seems to me that LLN does nothing to "guarantee stable long-term results", for as stated above those stable long-term results have to be assumed in the first place for the terms occuring in the derivation to have the intuitive meaning we typically ascribe to them, not to mention something has to be done to at all interpret $P()$ in the first place. Another instance from Wikipedia:

According to the law of large numbers, if a large number of six-sided die are rolled, the average of their values (sometimes called the sample mean) is likely to be close to 3.5, with the precision increasing as more dice are rolled.

Does this really follow from the mathematical theorem? In my opinion, the interpretation of the theorem that is used here, rests on assuming this fact. There is a particularly vivid example in the "Treatise on probability" by Keynes of what happens when one follows the WLLN with even a slight deviation from this initial assumptions of p's being the relative frequencies in the limit of an infinite number of trials:

The following example from Czuber will be sufficient for the purpose of illustration. Czuber’s argument is as follows: In the period 1866–1877 there were registered in Austria

m = 4,311,076 male births

n = 4,052,193 female births

s = 8,363,269

for the succeeding period, 1877–1899, we are given only

m' = 6,533,961 male births;

what conclusion can we draw as to the number n of female births? We can conclude, according to Czuber, that the most probable value

n' = nm'/m = 6,141,587

and that there is a probability P = .9999779 that n will lie between the limits 6,118,361 and 6,164,813. It seems in plain opposition to good sense that on such evidence we should be able with practical certainty P = .9999779 = 1 − 1/45250 to estimate the number of female births within such narrow limits. And we see that the conditions laid down in § 11 have been flagrantly neglected. The number of cases, over which the prediction based on Bernoulli’s Theorem is to extend, actually exceeds the number of cases upon which the à priori probability has been based. It may be added that for the period, 1877–1894, the actual value of n did lie between the estimated limits, but that for the period, 1895–1905, it lay outside limits to which the same method had attributed practical certainty.

Am I mistaken in my reasoning above, or are all those really mistakes in the Wikipedia? I have seen similar statements all over the place in textbooks, and I am honestly wondering what I am missing.

This is a much more concrete version of the question I asked earlier http://math.stackexchange.com/questions/775788/logical-issues-with-the-weak-law-of-large-numbers-and-its-interpretation, and that I would ask the dear moderators to delete, as it was too vague to be useful. Please forgive me partially reposting something, I hope you will understand making a complicated reasoning clear does not always come easy or quickly. I will not post anything similar again. — Jarosław Rzeszótko, May 01 '14 at 20:07
The law of large numbers is a red herring, I think: you're stuck on the idea of expressing "physical" quantities (such as the result of a frequency-measuring experiment) as random variables. — , May 01 '14 at 21:18
You can express a frequency-measuring experiment as X-dash as defined above regardless of what P() is, but the moment you take expectations, and multiply the P()'s of particular values of the random variable by the actual values, you end up with a statement about what we intuitively think as the mean value from repetitions of the experiment only with additional assumptions about P()'s that are not in the axioms of Kolmogorov. That is indeed where my disagreement with Wikipedia and its interpretation of LLN has its roots, but you seem to claim I am simply misunderstanding something here, right? — Jarosław Rzeszótko, May 01 '14 at 21:53
Was studying your question a bit more, your formula of the variance looks wrong, where did you get this formula from? The same applies to Chebyshev's inequality. the correct formula is $P(|\overline{X}-\mu| \ge k\sigma ) \le \frac{1}{k^2}$
That's all for now was thinking maybe your question is better at the stats SE http://stats.stackexchange.com — Willemien, May 03 '14 at 00:17
Also the large reason Kolmogrov's axioms really came about, and one reason why probability was initially hard to study and so much disagreement was because there is no formal definition or mode of thought of what determines a probability. The Bayesian view is that its subjective, the frequentist is that its a relative frequency. But no matter which side you take the axioms are devised so they are general attributes that must be true for any probability measure are there(notice no where in the theory does it say how to determine $P()$. — Kamster, Aug 29 '14 at 16:50
Defining $P()$ is determined what mode of thought you follow, so if you a frequentist $P()$ would be function that take takes elements of sample space and returns there relative frequency (i.e. probability in the eyes of a frequentist) — Kamster, Aug 29 '14 at 16:51
Relative frequency that $X_i=7$ is usually treated with indicator functions $$Y_i = \left{\begin{array}{cc} 1 & \mbox{ if $X_i=7$} \ 0 & \mbox{ else}\end{array}\right.$$ and the LLN is applied to the i.i.d. random variables ${Y_i}{i=1}^{\infty}$ to get the empirical frequency that $X_i=7$ converges to $E[Y_i]=P[X_i=7]$ with prob 1. Similar if you want the empirical fraction of time that $X_i \in A$ for some set $A$, you define indicator functions $Z_i=1{{X_i \in A}}$ and apply LLN to those. — Michael, Mar 28 '22 at 21:13

Torsten Schoeneberg · Answer 1 · 2023-04-03T16:13:01.793

I. I agree with you that no version of the Law of Large Numbers tells us something about real life frequencies, already for the reason that no purely mathematical statement tells us anything about real life at all, without first giving the mathematical objects in it a "real life interpretation" (which never can be stated, let alone "proven", within mathematics itself).

Rather, I think of the LLN as something which, within any useful mathematical model of probabilities and statistical experiments, should hold true! In the sense that: If you show me a new set of axioms for probability theory, which you claim have some use as a model for real life dice rolling etc.; and those axioms do not imply some version of the Law of Large Numbers -- then I would dismiss your axiom system, and I think so should you.

II. Most people would agree there is a real life experiment which we can call "tossing a fair coin" (or "rolling a fair die", "spinning a fair roulette wheel" ...), where we have a clearly defined finite set of outcomes, none of the outcomes is more likely than any other, we can repeat the experiment as many times as we want, and the outcome of the next experiment has nothing to do with any outcome we have so far.

And we could be interested in questions like: Should I play this game where I win/lose this much money in case ... happens? Is it more likely that after a hundred rolls, the added number on the dice is between 370 and 380, or between 345 and 350? Etc.

To gather quantitative insight into answering these questions, we need to model the real life experiment with a mathematical theory. One can debate (but again, such a debate happens outside of mathematics) what such a model could tell us, whether it could tell us something with certainty, whatever that might mean; but most people would agree that it seems we can get some insight here by doing some kind of math.

Indeed, we are looking for two things which only together have any chance to be of use for real life: namely, a "purely" mathematical theory, together with a real life interpretation (like a translation table) thereof, which allows us to perform the routine we (should) always do:

Step 1: Translate our real life question into a question in the mathematical model.

Step 2: Use our math skills to answer the question within the model.

Step 3: Translate that answer back into the real life interpretation.

The axioms of probability, as for example Kolmogorov's, do that: They provide us with a mathematical model which will give out very concrete answers. As with every mathematical model, those concrete answers -- say, $P(\bar X_{100} \in [3.45,3.5]) > P(\bar X_{100} \in [3.7,3.8])$ -- are absolutely true within the mathematical theory (foundational issues a la Gödel aside for now). They also come with a standard interpretation (or maybe, a standard set of interpretations, one for each philosophical school). None of these interpretations are justifiable by mathematics itself; and what any result of the theory (like $P(\bar X_{100} \in [3.45,3.5]) > P(\bar X_{100} \in [3.7,3.8])$) tells us about our real life experiment is not a mathematical question. It is philosophical, and very much up to debate. Maybe a frequentist would say, this means that if you roll 100 dice again and again (i.e. performing kind of a meta-experiment, where each individual experiment is already 100 "atomic experiments" averaged), then the relative frequency of ... is greater than the relative frequency of ... . Maybe a Bayesian would say, well it means that if you have some money to spare, and somebody gives you the alternative to bet on this or that outcome, you should bet on this, and not that. Etc.

III. Now consider the following statement, which I claim would be accepted by almost everyone:

( $\ast$ ) "If you repeat a real life experiment of the above kind many times, then the sample means should converge to (become a better and better approximation of) the ideal mean".

_{A frequentist might smirkingly accept ($\ast$), but quip that it's is true by definition, because he might claim that any definition of such an "ideal mean" beyond "what the sample means converge to" is meaningless. A Bayesian might explain the "ideal mean" as, well you know, the average -- like if you put it in a histogram, see, here is the centre of weight -- the outcome you would bet on -- you know! And she might be content with that. And she would say, yes, of course that is related to relative frequencies exactly in the sense of ($\ast$).}

I want to strees that ($\ast$) is not a mathematical statement. It is a statement about real life experiments, which we claim to be true, although we might not agree on why we do so: depending on your philosophical background, you can see it as a tautology or not, but even if you do it is not a mathematical tautology (it's not a mathematical statement at all), just maybe a philosophical one.

And now let's say we do want a model-plus-translation-table for our experiments from paragraph II. Such a model should contain an object which models [i.e. whose "real life translation" is] one "atomic" experiment: that is the random variable $X$, or to be precise, an infinite collection of i.i.d. random variables $X_1, X_2, ...$.

It contains something which models "the actual sample mean after $100,1000, ..., n$ trials": that is $\bar X_n := \frac{1}{n}\sum_1^n X_i$.

And it contains something which models "an ideal mean": that is $\mu=EX$.

So with that model-plus-translation, we can now formulate, within such model, a statement (or set of related statements) which, under the standard translation, appear to say something akin to ($\ast$).

And that is the (or are the various forms of the) Law of Large Numbers. And they are true within the model, and they can be derived from the axioms of that model.

So I would say: The fact that they hold true e.g. in Kolmogorov's Axioms means that these axioms pass one of the most basic tests they should pass: We have a philosophical statement about the real world, ($\ast$), which we believe to be true, and of the various ways we can translate it into the mathematical model, those translations are true in the model. The LLN is not a surprising statement on a meta-mathematical level for the following reason: Any kind of model for probability which, when used as model for the above real life experiment, would not give out a result which is the mathematical analogy of statement ($\ast$), should be thrown out!

In other words: Of course good probability axioms give out the Law of Large Numbers. They are made so that they give them out. If somebody proposed a set of mathematical axioms, and a real-life-translation-guideline for the objects in there, and any model-internal version of ($\ast$) would be wrong -- then that model should be deemed useless (both by frequentists and Bayesians, just for different reasons) to model the above real life experiments.

IV. I want to finish by pointing out one instance where your argument seems contradictory, which, when exposed, might make what I write above more plausible to you.

Let me simplify an argument of yours like this:

(A) A mathematical statement like the LLN in itself can never make any statement about real life frequencies.

(B) Many sources claim that LLN does make statements about real life frequencies. So they must be implicitly assuming more.

(C) As an example, you exhibit a Kolmogorov quote about applying probability theory to the real world, and say that it "seems equivalent to introducing the weak law of large numbers in a particular, slightly different form, as an additional axiom."

I agree with (A) and (B). But (C) is where I want you to pause and think: Were we not in agreement, cf. (A), that no mathematical statement can ever tell us something about real life frequencies? Then what kind of "additional axiom" would say that? Whatever the otherwise mistaken sources in (B) are implicitly assuming, and Kolmogorov himself talks about in (C), it cannot just be an "additional axiom", at least not a mathematical one: Because one can throw in as many mathematical axioms as one wants, they will never bridge the fundamental gap in (A).

I claim the thing that all the sources in (B) are implicitly assuming, and what Kolmogorov talks about in (C), is not an additional axiom within the mathematical theory. It is the meta-mathematical translation / interpretation that I talk about above, which in itself is not mathematical, and in particular cannot be introduced as an additional axiom within the theory.

I claim, indeed, most sources are very careless, in that they totally forget the translation / interpretation part between real life and mathematical model, i.e. the bridge we need to cross the gap in (A); i.e. steps 1 and 3 in the routine explained in paragraph II. Of course it is taught in any beginner's class that any model in itself (i.e. without a translation, without steps 1 and 3) is useless, but it is commonly forgotten already in the non-statistical sciences, and more so in statistics, which leads to all kind of confusions. We spend so much time and effort on step 2 that we often forget steps 1 and 3; also, step 2 can be taught and learned and put on exams, but steps 1 and 3 not so well: they go beyond mathematics, seem to fit better into a science or philosophy class (although I doubt they get a good enough treatment there either). However, if we forget them, we are left with a bunch of axioms linking together almost meaningless symbols; and the remnants of meaning which we, as humans, cannot help applying to these symbols, quickly seem to be nothing but circular arguments.

your distinction between "mathematical model" and "physical object /process" is the key point ! (+1) — G Cab, Mar 28 '22 at 18:58

kingddd · Answer 2 · 2022-03-29T11:12:27.463

I understand OP's concern and I want to illustrate it by an example from geometry.

Pythagorean theorem. Let $V$ be a two-dimensional real vector space with inner product $\langle \cdot, \cdot \rangle$ and induced norm $\| \cdot \|$. In linear algebra we learn that $\langle a,b \rangle = 0$ implies $\| a \|^2 + \| b \|^2 = \| a + b \|^2$, and this result is called Pythagorean theorem. This is Step 2 of the routine in Torsten's answer [1]. Clearly, we would like to know what the connection of this Pythagorean theorem is with right-angled triangles drawn on a real sheet of paper. So we need to think about Steps 1 and 3 of the routine. Generations of students have drawn right-angled triangles and measured the lengths of the legs $a,b$ and the hypothenuse $c$, arriving at the identity $a^2+b^2=c^2$ to some degree of precision. Based on the amount of data, it is plausible to assume that there exists an empirical Pythagorean theorem in the real world. Now we can use the empirical Pythagorean theorem and the standard interpretation (using a rectangular coordinate system) to 'identify' $\| a \|$ with the length of $a$. In this way we obtain an interpretation of the Pythagorean theorem (in $V$) in terms of lengths. Doesn't it then feel wrong to say that the empirical Pythagorean theorem is a consequence of the Pythagorean theorem (in $V$) under the above interpretation?

There is an empirical result often called the empirical law of large numbers or the stability of frequencies, which states, for example, that the relative frequencies of heads in a long sequence of coin tosses 'converge' to some value $p$. In my opinion it is this empirical law which Kolmogorov refers to in the excerpt cited by OP. Afterwards OP argues that since we are using stability of frequencies to interpret probabilities and thereby the law of large numbers (LLN), it feels wrong to say that LLN 'guarantees' stability of frequencies.

I agree that it seems unnecessary to say that LLN is responsible for stability of frequencies whenever stability of frequencies is used to provide an interpretation.

However, stability of frequencies only looks good on paper. Since we cannot make infinitely many observations, this empirical law isn't of much help for determining probabilities in practice. On top of that many problems of practical interest are not as reproducible as a coin toss. I am not a probabilist or statistician, so from here on I have to rely on the opinion of other people.

First of all, let me quote Mark Kac. OP has provided a short excerpt in [4]. Here is how Kac continues.

The applicability of such a theory [probability theory] to natural sciences must ultimately be tested by an experiment. But this is true of all mathematical theories when applied outside the realm of mathematics, and the vague feeling of discomfort one encounters (mostly among philosophers!) when first subjected to statistical reasoning must be attributed to the relative novelty of the ideas.

To me there is no methodological distinction between the applicability of differential equations to astronomy and of probability theory to thermodynamics or quantum mechanics.

It works! And brutally pragmatic as this point of view is, no better substitute has been found. ([2], p.5)

What Kac suggests is a more pragmatic point of view, that of a physicist. Let me quote Krzysztof Burdzy.

[Compared with mathematicians,] physicists have a different idea of a 'proof' – you start with a large number of unrelated assumptions, you combine them into a single prediction, and you check if the prediction agrees with the observed data. If the agreement is within 20%, you call the assumptions proved. ([3], p.41)

I think that Burdzy has made up the figure of 20%, but I am not a physicist. More importantly, we can apply probability theory (including LLN) with any assumption that we deem fit, which makes stability of frequencies in some sense obsolete. As long as we can produce predictions that can be tested, we don't have to worry about the 'vague' link between the model and the real world. Over time and by doing a lot of experiments, we acquire a certain confidence in our claims (if they agree with the observations) and then they become accepted by the math/science community.

All of this is rather difficult to comprehend for a beginner in probability/statistics. In a first probability course, the statistical tools that are needed for the predictions only enter very late or not at all, which may be the reason why students don't see this 'pragmatic' approach to applied probability/statistics. On the other hand, stability of frequencies may still be useful for gaining intuition.

A layman (like myself) gets lost easily in the big philosophical frequentist vs. Bayesian debate(s). As a mathematician, I can accept that the only definition of probability appears in the Kolmogorov axioms and I don't need to know its 'true' meaning in order to learn and apply the theory. My goal in writing this was to provide some consolation for a specific group of people (including myself), i.e. those who have gone through a similar thought process as OP.

[1] Torsten Schoeneberg (Aug 19, 2021)

[2] Mark Kac, "Probability and related topics in physical sciences"

[3] Krzysztof Burdzy, "The search for certainty: on the clash of science and philosophy of probability" (suggested by Bjørn Kjos-Hanssen in [4])

[4] Logical issues with the weak law of large numbers and its interpretation

[5] Is probability and the Law of Large Numbers a huge circular argument?

score 1 · Answer 3 · answered Aug 07 '17 at 07:37

You are correct. The Law of Large Numbers does not actually say as much as we would like to believe. Confusion arises because we try to ascribe too much philosophical importance to it. There is a reason that the Wikipedia article puts quotes around 'guarantees' because nobody actually believes that some formal theory (on its own) guarantees anything about the real world. All LLN says is that some notion of probability, without interpretation, approaches 1 -- nothing more, nothing less. It certainly doesn't prove for a fact that relative frequency approaches some probability (what probability?). The key to understanding this is to note that the LLN, as you pointed out, actually uses the term P() in its own statement. I will use this version of the LLN:

"The probability of a particular sampling's frequency distribution resembling the actual probability distribution (to a degree) as it gets large approaches 1."

Interpreting "probability" in the frequentist sense, it becomes this:

Interpret "actual probability distribution": "Suppose that as we take larger samples, they converge to a particular relative frequency distribution..."

Interpret the statement: "... Now if we were given enough instances of n-numbered samplings, the ratio of those that closely resemble (within $\epsilon$) the original frequency distribution vs. those that don't approaches 1 to 0. That is, the relative frequency of the 'correct' instances converges to 1 as you raise both n and the number of instances."

You can imagine it like a table. Suppose for example that our coin has T-H with 50-50 relative frequency. Each row is a sequence of coin tosses (a sampling), and there are several rows -- you're kind of doing several samples in parallel. Now add more columns, i.e. add more tosses to each sequence, and add more rows, increasing the amount of sequences themselves. As we do so, count the number of rows which have a near 50-50 frequency distribution (within some $\epsilon$) , and divide by the total number of rows. This number should certainly approach 1, according to the theorem.

Now some might find this fact very surprising or insightful, and that's pretty much what's causing the whole confusion in the first place. It shouldn't be surprising, because if you look closely at our frequentist interpretation example, we assumed "Suppose for now that our coin has T-H with 50-50 relative frequency." In other words, we have already assumed that any particular sequence of tossings will, with logical certainty, approach a 50-50 frequency split. So is should not be surprising when we say with logical certainty that a progressively larger proportion of these tossing-sequences will resemble 50-50 splits if we toss more in each, and recruit more tossers? It's almost a rephrasing or the original assumption but at a meta-level (we're talking about samples of samples).

So this certainty about the real world (interpreted LLN) only comes from another, assumed certainty about the real world (interpretation of probability).

First of all, with a frequentist interpretation, it is not the LLN that states that a sample will approach the relative frequency distribution -- it's the frequentist interpretation/definition of $P()$ that says this. It sure is easy to think that, though, if we interpret the whole thing inconsistently -- i.e. if we lazily interpret the outer "probability that ... approaches 1" to mean "... approaches certainty" in LLN but leave the inner statement "relative frequency dist. resembles probability dist." up to (different) interpretation. Then of course you get "relative frequency dist. resembles probability dist. in the limit". It's kind of like if you have a limit of an integral of an integral, but you delete the outer integral and apply the limit to the inner integral.

Interestingly, if you interpret probability as a measure of belief, you might get something that sounds less trivial than the frequentist's version: "The degree of belief in 'any sample reflects actual belief measures in its relative frequencies within $\epsilon$ error' approaches certainty as we choose bigger samples." However this is still different from "Samples, as they get larger, approach actual belief measures in their relative frequencies." As an illustration, imagine if you have two sequences $f_n$ and $p_n$. I am sure you can appreciate the difference between $lim_{n \to \infty} P(|f_n - p_n| < \epsilon) = 1$ and $lim_{n \to \infty} |f_n - p_n| = 0$. The latter implies $lim_{n \to \infty} f_n$ = $lim_{n \to \infty} p_n$ (or $=p$ taking $p_n$ to be a constant for simplicity), whereas this is not true for the former. The latter is a very powerful statement, and probability theory cannot prove it, as you suspected.

In fact, you were on the right track with the "absurd belief" argument. Suppose that probability theory were indeed capable of proving this amazing theorem, that "a sample's relative frequency approaches the probability distribution". However, as you've found, there are several interpretations for probability which conflict with each other. To borrow terminology from mathematical logic: you've essentially found two models of probability theory; one satisfies the statement "the rel. frequency distribution approaches $1/2 : 1/2$", and another satisfies the statement "the rel. frequency distribution approaches $1/\pi : (1-1/\pi)$". So the statement "frequency approaches probability" is neither true nor false: it is independent as either one is consistent with the theory. Thus, Kolmogorov's probability theory is not powerful enough to prove a statement in the form "frequency approaches probability". (Now, if you were to force the issue by saying "probability should equal relative frequency" you've essentially trivialized the issue by baking frequentism into the theory. The only possible model for this probability theory would be frequentism or something isomorphic to it, and the statement becomes obvious.)

score 1 · Answer 4 · edited Apr 13 '17 at 12:44

1

Kolmogorov's axioms, if one were to make an assumption about the distribution of the random variable $X_i$, could be used to derive the distribution of the random variable $\bar{X}$. Notice in the last statement that since $X_i$ is a random variable, $\bar{X}$ is also a random variable. The fact that $\bar{X}$ is a random variable means that there is a probability measure for the random variable $\bar{X}$. The beauty of the WLLN is that so long as both $\mu$ and $\sigma^2$ are finite, no assumptions about the measure $P()$ must be made in order to derive that $\bar{X_n}$ converges in probability to $\mu$. I agree with Hurkyl. Perhaps this post will help with the concept of a random variable https://stats.stackexchange.com/questions/50/what-is-meant-by-a-random-variable

You do make a good point, however, about whether or not the assumptions that the $X$'s are independent and identically distributed random variables may not be true in practice, which is the problem alluded to in the Keynes example.

The example regarding dice appears to rely on the assumption that the die is fair, which may or may not be reasonable depending on how the die is constructed and rolled. However, it seems reasonable to assume that there exists appropriate setups of a dice rolling experiments for which the rolls are $i.i.d$ random variables with a probability measure $P$. In such a case, it does follow from the WLLN that $\bar{X}$ would indeed converge to $\mu$.

edited Apr 13 '17 at 12:44

Community

1

answered May 03 '14 at 06:15

jsk

543

1

I have no doubts that X-dash converges to mu in the framework of the Kolmogorov's axioms, but the question is whether this allows to draw any interpretable conclusions. Based only on the axioms, mu and X-dash are not interpretable as the average value from a large result of trials, they are simply weighted averages of some set of values using the, arbitrary to some extent, measure P(). Similarly when "relative frequency" is mentioned in the context of the theory, I think it does not really translate into real world relative frequency, unless the WLLN is assumed as true a priori. – Jarosław Rzeszótko May 03 '14 at 13:04
In other words, it seems to me people widely fail to notice that when "relative frequency" is spoken of in the context of probability theory, it only corresponds to our intuitive notion of "relative frequency", if one makes assumptions additional to the Kolmogorov's axioms, and the assumption needed is the WLLN itself. Hence no conclusions about real world situations follow purely from the WLLN as derived from the axioms. – Jarosław Rzeszótko May 03 '14 at 13:05
By the way, the trials in the Keynes examples are independent and identically distributed, the problem is that the probabilities are slightly off from the ideal theoretical relative frequency in an infinite limit of trials. While such P()'s satisfy the axioms, and the formal mathematics stays "true", you see that the result does not seem to be true anymore, and that is because the intuitive interpretation of the various terms in the derivation does not hold anymore. This example shows the WLLN has to be assumed a priori for the usual real world interpretation to hold. – Jarosław Rzeszótko May 03 '14 at 13:19
Based on the axions, $\mu$, the expected value is calculated mathematically and is not based on a relative frequency argument. You are correct the large trial interpretation of $\mu$ that it is the average of a large number of trials would imply that WLLN would be circular, but I would argue that is based solely on that particular interpretation of $\mu$. – jsk May 03 '14 at 16:23
In regards to the relative frequency interpretation of probability, that is again only the frequentist interpretation of probability. There is nothing in Kolmogorov's axioms which states that you should invoke the relative frequency interpretation of probability. – jsk May 03 '14 at 16:27
In the Keynes example, one is making an assumption that the actual births are independent and identically distributed. We have no evidence this is true. Also note that there is no way to know the actual number of male and female births. You can only know the reported number of births. http://en.wikipedia.org/wiki/Human_sex_ratio – jsk May 03 '14 at 16:38
@JarosławRzeszótko The assumption needed is not the WLLN itself, but that each random variable is independent and identically distributed. In practice, it is difficult to come up with situations in which variables can be constructed to be iid. Rolling dice and flipping coins are two examples for which under real world circumstances the individual rolls or flips may be treated as iid. – jsk May 03 '14 at 16:43
I think we are pretty much in agreement in the end. Kolmogorov axioms are "OK", and there is no circularity in the theorems, but they are also not having any real world interpretation until more assumptions are made. I agree, that it is only under the frequency interpretation that the assumption that is to be made overlaps the WLLN. The problem is that people invoke Kolmogorov's axioms, but then to give interpretation to the derivations use frequentist notions, and then they forgot about those frequentist assumptions and claim things follow directly from axioms. – Jarosław Rzeszótko May 03 '14 at 16:59
This then results in misinterpretation of what follows from the theorem, have another read of those Wikipedia quotations with an eye on what we have just discussed. – Jarosław Rzeszótko May 03 '14 at 17:01
@JarosławRzeszótko The bernoulli wikipedia example makes the assumption clear. The die example does not explicitly state the rolls are treated as iid, though I would argue that 3.5 is meant to be the theoretical quantity calculated for $E(X)$ assuming a fair die. – jsk May 03 '14 at 17:29
It's not only the i.i.d. assumption, that one still leaves you in the Kolmogorov framework, where statements derived are not having any real world interpretation. The moment you want to interpret what is called "relative frequencies" in the purely mathematical framework, as relative frequencies in the real world, you have to assume the LLN as an axiom, otherwise expected values of random variables in the derivations are products of interpretation-less P()'s and values of the variable, that can not be translated into meaningful statements about relative frequencies. – Jarosław Rzeszótko May 03 '14 at 18:24
More clearly: for the very notion of frequency to appear in the mathematical framework, you have in the first place to interpret the probabilities as relative frequencies, and this forces you to adopt the LLN as an axiom, for if the relative frequencies do not tend to a fixed limit, there is no possiblity to assign fixed real numbers as probabilities. I do not see how "relative frequencies" could be a meaningful concept with only Kolmogorov's axioms given, it's then a name that bears no the relation to what we normally think as relative frequencies. – Jarosław Rzeszótko May 03 '14 at 18:28
Note the wikipedia speaks about empirical probabilities, not theoretical ones, so we are discussing what assumptions are necessary to get an interpretable "version" of the LLN. – Jarosław Rzeszótko May 03 '14 at 18:31
The wikipedia entry claims the empirical probability converges to the theoretical one.... This is saying that the emperical probability $\hat{p} = \sum X_i/n = \bar{X}$ converges to $p$ where $p$ is the assumed probability of success of each $X_i$. – jsk May 03 '14 at 19:39
The word "empirical probability" in this fragment is linked to the Wikipedia page of empirical probability, which defines it to be the relative frequency as measured in physical experiment. While I know it is common practice to call the purely theoretical relative frequency also the "empirical probability", it is also a very confusing one, even more so when stated in an encyclopedia, without indication that there is no intrinsic relation to reality present. – Jarosław Rzeszótko May 03 '14 at 19:46
The mathematical framework is pure and correct as written. Why sully it's beauty with mixing it with the ugliness of the real world? LLN is interpretable, it just may not be applicable to situations that are not iid. There is however a form for the non-iid case http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&sqi=2&ved=0CDIQFjAB&url=http%3A%2F%2Fwww.stat.duke.edu%2Fcourses%2FFall11%2Fsta205%2Flec%2Fwk-07.pdf&ei=8EZlU4nkG87soAT__YHABQ&usg=AFQjCNGdHsl1lMo3hbKbjACnfeXole8mMA&bvm=bv.65788261,d.cGU&cad=rja – jsk May 03 '14 at 19:47
Basically, using names with very suggestive meaning for concepts that are logically devoid of this meaning in the mathematical theory, makes a huge fraction of people assume that those mathematical statements itself say something about reality, which they do not. – Jarosław Rzeszótko May 03 '14 at 19:48
And for getting a real world interpretation of the LLN, you must assume the LLN itself. I mean, it only has an intuitive interpretation in the frequentist framework, and there you have to assume all relative frequencies tend to a limit, or you cannot assign real numbers to probabilities. – Jarosław Rzeszótko May 03 '14 at 19:49
I'm not following. $\sum X_i$ is the total number of times the outcomes occurred in $n$ experiments. Thus, $\sum X_i/n$ is the sample proportion, also known as the empirical probability. – jsk May 03 '14 at 19:49
I know it is defined this way sometimes, but Wikipedia defines it as follows: http://en.wikipedia.org/wiki/Empirical_probability – Jarosław Rzeszótko May 03 '14 at 19:50
Their definition is the same as mine... Look at the first line of the second paragraph... "In statistical terms, the empirical probability is an estimate or estimator of a probability." – jsk May 03 '14 at 19:51
Either you stay in Kolmogorov's world, where the LLN is derivable, and the "theoretical empirical mean" (oh my god) tends to mu, but neither mu nor the "mean" can be known to have the meaning we ordinarly ascribe it to, or to interpret it meaningfully you choose a frequentist interpretation, which requires P()'s to be real world limits of relative frequencies in an potentially infinite number of trials, so that you are forced to make the LLN an additional axiom. – Jarosław Rzeszótko May 03 '14 at 19:55
I'm letting $X_i=1$ if the event occurs, $X_i=0$ otherwise, thus $\sum X_i$ is the number of times the outcome occurred and my definition is the exact same as "the ratio of the number of outcomes in which a specified event occurs to the total number of trials" – jsk May 03 '14 at 19:55
But what you write formally as $\Sigma X_i/n$ can be interpreted as the "really empirical" frequency if you choose a frequentist interpretation = add the LLN as an axiom. – Jarosław Rzeszótko May 03 '14 at 19:58
Otherwise you multiply P()s of unknown correspondence to the real world by 1's or 0's (when taking expectation), how does that count the number of occurrences? – Jarosław Rzeszótko May 03 '14 at 19:59
No, what I wrote is the sum of random variables. These $X_i$'s are random and have not yet occurred. They are not the realized values from the experiment until the experiment has been performed. This is analogous to the difference between an estimator and an estimate. $\bar{X}$ is a random variable, $\bar{x}$ is the realized value, the estimate from a series of n experiments. – jsk May 03 '14 at 20:06
2

I do now have a reference for this too:
"It is also odd, if we begin with frequency as the definition of probability, that we should then expend great effort to prove the law of large numbers the theorem that the probability of an event will almost certainly be approximated by the event's relative frequency. This was seen as a real problem by the frequentists of the nineteenth century (Porter 1986)."

(The following paragraphs are also very interesting, too little space here)

http://www.glennshafer.com/assets/downloads/articles/article46.pdf
– Jarosław Rzeszótko May 03 '14 at 20:08
Yes, but when you take expectations of this random variable, you use the assumption that probabilites are relative frequencies, to count the mean number of occurrences over the whole sample space, for example. – Jarosław Rzeszótko May 03 '14 at 20:10
I guess this means you are like the frequentists of the nineteenth century? Why don't you take a more relaxed attitude as the author suggests that most frequentists do these days? – jsk May 03 '14 at 20:18
In other words: the method of indicator random variables does not do what one would think it does if P()'s are not relative frequencies. What is, for example, the real world meaning of my subjective belief of heads in the first trial multiplied by 1, added to my subjective belief of heads in the second trial multiplied by 1? – Jarosław Rzeszótko May 03 '14 at 20:20
Did you miss this line "Frequency is the definition of probability in practice, they say, but it is convenient in the purely mathematical theory to take probability as a primitive idea and to prove the law of large numbers as a theorem." – jsk May 03 '14 at 20:21
1

Hey, I am not criticizing the foundations of probability theory here, I just think due to the suggestive terminology people fail to see those subtle distinction and make logically erroneous conclusions about what follows from what. This is important, for the axioms / assumptions have to be checked with the real world, while the theorems follow by logic alone from those axioms. – Jarosław Rzeszótko May 03 '14 at 20:23
I did not miss it, but the theorem derived this way can not be interpreted in any way. – Jarosław Rzeszótko May 03 '14 at 20:23
It has as much bearing on the real world as a theorem from group theory. – Jarosław Rzeszótko May 03 '14 at 20:24
I agree that checking that the axioms and assumptions coincide with the real world is important in practice. I do not agree that people misunderstanding or misinterpreting results established from a mathematical framework as reason to abandon the foundations. – jsk May 03 '14 at 20:28
What is the value of a theorem that for all practical purposes has to be assumed? It shows consistency of the theorem with the axioms, which is nice, it gives quantitative bounds, which is also nice, but many of the other real world conclusions that are said to hold in a great many places are not in fact conclusions from the axioms, but facts to be empirically verified. This is the point I try to stress. – Jarosław Rzeszótko May 03 '14 at 20:28
It's sad to hear you take this stance. There are so many useful results from probability theory and statistics. – jsk May 03 '14 at 20:30
I am not asking anyone to abandon the foundations. I read somewhere that probability theory in almost its modern axiomatic form was supposed to be called "valence theory", for it is more general than probability theory (so it was written). Maybe that would be nice, it would avoid a lot of misunderstanding. Speaking of theoretical entities as "empirical", also not so nice. But in the end I am not proposing anything, it just took me weeks literally to get to the conclusions we are now discussing. I wish it was all presented more clearly in textbooks, that is all. – Jarosław Rzeszótko May 03 '14 at 20:31
I am not denying the usefulness of probability theory! It's about delineating what follows purely from mathematical theorem, and what has additionally to be verified empirically. – Jarosław Rzeszótko May 03 '14 at 20:32
Then you should perhaps look into statistics. – jsk May 03 '14 at 20:40
I don't think we really disagree much about anything in the end. Thank you for the discussion. – Jarosław Rzeszótko May 03 '14 at 20:44
Indeed. Thank you as well. – jsk May 03 '14 at 20:52

score -1 · Answer 5 · answered Aug 29 '14 at 16:41

What you're missing is that the derivation of the WLLN is allowed to use, not only the Kolmogorov axioms, but also the assumption stated in the theorem: "The $X_1,X_2,\dots,X_n$ are a sequence of $n$ independent and identically distributed random variables with the same finite mean μ, and with variance $σ^2$". So, for example, if we are tossing a fair coin, we know that μ=1/2 (this is what "fair coin" means in probability theory), not $1/\sqrt\pi$. And likewise, in a Bernoulli trial, we are given the actual mean to which the observed probabilities are supposed to converge. And Keynes/Czuber's example isn't a valid application of the LLN because we are not given the actual mean and standard deviation.

So the first two claims in the Wikipedia article are basically correct (except that "will converge to the theoretical probability" should read "will converge in probability to the theoretical probability"; the probability that the observed values do not converge to the theoretical value is 0; but it might happen anyway).

However, the third claim, "According to the law of large numbers, if a large number of six-sided die are rolled, the average of their values (sometimes called the sample mean) is likely to be close to 3.5, with the precision increasing as more dice are rolled." doesn't follow, since we don't know a priori that rolling a six-sided die constitutes a Bernoulli trial. Looking at the context, it seems that the fairness of the die is meant as an ambient assumption, since one of the preceding sentences is "For example, a single roll of a six-sided die produces one of the numbers 1, 2, 3, 4, 5, or 6, each with equal probability."

Do the Kolmogorov's axioms permit speaking of frequencies of occurence in any meaningful sense?

5 Answers5

Linked