What is a mathematically rigorous justification for multiplying edge probabilities of a tree diagram

Question

I was trying to understand why it was mathematically justified to multiply edge probabilities in a tree diagra and I came across the following question:

Why do we multiply in tree diagrams?

The second answer is nearly exactly what I needed, however, there are some things that I still don't understand about it. Or maybe there are just some things about the answer that are not mathematically rigorous enough for me (or trivial enough for the poster that the details were omitted), specially in the context of view probability with set theory. That is what I wish to address: how to treat probability tree diagrams with set theory rigorously and thus calculate probabilities on trees in a mathematically justified way.

The main issue that I have is how leaves/outcomes are specified with set notation in a sloppy way, which leads has lead to weird justification of calculating probabilities in tree diagrams. I will try to address what I think the issue are in detail in the frame of set theory to make sure that everything is precisely and clearly defined.

The exact issue that I am having is with the notation $\cup$ and $\cap$ being used to describe probabilistic statements. In high school we are taught to think about these as AND and ORs. I wish to abandon that mentality (since I think its one of the reasons for my confusion) and be extremely precise on the use of $\cap$ and $\cup$. Intersection and Union are two operations that only apply to sets. I will use them in that way and wish to address their correct use in probability theory.

First lets try to define "outcome" and "events" precisely and see how they relate to tree diagrams.

An outcome normally means a specific way of specifying the result of an experiment. For example, in the monty hall problem we can specify the outcome of the experiment by the following triplet:

outcome = (car location, player's initial guess, door revealed by the host).

i.e. an outcome is fully specified when we specify the location where the car actually is, what the players initial guess is and the door that was revealed by the host. Hence resulting in the following tree diagram:

enter image description here

(which I got from MIT's course for mathematics for computer science 6.042). as it can be appreciated, the leaves of the tree are the outcomes and all the leaves are the whole samples space $S$. In these terms the samples space is the set of triples:

enter image description here

Now, an event is a subset of this samples space, i.e. choosing a subset of the leaves.

The issue that I have is that I have seen the leaves of such a tree trees diagram denoted as $(A \cap A \cap B)$ (for the first one on my example) instead of $(A, A, B)$. For me, these two are not the same. The second triplet is just a sequence that acts as an "index" to specify a specific outcome in the sample space (which is an element of the set $S$). The notation with intersection (i.e. $A \cap A \cap B$) tries to specify a leaf but it seems plain wrong to me and confusing (or horrible abuse of notation? not sure...). Let me justify why I think its an incorrect way to specify a leaf:

firstly, it is not clear to me what $(A \cap A \cap B)$ even means. For me, that just means the empty set, because intersection should only be applied to sets and $(A \cap A \cap B)$ has no intersection.
secondly, even if you try to "repair" the first issue by insisting that the first position, second and third position are simply events and taking intersections of them is valid, still brings problems. i.e. $(A \cap A \cap B)$ are intersection of "events" is still wrong I believe. That solution only bring further problems/question. First, if A is now an event, then, what exactly is it a subset of? (since thats what an event is. If you are trying to use set notation to denote stuff, you better specify what the sets/subsets are). How do we re-define the sample space so that this notation indication of a leaf is justified? If we could do this, then (maybe) the justification explained in the question I posted might be valid (with further justifications).
If you try to use set notation to specify a leaf, it seems to me that the correct way to do it is by unions, not intersections. The reason is because, that would actually lead to the correct meaning of what a triplet specifies (and avoid having the issue of the empty set that I specified in my first point). However, since the order of the elements of a sets "don't matter" and because the triplets are sequences (where the order maters), the way to fix the new problem I have introduced by using unions is by using a subscript on the position of the triplet (kind of defining a bijection) i.e. the outcome $(A, A, B)$ corresponds $\{ A_1 \cup A_2 \cup B_3\}$. Anyway, taking this definition doesn't help that much, because its not clear to me how to use the general chain rule of probability to justify the probability of a leaf.

Basically, how do you rigorously justify using the chain rule of probability to calculate the probability of a single outcome in a probability tree diagram?

David K · Accepted Answer · 2014-12-29T02:49:29.010

You are right that $(A\cap A\cap B)$ is a nonsensical way to write the event $(A,A,B)$ in the tree diagram you presented. After all, $A\cap A\cap B = B\cap A\cap A = A\cap B,$ so it's really unclear which of the diagram's outcomes is being expressed.

In your proposed notation $(A_1\cup A_2\cup B_3),$ you fix one of the problems in the previous notation, namely that we cannot tell whether $A$ indicates taking branch $A$ at the first node or at the second node. But before we jump to a notation combining $A_1,$ $A_2,$ and $B_3,$ it pays to examine what each of the symbols $A_1,$ $A_2,$ and $B_3$ means.

A good model of probability is that all events are sets. In order to not get confused by the multiple ways the letter $A$ was used in the diagram, I'm just going to number the leaves of the tree with the numbers $1,2,3,\ldots, 12$ in sequence from the top to the bottom of the diagram, so that for example leaf $4$ is the one labeled $(A,C,B)$ in the diagram. Then $\{4\}$ is an event (namely, the event that that unique outcome occurs), but $\{1,2\}$ is also an event and so is $\{2,5,6,11\}$ or any other arbitrary subset of the twelve unique outcomes.

I would then understand $A_1$ to be the event in which the car's location is $A,$ with no other restriction on the player's choice or the door that is opened. That is, $A_1 = \{1,2,3,4\},$ all the outcomes you can get to by following the first branch. But $B_3$ is the event in which the door revealed is $B,$ namely, $B_3 = \{1,4,9,12\},$ that is, the set containing any outcome you can get to by following a path whose third step is labeled $B.$ And $A_2$ is the event in which the player chooses $A,$ that is, $A_2 = \{1,2,5,9\}.$

A good interpretation of the notation $A_1\cup A_2\cup B_3$ then is the union of those three sets I just described, namely, $A_1\cup A_2\cup B_3 = \{1,2,3,4,5,9,12\}.$ This is a legitimate event, but I doubt it is what you were looking for.

But let's try taking the intersection of $A_1 = \{1,2,3,4\},$ $A_2 = \{1,2,5,9\},$ and $B_3 = \{1,4,9,12\}.$ There is only one element that is in all three of those sets; $A_1\cap A_2\cap B_3 = \{1\}.$

That is why it makes sense to use set intersection to denote unique outcomes of this tree. Intersections reduce the size of the event, eventually narrowing it down to a very precise outcome.

As to why we would take the product of the weights of edges of the tree, if each edge leading from the root node is correctly labeled with a numeric weight then that weight is the probability that this particular edge leads to the outcome that occurred. In the diagram in the question, for example, we could (and should) assign weight $\frac13$ to edge $A$ under "car location," since at the start of the game we have no reason to think the car is more likely to be placed behind one door than any other; that is, we set $P(A_1) = \frac13.$

Now consider the topmost edge under "player's initial guess." We can traverse this edge only if event $A_1$ occurs (allowing us to arrive at the node where this edge starts) and event $A_2$ occurs. If the weight of this edge also is $\frac13,$ that signifies that we will take this edge in $\frac13$ of all cases where we arrive at the node where this edge starts (as measured by our probability measure), which in turn happens in $\frac13$ of all cases of the game (by our probability measure). One-third of one-third of a total is one-ninth of the total, that is, the probability of traversing both edges $A_1$ and $A_2$ is $\frac13\times\frac13=\frac19.$

(The second $\frac13$ in that product derives from a questionable assumption, but let's finish at least one probability analysis before tackling that point.)

As you traverse more edges, each edge you traverse keeps only some proportion $p$ of the probability with which you arrived at the start of that edge, so we multiply the probability of reaching the start of that edge by the probability of taking that edge, and we get the probability of arriving at the end of that edge. Continuing this procedure all the way to a leaf, we end up with a probability of reaching that leaf which is the product of all the probabilities assigned to the edges we traversed.

This does not mean that $P(A_1 \cap A_2) = P(A_1) P(A_2),$ by the way. That equation will only be true if the probability of $A_2$ is independent of the probability of $A_1.$ It's reasonable in this example to assume independence, because the player's decision is not influenced by the actual location of the car, but it's not true for ever path in every probability tree. What is always true is that $$P(A_1 \cap A_2) = P(A_1) P(A_2 \mid A_1),$$ where $P(A_2 \mid A_1)$ is the probability that $A_2$ occurs given that $A_1$ occurs. So the weight of that topmost edge really needs to be $P(A_2 \mid A_1),$ but we tend to use probability trees for problems where it's fairly easy to figure out what that probability should be, so this is usually not difficult.

Consider the topmost edge under "door revealed," for example. We can only get to the start of that edge if $A_1 \cap A_2$ has already occurred. In the standard form of Monty Hall, the host can only reveal door $B$ or door $C$ at this point, and chooses either with equal probability, so we assign this edge the weight $P(B_3 \mid A_1 \cap A_2) = \frac12.$ But on the other hand, consider the fourth edge from the top, which also is labeled $B.$ That edge can be traversed only if we first traverse $A_1$ and $C_2,$ at which point the host is required to reveal door $B.$ We therefore assign to this edge the weight $P(B_3 \mid A_1 \cap C_2) = 1.$

Now this is all fine and well, except that in my opinion there is some question about how we should assign weights to the edges under "player's initial guess." Is it really plausible to assign a random distribution to the player's guess? After all, the Monty Hall problem is about using probability to make intelligent guesses. The player could choose to guess any door with $\frac13$ probability, but he or she could instead choose to guess $A$ with probability $1$. What we can reasonably say is that the player must have some method of deciding which door to guess, and that method selects doors $A$, $B$, and $C$ with probability $p_A$, $p_B$, and $p_C,$ where $p_A+p_B+p_C=1.$ We can still take products of probabilities along edges; these products will just have an unknown factor now. If we then find the outcomes belonging to the event "the player's initial guess was correct," and compute the probability that at least one of those outcomes occurred, we will find that that probability is $$\frac13(p_A+p_B+p_C)=\frac13,$$ the same as in the MIT 6.042 course notes.

on a related concept that I have been confused about, does your reasoning also extend to joint random variables? Say $X_1 = x_1$, is basically the collection of outcomes w that make $X_1(w) = x_1$, therefore a event. So does the joint $(X_1 = x_1, X_2 = x_2)$ similarly mean the intersection of events that make $(X_1 = x_1, X_2 = x_2)$ true? i.e. if $ \mathcal{X}_1$ denotes a subset that makes $X_1 = x_1$, then does $(X_1 = x_1, X_2 = x_2)$ correspond to $\mathcal{X}_1 \cap \mathcal{X}_2$? I am asking because I was told once that joint r.v.s are unions and that confused me. is it really unions? — Charlie Parker, Dec 28 '14 at 05:37
Your description of $\mathcal X_1 \cap \mathcal X_2$ makes sense to me. The set representing all possible outcomes (of which $\mathcal X_1 \cap \mathcal X_2$ is a subset) might be constructed as a product of sets. Of course you can always take a union of events to mean that at least one of those events occurred; but that applies to single r.v.s as well as joint r.v.s. — David K, Dec 28 '14 at 14:28
David, last comment to make your answer perfect (and self contained), since my question was originally addressed at the issue of multiplicity of tree edges to calculate probabilities, do you mind mentioning how it relates to your answer? I know its obvious, but I think it could benefit future readers. Thanks so much btw! I was very impressed with your answer. :) — Charlie Parker, Dec 29 '14 at 00:43
@CharlieParker Done, though it took more words than I would have liked. This was partly because I'm not completely happy with the MIT analysis, which is partly because I think this is a poor choice of problem with which to show how to use probability trees. — David K, Dec 29 '14 at 02:59
I think the motivation for MIT to choose that example for tree's (if you read the intro of that chapter) seems to me because even famous professors in other institutions made a mistake in the analysis on this problem because they were not careful and didn't use the tree method.I think the reason for them using this as an example is to show that even a famously "tricky" example can be handle with a tree diagram. Though I am just guessing why they did it.The first time I read the notes I did find it weird to understand it with that example...but oh well, now it makes sense I guess.Thnx anyway :) — Charlie Parker, Dec 29 '14 at 04:22

ryang · Answer 2 · 2021-09-24T14:28:20.983

I agree that probability trees are typically quite shoddily or hand-wavily treated. My Answer (7 years late) is consistent with, and complements, David K's, as well the as the OP's own suggestions.

I disagree though that the Monty Hall is a poor choice of problem to illustrate probability trees; further down, I present my probability-tree solution of the Monty Hall problem, which is clearer or better than MIT's version.

A probability tree represents a probability experiment with $n$ trials which are not generally independent:

each column represents a trial;
each node (except the starting node) represents a conditional trial outcome, and each branch represents the corresponding conditional event;
each leaf represents an experiment outcome.

N.B. the sample space comprises the experiment outcomes; an event is simply some subset of the sample space; in particular, an elementary event contains just one experiment outcome.

For example,

the fourth branch in the second column represents the conditional event $(A_2|B_1)$ of outcome $A$ in the second trial given outcome $B$ in the first trial;
the last branch in the third column represents the conditional event $(B_3|C_1\cap C_2)$ of outcome $B$ in the third trial given outcomes $C$ in the preceding trials;
the fifth leaf corresponds to the elementary event $B_1\cap A_2\cap C_3=\{BAC\},$ i.e.,
the experiment having outcomes $B,A,C$ in trials one, two, three, respectively;
the event $A_2$ is the event of outcome $A$ in the second trial, i.e.,
$A_2=\{AAB\}\cup\{AAC\}\cup\{BAC\}\cup\{CAB\}=\{AAB, AAC, BAC, CAB\}.$

As pointed out above, branches represent conditional events; this takes into account any dependence among trials, so the probability of an outcome simply equals the product of its branch probabilities. One might think of event $B_1$ and conditional events $(A_2|B_1)$ and $(C_3|B_1\cap A_2)$ as being mutually independent, so that $P(\{BAC\})=P(B_1)P(A_2|B_1)P(C_3|B_1\cap A_2).$

Here's my visual representation of the classic Monty Hall game. It extends the traditional probability tree by framing the contestant's winning decision as the third trial of the probability experiment.

Here—without loss of generality^♪—the contestant has chosen Door $1.$ From the diagram:

the sample space is $\{12c,121,13c,131,23c,231,32c,321\};$
$P(\text{wins by sticking to Door $1$})=P(\{121,131\})=\frac13;$
$P(\text{switching to Door $2$ or $3$})=P(\{23c,32c\})=\frac23.$

^♪Alternatively, frame the game as a four-trial experiment, with the first trial being the contestant's initial door choice and the subsequent trials as above. Then there are $24$ outcomes, but $$P(\text{wins by sticking to initial door choice})$$ and $$P(\text{wins by switching door})$$ remain the same as above.

Trying to recall my thinking from almost seven years ago, I don't think I meant that a probability tree could not be used for this exercise; I think my objection was more along the lines of, "Is this really a good way to introduce people to probability trees?" And that opinion was no doubt influenced by the strange choice of "probability tree" chosen by the MIT course notes. Your probability tree, on the other hand, is fine. We could make the MIT tree work too with appropriate probability assignments (including some variables) on each branch; it is just more complicated. — David K, Sep 24 '21 at 13:49
@DavidK Thanks for reading! Yes I understood what you meant, and because I recently had a good class discussion using this probability tree (albeit not as an introduction to tree diagrams), my thinking when I wrote that I disagree was along the lines of "it's not ideal, but maybe it's not that poor a choice." :D — ryang, Sep 24 '21 at 14:28

What is a mathematically rigorous justification for multiplying edge probabilities of a tree diagram

2 Answers2

Linked