5

I have an understanding question about the Bayes' Theorem: in

$$p(z|x) = \frac{p(x|z)p(z)}{p(x)},$$

the term $p(z)$ is usually interpreted as the prior probability distribution of a hypothesis $z$ before observing any data $x$.

However, if we write $p(z)$ as the marginal

$$p(z) = \int p(z, x) dx = \int p(z|x)p(x) dx= \mathbb{E}_{x\sim p(x)}p(z|x),$$

then the term $p(z)$ seems to contain the knowledge about all data $x$.

  • Therefore, is the prior really representing the hypothesis with no data, or with all data?

  • We are not any smarter with all data than we are with no data?

  • Or is it a question of perspective?

  • How should I understand the prior correctly?

Thank you!

Jacob
  • 53
  • 1
    Just a remark: The equation $$p(z) = \int p(z | x)p(x) : dx$$ does not imply that $p(z)$ contains information about $p(x)$. In fact the formula will always hold (as long as all the densities exists). Think for instance about the case where $x$ and $z$ are independent - then the formula reduces to $p(z) = p(z) \int p(x) : dx$, which is trivially true, but doesn't tell anything about $p(x)$ (other than $\int p(x) : dx =1$). – Leander Tilsted Kristensen May 21 '21 at 17:37
  • Thanks! This is true, but in this context we would always be interested in the situation where $z$ and $x$ are not independent, wouldn't we, since we want to explain $x$ with $z$? – Jacob May 22 '21 at 11:56
  • Yes of course, i just mentioned independence as an extreme case. I just wanted to point out the logical fallacy in the conclusion $"p(z)$ seems to contain the knowledge about all data $x"$. – Leander Tilsted Kristensen May 22 '21 at 14:24
  • I see. Thank you very much! Now I see the problem in my understanding. – Jacob May 22 '21 at 15:05

2 Answers2

3

The prior encodes the asker's existing belief about the state of the world. This may be in context of prior knowledge that's given to you (if the question rests on certain assumptions), or as an entire philosophy.

For the former, suppose you've been told that before you flipped a coin, it's been observed previously to come up heads 99% of the time. Depending on how strongly you decide to weight this as evidence, you may decide it should count "as if" you've seen several extra flips. This leads to the concept of conjugate priors, which are mathematically convenient ways to have the posterior be the same form as the prior - which really exposes the correspondence that Bayesian inference is updating your prior with additional evidence.

You may, at one extreme, decide this is conclusive information and assign this infinity weight - you will in effect ignore any and all evidence to the contrary. The frequentist side would completely ignore the pre-existing information and estimate the coin purely based on what was observed in the experiment.

As a philosophy, Bayesian statistics fundamentally rejects the frequentist assumption that there is a fully objective description of the world independent of the asker's experience and belief. See for example this XKCD for a humourous comparison.

There are particular choices which may be better suited to expressing complete ignorance, for example the Jeffreys prior, but even these may be arguable on whether "minimizing information" truly embodies "ignorance".

obscurans
  • 3,422
2

It should be interpreted as representing the uncertainty in the hypothesis with no data.

The marginalization computation that you've written should not be viewed as "using information from more data," but rather as averaging over all possible outcomes of what the data $x$ could be. This is less informative than if you have a particular instance of the data $x$, which gives you extra information about $z$.

angryavian
  • 89,882
  • Thank you! I really like your formulation that the average of "all possible outcomes" of data observation is actually not as informative as "a particular instance of the data $x$". – Jacob May 23 '21 at 10:01