Solving binomial equations with huge (and tiny) numbers.

Question

I was reading through a paper called “Why Most Published Research Findings Are False” by John Ioannidis and It got me thinking about the social sciences. I was thinking about their sample sizes, they felt small relative to the population of the planet so I decided to test my thinking by trying to figure out if their sample sizes (between tens to hundreds of participants) could be used to show a true relationship on a planet wide scale. I decided to play God and create a trait in the population called T that had a 65% probability of occurring in a world of a billion people, I wanted to figure out:

For a sample of n people, what is the probability that we would see 65% of the n people having T ($n \cdot 0.65$) by chance alone (remember we are randomly choosing a sample n people from the billion on the planet, 65% of which we know have T).
What would n have to be for the probability of us seeing T by chance alone < 0.005

For the first question I assumed we had a sample size of 10 so I set n = 10, 65% of which is 6.5 so I rounded up to 7 as you can’t have half a person. I realised that we would need to know how many ways you can select those 7 people from a sample of 10 and multiply that by the probability of getting 7 people who have the trait followed by 3 that don’t have it. (The binomial formula)

$${10\choose 7} \cdot 0.65^{7} \cdot 0.35^{3} = 0.2522$$

So just by randomly selecting 10 people there will be a 25.22% chance that it would show that 7 of the 10 people have T regardless of the tests I do on them. So if I wanted to prove that T has a 65% prevalence in the population I would need to use a larger n such that the resulting probability is small enough that it couldn't be chance alone that is causing it (I’m calling 0.005 small here)

So lets try a larger sample of 100. 65% of 100 is 65.

$${100\choose 65} \cdot 0.65^{65} \cdot 0.35^{35} = 0.0834$$

8.34% … and improvement but lets keep going

Now n = 200, 65% of 200 is 130

$${200\choose 130} \cdot 0.65^{130} \cdot 0.35^{70} = 0.059$$

As you can see, we are getting less and less bang for our buck when we use a larger and larger sample. The difference between 10 and 100 is 0.1688 where as 100 and 200 is 0.0244. I’m not a smart man, and I wasn’t too sure how to use math to figure out what the value for n should be but I can approximate using excel, and when I plot a trendline on this it gave me the equation $y = 0.7683x^{-0.482}$, $R^{2} = 0.9999$. I set y = 0.005, and solved for x which is roughly 34,390.

65% of a sample of 34,390 is 22353.5 (round up) 22,354.

As you can imagine checking this seems… difficult

$${34,390\choose 22,354} \cdot 0.65^{22,354} \cdot 0.35^{(34,390 - 22,354)}.$$

This should give around 0.005 but I have no idea how to compute it. Is my value for n accurate? How could I compute such a result given that ${34,390\choose 22,354}$ is so massively big and $0.65^{22,354} \cdot 0.35^{(34,390 - 22,354)}$ is so incredibly small? Is there a way around this? And finally am I being completley insane and doing this all wrong (I'm self taught) Thanks for the help!

The logarithm of your "incredibly small" number is $22354\log(0.65)+12036\log(0.35)$, which is a pretty reasonable number. So is the log of that binomial coefficient, and I bet if you hunt around the net for approximating binomial coefficients you'll find some good ways to calculate it to some precision. — Gerry Myerson, Mar 16 '23 at 02:50
Possibly https://math.stackexchange.com/questions/202554/how-do-i-compute-binomial-coefficients-efficiently has some good advice. Also, some of the questions listed there under "Linked" or "Related". — Gerry Myerson, Mar 16 '23 at 02:55
See also https://stackoverflow.com/questions/9619743/how-to-calculate-binomial-coefficents-for-large-numbers and https://stackoverflow.com/questions/55552775/approximating-the-lograrithm-of-binomial-coefficients-for-very-large-numbers — Gerry Myerson, Mar 16 '23 at 03:01

Claude Leibovici · Answer 1 · 2023-03-18T02:23:47.340

You can have a very good approximation for $$P(n)=\binom{n}{\frac{13 }{20}n}\,\,\left(\frac {13}{20}\right)^{\frac{13 }{20}n}\,\,\left(\frac {7}{20}\right)^{\frac{7 }{20}n}$$

Use the gamma function instead of the binomial coefficient, take logarithms, use Stirling approximation, exponentiate again to obtain $$P(n)=\sqrt{\frac{200}{91 \pi }} \frac 1{\sqrt n}\,\left(1-\frac{103}{364\, n}+\frac{10609}{264992\, n^2}+O\left(\frac{1}{n^3}\right) \right)$$

This leads to a relative error of $0.10$% if $n\geq 19$ and $0.01$% if $n\geq 41$.

Notice that your coefficients are very close to the leading term since $\sqrt{\frac{200}{91 \pi }}\sim 0.836$.

Now, if you want to know $n$ for a given value of $P(n)$, we can inverse it and obtain $$n=\frac{200}{91 \pi P^2}-\frac{103}{182}-\frac{10609 \pi }{145600}P^2+O\left(P^4\right)$$ Since, in your case, $P$ is very small, just use $$n\sim \frac{200}{91 \pi P^2}\sim \frac{25}{36 P^2}$$

For $P=0.005$, this gives $n\sim 28000$ and recomputing with the exact formula, this gives $0.00499846$ which is more than correct.

Edit

If you want to replace $\frac {65}{100}$ by $a$, you will have $$P(n)=\frac 1{\sqrt{2\pi a(a-1)}}\frac 1{\sqrt n}\,\left(1-\frac{a^2-a+1}{12\, a(1-a) \,n} +O\left(\frac{1}{n^2}\right)\right)$$ and its inverse would be $$n=\frac{1}{2 \pi a(1-a) \, P^2}-\frac{a^2-a+1}{6 a(1-a) }+O\left(P^2\right)$$

If you want a very accurate inverse $$n=-\frac{1-a(1-a)}{6\, a(1-a)\,\, W\left(-\frac{\pi}{3} (1-a(1-a)) P^2\right)}$$ where $W(.)$ is Lambert function (it is available is some versions of Excel).

For the worked case, it would give, as a real, $n=\color{red}{27982.720757687}29$ while the exact value is $n=27982.72075768747$

score 1 · Accepted Answer · answered Mar 16 '23 at 03:43

Your trendline may have been distorted a bit by the rounding you had to do for the case of $10$ people. This could have been avoided by starting with $20$ people. Nevertheless, you got close to the desired probability; according to Wolfram Alpha,

$$ {34390\choose 22354} \cdot 0.65^{22354} \cdot 0.35^{34390 - 22354} \approx 0.004510. $$

I don't know exactly which methods Wolfram Alpha used for this result, but they might include Stirling's approximation, logarithms, and other techniques mentioned in the posts linked in the comments under the question.

But here's the rub: you have calculated how unlikely it is to get the result $0.65$ exactly (or as near exactly as possible) if the estimate of $65\%$ probability is correct. The purpose of statistical tests such as a $0.005$ $p$-value is to show that you were unlikely to get the observed result if your estimate was incorrect. In other words, we want to make it hard to get a wrong answer, not hard to get the right answer. So your statistical working, while impressive, accomplishes exactly the opposite of what is wanted.

In order to measure something like the prevalence of a trait in a population, we accept the fact that our measurement will always be subject to some uncertainty and is unlikely to be exactly the true rate of prevalence. There are techniques for deciding how uncertain we should be about such a result if we want a certain level of "confidence" (corresponding to a low $p$-value). This is why political polls, for example, usually are stated as a percentage plus or minus some margin of error, for example $65\%$ plus or minus $4\%.$ In the case of your sample of $34390$ individuals, a plus or minus $4\%$ margin of error would include thousands of possibilities for the number of times the trait is observed.

Ioannidis is concerned with other sources of error, such as the fictional study conducted in an XKCD comic. The joke there is that the scientists found an effect that has only a $1$ in $20$ chance to be "observed" by chance if it is not real, but they conducted $20$ similar studies in order to find this effect once.

"So your statistical working, while impressive, accomplishes exactly the opposite of what is wanted." - Thats typical of me. Thank you! — James, Mar 16 '23 at 04:17
@James Don't feel bad. Our brains were not designed to work this way. This is pretty much the normal experience. — David K, Mar 16 '23 at 13:30

Solving binomial equations with huge (and tiny) numbers.

2 Answers2