I was reading through a paper called “Why Most Published Research Findings Are False” by John Ioannidis and It got me thinking about the social sciences. I was thinking about their sample sizes, they felt small relative to the population of the planet so I decided to test my thinking by trying to figure out if their sample sizes (between tens to hundreds of participants) could be used to show a true relationship on a planet wide scale. I decided to play God and create a trait in the population called T that had a 65% probability of occurring in a world of a billion people, I wanted to figure out:
- For a sample of n people, what is the probability that we would see 65% of the n people having T ($n \cdot 0.65$) by chance alone (remember we are randomly choosing a sample n people from the billion on the planet, 65% of which we know have T).
- What would n have to be for the probability of us seeing T by chance alone < 0.005
For the first question I assumed we had a sample size of 10 so I set n = 10, 65% of which is 6.5 so I rounded up to 7 as you can’t have half a person. I realised that we would need to know how many ways you can select those 7 people from a sample of 10 and multiply that by the probability of getting 7 people who have the trait followed by 3 that don’t have it. (The binomial formula)
$${10\choose 7} \cdot 0.65^{7} \cdot 0.35^{3} = 0.2522$$
So just by randomly selecting 10 people there will be a 25.22% chance that it would show that 7 of the 10 people have T regardless of the tests I do on them. So if I wanted to prove that T has a 65% prevalence in the population I would need to use a larger n such that the resulting probability is small enough that it couldn't be chance alone that is causing it (I’m calling 0.005 small here)
So lets try a larger sample of 100. 65% of 100 is 65.
$${100\choose 65} \cdot 0.65^{65} \cdot 0.35^{35} = 0.0834$$
8.34% … and improvement but lets keep going
Now n = 200, 65% of 200 is 130
$${200\choose 130} \cdot 0.65^{130} \cdot 0.35^{70} = 0.059$$
As you can see, we are getting less and less bang for our buck when we use a larger and larger sample. The difference between 10 and 100 is 0.1688 where as 100 and 200 is 0.0244. I’m not a smart man, and I wasn’t too sure how to use math to figure out what the value for n should be but I can approximate using excel, and when I plot a trendline on this it gave me the equation $y = 0.7683x^{-0.482}$, $R^{2} = 0.9999$. I set y = 0.005, and solved for x which is roughly 34,390.
65% of a sample of 34,390 is 22353.5 (round up) 22,354.
As you can imagine checking this seems… difficult
$${34,390\choose 22,354} \cdot 0.65^{22,354} \cdot 0.35^{(34,390 - 22,354)}.$$
This should give around 0.005 but I have no idea how to compute it. Is my value for n accurate? How could I compute such a result given that ${34,390\choose 22,354}$ is so massively big and $0.65^{22,354} \cdot 0.35^{(34,390 - 22,354)}$ is so incredibly small? Is there a way around this? And finally am I being completley insane and doing this all wrong (I'm self taught) Thanks for the help!