I am running a series of experiments that I expect to have similar outcomes. What is the best method to measure statistical significance?

Question

Following on from this comment on an answer to my previous question, I'd like to know two things:

what the best statistical test I can use to measure significance on the experiments I'm running? (previously it was stated that I could potentially use z-tests or Fisher's Exact test)
how can I measure the required cohort size needed for each experiment to achieve reasonable power?

Here's some information on the experiments I'm running -- happy to provide more if needed:

Each experiment will have an A cohort (the control) and a B cohort (which will see the treatment).
Most experiments will only be run on cohorts of 30-200 participants
I'm only looking for B's that have a positive increase over A (one-sided).
I also expect that if there is a positive increase in B that it will be a rather large increase (> 100% increase relative to the control).
Finally, the A cohort will generally have a low success rate (< 10%), so we cannot rely on the fact that the sampling distribution is approximately normal.

You can see some example data in my previous question.

This might get a better answer on stats.stackexchange.com${},{}$. ${}\qquad{}$ — Michael Hardy, Aug 24 '15 at 21:06
Is there a guide for which site I should use (math or stats)? I seem to be constantly running into this problem. — Josh, Aug 24 '15 at 21:12
I don't know of any such guide but anything hinting of design of experiments is more likely to find knowledgeable people on stats.stackexchange.com than here. ${}\qquad{}$ — Michael Hardy, Aug 24 '15 at 21:45

BruceET · Accepted Answer · 2015-08-26T02:38:04.823

First, I took some time to verify that the z-test does not work well when the success probability in the control group is as small as 10%.

Second, here are some results using a one-sided Fisher's exact test that rejects the null hypothesis that success probabilities in the two groups are equal when there are significantly many more successes is the treatment group than in the control group. (This means that you would disregard as a fluke any results with significantly more successes in the control group.)

All of the results below are for Fisher's exact test, and sample sizes are equal in the two groups. I looked at cases for $n = n_T = n_C = 50, 100,$ and $200.$

$n = 50.$ Suppose the success probability in the control group is $\pi_C = 0.02$: If $\pi_T = 0.15,$ then the P-value averages $.07.$ If $\pi_T = 0.2,$, the average P-value decreases to $.022.$ And if $\pi_T = 0.25,$ the average p-value decreases to $.007.$ This is summarized in the first cluster below, and the second cluster is for $\pi_C = 0.1.$

         ppc  ppt   Pv
  n=50   .02  .15  .07
              .20  .022
              .35  .007
         .10  .25  .11     # Scenario (b) below
              .30  .05
              .35  .021    # Scenario (a) below
              .40  .008 

  n=100  .02  .10  .06
              .15  .009
              .20  .001
         .10  .20  .10
              .25  .03
              .30  .007

  n=200  .02  .05  .16
              .10  .009
              .15  .0003
         .10  .15  .17
              .20  .028
              .25  .003
              .30  .0002

I hope you can see that this gives you a rough idea what differences between $\pi_C$ and $\pi_T$ can be reliably detected and at what level of significance, for each of the three sample sizes. All average P-value results are based on simulation and are subject to small simulation errors.

Examples with $n = 100$ and control group with population proportion of successes $\pi_C = .10$: At the 5% significance level, you will seldom be able to detect that $\pi_T = .20$ is an improvement, usually be able to detect that $\pi_T = .25$ is an improvement, and seldom overlook that $\pi_T = .30$ is an improvement.

If you like, I can show you the R code I used to get these results. Then you could investigate other scenarios. R is available free at www.r-project.org and no particular knowledge of R would be necessary to change numbers in my program and run additional scenarios.

Finally, I would not trust even Fisher's exact test (any sample size) unless the number of successes in the treatment group is at least 5.

Addendum: R code for Fisher exact tests. As requested, here is the R code used to obtain the information tabled above. Answers for one of the specific tabled situations is shown. Constants in the first two lines of code may be changed to investigate other situations. (Values for power, included here, are not tabled above.)

 nc = 50;  nt = 50      # sample sizes
 ppc = .1;  ppt = .35   # population proportions of Success--Scenario (a)
 m = 10^6               # iterations for simulation (adjustable >= 10^4)
 xc = rbinom(m, nc, ppc)   # m-vector of numbers of control Successes 
 xt = rbinom(m, nt, ppt)   # m-vector of numbers of treatment Successes
 pv = phyper(xt-1, nt, nc, xt+xc, lower.tail=F)  # m-vect of 1-sided P-vals
 mean(pv)                                        # avg of 1-sided P-vals
 ##  0.02102584
 mean(pv <= .05)  # P(Rej Ho | Ho False as specif) = Power against alt. specif
 ##  0.887290

Plots of simulated P-values are shown in the histograms below. Scenario (a) is for $n_C = n_T = 50;\, \pi_C = .1, \pi_T = .35$ and in Scenario (b) $\pi_T = .25.$ The vertical dotted red lines are at $0.5,$ so the bar to the left of the line represents the power of the test, the probability of rejecting $H_0: \pi_T = \pi_c$ against the alternatives $H_a: \pi_T > \pi_C$ (as specified), at level $\alpha = 5\%.$

Perhaps the first use of this code should be to verify the values in the table above to make sure there are no misprints.

This is brilliant. Thanks Bruce! It's really interesting to see how big the increase would need to be when the ppc = 0.02 to see a significant result. I'd love the R code if possible -- I have a basic knowledge of using R, and will be able to play with the numbers to find out other variations of n, ppc and ppt. — Josh, Aug 25 '15 at 17:41
See cleaned-up and annotated R code in the Addendum (below the line) in my Answer above. — BruceET, Aug 25 '15 at 21:42

score 0 · Answer 2 · answered Aug 25 '15 at 01:20

Define a random variable. Define a distribution function.Wickipedia can help.Calculate the mean value and the standard deviation (Wickepedia again).Then calculate confidence intervals (yep, Wickipedia).The confidence interval gives you the probability that your average value will fall within the confidence interval.

This approach has served me well with analyses of explosion tests on various military structures and vehicles.Sponsors who are not mathematically informed appreciate this approach.

I'm retired now and use this approach to plot the miles per gallon (random variable) against car mileage. Although there is quite a bit of scatter in the data the average remains close to 35 mpg. Good Luck.

I am running a series of experiments that I expect to have similar outcomes. What is the best method to measure statistical significance?

2 Answers2

Linked