Sample size and power with the completely randomized design

Read the sections on power and sample size (pages 37-39 and page 85) in Carl Schwarz's notes, Chapter 6.

Some basic ideas: The significance level of a test protects against incorrect rejection of the null hypothesis of no treatment effect. If the null hypothes is true, we want the probability of rejecting it to be small, specifically as small as the specified significance level. The significance level of the test is the probability of rejecting the null hypothesis when it is true. But if the null hypothesis is false and there really is some effect of the treatments, then we want to reject the null hypothesis in favor of the alternative. The probability of rejecting the null hypothesis when it is false is called the power of the test. To increase the power, one may use larger sample sizes or find a more effective design.

If the null hypothesis is true, the test statistic has one distrubution, such as the t or F distribution or randomization distribution. If the null hypothesis is false, however, then one or more of the treatments has an effect and the same test statistic has a different distribution. In the case of the t and F statistics under the standard model assumptions, the distribution of the test statistic when the alternative hypothesis is true is the non-central t or F distribution respectively. The distribution of the test statistic when there is some treatment effect, depends on how big the effect is, or how much the effects of different treatments differ one from another.

With the completely randomized design, the power is maximized by assigning the same number of units to each treatment. To calculate the power of a test for the completely randomized design with given sample sizes, one must specify (1) the required significance level, (2) the effect size that needs to be detected, (3) the common variance of the response variables, and (4) the sample size for each group.

To calculate the sample sizes needed to achieve a given power, one must specify (1) the required significance level, (2) the effect size that needs to be detected, (3) the common variance of the response variables, and (4) the required power.

Learning goals: Understand the concept of power and how it is related to sample size and design. Understand the difference between the distribution that a test statistic would have if the null hypothesis is true and the distribution it would have if the alternative is true, and how this depends on the size or pattern of the treatment effect. Know what must be specified and what must be estimated to compute power or required sample size for an experiment. Know how to obtain these values with standard statistical software.

Example: Insect Spray Experiment

With the insect spray experiment, the analysis of variance reject the null hypothesis of no treatment effect very strongly. Now we'll ask, with differences so large among the treatment means, would it be possible to use smaller sample sizes and still have a power (probability of rejecting) of .90, while holding the significance level at .05? Then we'll ask, with the sample sizes we had the the treatment effects so large, what was the power?

> names(InsectSprays) # print the names of the columns of the insect spray data>
[1] "count" "spray"
attach(InsectSprays) # make the variables "count" and "spray" available
> meancounts <- tapply(count,spray,mean) # get the mean count for each spray
> meancounts # print the sample means of counts for each treatment group
        A         B         C         D         E         F
14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
> var(meancounts) # the between-group variance
[1] 44.48056
> anova(lm(count~spray)) # use the anova table to get mean square residual
                                              # as an estimate of within-group variance
Analysis of Variance Table

Response: count
          Df Sum Sq Mean Sq F value    Pr(>F)
spray      5 2668.83 533.77 34.702 < 2.2e-16 ***
Residuals 66 1015.17   15.38
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that the mean square error for spray, if divided by the within-group sample size 12, equals 44.48, the between-group variance.

# what sample sizes would actually have been sufficient to detect these differences
# between groups with power .90 (if the estimated effects were the actual effects)?

> power.anova.test(groups=6,between.var=44.48,within.var=15.38,power=.90)

     Balanced one-way analysis of variance power calculation

         groups = 6
              n = 2.28168
    between.var = 44.48
     within.var = 15.38
      sig.level = 0.05
          power = 0.9

NOTE: n is number in each group

So because the treatment effects in this experiment are so pronounced, we really only needed between 2 and three experimental units in each treatment group to achieve 90 percent power. That is, with the significance level set at .05, 2-3 units per group would provide have a 90 percent chance of rejecting the null hypothesis of no treatment effect. This calucation uses our estimate of within-group variance from the previous experiment. The larger sample sample sizes used provide additional power for making multiple comparisons between treatments, however.

# With 12 units assigned to each treatment, what is the power to detect to detect effects as large as the ones we had?   (But note, the reasoning is shakey in interpretating this retroactively as the power we actually had.)

> power.anova.test(groups=6,n=12,between.var=44.48,within.var=15.38)

     Balanced one-way analysis of variance power calculation

         groups = 6
              n = 12
    between.var = 44.48
     within.var = 15.38
      sig.level = 0.05
          power = 1

NOTE: n is number in each group

# With such big differences between some of the treatments, the power was
# virtually 1.

If there are only two treatment groups, one can alternatively use the R command power.t.test. There, it is only necessary to specify the difference between means that needs to be detected (not a variance among means). Check "?power.t.test" for the exact calling sequence.