|
|
||||||||
a Department of Biostatistics, University of Alabama at Birmingham
David B. Allison, Department of Biostatistics, Section on Statistical Genetics, Ryals Bldg, Suite 327, 1665 University Blvd, Birmingham, AL 35294 E-mail: Dallison{at}ms.soph.uab.edu.
Decision Editor: John A. Faulkner, PhD
| Abstract |
|---|
|
|
|---|
THE advent of microarray technology for gene expression measurements opens many exciting opportunities and challenges in aging research (1)(2). One of the major challenges involves determining whether a sample numerical difference in gene expression among two or more groups, conditions, or tissues represents a "statistically significant" difference (3). This is challenging in part because microarrays allow one to simultaneously test for differences in thousands of genes, thereby creating a problem of multiple inference if one stays in the frequentist (4) null hypothesis testing paradigm (5). For example, if differences in each gene expression are compared at the .05 significance level for a microarray containing 10,000 genes and the null hypothesis of no difference in gene expression were true for all genes, we would expect to observe approximately 500 false positives or genes for which a statistically significant difference is observed when there is, in truth, no difference. This issue of multiple comparisons has long been a thorn in the side of researchers. There are many accepted strategies for adjusting the significance level to ensure that the probability of making any false positive is equal to or below the desired significance level. For example, a Bonferroni correction (6) divides the desired experimentwise alpha level by the number of tests. Each individual test is then conducted at this Bonferroni adjusted alpha level. A criticism of this and similar approaches is the fact that by controlling the false-positive rate, one becomes more likely to observe false negatives, or differences that fail to reach statistical significance when an actual difference exists. An alternative approach for meeting this challenge was suggested in a recent paper (7) which "advocates a two-stage design in which significance testing applied to exploratory data is used to guide a second round of hypothesis-testing experiments conducted in a separate set of experimental studies" (p. B52). Ideally, this method would control the experimentwise alpha level or type 1 error rate (the probability of making any false positives in the study) while making fewer type 2 errors than the single-stage design. However, as previously noted (8), evidence has yet to be offered that the two-stage design procedure either controls the experimentwise type 1 error rate or reduces the risk of type 2 errors. The purpose of this brief paper is to examine the type 1 error rate and power of the two-stage design and evaluate how these statistical properties compare with those of a single-stage design.
| The Two-Stage Approach |
|---|
|
|
|---|
1) that is greater than the alpha level that would be required by a Bonferroni correction. No specific mention is given as to how to choose
1, but Miller and colleagues use
1 = .001 for an example with k = 10,000, suggesting perhaps that they mean for
1 to be set somewhat below the more conventional .05. The k hypothesis tests are then performed, yielding some number of genes, m, with significant effects (0
m
k). Then, at stage 2, a second independent set of data is gathered and only those m genes found to be significant at stage 1 are tested at level
2 = .05/m. Any gene significant at level
2 "can be accepted as age sensitive" ((7); p. B55). Presumably, although not explicitly stated, a two-tailed hypothesis test is conducted at stage 1, whereas a one-tailed test is conducted at stage 2. The stage 2 test should test only in the direction that the apparent effect was observed at stage 1 because it would make little sense to conclude that there is an effect on the basis of two random samples producing significant results in opposite directions.
This two-stage testing procedure was offered as one way of dealing with the problem of false positives (type 1 errors) that would result from multiple significance testing without correction and false negatives (type 2 errors) that would result from the use of a Bonferroni correction. However, concrete information indicating that this approach will achieve these goals has yet to be presented. First, although Miller and colleagues do not state exactly to what overall alpha level this procedure holds the entire experiment, their use of .05 in determining
2 suggests that perhaps they intend to achieve an overall experimentwise alpha level of .05. No information has been offered as to whether the proposed design actually controls the experimentwise alpha level. Second, no information has been offered to indicate that this procedure is more powerful than simply testing all data together in a single-stage design with a method that controls the experimentwise alpha level. These questions are further evaluated in the paragraphs that follow. It may be worth pointing out that the two-stage procedure under discussion is aimed at reducing purely stochastic threats to statistical inference and should be seen as distinct from constructive-type replications that have independent value in helping to eliminate nonstochastic threats to valid inference (9).
| Type 1 Error (False-Positive) Rate |
|---|
|
|
|---|
ew) to a value close to (though just slightly less than) .05. However, because we are testing in two stages, a type 1 error will be made only if genes for which no difference truly exists show significant differences (false positives) at both stages. To obtain the experimentwise alpha level, first compute the alpha level at stage 2 given that m tests were significant at stage 1. Then, because the number of genes significant at stage 1 (M) is a random variable, the experimentwise alpha level is obtained by computing the weighted sum over all possible outcomes, m, with the weights representing the probability of that outcome:
ew=P
+
ew=
.
Finally, if the probability of rejecting each test equals
1, the probability of observing m significant tests out of k independent tests can be described by the binomial distribution with parameters k and
1. When this is taken into account, the experimentwise alpha level for the two-stage design can be written as
ew=
.
Clearly,
ew is affected by the choice of
1. To demonstrate this dependence, consider the example used by Miller and colleagues (7) where k = 10,000,
1 = .001, and
2 = .05/m. Under this circumstance,
ew = .049, which is very close to the level of .05 that might be desired. However, if
1 were switched to .0001, a value still greater than the Bonferroni corrected value (.05/10000), then the overall type 1 error rate becomes only .0314. Furthermore, when the fact is used that the stage 2 alpha level will be less than or equal to .05, a simple bound on the experimentwise alpha level for the two-stage design is as follows:
This demonstrates that the two-stage design is often conservative, leading to an experimentwise alpha level that is lower than that desired. Furthermore, as the choice of
1 becomes smaller, the experimentwise alpha level of the two-stage design becomes more conservative.
As the formula above shows and the example illustrates, the two-stage procedure fails to consistently hold the overall alpha level at .05. On the contrary, one can more easily achieve the goal of holding
ew = .05 in a single-stage design by simply setting
1 = 1 - (1 - .05)1/k and only conducting a stage 1 analysis. This provides a correction that is nearly equivalent to the Bonferroni correction, but less conservative. Nevertheless, perhaps the two-stage procedure will reduce the type 2 error rate (i.e., increase power) relative to a one-stage procedure with
1 = 1 - (1 - .05)1/k.
| Power and False Negatives |
|---|
|
|
|---|
=
,
where µ1 is the population (not sample) mean level of gene expression for one group of subjects (group 1), µ2 is the population mean for the second group (group 2), and
is the common within-group standard deviation. Then, assuming equal numbers of subjects per group, and denoting the total number of subjects by 2n, one finds that the noncentrality parameter for the t distribution with 2n - 2 degrees of freedom (df) for testing the between-group difference is
1=
Let t
,
represent the value that cuts off the upper 100
percentile of the central t distribution with
degrees of freedom and let F(x,
,
) denote the cumulative distribution function at the point x of a noncentral t distribution with
degrees of freedom and noncentrality parameter,
. The power for single-stage testing can be written as P1 = 1 - F (t
1/2,n - 2,
1).
If one splits one's sample into two nonoverlapping subsamples to be used in the two stages, then the power calculation is somewhat more complex. Miller and colleagues (7) did not state how the subjects should be divided between the two stages, but for our subsequent calculations we will assume subjects are divided equally between the two stages. Because a significant result will occur only if the gene showed statistically significant differences at both stages, the power for two-stage testing will equal the product of the probability (power) of obtaining a significant result at stage 1 and the probability (power) of getting a significant result at stage 2 given a significant result at stage 1. The probability of getting a significant result at stage 1 can be derived just as above for single-stage testing, with the exception that the sample size will now be half of that used in the single-stage design. As a consequence, the noncentral t distribution will now have
2=
and df = n - 2. Once a significant result is obtained at stage 1, the conditional probability of getting a significant result at stage 2 depends on how many other genes (C) were declared significant at stage 1. Because C is a random variable, we must then sum these conditional powers over all possible outcomes for C, weighting by the probability of that outcome. Then we can write the power for two-stage testing as
![]() |
Note that the third term within the square brackets on the right side of the equation is for the one-tailed test at stage 2. For any given value of c, P(C = c) depends on the power of the tests for the other genes that may vary from one data set to the next and will be unknown. Therefore, in order for us to calculate P(C = c) we need to assume some model. For simplicity, we assume that the null hypothesis is true for all genes except the one for which we are calculating power. Were we to assume the null hypothesis is not true for other genes as well, we would increase the probability of declaring a larger number of genes significant at stage 1. As more genes are declared significant at stage 1, the stage 2 alpha level used for each individual test will become smaller, hence reducing the stage 2 power for that test. As a consequence, the overall power will be smaller than it would be under the assumption that the null hypothesis is true for all genes except the one of interest. That is, by assuming that the null hypothesis is true for all genes except the one for which we are calculating power, we are, for any particular effect size and sample size, deriving the maximum possible power for the two-stage procedure given by Miller and colleagues (7). Using the same reasoning as for M above, we find that C will follow a binomial distribution with parameters k - 1 and
1 and we can write the power for the two-stage design as
P2=

.
| Quantitative Results |
|---|
|
|
|---|
ew = .05 with single-stage testing of 10,000 genes requires a per test alpha of
1 = 5.13 x 10-6. Fig. 1 and Fig. 2 compare the power for single-stage and two-stage testing with two different levels of
1 (.001 and .0001) across a range of effect sizes. Fig. 1 corresponds to the example of 10 subjects per group (7), whereas Fig. 2 demonstrates the same results for a study with 20 subjects per group. The figures clearly demonstrate that the two-stage procedure of Miller and colleagues does not achieve the goal of providing a method that reduces false negatives (i.e., increases power). In fact, it can even exacerbate the very problem it is intended to alleviate. For example, under the scenario considered in that paper (
1 = .001,
= 3.0, and n = 10), the power for a single-stage test is .614 (calculations were conducted by using SAS/IML [SAS Institute, Cary, NC]) but the power is only .409 for the two-stage approach.
|
|
1 <
2 and that the p value obtained for testing a gene in the first stage is less than
1 whereas the p value obtained at the second stage is between
1 and
2. With the use of the two-stage approach, differences for this gene would be declared significant. In contrast, a gene for which the p value at stage 1 is between
1 and
2 and the p value at stage 2 is less than
1 would not be declared significant. Yet the two situations offer equivalent evidence against the null hypothesis. The fact that evidence against the null hypothesis in this second situation is ignored shows that information is being discarded, and it is therefore not surprising that power is lost. | Conclusions |
|---|
|
|
|---|
1, the per test alpha level at stage 1, will result in an overall experimentwise type 1 error rate that is overly conservative. Less conservative, single-stage methods of holding the overall type 1 error rate to any desired level exist (10). Moreover, this two-stage method can also exacerbate the false-negative (type 2 error) rate; that is, it can decrease power compared with a single-stage method. Although the two-stage design does not fare well when compared statistically with the single-stage design, there may be nonstatistical concerns that increase the attractiveness of the two-stage design. For example, it is possible that such a two-stage approach could improve power per dollar spent on a study if, at stage 2, one needed only to assay a subset of all genes on the array used in stage 1 and the cost for assaying a subset was less than the cost for assaying the entire set. In such situations, both the costs and required resources per subject in stage 2 (and hence the entire study) might be substantially reduced. Furthermore, the two-stage design proposed by Miller and colleagues represents only one possible type of two-stage (or more generally multistage) design. It is possible that other two-stage designs could be proposed that have better statistical properties and compare more favorably with the single-stage design.
In attempting to interpret these results, one question that may be of primary interest to researchers regards the size of the mean differences represented by the effect sizes shown in Fig. 1 and Fig. 2. To address this issue, we can offer the following information. Writing from the social science perspective, Cohen (11) defined "small," "medium," and "large" effect sizes as values of 0.20, 0.50, and 0.80, respectively. By this standard, an effect size of 3.0, as in the example considered by Miller and colleagues (7), is extremely large. However, in basic laboratory research, effect sizes are often much larger. For example, consider a study of a knock-out mouse model of hereditary hemochromatosis (12). In that study, when knock-out mice were compared with wild-type mice, the iron concentration in livers were 170 ± 15 µg/g (mean ± standard deviation) in controls and 1010 ± 50 µg/g in beta 2m (-/-) mice. This represents an effect size of 22.6. Unfortunately, because fold change is not a statistic that takes within-group variability into account, there is no way to directly translate an effect size expressed as a standardized mean difference into a specific fold-change value.
Finally, the two-stage procedure discussed operates from a strictly frequentist point of view under the seemingly implausible assumption of the null hypothesis being true for all genes studied. Alternatives to a strict frequentist approach exist (e.g., (3)(13)(14)) and are seen by many (e.g., (15)) to be preferable when many tests are conducted and a global null hypothesis seems untenable.
| Acknowledgments |
|---|
Received July 9, 2001
Accepted January 1, 2002
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
G. L Gadbury, G. P Page, J. Edwards, T. Kayo, T. A Prolla, R. Weindruch, P. A Permana, J. D Mountz, and D. B Allison Power and sample size estimation in high dimensional biology Statistical Methods in Medical Research, August 1, 2004; 13(4): 325 - 338. [Abstract] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
|---|
| All GSA journals | The Gerontologist |
| Journals of Gerontology Series B: Psychological Sciences and Social Sciences | |