|
|
||||||||
a Department of Pathology, Geriatrics Center, University of Michigan, Ann Arbor
b Institute of Gerontology, Geriatrics Center, University of Michigan, Ann Arbor
c Ann Arbor DVA Medical Center, Michigan
d Department of Geriatrics, University of Arkansas for Medical Sciences, Little Rock
Richard A. Miller, Box 0940, University of Michigan, 5316 CCGCB, 1500 E. Medical Center Drive, Ann Arbor, MI 48109-0940 E-mail: millerr{at}umich.edu.
Decision Editor: Edward J. Masoro, PhD
| Abstract |
|---|
|
|
|---|
NEW methods for the simultaneous assessment of the level of expression of hundreds or thousands of mRNA levels in individual cell or tissue samples have caught the attention of cell and molecular biologists, gerontologists among them. The lure is obvious: Instead of laborious, one-at-a-time assays for a handful of cytokines, cell cycle regulators, surface proteins, or transcription factors, high-throughput approaches seem to promise a cornucopia of quantitative gene expression data from which to select the most promising candidate genes for further analysis, as well as "expression fingerprints" that are as informative, and as detailed, as real fingerprints or DNA restriction fragment length polymorphism patterns. Articles presenting lists of mRNAs allegedly over- or underexpressed in the tissues of aged rodents (1) or in tissues derived from skin biopsies of young or aged human donors (2) have appeared in prominent peer-reviewed journals and are sure to be merely the vanguard of a flood of articles reporting the effects of age, species, mutations, diets, antioxidants, and disease states on patterns of gene expression in multiple tissue and cell sources. These articles have included, and will continue to include, lists of specific mRNAs found to be altered to a specific degree (e.g., 2-fold or 10-fold changes) by the factor of interest, accompanied by a discussion of patterns perceived within the data set: arguments that the list of altered genes includes many genes involved in antioxidant defenses, or cell cycle control, or responses to specific hormones, etc.
This essay presents the viewpoint that the design and interpretation of the most prominent gene expression studies published to dateas well as the majority of those now being presented at meetings or making their way through the review queueare seriously flawed, that the data sets are filled with false-positive results, and that conclusions made on the basis of such fragile foundations are likely to prove misleading and premature.
The goal of such studies is usually to produce a listing of the genes whose expression distinguishes two samples of interest, for example, the muscle of young mice from that of old mice. For these lists to be useful as tests of specific ideas about aging and as guides for further work, it is important that most of the findings be reproducible (i.e., likely to produce equally large effects in replicate sets of samples). What criteria, then, should be used to ensure that such lists of age-sensitive genes contain only a small proportion of false positives (i.e., nonreproducible findings)? We will consider two sorts of criteria: (i) that the list should include all genes with a young/old ratio over some arbitrary value and (ii) that the list should include all genes where the two age groups meet some statistical significance test that compares effect size with its variance among subjects, such as the Student's t test. A recent review (3) provides a more comprehensive analysis of the statistical problems and opportunities involved in extracting biological insights from expression data sets and cites many useful articles describing alternate approaches to microarray-based data mining.
| Ratio-Based Criteria (Young/Old [or Test/Control] Without Formal Significance Testing) |
|---|
|
|
|---|
Table 1 shows the results of this simulation study. For a data set in which, for example, all of the genes show a coefficient of variation (CV = 100 x SD divided by the mean) of 30%, the table shows that 58 genes would produce, by chance, a mean young/old ratio of 2x or higher. The criterion used in the study by Lee and colleagues (2) would increase the number of false positives from 58 to 212 (i.e., 2.1% of the genes examined). At this CV, only 1 gene per 10,000 would produce a fourfold change by chance alone, although the calculation adopted by Lee and colleagues would produce 16 false positives with fourfold changes. A less restrictive threshold (i.e., a 1.5-fold change) would produce false positives for 6.9% of the genes tested, or 10.1% using Lee and colleagues' criterion.
|
|
|
One violation of the normality assumption deserves special mention: instances in which the levels of expression of a specific gene turn out to be bimodal among individual subjects. Documentation of genuine bimodality requires fairly large sample sizes, but even in small samples such genes are associated with very large CVs. Of the 153 genes shown in Fig. 1, 15 have CVs > 100. (Actually, Table 2 produces underestimates of the false-positive rate because it assigns CVs = 100 to all genes where CV > 100.) Most of these genes show high-level expression in only one or two of the four mice tested, with zero or near-zero expression in the other animals. Genes like these whose expression is sporadic among similar mice are particularly likely to give very high ratios in small series of this kind. If for a particular gene the distribution of expression among mice is truly bimodal or contains an occasional outlierassumptions that cannot be tested without much higher numbers of animalsthen assessment of the effects of age, treatment, or genotype on expression may be particularly difficult. Demonstration that genes with high ratios appear in two independent short series does not provide an adequate test against type I errors because genes with high variance will indeed frequently appear to differ, by ratio, between small groups of subjects even if there is no real effect of the diet or genotype under study.
The number of expected false-positive results depends on the number of mice (or other samples) tested in the study. Table 3 shows simulation results for varying SD (from 10% to 50% of the mean) for study designs utilizing two, three, four, or five animals in each of the two test groups. Using a criterion of a twofold change, for example, experiments in which CVs = 30% will yield 223 false positives if only two mice are used per group but will yield a mere six false positives if five mice are used per group. For CVs of 50%, and for designs with n = 2 mice per group, as many as 143 genes per 10,000 would by chance produce false-positive findings even when using a fourfold change as the criterion for acceptance; however, this number falls to near zero when n = 5 mice per group. Calculations similar to those shown in Table 1 , Table 2 , and Table 3 can be used to estimate the number of false positives expected for any given empirical distribution of CVs. We recommend that those groups wishing to report gene array results without formal statistical evaluation of significance should accompany their reports of two- and threefold changes with a comparison table showing the numbers of false-positive results to be expected from their experimental design and observed distribution of CVs.
|
| Criteria Based Upon Formal Significance Testing |
|---|
|
|
|---|
A key problem with a t-testbased approach in the context of gene expression screening is that it ignores multiple comparison artifacts. Consider a hypothetical situation in which a postdoctoral scientist decides to measure expression levels of 10,000 genes in each of 20 young and 20 old mice and to make her biological interpretations on the basis of those genes where the age effect is large and consistent enough to reach p(t) < .05. Alas, unbeknownst to this researcher, a disgruntled technician has switched the identification codes on all the mice at random, so that the nominally "young" group actually contains an equal number of young and old animals. Among 10,000 genes, however, 1 in every 20 will, entirely by chance, reach p(t) = .05; the postdoc, not knowing of the deception, is pleased to find 500 genes that show "significant" age effects, and she makes her interpretation and conducts years of follow-up analyses on the basis of these entirely spurious and unreproducible findings. The problem, well described in most elementary statistics texts, is that a significance criteria of .05 does not protect against false-positive conclusions in a large series of tests.
The Bonferroni procedure is the accepted way to adjust significance criteria in such a situation. When testing 1000 hypotheses simultaneously, for example, one would use as criterion a p value of .05/1000 = .00005. Such a criterion is very conservative in the sense that it tends to produce large numbers of false negative conclusions; it tends, in other words, to make it hard to accept as proven hypotheses that are in fact true. If an experiment testing 1000 genes produces p values <.00005 for, say, 8 genes, one could confidently conclude that all eight genes are likely to distinguish old from young mice; there would be only 1 chance in 20 that any of the eight effects is due to chance alone. Producing such a high p value requires either very large numbers of animals or very small interanimal SDsmuch smaller than are seen in practical cases. (Evidence that the experimental system in question gives very reproducible values for replicate aliquots of the same sample is not germane; the variation in weight among a set of laboratory members, for example, is not diminished by weighing them on a scale accurate at the microgram level.) If a survey of 10,000 genes shows that 20 of them reach p(t) = .001, it is likely that some of these 20 will prove reproducible in subsequent tests, but it is not possible to know which ones without further experimental data.
One way of dealing with this problem is to use a two-stage experimental design. The first stage is used for hypothesis generation: all genes are tested and ranked in order of statistical probability. In a typical case, few if any of the genes will show a sufficiently large age effect, with sufficiently low interanimal variance, to meet the Bonferroni criterion (p =.000005 for a set of 10,000 genes), but some are likely to provide suggestive evidence of a real effect, say p < .001. The second stage, then, involves testing a separate set of animals, using either the array method or some other convenient test (RT-PCR or RNAse protection assays, for example) for each of these genes that shows the most extreme probabilities in the initial survey. If, for example, the initial screen generates a list of 25 genes where p < .001, the second, hypothesis-testing phase of the study can employ a value of p = .05/25 = .002 as its criterion for hypothesis confirmation; any genes that reach this level in the second stage can be accepted as age-sensitive, at least in this organ, genotype, and age range.
This methodlike any method using small number of animals to examine traits with high varianceis likely to suffer from a high false negative rate: those genes that show above-average interanimal variance will not produce significant p values at any stage of the analysis in tests that use only 5 to 10 mice per group. Investigators who have invested considerable effort in large-scale gene scanning surveys may therefore wish to make publiceither in a formal report or in an associated electronic archivelists of genes that show relatively large effects (say two- or threefold changes) even if these do not approach statistical significance; genes that show large effects, even with high interanimal variation, may still deserve further attention if the patterns of expression suggest or refute specific biological theories of interest.
If the cost of the animals (or human samples or cell lines) is relatively small compared with the overall cost of the testing program, it may be useful to carry out the initial first-stage survey using pools instead of individuals. If, for example, a group of 24 young mice can be tested as six pools of 4 animals, the statistical analysis must treat this as n = 6 replicates, but the variation among the six pools is likely to be a good deal less than the variation expected from among six individual animals. Comparing six pools of young samples with six pools of old samples should increase the number of genes that achieve some arbitrary p value (e.g., p < .001) in the initial screen, and genes that appear promising in this initial survey can then be retested using fresh samples from individual animals of the age or treatment groups of interest. It may be possible to develop specialized methods for determining the optimal pooling strategy for the initial screening step, but it will be difficult to reduce these to simple rules because the decision will depend on the cost of each assay relative to the cost of preparing each sample and because the optimal pooling plan will differ for genes with different CVs.
| Alternate Approaches to Data Interpretation |
|---|
|
|
|---|
In addition, there is a repertoire of methods for defining clusters of genes based on similar (or, more generally, correlated) patterns of responsesfor example after antigenic or nutrient stimulation, across different tissue types, or among sets of individual tumors. There is at present, however, no consensus as to which of the many alternative procedures are optimal for extracting biological information from these correlation matrices. Claverie (3) includes an excellent introduction to these approaches with an outline of the problems involved. At their best (7), clustering methods can reveal previously unsuspected relationships among genes not known to exhibit coordinated regulation and can provide new diagnostic tools for sorting individual tumors or individuals on the basis of expression patterns. Effective application of these clustering methods, however, requires not merely sophisticated selection among alternate clustering algorithms, but also very large numbers of tested individualsnumbers well beyond the small sample sizes so far tackled by experimental gerontologists.
| Summary |
|---|
|
|
|---|
| Acknowledgments |
|---|
Received June 20, 2000
Accepted August 9, 2000
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
W. B. Rowe, E. M. Blalock, K.-C. Chen, I. Kadish, D. Wang, J. E. Barrett, O. Thibault, N. M. Porter, G. M. Rose, and P. W. Landfield Hippocampal Expression Analyses Reveal Selective Association of Immediate-Early, Neuroenergetic, and Myelinogenic Pathways with Cognitive Impairment in Aged Rats J. Neurosci., March 21, 2007; 27(12): 3098 - 3110. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Mirotsou, V. J. Dzau, R. E. Pratt, and E. O. Weinberg Physiological genomics of cardiac disease: quantitative relationships between gene expression and left ventricular hypertrophy Physiol Genomics, January 12, 2007; 27(1): 86 - 94. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. R. Warner LONGEVITY REGULATION AND AGING IN ANIMAL MODELS Gerontologist, December 1, 2006; 46(6): 844 - 847. [Full Text] [PDF] |
||||
![]() |
H. Bugger, S. Leippert, D. Blum, P. Kahle, B. Barleon, D. Marme, and T. Doenst Subtractive hybridization for differential gene expression in mechanically unloaded rat heart Am J Physiol Heart Circ Physiol, December 1, 2006; 291(6): H2714 - H2722. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Qu and S. Xu Quantitative Trait Associated Microarray Gene Expression Data Analysis Mol. Biol. Evol., August 1, 2006; 23(8): 1558 - 1573. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Norris, I. Kadish, E. M. Blalock, K.-C. Chen, V. Thibault, N. M. Porter, P. W. Landfield, and S. D. Kraner Calcineurin Triggers Reactive/Inflammatory Processes in Astrocytes and Is Upregulated in Aging and Alzheimer's Models J. Neurosci., May 4, 2005; 25(18): 4649 - 4658. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Melov and A. Hubbard Microarrays as a Tool to Investigate the Biology of Aging: A Retrospective and a Look to the Future Sci. Aging Knowl. Environ., October 20, 2004; 2004(42): re7 - re7. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. M. Blalock, J. W. Geddes, K. C. Chen, N. M. Porter, W. R. Markesbery, and P. W. Landfield Incipient Alzheimer's disease: Microarray correlation analyses reveal major transcriptional and tumor suppressor responses PNAS, February 17, 2004; 101(7): 2173 - 2178. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. Jones and K. Ravid Vascular Smooth Muscle Polyploidization as a Biomarker for Aging and Its Impact on Differential Gene Expression J. Biol. Chem., February 13, 2004; 279(7): 5306 - 5313. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Schwartz, A. Duka, E. Triantafyllidi, C. Johns, I. Duka, J. Cui, and H. Gavras Serial analysis of gene expression in mouse kidney following angiotensin II administration Physiol Genomics, December 16, 2003; 16(1): 90 - 98. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. J. Kyng, A. May, S. Kolvraa, and V. A. Bohr Gene expression profiling in Werner syndrome closely resembles that of normal aging PNAS, October 14, 2003; 100(21): 12259 - 12264. [Abstract] [Full Text] [PDF] |
||||
![]() |
C Napoli, L O Lerman, V Sica, A Lerman, G Tajana, and F de Nigris Microarray analysis: a novel research tool for cardiovascular scientists and physicians Heart, June 1, 2003; 89(6): 597 - 604. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-A. Tsai, Y.-J. Chen, and J. J. Chen Testing for differentially expressed genes with microarray data Nucleic Acids Res., May 1, 2003; 31(9): e52 - e52. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. M. Blalock, K.-C. Chen, K. Sharrow, J. P. Herman, N. M. Porter, T. C. Foster, and P. W. Landfield Gene Microarrays in Hippocampal Aging: Statistical Profiling Identifies Novel Processes Correlated with Cognitive Impairment J. Neurosci., May 1, 2003; 23(9): 3807 - 3819. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Bey, N. Akunuri, P. Zhao, E. P. Hoffman, D. G. Hamilton, and M. T. Hamilton Patterns of global gene expression in rat skeletal muscle during unloading and low-intensity ambulatory activity Physiol Genomics, April 16, 2003; 13(2): 157 - 167. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Dozmorov, M. R. Saban, N. P. Gerard, B. Lu, N.-B. Nguyen, M. Centola, and R. Saban Neurokinin 1 receptors and neprilysin modulation of mouse bladder gene regulation Physiol Genomics, February 6, 2003; 12(3): 239 - 250. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Morley Editorial: Citations, Impact Factor, and the Journal J. Gerontol. A Biol. Sci. Med. Sci., December 1, 2002; 57(12): M765 - 769. [Full Text] [PDF] |
||||
![]() |
E. Wang, C. Lacelle, S. Xu, X. Zhao, and M. Hou Designer Microarrays: From Soup To Nuts J. Gerontol. A Biol. Sci. Med. Sci., November 1, 2002; 57(11): B400 - 405. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Boeuf, J. Keijer, N. L. W. Franssen-Van Hal, and S. Klaus Individual variation of adipose gene expression and identification of covariated genes by cDNA microarrays Physiol Genomics, October 2, 2002; 11(1): 31 - 36. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. B. Allison and C. S. Coffey Two-Stage Testing in Microarray Analysis: What Is Gained? J. Gerontol. A Biol. Sci. Med. Sci., May 1, 2002; 57(5): B189 - 192. [Abstract] [Full Text] |
||||
![]() |
T. A. Prolla DNA Microarray Analysis of the Aging Brain Chem Senses, March 1, 2002; 27(3): 299 - 306. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Dozmorov, A. Galecki, Y. Chang, R. Krzesicki, M. Vergara, and R. A. Miller Gene Expression Profile of Long-Lived Snell Dwarf Mice J. Gerontol. A Biol. Sci. Med. Sci., March 1, 2002; 57(3): B99 - 108. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Dozmorov, A. Bartke, and R. A. Miller Array-Based Expression Analysis of Mouse Liver Genes: Effect of Age and of the Longevity Mutant Prop1df J. Gerontol. A Biol. Sci. Med. Sci., February 1, 2001; 56(2): 72B - 80. [Abstract] [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
|---|
| All GSA journals | The Gerontologist |
| Journals of Gerontology Series B: Psychological Sciences and Social Sciences | |