Here, we review briefly the sources of experimental and biological variance that affect the interpretation of high-dimensional DNA microarray experiments. We discuss methods using a regularized t-test based on a Bayesian statistical framework that allow the identification of differentially regulated genes with a higher level of confidence than a simple t-test when only a few experimental replicates are available. We also describe a computational method for calculating the global false-positive and false-negative levels inherent in a DNA microarray data set. This method provides a probability of differential expression for each gene based on experiment-wide false-positive and -negative levels driven by experimental error and biological variance.
Gene expression array data can be analysed on at least three levels of increasing complexity. The first level is that of single genes, where one seeks to establish whether each gene in isolation behaves differently in a control versus an experimental or treatment situation. Here, experimental/treatment is to be taken, of course, in a very broad sense – essentially any situation different from the control. The second level is multiple genes, where clusters of genes are analysed in terms of common functionalities, interactions, co-regulation, etc. Gene co-expression can provide a simple means of gaining leads to the functions of many genes for which information is not currently available. This level includes dimensionality reduction and visualization techniques, such as hierarchical and k-means clustering and principal component analysis. It also includes methods that leverage DNA array data information to analyse DNA sequences, for example to find regulatory regions and motifs. Finally, the third level is the level of systems biology, where the goal is to infer and understand the underlying gene and protein net-works that are ultimately responsible for the patterns observed at the systems level. The focus of this review is on the first and most basic level of analysis, differential analysis.
Although differential analysis tends to look at each gene in isolation and, as such, is still part of the old paradigm of ‘one gene at a time’, DNA microarray technology allows differential analysis to be conducted on a new genomic scale. Furthermore, there is a natural transition from differential analysis to clustering. In fact, differential analysis can be viewed as clustering into two categories, ‘changed’ or ‘not changed’, whereas differential analysis applied to more than two conditions results in multiple clusters. Differential analysis is an important tool for the identification of all the genes directly or indirectly regulated by a given protein or RNA molecule. Indeed, it is the first step in trying to define all the players in regulatory circuits and networks.
DNA microarray data variability and noise
Differential analysis of microarray data is difficult because of the variability inherent in these data. This variability results from a large number of disparate factors operating at different times and levels during the course of a typical experiment. These numerous factors are often inter-related in complex ways but, for the purpose of simplicity, they can be broken down into two major categories, biological variability and experimental variability. Other sources of variability involve DNA microarray fabrication methods as well as differences in imaging technology, signal extraction and data processing. Here, we restrict our discussion to the sources of variability for which the experimenter has immediate responsibility and control, experimental and biological variables.
In our experience, the largest source of error in the analysis of DNA microarray data comes from biological variations in individual mRNA expression levels in different cells or cell populations. For example, in the case of multiple human samples obtained from biopsy materials, biological variation can be extreme; samples will differ not only in genotype but also in cell types. Nevertheless, care should be taken to minimize such variables as much as possible. The ability to control these sources of biological variation in a model organism such as Escherichia coli with an easily manipulated genetic system presents an obvious advantage. However, even when genetically ‘identical’ cells cultured under ‘identical’ conditions are compared with one another, non-trivial and sometimes substantial differences in gene expression levels are observed (Arfin et al., 2000; Baldi and Hatfield, 2002). This variance results from a variety of influences including differences in the cells’ microenvironments (e.g. nutrient and temperature gradients), growth phase differences between cells in the culture, phase variations, periods of rapid change in gene expression and multiple additional stochastic effects that cannot be controlled. Furthermore, even though regulatory mechanisms appear to be deterministic and the short-term behaviour of cells and organisms ought to be somewhat predictable, there remains stochasticity in nanoscale regulatory chemistry as a result of thermal fluctuations and their consequences, such as the random distribution of a small number of transcription factor molecules between daughter cells during the mechanics of cell division. In some cases, expression of a gene in a cell can literally depend on a few molecules. These fluctuations influence the initiation and level of transcription. In fact, there are examples during development where cells or organisms seem to make use of molecular noise (McAdams and Arkin, 1999). Suggestions and methods such as the use of laser-capture techniques for the isolation of small numbers of cells of a common type (Best and Emmert-Buck, 2001) to reduce biological variance and experimental noise, as much as possible, are discussed elsewhere (Baldi and Hatfield, 2002).
Experimental variability comes from many sources, including methods by which samples are obtained or cells are cultured, methods for mRNA isolation, extraction and amplification for both non-polyadenylated RNA from bacteria or polyadenylated mRNA from other organisms, the size of experimental samples, hybridization conditions and labelling efficiencies, and contamination by genomic DNA or ribosomal RNA and other RNA species. Likewise, the growth conditions of cultured cells queried by DNA array experiments ought to be standardized if comparisons across different experiments in the same or different laboratories are to be carried out. These problems are further exacerbated if extreme care in the treatment and handling of the RNA is not taken during its extraction from the cell and its subsequent processing. For example, it is often reported that the cells to be analysed are harvested by centrifugation and frozen for RNA extraction at a later time. It is important to consider the effects of these experimental manipulations on gene expression and mRNA stability. If the cells encounter a temperature shift during the centrifugation step, even for a short time, this could cause a change in gene expression profiles resulting from the consequences of temperature stress (Gross, 1996). If the cells are centrifuged in a buffer with even small differences in osmolarity from the growth medium, this could cause a change in the gene expression profiles as a result of the consequences of osmotic stress (Higgins et al., 1988). Also, removal of essential nutrients during the centrifugation period could cause significant metabolic perturbations that would result in changes in gene expression profiles (Ninfa, 1996). Each of these and other experimentally caused gene expression changes will confound the interpretation of the experiment. These are not easy variables to control. Therefore, the best strategy is to harvest the RNA as quickly as possible under conditions that ‘freeze’ it at the same levels that it occurs in the cell population at the time of sampling and inhibit RNase activities (Arfin et al., 2000). In a similar vein, if polyadenylation levels are differentially affected by different treatment conditions, then erroneous conclusions concerning gene expression levels from DNA array experiments performed with poly(A)-derived targets might be reached (Olivas and Parker, 2000; Baldi and Hatfield, 2002).
Other experimental considerations unique to bacteria include problems concerned with the isolation of non-polyadenylated mRNA from bacteria and problems associated with the rapid turnover of mRNA in bacterial cells (from a few seconds to several minutes versus several hours to days in eukaryotes). Originally, it was feared that targets produced from total RNA preparations would produce unacceptable backgrounds because ribosomal RNA and other non-mRNA cDNA products would cross-hybridize to array probes. An early solution was synthetically to prepare a collection of 3′-oligonucleotide primers specific for the 3′ ends of each bacterial open reading frame (ORF). However, it was discovered at an early stage that, in fact, quantitative measurements of relative mRNA levels could not be attained with 3′-specific primers. This is because rapid mRNA decay is initiated in bacteria by endonucleolytic cleavages followed by 3′ to 5′ exonucleolytic degradation. Therefore, if the initial endonucleolytic site is adjacent to the 3′ ORF-specific primer binding site, this region is rapidly degraded, and little or no steady-state message is extracted for primer extension labelling Furthermore, if the 3′ ORF-specific primer binding site is located in a portion of the mRNA stabilized by secondary structure, it will be present in the cell at a high steady-state level. Thus, varying amounts of message will be extracted for primer extension labelling of each gene-specific transcript depending on the location and degradation rate of the primer site. These problems are overcome by the high-stringency conditions used for the hybridization of targets prepared from total RNA preparations to presynthesized arrays containing full-length ORF cDNA probes (Arfin et al., 2000). An additional advantage is that the random hexamer-labelling procedure produces RNA–DNA duplexes for primer extension from all the partial degradation products of each message. As the exonucleolytic clearance of all mRNA degradation products to free nucleotides is constant, the steady-state level of these degradation intermediates is proportional to their rates of synthesis (Arfin et al., 2000; Baldi and Hatfield, 2002). Thus, more quantitative measurements of relative gene expression levels are obtained. However, with in situ-synthesized arrays containing short oligonucleotide probes, such as Affymetrix GeneChips, high-stringency hybridization conditions are not possible. In this case, alternative mRNA enrichment procedures coupled with direct biotinylation of mRNA products are used (Baldi and Hatfield, 2002).
Differential analysis of DNA microarray data
Data normalization procedures
Before we can determine the differential gene expression profiles between two conditions obtained from the data of two DNA array experiments, we must first ascertain that the data sets are comparable. That is, we must develop methods to normalize data sets in a way that accounts for sources of experimental and biological variations, such as those discussed above, that might obscure the underlying variation in gene expression levels attributable to biological effects. However, with few exceptions, the sources of these variations have not been measured and characterized. As a consequence, many array studies are reported without statistical definitions of their significance. This problem is exacerbated even further by the presence of many different array formats and experimental designs and methods. Although some theoretical studies that address this important issue have appeared in the literature, the normalization methods currently in common use are based on more pragmatic biological considerations (Zien et al., 2001). Basically, these methods attempt to correct for the following variables:
• number of cells in the sample;
• total RNA isolation efficiency;
• mRNA isolation and labelling efficiency;
• hybridization efficiency;
• signal measurement accuracy and sensitivity.
These methods, discussed in a recent book by Baldi and Hatfield (2002), include normalization to total or ribosomal RNA, normalization to housekeeping genes, normalization to a reference RNA and normalization by global scaling.
Data analysis using a simple t-test
To begin with, we assume for simplicity that DNA microarray data consists of a set of replicate measurements for each gene and representing expression levels, or rather their logarithms, in both a control and a treatment situation. For each gene, the fundamental question we wish to address is whether the level of expression is significantly different in the two situations.
One approach commonly used in the literature, at least in the first wave of DNA microarray publications, was a simple fold approach, in which a gene is declared to have changed significantly if its average expression level varies by more than a constant factor, typically two, between the treatment and control conditions (Schena et al., 1995). Inspection of gene expression data suggests, however, that such a simple ‘twofold rule’ is unlikely to yield optimal results, as a factor of two can have quite different significance and meaning in different regions of the spectrum of expression levels, in particular at the very high and very low ends (Fig. 1). In a noisy environment, 2000/1000 or 2/1 can have a quite different significance. Small random fluctuations are much more likely to produce a change from 1 to 2 than from 1000 to 2000.
Another method of identifying differentially expressed genes is to use a t-test, for instance on the logarithm of the expression levels. In a t-test, the empirical means mc and mt of the control and treatment populations, respectively, as well as their variances sc2 and st2 are used to compute a normalized distance between the two populations in the form:
where, for each population, m = Σi xi/n and s 2 =Σi( xi − m)2/(n − 1). These are the well-known estimates of mean and variance.
When t exceeds a certain threshold, depending on the confidence level selected, the two populations are considered to be different. Because in the t-test, the distance between the population means is normalized by the empirical standard deviations, this has the potential for addressing some of the shortcomings of the simple fixed-fold threshold approach. The fundamental problem with the t-test for array data, however, is that the repetition numbers nc and/or nt are often small because experiments remain costly or tedious to repeat. Small populations of size n = 1, 2 or 3 are still very common and lead to poor estimates of variance. Thus, a better framework is needed to address these shortcomings.
Data analysis using a regularized t-test
The Bayesian approach to probability and statistics interprets probabilities as degrees of belief or confidence in hypotheses rather than measured frequencies and specifies how to update these probabilities as data are gathered. More specifically, the Bayesian theorem allows one to compute the posterior probability P of a hypothesis H given data D in the form:
where P(H) is the prior probability of the hypothesis, before gathering the data.
In the case of DNA microarray data, one of the problems in performing a t-test is to obtain accurate estimates of the standard deviation of individual gene measurements based on only a few measurements. However, it has been observed that an overall reciprocal relation-ship exists between variance and gene expression levels, and that genes expressed at similar levels exhibit similar variance (Fig. 1). Therefore, it is possible to use this prior knowledge in a Bayesian statistical framework to obtain more robust estimates of variance for any gene by examining the expression levels of other genes in the same expression neighbourhood within a single experiment. That is, to supplement the weak empirical estimates of single-gene variances across a small num-ber of replicates with more robust estimates of variance obtained by pooling genes with similar expression levels. The alternative is to use a strict frequentist approach, in which the estimate of the standard deviation is compromised by the limited number of measure-ments of each gene that is typical of DNA microarray experiments.
The Bayesian statistical framework that we have developed (Baldi and Long, 2001; Long et al., 2001) incorporates this logical inference to derive a more robust estimate of variance in the form:
In this formula, the n−1 empirical observations with variance s2 are complemented by an additional set of V0‘background’ observations with background variance σ 02. In a flexible implementation, the background variance is computed by taking into account a neighbourhood of genes around the gene under consideration by considering, for instance, a symmetric window of 100 genes, the average expression levels of which are closest to the gene under consideration.
This regularized t-test approach has been implemented in a program called CyberT available for online use over the Internet al.at http:www.igb.uci.edu. Although users can set the value of V0, a reasonable default is to have V0+ n set al.to a constant value (e.g. 10), which provides an automatic adjustment to the number of measurements and to missing data. In controlled experiments, this approach has been shown to give better estimates of standard deviations of expression than the simple t-test (Fig. 1) or the fixed-fold approach. In particular, although there is no substitute for replication, this method has allowed us to reduce false-positive rates in the case of experiments with low replication (Baldi and Long, 2001; Long et al., 2001).
Other approaches, including mixture modelling and a more complete Bayesian treatment, are possible, but it is not clear that at this stage the resulting increase in computational complexity would be worth the effort. It should also be noted that the above framework works best when applied to Gaussian data and that it is not unreasonable to use a Gaussian approximation for the logarithm of the gene expression levels obtained from DNA microarray experiments. Nevertheless, other models for data that deviate strongly from the Gaussian approximation may also become useful. In either case, because of the large number of measurements from a single experiment, high levels of noise and experimental and biological variabilities, array data are best modelled and analysed using a probabilistic framework of the type described here.
Estimating a probability of differential gene expression
Because of noise and variability of array data, no statistical method can be expected to provide perfect answers, and a number of false-positive and false-negative results must be expected. What matters is to be able to minimize these rates and to have an idea of the level of confidence that can be placed on an individual differential gene expression measurement. An intuitive method to estimate the false-positive level inherent in any set of DNA microarray experiments can be obtained by comparing control data with control data where, under perfect circumstances, no gene ought to have changed. This ad hoc method has been used to estimate the false-positive level in experiments reported by Arfin et al. (2000) and Hung et al. (2002).
Recently, a more formal computational implementation of this ad hoc method has been described by Allison et al. (2002). This method uses the Bayesian framework described above. However, this time, our prior knowledge is based on the observation that, in the case of no expected change in expression levels (e.g. control versus control), the p-values for each gene measurement ought to have a uniform distribution between 0 and 1 (the fraction of genes below any value p is equal to p and the fraction above is equal to 1–p). In contrast, when there is change (e.g. control versus experimental), the distribution of p-values will tend to cluster more closely to zero than to one; i.e. there will be a subset of differentially expressed genes with ‘significant’ p-values. This distribution can be modelled as a weighted combination of a uniform and a non-uniform distribution. For each gene, we can then compute the probability that it may be associated with the non-uniform distribution resulting in a posterior probability of differential expression (PPDE) value ranging from 0 to 1. This method is implemented by considering the p-values obtained from the t-test distribution as a new data set and building a probabilistic model for these new data. One can use a mixture of beta distributions (Allison et al., 2002) to model this distribution of p-values in the form
For i = 0, we use r0 = s0 = 1 to implement the uniform distribution as a special case of a beta distribution. Thus, K + 1 is the number of components in the mixture, and the mixture coefficients λ represent the prior probability of each component. In many cases, two components (K = 1) are sufficient, but sometimes additional components are needed. In general, the mixture model can be fitted to the p-values using the EM algorithm or other iterative optimization methods to determine the values of the λ, r and s parameters (Titterington et al., 1985; McLachlan and Peel, 2000).
From the mixture model, given n genes, the estimate of the number of genes for which there is a true difference is n(1–λ0). Similarly, if we set a threshold T, below which p-values are considered significant and representative of change, we can estimate the rates of false positives and false negatives. For instance, the false-positive rate is given by:
A posterior probability for differential expression (PPDE) can then be calculated for each gene in the experiment with p-value p as:
The distribution of p-values from four replicate experiments comparing two genotypes of otherwise isogenic lrp+ versus lrp–E. coli strains is shown in Fig. 2 (Hung et al., 2002). A plot of PPDE values against p-values is shown in Fig. 3.
We have tested the data analysis methods described above on carefully conducted and replicated experi-ments in E. coli to determine the target genes of global regulatory proteins, IHF (Arfin et al., 2000), Lrp (Hung et al., 2002), Fnr, Arc and Nar (R. Gunsalus and G. W. Hatfield, unpublished data) that allow E. coli cells to respond to their nutritional and physical environments. These methods have allowed us to identify genes differentially regulated by these global regulatory proteins with a high level of confidence. We are currently implementing the computational method for estimating the PPDE value for each gene of a DNA microarray experiment in the cybert program available for online use at http:www.genomics.uci.edu. This will allow a user to submit a text file containing background-subtracted, normalized, raw or modelled data from a gene expression profiling experiment and to obtain immediate results for every gene expressed at an above-background level sorted on the p-values based on a simple or a regularized t-test and global PPDE confidence values, based on the experiment-wide false-positive and false-negative levels. With this information the experimenter can make an informed decision about differentially expressed genes worthy of further investigation.
We acknowledge our many collaborators who have contributed to and tested the data analysis methods reviewed here including Denis Heck, Dennis Kibler, Richard Lathrop, Anthony Long, Harry Mangalam, Cal McLaughlin and Suzanne Sandmeyer (UC Irvine), Rob Gunsalus (UC Los Angeles) and David Low (UC Santa Barbara). This work was supported in part by the UCI Institute of Genomics and Bioinformatics and grants from the NIH (GM-55073) to G.W.H. and a Laurel Wilkening Faculty Innovation Award and a Sun Microsystems Award to P.B. S.H. was supported by a UC Biotechnology Research and Education Program predoctoral fellowship. We are also grateful to Cambridge University Press for permission to reproduce materials that appear in a recent book by P.B. and G.W.H. entitled DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modelling.