Discovery and annotation of disease-associated microarray experiments
To evaluate the hypothesis of disease concordance across microarray experiments, we first assembled a large data set of disease-associated microarray experiments from NCBI GEO and systematically annotated each experiment to codify the disease and tissue conditions that were measured. This process identified 429 disease-associated experiments that measured both a disease and normal control state, representing 238 unique diseases measured in 122 distinct tissues. In total, these experiments yielded 429 diseases versus control comparisons that were associated with 8435 microarray samples comprised of more than 161 distinct microarray platforms. Interestingly, 95 diseases were found to have two or more representative experiments in public data, even given our constraint that each experiment was required to have samples for both the disease and normal control conditions. Although diseases important to public health, such as type 2 diabetes, are among the diseases with the most experiments, it was possible to find replicate experiments for rare disorders, such as essential thrombocythemia, in public data.
Quantifying and comparing disease conditions
For each of the 429 disease-associated experiments, we computed a disease state vector, which represented the change in expression in the disease condition relative to the normal control condition for all measured genes. To evaluate the effects of various data normalization and disease state quantification methods, we created parallel sets of disease state vectors using many different combinations of commonly used normalization and quantification methods (see Materials and methods for details), which we will refer to as pipeline routings. For each of the 84 possible pipeline routings, we computed all possible within-species, pair-wise correlations between disease state vectors, which resulted in 36 417 distinct correlation measures per pipeline routing (3 059 028 total correlations). As a control, we also calculated these pair-wise correlations after randomly shuffling tissue and disease annotation labels.
We find the ability to establish a statistical concordance between microarray experiments depends on the normalization and disease state quantification methods chosen. In our analysis, the subtractive approach to disease state quantification, in which the gene expression values from the normal state are simply subtracted from those in the respective disease state, outperforms fold-change and t-test methods in capturing disease concordance within and across tissues (Supplementary Table S1). It is surprising that t-test methods performed poorly in capturing disease concordance, as t-test-based methods are among the most commonly used in microarray data analysis. However, t-test based methods are strongly influenced by estimates of gene-specific variance (Breitling et al, 2004), therefore, it is likely that the t-test approach suffered from the small sample sizes, characteristic of a number of disease experiments in the public data. The prominence of the subtractive methods may be explained by the use of correlation as our concordance measure, and the possibility that the magnitudes of critical differential gene expression changes in the disease state are somehow dampened by fold-change and t-test-based approaches. As an alternative measure of disease signature robustness, we computed ROC AUC distributions for each disease/tissue category using the best performing method under correlation, and we found that concordant experiments are also significantly predictive of each other (Supplementary Figure S1).
Disease concordance versus tissue concordance
To gain a comprehensive picture of disease concordance across microarray experiments, we evaluated whether correlations between disease-associated experiments were driven by tissue-specific gene expression. For each pipeline routing, the resulting 36 417 pair-wise correlation coefficients were assigned to one of the four categories according to their disease and tissue annotations. Under this scheme, we could evaluate the distributions of correlations between experiments in which both vectors measured the same disease from the same tissue (D+/T+), the same disease from different tissues (D+/T−), different diseases from the same tissue (D−/T+), or different diseases from different tissues (D−/T−). Analysis of the Fisher's z transformed correlation coefficient distributions between these categories revealed a significant degree of variance among pipeline routings with regards to the strength of the disease signal over the tissue signal. Figure 1 contrasts the results between two pipeline routings. Figure 1a shows a pipeline routing in which correlation coefficients between disease state vectors measuring the same disease from the same tissue (D+/T+) were significantly greater than correlations between different diseases in the same tissue (D−/T+) (Tukey's HSD P-value=1.15 × 10−14). In contrast, the pipeline routing in Figure 1b shows correlation coefficients with no differences between D+/T+ and D−/T+.
Figure 1. Boxplots comparing the distributions of the Fisher's z transformed correlation coefficients across the four disease/tissue categories. Boxplot in (A) shows a pipeline routing using parameters (NoNorm/NoCollapse/NoAggregate/SubtractiveDiff), which resulted in a significant separation of the same disease, different tissue category (D+/T−) from the different disease, same tissue category (D−/T+). Boxplot in (B) shows a pipeline routing using parameters (NoNorm/NoCollapse/NoAggregate/TtestDiff), which resulted in a distribution similar to that produced by the randomized data using the same pipeline routing parameters.
Download figure to PowerPoint
For many pipeline routings, we found that the distribution of correlations between disease state vectors, measuring the same disease in a different tissue (D+/T−), was often higher than the distributions of correlations between disease state vectors measuring a different disease in the same tissue (D−/T+). Such a case is illustrated by the pipeline routing in Figure 1a. These cases seem to imply that the signal of disease concordance across microarray experiments is stronger than the signal of tissue concordance.
To determine whether this observation could be generalized across all pipeline routings, we plotted the median disease/tissue category correlation coefficients for each of the 84 pipeline routings, along with the medians computed by randomly shuffling annotation labels for each pipeline routing (Figure 2). We find support for a general trend indicating that the disease concordance signal was generally distinguished above the level of the tissue concordance signal irrespective of the data processing techniques applied.
Figure 2. An aggregate view of the median correlation across the four disease/tissue categories for all 84 possible pipeline routings. Vertical black bars represent the s.e.m. correlation. The colored lines connect disease/tissue category medians computed using the same pipeline-routing. Although certain pipeline routings perform better than others at establishing disease concordance, we observe a general trend indicating that the disease signal is stronger than the tissue signal regardless of the analytical methods used.
Download figure to PowerPoint
The relative strength of the disease concordance signal over the tissue concordance signal is a compelling finding with substantial implications for the general practice of microarray meta-analysis. One might have expected a relatively strong degree of concordance between diseases experiments sampled from the same tissue, given the number of genes likely to be involved in tissue-specific biology (Kilpinen et al, 2008; Shyamsundar et al, 2005). However, we have shown here that disease conditions seem to have synchronized gene expression changes across different tissues. Figure 3 illustrates the symmetry in gene expression that can be observed for the same disease across tissues. In Figure 3, we observe a significant concordance between two experiments measuring Huntington's disease from different tissues, whereas there is relatively minimal concordance observed between two experiments measuring distinct diseases from the same tissue. This could occur as the result of the systemic nature of the disease pathogenesis. For example, a localized gene expression signature involving INF-γ, TNF-α, IL-2, IL-12, and IL-18 genes might signify the formation of noncaseating granulomatous lesions across multiple tissue types in systemic sarcoidosis (Kettritz et al, 2006; Nunes et al, 2007). It is also possible that there are limited channels through which disparate tissues communicate, and perhaps disease conditions essentially maximize the amplitude of communication in one type of channel to the effect of synchronizing the genes mediating the communication. For example, hyperglycemia in diabetes might maximize the amplitude of signaling and pathways involved in the regulation of insulin, glucagon, and other hormones across muscle, hepatic, and pancreatic tissues (Bansal and Wang, 2008; Yano et al, 2008). Recently, Dobrin et al (2009) discovered that tissue-to-tissue co-expression sub-networks in mouse models for obesity were more highly connected than within-tissue networks, lending credence to this assertion. Perhaps another explanation for the observed lack of tissue concordance is greater variation in tissue-specific gene expression than previously acknowledged between and among populations represented in public data (Whitehead and Crawford, 2005).
Figure 3. Symmetry of disease-state gene expression for the same disease in different tissues (D+/T−) versus different diseases in the same tissue (D−/T+). The colors indicate the direction of change in the expression of a gene in the disease state relative to the normal control state, in which green indicates upregulation of disease, and red indicates downregulation of disease. Here we observe that the differential expression concordance between Huntington's disease in the brain (GDS2169) and blood (GDS1331) is much more extensive than that observed between type 2 diabetes (GDS162) and Duchenne's muscular dystrophy (GDS214) in skeletal muscle.
Download figure to PowerPoint
We acknowledge several limitations to the approach taken by this study. Foremost, we acknowledge that experimental investigators will generally draw samples from tissues that are relevant to the disease condition under study. Therefore, we cannot assert that disease concordance would be maintained in samples drawn from tissues that would not commonly be chosen in the study of a disease. Nonetheless, the primary purpose of this investigation was to make observations from the data currently available in public repositories. We also recognize that the results are dependent on the quality and accuracy of the vocabulary annotations attributed to the experiments, though here, we manually validated our annotations. We also acknowledge that these vocabularies are dynamic, in which a term describing a single tissue might be split into two different concepts in the future, and that the vocabulary structure may have a bearing on the interpretation of the results. However, we determined that there was no significant relationship between vocabulary structure and observed correlation values (Supplementary Figure S2).
The findings of this study raise several important implications for the study of human disease and the role of public data in translational research. With the understanding of a general, trans-tissue disease concordance across the public microarray data, it is now reasonable to undertake efforts to incorporate these data in new systems models for disease pathology across multiple tissues and organ systems. One possible utility would be in biomarker discovery, in which the traditional practice begins with a disease condition of interest and applies molecular quantification techniques to discover putative molecular markers that signify some aspect of the molecular pathology. Instead, a broader systems view of disease derived from public data would serve as a filter to restrict costly efforts in biomarker discovery and validation of the space of molecular components and phenomena that are unique to the disease condition under study (Dudley and Butte, 2009). Furthermore, the trans-tissue nature of disease concordance suggests that it is reasonable to leverage public data to search for biomarkers in more peripheral cells and fluids, such as those found in blood and urine. This potential is illustrated in Supplementary Figure S3, in which a microarray experiment measuring type 2 diabetes (T2D) in the blood is clustered with a core set of experiments measuring various diseases from skeletal muscle. Not only do the diseases cluster consistently within tissue, but also the experiment measuring T2D in peripheral blood is clearly matched with the experiment measuring T2D in muscle. Future study should seek to model relationships between primary affected tissues and peripheral fluids within the public data to determine whether the potential demonstrated in Supplementary Figure S3 can be generalized across a broad range of human diseases.
These findings also suggest support for experimental designs that are inclusive of both newly generated data and relevant data available from public data repositories. We previously showed that the integration of 49 obesity-related, genome-wide experiments significantly improved the predictive capability for discovering obesity-associated genes (English and Butte, 2007), and the results from the study detailed here validates a similar inclusive approach for every disease represented in public data. Furthermore, major research is presently underway by others to characterize sub-types of clinically heterogeneous diseases such as breast cancers, which are observed to show a great deal of variance with regards to response to therapeutics and patient outcomes (Weigelt et al, 2008; Wirapati et al, 2008). We argue that when investigating the drivers of molecular concordance between diseases, public experiments should be seen as opportunities to allow for new directions in research into the shared molecular pathophysiology of disease, which might offer a more concise molecular characterization of the heterogeneity observed within diseases or disease categories.
With the growing set of publicly available molecular measurement data, biological and clinical investigators are now enabled to ask new questions about the global properties of human disease, and to build multi-tissue systems models for disease pathophysiology. Future studies in this area are likely to impact our fundamental understanding of the molecular bases of human disease, the repurposing of therapeutics across disease conditions, or even lead to a completely new system of human disease classification founded on molecular characteristics, rather than symptoms and anatomy.