- Top of page
- Material and Methods
- Supporting Information
The global gene expression analysis of cancer and healthy tissues typically results in large numbers of genes that are significantly altered in cancer. Such data, however, has been difficult to interpret due to the high level of variation of gene lists across laboratories and the small sample sizes used in individual studies. In this investigation, we compiled microarray data obtained from the same platform family from 84 laboratories, resulting in a database containing 1,043 healthy tissue samples and 4,900 cancer samples for 13 different tissue types. The primary cancers considered included adrenal gland, brain, breast, cervix, colon, kidney, liver, lung, ovary, pancreas, prostate and skin tissues. We normalized the data together and analyzed subsets for the discovery of genes involved in normal to cancer transformation. Our integrated significance analysis of microarrays approach produced top 400 gene lists for each of the 13 cancer types. These lists were highly statistically enriched with genes already associated with cancer in research publications excluding microarray studies (p < 1.31 E - 12). The genes MTIM and RRM2 appeared in nine and TOP2A in eight lists of significantly altered genes in cancer. In total, there were 132 genes present in at least four gene lists, 11 of which were not previously associated with cancer. The list contains 17 metal ions and 15 adenyl ribonucleotide binding proteins, six kinases and six transcription factors. Our results point to the value of integrating microarray data in the study of combination drug therapies targeting metastasis.
Tens of thousands of microarray samples have accumulated in public access databases in the last decade.1–3 A large portion of such data is cancer-specific and therefore holds the promise of cancer-associated gene discovery based on thousands of samples (not tens or hundreds). Much of the cancer-associated microarray data in public domains comes without control samples. In fact, the data in Gene Expression Omnibus (GEO) is highly asymmetric, containing datasets with cancer microarray samples only and other datasets containing samples for healthy tissue but not cancer tissues. Conventional meta-scale approaches of integrating data, where laboratory results are combined after the datasets were analyzed, would not be useful in drastically increasing the sample sizes in the microarray analysis of cancer. Such analyses require the presence of both cancer and normal tissue samples in the same microarray dataset.
In our study, we used a large-scale approach to integrate microarray data from multiple laboratories by normalizing them together and then using the significance analysis of microarray (SAM) method4 to identify the list of genes that are significantly altered in cancer compared to normal (SAM genes), specific for 13 distinct tissues. Our methodology is grounded on our previous study that revealed the predictive potential of integrated microarray data.5 Large-scale meta-analysis techniques applied to cancer have already been adopted by a few groups,6–9 focusing on a single tissue type. Other studies merged all cancer microarray data regardless of tissue type into one group and controls into another10, 11 to identify gene sets associated with common cancer mechanisms. Our approach is unusual when compared to the typical meta-analysis methods but it allows for the integration of asymmetric microarray data for global gene expression.5 In our study, we asked the question to what extent does the currently available microarray data have the potential to replicate the research literature on the molecular mechanisms of cancer. The automated text search algorithms we used point to high level of coincidence between our gene lists and cancer-associated genes determined from non-microarray research literature.
Using nearly 6,000 microarray samples, our study identifies 132 genes that are highly significantly associated in at least four distinct cancer types. Our study also presents a set of 270 genes that appear to be highly significant in comparisons of datasets consisting of cancer and normal tissues independent of tissue type. These sets have 74 genes in common and will potentially contribute to more detailed annotation of the genes in the cancer bioinformatics databases. Our study shows the value of large-scale compilation of microarray data in cancer research, as the inclusion of large amounts of microarray data from different labs helps eliminate the effects of lab-specific noise, increasing the reliability of the results.12
- Top of page
- Material and Methods
- Supporting Information
In our study, nearly 6,000 microarray samples were obtained from comparable Affymetrix platforms to investigate the commonalities as well as the tissue-specific components of normal to cancer transformations in 13 distinct tissue types. Obtaining such a large sample size was accomplished by adding highly asymmetric datasets into our microarray sample pool. In addition to symmetric data, following our recent method evaluation study,5 we also considered those datasets with large numbers of cancer samples and small number (including zero) of control samples and vice versa. Out of the 13 tissue types under study, only the breast, colon, kidney and pancreas tissues had three or more different datasets that included at least 10 cancer and 10 control samples.
Our approach is unusual in the sense that it does not fit typical meta-analysis techniques7, 11, 24–27 where each dataset need to have both disease and control samples in sufficient numbers, and datasets are normalized and analyzed separately for significant genes. However, we have recently shown in a number of test cases that our approach yields gene lists better matching cancer literature than meta-analysis techniques.5 Using the meta-analysis approach, Ramasamy et al.25 analyzed 21 distinct microarray datasets from 14 different cancer types comprising of 419 control and 973 cancer samples. The minimum sample size for cancer and control in their study was seven and some of the tissue types such as renal tissue appeared only in one dataset in their collection. The advantage of this method is the flexibility concerning the multiple platforms that can be incorporated and thereby increasing the sample size through acceptance of several platforms. Because we focused on a set of comparable platforms, our results are not directly comparable. Nevertheless, Ramasamy et al.25 published five upregulated and five downregulated genes as most significantly associated with cancer. Among this list of 10 genes, four (TMEM136, RBM15, FGD4 and KIAA1881) are not part of the minimal platform considered in our study, suggesting that as the data in public-access microarray repositories grows, datasets used in our approach will be restricted to the latest version of platforms containing many more probes. Of the remaining six genes, our top 400 lists confirmed the downregulation of PRKAR2B and GPM6B in four different tissues. Genes MYOM2 and RBCK1 in their 10 gene list were SAM genes in multiple lists in our study but were in the top 400 only in the liver gene list. Similarly, ALG3 did not appear in any of our top 400 gene lists but was significantly upregulated in six of the 13 tissues in our complete SAM lists. The last gene in their list, IRAK1 was a top 10 ranking gene in our pancreas SAM gene list, however, this gene was downregulated in the pancreas as well as five more tissues in our study, as opposed to the upregulated notation presented to the gene by Ramasamy et al.25 Note that our study contained 106 cancer and 71 normal tissue pancreatic microarray samples as opposed to the 12 tumor and seven normal microarray samples in Ref. 25. It is not feasible to summarize the comparison with a p value because the gene list presented in Ref. 25 contains only 10 genes whereas our various gene lists contain hundreds of genes. Nevertheless, it is clear that two approaches could potentially produce gene lists whose intersection is highly unlikely to be a random event.
Our approach takes advantage of the rapid increase of asymmetric databases in public-access microarray repositories. Moreover, gene lists predicted using this large asymmetric data reproduces much of the research literature on cancer-associated genes obtained by experimental methods other than microarray. Our analysis predicts 132 genes as significantly altered in normal to cancer transformation in at least four tissue types and out of this list, 121 were previously annotated in the literature as cancer associated. The remaining 11 genes comprise potential targets for further studies in cancer research. Note also that 74 out of the 132 genes in the list also appear in 70% of the SAM gene lists generated by comparing normal and cancer datasets comprising of randomly chosen 10 samples from each tissue type. The two gene lists presented in our study for cancer-associated genes with multiple tissue specificity will further contribute to the annotation of pathways of cancer. Recently emerging annotation-based microarray data tools such as A-MADMAN28 will help in the compilation process of large-scale microarray data for studying complex diseases and for biomarker discovery and drug development.
In conclusion, in our study we used nearly 6,000 microarray samples and identified a total of 329 genes that appeared as highly significant in normal to cancer transformation with regards to multiple cancer types. The gene list consists largely of genes that have already been associated with cancer in the research literature excluding microarray studies. The list can be used in detailed annotation of cancer pathways. In addition, due to the inclusion of numerous subtypes and cancer grades, the genes in this list can serve as potential targets for new drugs against metastasis.