Recent re-examination of the logic behind the biological interpretation of gene expression profiles in the context of cancer studies casts some doubts on their validity. Since the advent of DNA microarrays in the late 1990s, expression profiles obtained from resected tumours or biopsies have been widely used to characterise tumour subtypes 1, 2. They have also been promoted as tools for prognosis, indicating the probable evolution of a cancer and, in some cases, for prediction of the efficacy of a given therapy. Early technical problems 3, 4 were progressively controlled 5. At the same time, the somewhat questionable statistical procedures initially used to derive clinically oriented ‘signatures’ were analysed critically 6, 7 and largely replaced by more rigorous methods. Indeed, several tests based on transcriptome analysis eventually achieved regulatory approval and are currently used in the clinic. However, a recent paper 8 strongly challenges the scientific validity of these profiles as applied to studies on breast (and other) cancers, and claims that ‘Most published signatures are not significantly better outcome predictors than random signatures of identical size’1. In this commentary I address this paper and some of the relevant literature, showing that – even though the claim stated above is somewhat overstated – there is a significant component of truth in this conclusion, and important lessons can be gained from this work.
An iconoclastic paper
An expression signature is based on a set of genes, derived from retrospective studies on cancer samples, whose expression levels are used to assess the prognosis of patients (Fig. 1A). To do this, tissue from a resected tumour or from a biopsy is employed for expression profiling. The resulting values, using a specific algorithm, provide either a classification into ‘good’ or ‘bad’ prognosis categories 9 or, alternatively, a continuous ‘recurrence index’ 10. The paper discussed here strongly challenges the significance of these classifications. It is based on creative use of existing (and available) data, and has not required any additional ‘wet lab’ work. In outline, the authors have taken nearly 50 published expression signatures claiming to predict outcome in breast cancer, and have applied them to the extensive set of expression data on which the landmark paper of van de Vijver et al. 11 is based2. The data consist of expression values, obtained using 25,000-feature-long oligonucleotide arrays and frozen tumour samples from 295 patients of the Netherlands Cancer Institute (the ‘NKI cohort’). Clinical follow-up information over an average of eight years is available for these patients, and the clinical end-point used in the study is overall survival. The question addressed is whether or not the published signatures3 provide a significant classification of patients into long-term and short-term survivors – and whether or not this categorization is more reliable than the one obtained using a signature containing the same number of randomly chosen genes.
As an appetizer, the paper tests several published signatures that are a priori completely irrelevant to breast cancer. Figure 1B shows the result for one of them, a gene signature from a high-profile publication 12 studying ‘sensitivity to social defeat in mice’. Surprisingly, the set of genes found to be significant for this outcome turns out to provide a passable prediction for the overall survival of breast cancer patients!
As seen in Fig. 1, the ‘social defeat’ signature (Fig. 1B) ends up separating the patients into two classes, with strongly different outcomes (relative risk 2.4) and high apparent significance, p = 0.00014, as well as, or better than, the ‘real’ cancer signature (Fig. 1A).
Scanning through published signatures
Of course, this provocative result suggests a thorough and critical study of published breast cancer prognostic signatures. Indeed, the authors then use the same set of patients and data to test each signature against 1,000 random signatures of the same size (i.e. based on the same number of genes). The aim here is to see if the published signatures predict the survival of breast cancer patients more accurately than random signatures – as would be expected, of course. The results are presented in Fig. 2, an annotated version of the corresponding figure in the paper of Venet et al. 8. This shows the p value of the survival prediction for each of the published signatures applied to the NKI cohort (shown by a red dot) against the background of the predictions from 1,000 random signatures containing the same number of genes. Naturally, each instance of a random signature gives a different result, so the distribution is indicated by a tapering yellow shape, the most probable outcome corresponding to the thickest part of the shape. The lower 5% of each distribution are shaded in green, the median being shown by a black vertical bar.
This very interesting figure allows several conclusions. First, it is reassuring to notice that some of the best-known signatures, which have led to the development of commercial diagnostic tests, do give a better prediction than almost all corresponding random signatures. In this respect, the authors have somewhat overstated their conclusions. As highlighted in Fig. 2, this is true for the 16-gene signature of Paik et al. 13, which forms the basis of the Genomic Health (USA) Oncotype test 10, for the 70-gene ‘Amsterdam’ 11 set, which is the foundation of the Agendia (The Netherlands) MammaPrint test 9, and for the Sotiriou 14 genomic grade (MapQuant Dx, Ipsogen, France). However, many of the other published instances are no better than the median of random tests (e.g. the Korkola et al 15 ‘robust signature’) or even worse than almost all (the Taube et al. 16 202-gene signature). It is also clear that sets comprising 100 or more genes are passable predictors of breast cancer survival, with p values in the range of 10−4–10−5, whatever the gene choice…
Clinical significance is still present… but mechanistic interpretation is not warranted, and proliferation is the key
These surprising and disturbing results do not invalidate the clinical usefulness of the signatures (at least the better ones): they are significantly correlated with clinical outcome (overall survival in this case), often with quite strong statistical value, and can indeed provide helpful guidance to the clinician. What is questioned is the significance of the choice of genes that are assessed; many random collections of the same size give a prediction that is as good, or better, than the set chosen and published after many experiments and detailed analysis of the results. In fact, a recent reanalysis of the wealth of existing data concludes that essentially all of the breast cancer classification can be achieved with the expression levels of just three key genes 17. Thus, discussions on pathogenic pathways based on the nature of the genes found to be ‘implicated’ have no firm foundation, as this set could be replaced by almost any other assortment without lowering the correlation value. This is an important caveat, as many biological hypotheses have been published on this assumption.
So what is going on? Why does almost any signature based on a randomly chosen assortment of 200 genes provide a reasonable prognosis for survival after therapy for early stage breast cancer, as seen in Fig. 24? The authors' explanation is that any multigene signature is likely to be correlated with cellular proliferation in the tumour cells, which itself correlates (negatively) with patient survival. In support of this hypothesis, they construct a ‘metagene’ corresponding to the 1% of genes whose expression is most strongly correlated with that of the PCNA (proliferating cell nuclear antigen) gene whose protein product is a widely used target for immunohistochemical determination of the proliferation index in tumour samples 18. The expression level of this metagene (i.e. the average expression level of the underlying genes) is then used to correct the data, i.e. to remove the proliferation component from the signatures obtained. Of course, they become much less significant in terms of prognosis. However, they should be enriched in changes reflecting the real ‘driver’ events in cancer genesis, rather than the wide-ranging, but secondary, effects liked to proliferation. Indeed, most (91%) of the genes that are individually associated with outcome at the p = 0.05 level (almost 20% of the genes represented on the array) are also correlated with the PCA metagene.
The result of this data manipulation is to reduce the prognostic abilities of the published signatures. Those that survive this treatment must contain information independent of proliferation, i.e. possibly provide insight into the genes and pathways that play a causal role in the aggressiveness of breast cancer 19. Again, it is reassuring to see that the best established signatures previously mentioned remain valid under these conditions, with p values ranging from 10−4 to 10−3 and hazard ratios (bad to good prognosis) close to 2. Most of the other signatures end up with hazard ratios very close to 1 and low to nonexistent statistical significance. Some of them (without metagene correction) may still be useful clinically, since they are correlated with overall survival – but that correlation is essentially based on proliferation and the gene set chosen has no deep biological significance.
Fighting an entrenched paradigm?
As one of the authors describes in a science magazine for the general public 20, publishing this analysis has not been easy: it took four years and six rejections for it to finally appear in an open-access computational biology journal whose impact in the genomics community is probably rather limited. Of course, there could be something wrong with the computational methods… Even though the article is supported by ample supplements with much additional data, and includes a number of controls not described here in the interest of brevity, the procedures are quite intricate and could possibly generate some bias. However, at a time when so much of the published work brings very little new understanding or insight to the field, it seems that such a provocative finding should have been highlighted. If that had been done, it would at least have stimulated efforts to verify it via other analyses. It is difficult not to interpret this difficulty as an expression of reluctance to question the significance of a widely practised (and widely published) approach. Indeed, dissatisfaction is beginning to emerge about the low success rate of the thousands of biomarkers described in the recent years: very few of them have actually proven useful in clinical practice or in drug development 21, 22. Even though statistical methods are much improved compared to initial work in this field 7, the design of experiments and the methods for error estimation still leave much to be desired, as discussed recently in this Journal 23. Again, the analysis presented in the paper by Venet et al. 8 does not negate the clinical worth of expression signatures. However, it does raise important questions about their mechanistic significance and the validity of many interpretations derived from them. The paper is, therefore, a very useful addition to the genomics literature.
These data were until recently available on the Rosetta web site, but can now only be accessed through the paper discussed in this article 8.
For the original references see 8, supplemental text.
In Fig. 2, at the level corresponding to 200-gene signatures (where the Taube et al signature is indicated), essentially every random signature gives a prediction that is valid at the p = 0.05 level.