The expression level of small non-coding RNAs derived from the first exon of protein-coding genes is predictive of cancer status



Small non-coding RNAs (smRNAs) are known to be significantly enriched near the transcriptional start sites of genes. However, the functional relevance of these smRNAs remains unclear, and they have not been associated with human disease. Within the cancer genome atlas project (TCGA), we have generated small RNA datasets for many tumor types. In prior cancer studies, these RNAs have been regarded as transcriptional “noise,” due to their apparent chaotic distribution. In contrast, we demonstrate their striking potential to distinguish efficiently between cancer and normal tissues and classify patients with cancer to subgroups of distinct survival outcomes. This potential to predict cancer status is restricted to a subset of these smRNAs, which is encoded within the first exon of genes, highly enriched within CpG islands and negatively correlated with DNA methylation levels. Thus, our data show that genome-wide changes in the expression levels of small non-coding RNAs within first exons are associated with cancer.



The expression of small non-coding RNAs encoded within the first exon of genes can be used to efficiently identify cancer samples and classify patients into subgroups of different survival. Such pan-cancer association is the first link between these RNAs and disease.

  • Exon 1 small non-coding RNAs (smRNAs) can distinguish between cancer and normal tissues.
  • The prediction potential of exon 1 smRNAs differs from that of other smRNAs around transcriptional start sites (TSS).
  • smRNA locations around TSS are conserved between different individuals.
  • smRNA locations are enriched within CpG islands and their levels negatively correlated with DNA methylation.