In 1996, it was estimated that the human genome included 50 000–100 000 genes . Far from these preliminary estimations, as derived from the results from human sequencing projects [2, 3], it is currently thought that the human genome includes only approximately 20 300 protein-coding genes (reviewed in  and updated from hupo.org/research/hpp). At this stage, it is widely accepted that a number of molecular mechanisms contribute to expand the repertoire of proteins present in a given organism.
Correspondence concerning this and other Viewpoint articles can be accessed on the journals' home page at:
Correspondence for posting on these pages is welcome and can also be submitted at this site.
The origin of protein diversity is driven through three main processes: first, at DNA level (i.e. gene polymorphisms) that generates diverse proteins with single amino acid changes or shorter polypeptides; second, at precursor messenger ribonucleic acid or messenger RNA (pre-mRNA and mRNA) level (i.e. alternative splicing, also termed differential splicing) that generates proteins with alternate functional modules; and finally, at the protein level (i.e. PTM and specific proteolytic cleavages) that modulate protein processing, function, degradation, and turnover .
It has been demonstrated that alternative splicing of mRNA molecules is a widespread process in most eukaryotes, and a major source of novel protein species (for review, see Nilsen and Graveley ). According to Blakeley et al. , alternative splicing is one of the sources for discrepancy between the number of genes and the proteome complexity in multi-cellular eukaryotic organisms. These authors estimated that more than 33% of the genes susceptible to alternative splicing undergo alternative splicing. To support their statement, they analyzed protein tryptic digests from five different model organisms (including human, mouse, chicken, Drosophila melanogaster, and Caenorhabditis elegans) using bottom-up LC-MS/MS. In humans, high-throughput genetic sequencing studies also indicate that alternative splicing occurs in 40–60% of all genes [8, 9] and affects to more than 95% of all human genes containing 2 or more exons [10-12] yielding multiple mRNAs and, therefore, multiple protein products.
Protein PTM is another key process promoting protein diversity and protein function modulation. Protein PTM seems more wide spread in nature than alternative splicing because it occurs not only in eukaryotes, but also in prokaryotes and Archaea . Hundreds of different PTMs have been referenced and compiled in UNIMOD database (www.unimod.org). Protein PTM is a dynamic process. Each protein may undergo multiple PTMs at the same time in vivo, thus, contributing to the generation of diversity.
There is increasing evidence of a fourth general mechanism consisting on the generation of chimeric RNA transcripts, produced by joining exons from two or more different gene loci. In humans, the generation of chimeric RNA transcripts was proposed in 2006 . Since then, several ways of producing chimeric RNA transcripts have been proposed (Fig. 1) including:
(i) gene products derived by gene rearrangement (inter- or intrachromosomal translocation, deletion, or inversion) [15-17], (ii) trans-splicing of pre-mRNAs [15, 18], and (iii) RNA tandem chimerism, formed by the transcription of two consecutive genes [14, 19]. Thus, although only a reduced number of chimeric transcripts have been characterized to date, the relevance of RNA chimerism is getting more and more focus.
Gene rearrangement processes occurring in lymphocytes (B- and T-lymphocytes) and cancer cells constitute two well-known examples of cells undergoing protein chimerism. Gene rearrangement is inherently characteristic of B and T lymphocytes, which constitutes the basis for the generation of protein diversity and antigen recognition. These two protein groups result from the expression of the same gene family, share a similar structural organization and undergo similar gene rearrangements . Since immunoglobulins and T-cell antigen receptors play crucial roles in the immune response, gene rearrangement in B- and T-lymphocytes constitutes a significant biological advantage. Conversely, there is ample evidence supporting that chimeric mRNAs derived after gene rearrangement may play a causal role in tumorigenesis (reviewed in ), suggesting that such chimeric transcripts and their products are the consequence of the deleterious accumulation of errors inside cells. A well-known example of deleterious gene rearrangement is the Bcr-Abl fusion gene, characteristic, and causative molecular event in chronic myeloid leukemia [22-24] (Fig. 2). Briefly, the bulk of the Abelson tyrosine-protein kinase 1 (Abl) gene is translocated from chromosome 9 onto the breakpoint cluster region gene (Bcr) in chromosome 22. The resulting fusion gene is frequently termed “the Philadelphia (Ph) chromosome.” Nevertheless, the specific positions of translocation (typically termed breakpoints) within the Bcr gene can be variable leading to up to eleven different gene rearrangements (e1/a2, e1/a3, b2/a2, b3/a2, b2/a3, b3/a3, e2/a1a, e6/a2, e8/a2, e13/a2, and e19/a2). The potential transcription of these genetic rearrangements may lead to different chimeric Bcr-Abl protein products [22-25]. Four Bcr-Abl chimeric protein products, termed b2a2 (p210Bcr-Abl), b3a2 (p210Bcr-Abl), e1/a2 (p190Bcr-Abl), and e19/a2 (p230Bcr-Abl) have been described and detected at the protein level using, at least, gel electrophoresis [23, 25]. As depicted in Fig. 2, specific and unambiguous identification of these four chimeric protein products by bottom-up proteomic approaches would require the detection of junction peptides, exclusively found inside each chimera.
The occurrence of chimeric mRNA transcripts in healthy/normal cells was also reported by Li et al. . It has to be noted that the translation of chimeric mRNA transcripts into proteins in normal cells was not addressed in their study . Reviewing the literature, it becomes evident that the identification of chimeric transcripts is typically addressed at the nucleic acid level using mainly PCR or RT-PCR, while the identification of the corresponding protein products and the assessment of their functionality were neglected. Taking that into consideration, the report by Frenkel-Morgenstern et al.  provided two significant contributions: first, they confirmed the expression of 12 different chimeric proteins in humans by bottom-up proteomic approaches using LC-MS/MS, which confirms the expression of chimeric RNA into proteins; second, such chimeric proteins may be also found in normal cells, demonstrating that the occurrence of chimeric proteins is not restricted to cells from the immune system or cancer cells and may constitute a general mechanism promoting protein diversity.
Bottom-up analyses of protein mixtures are a widespread approach to address the identification of proteins, where the proteins are digested by adding a protease (typically trypsin) to the protein preparation (reviewed in ). The resulting peptides are subjected to analysis by LC-MS/MS. The unambiguous assignment of peptides to their corresponding cognate proteins is a challenging task, since the filiations of the peptides to cognate proteins is lost after endoprotease digestion. A range of bioinformatic tools enable the reconstruction of the puzzle of peptide sequences identified and allow to build up a list of candidate proteins present in the sample under analysis. The list of proteins is inferred after the comparison of the data acquired in a mass spectrometer against a protein sequence database using one or a combination of search engines [28-30]. From a very simplistic perspective, if one (or more) peptide detected in the mass spectrometer matches a peptide in the database, then it can be assumed that the protein bearing that peptide and included in the database is also present in the sample of interest. In bottom-up LC-MS/MS experiments, the unambiguous identification of a single protein species relies on the identification of at least one peptide sequence that is uniquely found in that protein species. The peptides found only in a certain protein species are termed “proteotypic” (also termed “unique” or “discriminant” peptides in the literature). Therefore, it is evident that the occurrence of proteins resulting from the translation of chimeric RNA transcripts could add some complexity to the unambiguous identification of their cognate proteins. To exemplify this potential complexity, the identification of tryptic peptides present in two different protein species (chimeric protein and nonchimeric protein) could lead to ambiguities since it is not possible to ensure the source of the tryptic peptide. Conversely, detection of tryptic peptide ions spanning the junction sites for each chimera (junction peptides or chimerotypic peptides) unambiguously point to the identification of chimeric proteins.
In light of the recent report by Frenkel-Morgenstern et al. , most chimeric RNAs are characterized by low expression levels. Nevertheless, the same report evidenced the successful and confident (≤1% false discovery rate) identification of chimerotypic peptides and, therefore, chimeric proteins. From a proteomic perspective, the identification of chimeric peptides was achieved combining two different bottom-up mass spectrometric analyses: data-dependent shotgun and single reaction monitoring targeted mass spectrometry (see  for review). To date, the occurrence of chimeric proteins has been confirmed in human tissues and cell lines. Consequently, it may be worth taking this fact into account not only in future cancer-related comprehensive proteomic analyses but also in normal cell proteomic studies. Moreover, it could be also interesting to take the chance and challenge to dig into the publicly available LC-MS/MS data repositories with new eyes, exploring for the presence of chimeric protein products.
Conclusion: Since the filiations of the peptides to their original protein may be lost after endoprotease digestion, the correct assignment of protein species after the analysis may be challenging. To add some complexity, the number of biological mechanisms able to expand the repertoire of proteins found in humans (i.e. human proteome), and probably in other higher eukaryotes, is showing to be more and more intricate. The use of bottom-up shotgun proteomic analyses together with targeted type of approaches proved to be valid tools to address the identification of chimeric proteins and may provide irrefutable evidences of their occurrence even in complex protein mixtures. Nevertheless, bottom-up analyses should be considered with caution, since confident identification of chimeric proteins relies on the identification of chimeric peptides. Moreover, the concomitant occurrence of chimeric proteins and nonchimeric proteins in a single sample may lead to biased protein identification and quantitation results derived after bottom-up analyses.
A priori, our knowledge about protein chimerism and its potential biological role is limited. Despite those limitations, as exemplified in this manuscript, detection of chimeric proteins such as Bcr-Abl in clinical samples may be of paramount importance, since such proteins may act as biomarkers for early tumor detection and may also point to appropriate therapeutic strategies and case-suited treatments.
Future steps toward the implementation of methods aiming chimeric protein detection and characterization using bottom-up analyses will probably require: (i) inclusion of DNA and protein chimeric sequences in databases, (ii) chimeric protein enrichment from complex protein mixtures to facilitate their detection, and (iii) improvement of bioinformatic tools and search algorithms enabling the identification of chimeric proteins based on mass spectrometric data (mass-to-charge ratios and MS/MS data corresponding to junction peptides, exclusively found in chimeric proteins).
DNA and protein sequences constitute the basis of bottom-up proteomic workflows. For that reason, the availability of dedicated databases including detailed sequence information on chimeric proteins is key. In this sense, a number of dedicated databases compiling information of chimeric fusion genes are already publicly available and are continuously updated, such as: the atlas of genetics and cytogenetics oncology and haematology , ChimerDB , Decipher database , HYBRIDdb , TICdb ] or the Mitelman database (http://cgap.nci.nih.gov/Chromosomes/Mitelman). At the protein sequence level, the number of bona-fide chimeric protein sequences in dedicated databases is still limited and evidences supporting their occurrence are frequently restricted to the identification of mRNA transcripts. To exemplify this, Table 1 shows up to forty six different protein entries included in Swiss-Prot database, corresponding to different Bcr/Abl chimeric transcripts and their proposed protein sequences. Since the construction of chimeric protein sequence databases ad-hoc is possible, it is foreseeable that high-throughput bottom-up proteomics could also contribute to confirm the occurrence of chimeric proteins.
|Accession||Protein names/ protein description||Amino acids||Status|
|A1Z199_HUMAN||BCR/ABL p210 fusion protein (fragment)||97||Evidence at transcript level|
|A2RQD3_HUMAN||Bcr-abl1 e13a3 chimeric protein (fragment)||235||Evidence at transcript level|
|A2RQD4_HUMAN||Bcr-abl1 e14a3 chimeric protein (fragment)||260||Evidence at transcript level|
|A2RQD5_HUMAN||Bcr-abl1 e1a3 chimeric protein (fragment)||313||Evidence at transcript level|
|A2RQD6_HUMAN||Bcr-abl1 e6a2 chimeric protein (fragment)||585||Evidence at transcript level|
|A2RQD7_HUMAN||Bcr-abl1 e19a2 chimeric protein (fragment)||498||Evidence at transcript level|
|A3RL30_HUMAN||BCR/ABL fusion protein (fragment)||107||Evidence at transcript level|
|A6MF66_HUMAN||BCR/ABL fusion protein (fragment)||79||Evidence at transcript level|
|A6MF67_HUMAN||BCR/ABL fusion protein (fragment)||47||Evidence at transcript level|
|A6MF68_HUMAN||BCR/ABL fusion protein (fragment)||72||Evidence at transcript level|
|A6MFJ7_HUMAN||BCR/ABL fusion protein e1a5 (fragment)||77||Evidence at transcript level|
|A6MFJ8_HUMAN||BCR/ABL fusion protein e13a5 (fragment)||58||Evidence at transcript level|
|A6MFJ9_HUMAN||BCR/ABL fusion protein e14a5 (fragment)||83||Evidence at transcript level|
|A8E194_HUMAN||Bcr-abl1 fusion protein (fragment)||31||Predicted protein sequence|
|A8WE93_HUMAN||BCR/ABL b3a3 fusion protein (fragment)||99||Evidence at transcript level|
|A9UEZ4_HUMAN||BCR/ABL fusion protein isoform X1||429||Evidence at transcript level|
|A9UEZ5_HUMAN||BCR/ABL fusion protein isoform X2||557||Evidence at transcript level|
|A9UEZ6_HUMAN||BCR/ABL fusion protein isoform X3||1633||Evidence at transcript level|
|A9UEZ7_HUMAN||BCR/ABL fusion protein isoform X4||554||Evidence at transcript level|
|A9UEZ8_HUMAN||BCR/ABL fusion protein isoform X5||514||Evidence at transcript level|
|A9UEZ9_HUMAN||BCR/ABL fusion protein isoform X6||441||Evidence at transcript level|
|A9UF00_HUMAN||BCR/ABL fusion protein isoform X7||524||Evidence at transcript level|
|A9UF01_HUMAN||BCR/ABL fusion protein isoform X8||465||Evidence at transcript level|
|A9UF02_HUMAN||BCR/ABL fusion protein isoform X9||1644||Evidence at transcript level|
|A9UF03_HUMAN||BCR/ABL fusion protein isoform Y1||458||Evidence at transcript level|
|A9UF04_HUMAN||BCR/ABL fusion protein isoform Y2||454||Evidence at transcript level|
|A9UF05_HUMAN||BCR/ABL fusion protein isoform Y3||467||Evidence at transcript level|
|A9UF06_HUMAN||BCR/ABL fusion protein isoform Y4||513||Evidence at transcript level|
|A9UF07_HUMAN||BCR/ABL fusion protein isoform Y5||1790||Evidence at transcript level|
|A9UF08_HUMAN||BCR/ABL fusion protein isoform Y6||406||Evidence at transcript level|
|A9YD18_HUMAN||BCR/ABL e8a2 fusion protein (fragment)||130||Evidence at transcript level|
|B0ZRQ9_HUMAN||BCR/ABL e18-int1b-a2 fusion protein (fragment)||54||Evidence at transcript level|
|B0ZRR0_HUMAN||BCR/ABL e8a2 fusion protein (fragment)||79||Evidence at transcript level|
|B0ZRR1_HUMAN||BCR/ABL e14a2 fusion protein (fragment)||162||Evidence at transcript level|
|B1PL85_HUMAN||BCR/ABL fusion protein (fragment)||49||Evidence at transcript level|
|C0LYZ4_HUMAN||Mutant BCR/ABL fusion protein (fragment)||215||Evidence at transcript level|
|E7E8T7_HUMAN||BCR-ABL1 e8a2 variant (fragment)||448||Evidence at transcript level|
|F1JU33_HUMAN||Mutant BCR/ABL fusion protein (fragment)||256||Evidence at transcript level|
|Q13745_HUMAN||BCR-ABL mRNA encoding P185-ALL-ABL protein||163||Evidence at transcript level|
|Q13746_HUMAN||Bcr-abl mRNA of acute lymphocytic leukaemia (ALL) patients (fragment)||386||Evidence at transcript level|
|Q13846_HUMAN||Bcr-abl mRNA 5' (clone 3c) (fragment)||77||Evidence at transcript level|
|Q16189_HUMAN||BCR/ABL protein (fragment)||46||Evidence at transcript level|
|Q16190_HUMAN||BCR/ABL protein (fragment)||43||Evidence at transcript level|
|Q8NEY0_HUMAN||BCR-ABL fusion protein (fragment)||69||Evidence at transcript level|
|Q8NF93_HUMAN||BCRE3/ABL1A11 fusion protein (fragment)||148||Evidence at transcript level|
|Q8TDA2_HUMAN||BCRe18/ABL1e3 fusion protein (fragment)||142||Evidence at transcript level|
Recently, Frank et al.  pointed that up to 75–85% of spectra in a typical MS/MS experiment remain unidentified after database searches. Therefore, public MS/MS datasets include a wealth of information awaiting for further interpretation, which may require either improved search algorithms or alternative ways of analyzing mass spectrometry data (such as two-step database search strategies including searches against ad-hoc databases  or de novo peptide sequencing [37, 38]). The strategies proposed above provide the possibility of identifying chimeric proteins not only in future experiments, but also in a retrospective manner, since public mass spectrometric data repositories are continuously increasing and may contain information supporting the occurrence of chimeric proteins that still remains unidentified. Noteworthy, Frank et al.  proposed that publicly available MS/MS meta-data should be organized into spectral archives including identified and unidentified spectra detected in multiple experiments from different labs using similar technologies. These authors advocate that such a strategy would synergize data interpretation across labs and could help to unravel previously hidden information, including the identification of chimerotypic peptides.
The combination of novel technologies such as DNA deep sequencers and bottom-up proteomic analyses may open new avenues for the identification of chimeric proteins. Continuous improvements in nucleic acid and amino acid sequence databases will probably enable a deeper understanding of protein chimerism not only in humans, but also in other higher eukaryotes. The characterization of such chimeric proteins holds a great potential to understand their biological implications.