Protein chimerism: Novel source of protein diversity in humans adds complexity to bottom-up proteomics


  • Juan Casado-Vela,

    Corresponding author
    • Centro Nacional de Biotecnología. Lab 115. Dpt. Biología Molecular y Celular, Spanish National Research Council (CSIC), Madrid, Spain
    Search for more papers by this author
  • Juan Carlos Lacal,

    Corresponding author
    • Translational Oncology Unit, Instituto de Investigaciones Biomédicas ‘Alberto Sols’, Spanish National Research Council (CSIC-UAM), Madrid, Spain
    Search for more papers by this author
  • Felix Elortza

    Corresponding author
    • Proteomics Platform, CIC bioGUNE, CIBERehd, ProteoRed-ISCIII, Technology Park of Bizkaia, Derio, Spain
    Search for more papers by this author

  • Colour Online: See the article online to view Figs. 1 and 2 in colour.

Correspondence: Dr. Juan Casado-Vela, Centro Nacional de Biotecnología, Lab 115, Dpt. Biología Molecular y Celular, Spanish National Research Council (CSIC), 28049 Madrid, Spain.


Fax: +34-915854401

Additional corresponding authors: Dr. Felix Elortza, E-mail:; Dr. Juan Carlos Lacal, E-mail:


Three main molecular mechanisms are considered to contribute expanding the repertoire and diversity of proteins present in living organisms: first, at DNA level (gene polymorphisms and single nucleotide polymorphisms); second, at messenger RNA (pre-mRNA and mRNA) level including alternative splicing (also termed differential splicing or cis-splicing); finally, at the protein level mainly driven through PTM and specific proteolytic cleavages. Chimeric mRNAs constitute an alternative source of protein diversity, which can be generated either by chromosomal translocations or by trans-splicing events. The occurrence of chimeric mRNAs and proteins is a frequent event in cells from the immune system and cancer cells, mainly as a consequence of gene rearrangements. Recent reports support that chimeric proteins may also be expressed at low levels under normal physiological circumstances, thus, representing a novel source of protein diversity. Notably, recent publications demonstrate that chimeric protein products can be successfully identified through bottom-up proteomic analyses. Several questions remain unsolved, such as the physiological role and impact of such chimeric proteins or the potential occurrence of chimeric proteins in higher eukaryotic organisms different from humans. The occurrence of chimeric proteins certainly seems to be another unforeseen source of complexity for the proteome. It may be a process to take in mind not only when performing bottom-up proteomic analyses in cancer studies but also in general bottom-up proteomics experiments.


single nucleotide polymorphisms

In 1996, it was estimated that the human genome included 50 000–100 000 genes [1]. Far from these preliminary estimations, as derived from the results from human sequencing projects [2, 3], it is currently thought that the human genome includes only approximately 20 300 protein-coding genes (reviewed in [4] and updated from At this stage, it is widely accepted that a number of molecular mechanisms contribute to expand the repertoire of proteins present in a given organism.

Correspondence concerning this and other Viewpoint articles can be accessed on the journals' home page at:

Correspondence for posting on these pages is welcome and can also be submitted at this site.

The origin of protein diversity is driven through three main processes: first, at DNA level (i.e. gene polymorphisms) that generates diverse proteins with single amino acid changes or shorter polypeptides; second, at precursor messenger ribonucleic acid or messenger RNA (pre-mRNA and mRNA) level (i.e. alternative splicing, also termed differential splicing) that generates proteins with alternate functional modules; and finally, at the protein level (i.e. PTM and specific proteolytic cleavages) that modulate protein processing, function, degradation, and turnover [5].

It has been demonstrated that alternative splicing of mRNA molecules is a widespread process in most eukaryotes, and a major source of novel protein species (for review, see Nilsen and Graveley [6]). According to Blakeley et al. [7], alternative splicing is one of the sources for discrepancy between the number of genes and the proteome complexity in multi-cellular eukaryotic organisms. These authors estimated that more than 33% of the genes susceptible to alternative splicing undergo alternative splicing. To support their statement, they analyzed protein tryptic digests from five different model organisms (including human, mouse, chicken, Drosophila melanogaster, and Caenorhabditis elegans) using bottom-up LC-MS/MS. In humans, high-throughput genetic sequencing studies also indicate that alternative splicing occurs in 40–60% of all genes [8, 9] and affects to more than 95% of all human genes containing 2 or more exons [10-12] yielding multiple mRNAs and, therefore, multiple protein products.

Protein PTM is another key process promoting protein diversity and protein function modulation. Protein PTM seems more wide spread in nature than alternative splicing because it occurs not only in eukaryotes, but also in prokaryotes and Archaea [13]. Hundreds of different PTMs have been referenced and compiled in UNIMOD database ( Protein PTM is a dynamic process. Each protein may undergo multiple PTMs at the same time in vivo, thus, contributing to the generation of diversity.

There is increasing evidence of a fourth general mechanism consisting on the generation of chimeric RNA transcripts, produced by joining exons from two or more different gene loci. In humans, the generation of chimeric RNA transcripts was proposed in 2006 [14]. Since then, several ways of producing chimeric RNA transcripts have been proposed (Fig. 1) including:

Figure 1.

Schematic view of the main mechanisms generating protein diversity. Panel 1: differences occurring at single gene level. Gene A harbors one single nucleotide polymorphism (SNP) inside the sequence of allele 2 that is not present inside allele 1. Panel 2: depending on the alleles transcribed, two different types of mRNA molecules may be generated: mRNA transcribed from Gene A, allele 1 (mRNA A.1) or mRNA transcribed from Gene A, allele 2 (mRNA A.2). Cis-splicing of mRNAs contribute to the generation of different mRNA molecules (A.1.1–A.1.3 and A.2.1–A.2.3). Panel 3: chimeric mRNAs can be synthesised by two main mechanisms: trans-splicing (produced by joining exons from at least two different mRNA molecules) or gene rearrangements (genetic inter- or intrachromosomal translocations, deletions, or inversions). Panel 4: protein PTM is a well-known mechanism promoting protein diversity. Panel 5: the complexity of the human proteome still needs to be determined.

(i) gene products derived by gene rearrangement (inter- or intrachromosomal translocation, deletion, or inversion) [15-17], (ii) trans-splicing of pre-mRNAs [15, 18], and (iii) RNA tandem chimerism, formed by the transcription of two consecutive genes [14, 19]. Thus, although only a reduced number of chimeric transcripts have been characterized to date, the relevance of RNA chimerism is getting more and more focus.

Gene rearrangement processes occurring in lymphocytes (B- and T-lymphocytes) and cancer cells constitute two well-known examples of cells undergoing protein chimerism. Gene rearrangement is inherently characteristic of B and T lymphocytes, which constitutes the basis for the generation of protein diversity and antigen recognition. These two protein groups result from the expression of the same gene family, share a similar structural organization and undergo similar gene rearrangements [20]. Since immunoglobulins and T-cell antigen receptors play crucial roles in the immune response, gene rearrangement in B- and T-lymphocytes constitutes a significant biological advantage. Conversely, there is ample evidence supporting that chimeric mRNAs derived after gene rearrangement may play a causal role in tumorigenesis (reviewed in [21]), suggesting that such chimeric transcripts and their products are the consequence of the deleterious accumulation of errors inside cells. A well-known example of deleterious gene rearrangement is the Bcr-Abl fusion gene, characteristic, and causative molecular event in chronic myeloid leukemia [22-24] (Fig. 2). Briefly, the bulk of the Abelson tyrosine-protein kinase 1 (Abl) gene is translocated from chromosome 9 onto the breakpoint cluster region gene (Bcr) in chromosome 22. The resulting fusion gene is frequently termed “the Philadelphia (Ph) chromosome.” Nevertheless, the specific positions of translocation (typically termed breakpoints) within the Bcr gene can be variable leading to up to eleven different gene rearrangements (e1/a2, e1/a3, b2/a2, b3/a2, b2/a3, b3/a3, e2/a1a, e6/a2, e8/a2, e13/a2, and e19/a2). The potential transcription of these genetic rearrangements may lead to different chimeric Bcr-Abl protein products [22-25]. Four Bcr-Abl chimeric protein products, termed b2a2 (p210Bcr-Abl), b3a2 (p210Bcr-Abl), e1/a2 (p190Bcr-Abl), and e19/a2 (p230Bcr-Abl) have been described and detected at the protein level using, at least, gel electrophoresis [23, 25]. As depicted in Fig. 2, specific and unambiguous identification of these four chimeric protein products by bottom-up proteomic approaches would require the detection of junction peptides, exclusively found inside each chimera.

Figure 2.

(A) Schematic representation of Bcr and Abl genes and corresponding protein products (p160Bcr and p145Abl). Bcr-Abl fusion gene rearrangements and theoretical chimeric protein products are also depicted. The variability of genetic translocation of the Abl gene within the Bcr gene leads to up to eleven different gene rearrangements (e1/a2, e1/a3, b2/a2, b3/a2, b2/a3, b3/a3, e2/a1a, e6/a2, e8/a2, e13/a2, and e19/a2). The potential transcription of the novel fusion genes may lead to different chimeric Bcr-Abl protein products. Four Bcr-Abl chimeric protein products, termed b2a2 (p210Bcr-Abl), b3a2 (p210Bcr-Abl), e1/a2 (p190Bcr-Abl), and e19/a2 (p230Bcr-Abl) have been described and detected at the protein level using, at least, gel electrophoresis. Arabic numbers 1–23 on Bcr and 1–11 on Abl refer to known exons included inside each gene. 1' and 2' correspond to two different Bcr alternative exons contained within the first intron. Similarly, 1a and 1b in correspond to two alternative exons described in the Abl gene. Potential junction tryptic peptides (a, b, c, and d) are indicated with red lines. (B) Schematic workflow corresponding to bottom-up identification of chimeric protein products. Specific and unambiguous identification of the four chimeric Bcr/Abl proteins could be inferred after tryptic digestion by mass spectrometric techniques (MALDI-TOF/TOF or LC-MS/MS). Detection of precursor tryptic peptide ions spanning the junction sites for each chimera (junction peptides or chimerotypic peptides) point to the identification of chimeric proteins. Subsequent MS/MS fragmentation of junction peptides provides sequence information and serves to validate the identification of protein chimeras. Potential junction peptides/chimerotypic peptides (a, b, c, and d) are indicated in red. Panel A is adapted from [25] by permission from the American Association for Cancer Research.

The occurrence of chimeric mRNA transcripts in healthy/normal cells was also reported by Li et al. [26]. It has to be noted that the translation of chimeric mRNA transcripts into proteins in normal cells was not addressed in their study [26]. Reviewing the literature, it becomes evident that the identification of chimeric transcripts is typically addressed at the nucleic acid level using mainly PCR or RT-PCR, while the identification of the corresponding protein products and the assessment of their functionality were neglected. Taking that into consideration, the report by Frenkel-Morgenstern et al. [27] provided two significant contributions: first, they confirmed the expression of 12 different chimeric proteins in humans by bottom-up proteomic approaches using LC-MS/MS, which confirms the expression of chimeric RNA into proteins; second, such chimeric proteins may be also found in normal cells, demonstrating that the occurrence of chimeric proteins is not restricted to cells from the immune system or cancer cells and may constitute a general mechanism promoting protein diversity.

Bottom-up analyses of protein mixtures are a widespread approach to address the identification of proteins, where the proteins are digested by adding a protease (typically trypsin) to the protein preparation (reviewed in [28]). The resulting peptides are subjected to analysis by LC-MS/MS. The unambiguous assignment of peptides to their corresponding cognate proteins is a challenging task, since the filiations of the peptides to cognate proteins is lost after endoprotease digestion. A range of bioinformatic tools enable the reconstruction of the puzzle of peptide sequences identified and allow to build up a list of candidate proteins present in the sample under analysis. The list of proteins is inferred after the comparison of the data acquired in a mass spectrometer against a protein sequence database using one or a combination of search engines [28-30]. From a very simplistic perspective, if one (or more) peptide detected in the mass spectrometer matches a peptide in the database, then it can be assumed that the protein bearing that peptide and included in the database is also present in the sample of interest. In bottom-up LC-MS/MS experiments, the unambiguous identification of a single protein species relies on the identification of at least one peptide sequence that is uniquely found in that protein species. The peptides found only in a certain protein species are termed “proteotypic” (also termed “unique” or “discriminant” peptides in the literature). Therefore, it is evident that the occurrence of proteins resulting from the translation of chimeric RNA transcripts could add some complexity to the unambiguous identification of their cognate proteins. To exemplify this potential complexity, the identification of tryptic peptides present in two different protein species (chimeric protein and nonchimeric protein) could lead to ambiguities since it is not possible to ensure the source of the tryptic peptide. Conversely, detection of tryptic peptide ions spanning the junction sites for each chimera (junction peptides or chimerotypic peptides) unambiguously point to the identification of chimeric proteins.

In light of the recent report by Frenkel-Morgenstern et al. [27], most chimeric RNAs are characterized by low expression levels. Nevertheless, the same report evidenced the successful and confident (≤1% false discovery rate) identification of chimerotypic peptides and, therefore, chimeric proteins. From a proteomic perspective, the identification of chimeric peptides was achieved combining two different bottom-up mass spectrometric analyses: data-dependent shotgun and single reaction monitoring targeted mass spectrometry (see [28] for review). To date, the occurrence of chimeric proteins has been confirmed in human tissues and cell lines. Consequently, it may be worth taking this fact into account not only in future cancer-related comprehensive proteomic analyses but also in normal cell proteomic studies. Moreover, it could be also interesting to take the chance and challenge to dig into the publicly available LC-MS/MS data repositories with new eyes, exploring for the presence of chimeric protein products.

Conclusion: Since the filiations of the peptides to their original protein may be lost after endoprotease digestion, the correct assignment of protein species after the analysis may be challenging. To add some complexity, the number of biological mechanisms able to expand the repertoire of proteins found in humans (i.e. human proteome), and probably in other higher eukaryotes, is showing to be more and more intricate. The use of bottom-up shotgun proteomic analyses together with targeted type of approaches proved to be valid tools to address the identification of chimeric proteins and may provide irrefutable evidences of their occurrence even in complex protein mixtures. Nevertheless, bottom-up analyses should be considered with caution, since confident identification of chimeric proteins relies on the identification of chimeric peptides. Moreover, the concomitant occurrence of chimeric proteins and nonchimeric proteins in a single sample may lead to biased protein identification and quantitation results derived after bottom-up analyses.

A priori, our knowledge about protein chimerism and its potential biological role is limited. Despite those limitations, as exemplified in this manuscript, detection of chimeric proteins such as Bcr-Abl in clinical samples may be of paramount importance, since such proteins may act as biomarkers for early tumor detection and may also point to appropriate therapeutic strategies and case-suited treatments.

Future steps toward the implementation of methods aiming chimeric protein detection and characterization using bottom-up analyses will probably require: (i) inclusion of DNA and protein chimeric sequences in databases, (ii) chimeric protein enrichment from complex protein mixtures to facilitate their detection, and (iii) improvement of bioinformatic tools and search algorithms enabling the identification of chimeric proteins based on mass spectrometric data (mass-to-charge ratios and MS/MS data corresponding to junction peptides, exclusively found in chimeric proteins).

DNA and protein sequences constitute the basis of bottom-up proteomic workflows. For that reason, the availability of dedicated databases including detailed sequence information on chimeric proteins is key. In this sense, a number of dedicated databases compiling information of chimeric fusion genes are already publicly available and are continuously updated, such as: the atlas of genetics and cytogenetics oncology and haematology [31], ChimerDB [32], Decipher database [33], HYBRIDdb [34], TICdb [35]] or the Mitelman database ( At the protein sequence level, the number of bona-fide chimeric protein sequences in dedicated databases is still limited and evidences supporting their occurrence are frequently restricted to the identification of mRNA transcripts. To exemplify this, Table 1 shows up to forty six different protein entries included in Swiss-Prot database, corresponding to different Bcr/Abl chimeric transcripts and their proposed protein sequences. Since the construction of chimeric protein sequence databases ad-hoc is possible, it is foreseeable that high-throughput bottom-up proteomics could also contribute to confirm the occurrence of chimeric proteins.

Table 1. List of BCR/ABL chimeric protein sequences available in Swiss-Prot database. Accession numbers, protein descriptions, deduced number of amino acids, and status of the proteins are included
AccessionProtein names/ protein descriptionAmino acidsStatus
A1Z199_HUMANBCR/ABL p210 fusion protein (fragment)97Evidence at transcript level
A2RQD3_HUMANBcr-abl1 e13a3 chimeric protein (fragment)235Evidence at transcript level
A2RQD4_HUMANBcr-abl1 e14a3 chimeric protein (fragment)260Evidence at transcript level
A2RQD5_HUMANBcr-abl1 e1a3 chimeric protein (fragment)313Evidence at transcript level
A2RQD6_HUMANBcr-abl1 e6a2 chimeric protein (fragment)585Evidence at transcript level
A2RQD7_HUMANBcr-abl1 e19a2 chimeric protein (fragment)498Evidence at transcript level
A3RL30_HUMANBCR/ABL fusion protein (fragment)107Evidence at transcript level
A6MF66_HUMANBCR/ABL fusion protein (fragment)79Evidence at transcript level
A6MF67_HUMANBCR/ABL fusion protein (fragment)47Evidence at transcript level
A6MF68_HUMANBCR/ABL fusion protein (fragment)72Evidence at transcript level
A6MFJ7_HUMANBCR/ABL fusion protein e1a5 (fragment)77Evidence at transcript level
A6MFJ8_HUMANBCR/ABL fusion protein e13a5 (fragment)58Evidence at transcript level
A6MFJ9_HUMANBCR/ABL fusion protein e14a5 (fragment)83Evidence at transcript level
A8E194_HUMANBcr-abl1 fusion protein (fragment)31Predicted protein sequence
A8WE93_HUMANBCR/ABL b3a3 fusion protein (fragment)99Evidence at transcript level
A9UEZ4_HUMANBCR/ABL fusion protein isoform X1429Evidence at transcript level
A9UEZ5_HUMANBCR/ABL fusion protein isoform X2557Evidence at transcript level
A9UEZ6_HUMANBCR/ABL fusion protein isoform X31633Evidence at transcript level
A9UEZ7_HUMANBCR/ABL fusion protein isoform X4554Evidence at transcript level
A9UEZ8_HUMANBCR/ABL fusion protein isoform X5514Evidence at transcript level
A9UEZ9_HUMANBCR/ABL fusion protein isoform X6441Evidence at transcript level
A9UF00_HUMANBCR/ABL fusion protein isoform X7524Evidence at transcript level
A9UF01_HUMANBCR/ABL fusion protein isoform X8465Evidence at transcript level
A9UF02_HUMANBCR/ABL fusion protein isoform X91644Evidence at transcript level
A9UF03_HUMANBCR/ABL fusion protein isoform Y1458Evidence at transcript level
A9UF04_HUMANBCR/ABL fusion protein isoform Y2454Evidence at transcript level
A9UF05_HUMANBCR/ABL fusion protein isoform Y3467Evidence at transcript level
A9UF06_HUMANBCR/ABL fusion protein isoform Y4513Evidence at transcript level
A9UF07_HUMANBCR/ABL fusion protein isoform Y51790Evidence at transcript level
A9UF08_HUMANBCR/ABL fusion protein isoform Y6406Evidence at transcript level
A9YD18_HUMANBCR/ABL e8a2 fusion protein (fragment)130Evidence at transcript level
B0ZRQ9_HUMANBCR/ABL e18-int1b-a2 fusion protein (fragment)54Evidence at transcript level
B0ZRR0_HUMANBCR/ABL e8a2 fusion protein (fragment)79Evidence at transcript level
B0ZRR1_HUMANBCR/ABL e14a2 fusion protein (fragment)162Evidence at transcript level
B1PL85_HUMANBCR/ABL fusion protein (fragment)49Evidence at transcript level
C0LYZ4_HUMANMutant BCR/ABL fusion protein (fragment)215Evidence at transcript level
E7E8T7_HUMANBCR-ABL1 e8a2 variant (fragment)448Evidence at transcript level
F1JU33_HUMANMutant BCR/ABL fusion protein (fragment)256Evidence at transcript level
Q13745_HUMANBCR-ABL mRNA encoding P185-ALL-ABL protein163Evidence at transcript level
Q13746_HUMANBcr-abl mRNA of acute lymphocytic leukaemia (ALL) patients (fragment)386Evidence at transcript level
Q13846_HUMANBcr-abl mRNA 5' (clone 3c) (fragment)77Evidence at transcript level
Q16189_HUMANBCR/ABL protein (fragment)46Evidence at transcript level
Q16190_HUMANBCR/ABL protein (fragment)43Evidence at transcript level
Q8NEY0_HUMANBCR-ABL fusion protein (fragment)69Evidence at transcript level
Q8NF93_HUMANBCRE3/ABL1A11 fusion protein (fragment)148Evidence at transcript level
Q8TDA2_HUMANBCRe18/ABL1e3 fusion protein (fragment)142Evidence at transcript level

Recently, Frank et al. [36] pointed that up to 75–85% of spectra in a typical MS/MS experiment remain unidentified after database searches. Therefore, public MS/MS datasets include a wealth of information awaiting for further interpretation, which may require either improved search algorithms or alternative ways of analyzing mass spectrometry data (such as two-step database search strategies including searches against ad-hoc databases [28] or de novo peptide sequencing [37, 38]). The strategies proposed above provide the possibility of identifying chimeric proteins not only in future experiments, but also in a retrospective manner, since public mass spectrometric data repositories are continuously increasing and may contain information supporting the occurrence of chimeric proteins that still remains unidentified. Noteworthy, Frank et al. [36] proposed that publicly available MS/MS meta-data should be organized into spectral archives including identified and unidentified spectra detected in multiple experiments from different labs using similar technologies. These authors advocate that such a strategy would synergize data interpretation across labs and could help to unravel previously hidden information, including the identification of chimerotypic peptides.

The combination of novel technologies such as DNA deep sequencers and bottom-up proteomic analyses may open new avenues for the identification of chimeric proteins. Continuous improvements in nucleic acid and amino acid sequence databases will probably enable a deeper understanding of protein chimerism not only in humans, but also in other higher eukaryotes. The characterization of such chimeric proteins holds a great potential to understand their biological implications.


J. Casado-Vela is a Jae-DOC grant holder funded by the Spanish Research Council (CSIC, Consejo Superior de Investigaciones Científicas) and Ministerio de Economía y Competitividad (Spain). This work has been supported by Ministerio de Economía y Competitividad, Red Temática de Investigación Cooperativa de Cancer (RTICC, RD06–002-0016), SAF2011–29699, and CAM S-2010 BMD-2326 to JCL.

The authors have declared no conflict of interest.