SEARCH

SEARCH BY CITATION

Keywords:

  • Bioinformatics;
  • Diversity;
  • Homology-driven proteomics;
  • Quantitative analysis;
  • Sequence similarity search

Abstract

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 The optimal toolkit for exploring the Brazilian proteomosphere
  5. 3 Next generation strategies
  6. 4 Call to the proteomics adventure
  7. Acknowledgments
  8. 5 References

Our current knowledge in biology has been mostly derived from studying model organisms and cell lines in which only a small fraction of all described species have been extensively studied. Although these model organisms are amenable to genetic manipulations, this blinds researchers to the true variability of life. Groundbreaking discoveries are often achieved by analyzing “noncanonical” species; for example, the characterization of Taq polymerase from Thermus aquaticus ultimately led to a revolution in the field of molecular biology. Brazil possesses a rich biodiversity and a considerable fraction of Brazilian groups use current proteomic techniques to explore this natural treasure-trove. However, in our opinion, much more than the widely adopted peptide spectrum match approach is required to explore this rich “proteomosphere.” Here, we provide a critical overview of the available strategies for the analysis of proteomic data from “noncanonical” biological samples (e.g. proteins from unsequenced genomes or genomes with high levels of polymorphisms), and demonstrate some limitations of existing approaches for large-scale protein identification and quantitation. An understanding of the premises behind these computational tools is necessary to properly deal with their limitations and draw accurate conclusions.

Abbreviation
PSM

peptide spectrum matching

1 Introduction

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 The optimal toolkit for exploring the Brazilian proteomosphere
  5. 3 Next generation strategies
  6. 4 Call to the proteomics adventure
  7. Acknowledgments
  8. 5 References

Brazil is a leading country in all levels of biodiversity: genetic, species, and ecosystems. The Amazon rainforest accounts for approximately one-tenth of all species in the world and the estimated Brazilian biota is about 10–20% of all species described so far [1]. The Brazilian floodplain (i.e. Pantanal) is home to more than 174 mammal species, several of which are endangered [2]. Among Brazilian species, a considerable number are endemic and many still need cataloging. According to the Brazilian Fund for Biodiversity (FUNBIO), Brazil's fauna and flora are among the most diverse, accounting for about 17% of the total bird species and approximately 10% of all known amphibians and mammals. Pharmacological use of Brazilian species is continuously growing and its economic potential is virtually unlimited. Significant efforts in Brazilian research are currently devoted to exposing the potential of the country's biodiversity in multiple areas. To this end, proteomics stands as a key discovery-driven tool for revealing new paths for molecular diversity assessment.

The history of science shows several successful cases where assessing the variability of life has revolutionized biology. For example, in 1969, Thomas D. Brock and Hudson Freeze characterized the Thermus aquaticus found in the Yellowstone National Park and their findings led to the discovery of Taq polymerase that paving the way to modern molecular biology [3]. In Brazil, Dr. Sérgio Ferreira's group provided a groundbreaking contribution by characterizing bradykinin-potentiating peptides from the snake venom of the Brazilian viper Bothrops jararaca [4] that enabled the development

image

of the first angiotensin-converting enzyme inhibitor (Captopril) used in the treatment of hypertension and congestive heart [5]. Recently, a Nobel Prize was awarded to Osamu Shimomura, Martin Chalfie, and Roger Y. Tsien for the discovery and characterization of GFP from Aequorea victoria that revolutionized cellular biology.

2 The optimal toolkit for exploring the Brazilian proteomosphere

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 The optimal toolkit for exploring the Brazilian proteomosphere
  5. 3 Next generation strategies
  6. 4 Call to the proteomics adventure
  7. Acknowledgments
  8. 5 References

There are several well-established computational approaches for analyzing the bottom-up proteomic data. To date, even the most sophisticated strategies have their roots in three cornerstone techniques: peptide spectrum matching (PSM) [6], tag-searching [7], and de novo sequencing [8]. Each of these presents advantages and disadvantages, and, therefore, understanding these foundations is necessary if one wishes to embark on a “jungle proteomics” project.

2.1 PSM

PSM is certainly the most widely adopted method for protein identification. Its hallmark is a comparison of experimental spectra to those theoretically generated from a protein sequence database. Some examples of tools for PSM are SEQUEST [6], Mascot [9], and X! Tandem [10]. Confident matches are then statistically pinpointed according to an acceptable false discovery rate [11, 12], and this can be accomplished using several available tools [13, 14]. The PSM approach is certainly the most conservative because it is fully dependent on a sequence database, which makes it blind toward unexpected variabilities (e.g. polymorphisms and mutations). Existing algorithms are also limited to handling a few missed cleavages, and not so many simultaneous variable modifications. Conventional proteomics search algorithms based on PSMs require the genome of the organism of interest to be accurately sequenced and all ORFs to be correctly annotated. These conditions are not often met due to challenges in genome assembly, alternative splicing, and polymorphisms that are not represented in the databases. As a result, the number of confident matched MS/MS spectra is typically limited (20–50%) [15] and these percentages become even lower when studying “noncanonical” proteomes.

2.2 Tag searching

The “Sequence Tag” peptide identification strategy comprises of combining information from several sequence stretches of, typically, two to four amino acid residues deduced from the MS/MS spectrum; the adjacent pair of masses in the extremes of the tag and the precursor mass [7, 16]. Sequence tag searching is effective for error-tolerant searches as it assumes that one of its regions (and consequently the precursor mass) can mismatch, whereas the rest of its sequence is identical to the database peptides, and thus, unexpected posttranslational modifications and sequence polymorphisms are tolerated to some extent. In this way, sequence tag searching is complementary to PSM and can, therefore, be used to validate or improve on conventional searches from large MS/MS data sets [17] and help with genome annotations [18]. Sequence tag searching has been successfully applied to the identification of proteins not found in sequence databases and proteins of unsequenced organisms via cross-species identifications to other related proteins in the database [19]. Increasing error tolerance results in a dramatic loss of search specificity; however, Sunyaev et al. [20] describe a strategy for improving the confidence by combining information from several tags in different MS/MS spectra to overcome these issues. Recent applications of sequence tag searching have also been described in top-down proteomics for confident identification of proteins using higher-energy collisional dissociation or electron-transfer dissociation fragmentation without enzymatic digestion [21].

2.3 De novo peptide sequencing

The hallmark of de novo sequencing is that it does not require a sequence database. It attempts to assign a sequence by relying exclusively on the precursor mass and fragmentation data [22]. Examples of de novo sequencing software are pNovo [23], PepNovo [24], and NovoHMM [25]. Although this strategy is very tempting, there are several limitations that must be considered in order to avoid misleading conclusions. De novo sequencing has been described as the most error-prone strategy [26] because there is no sequence database to impose restrictions the algorithm must therefore work within huge hypothesis space that is aggravated by an exponential growth in the number of possible sequences as the mass of the parent ion increases. Indeed, de novo sequencing is rarely used exclusively in routine experiments, as it remains challenging to obtain confident results from shotgun experiments, although efforts by the scientific community have continuously pushed its limits. Indeed, recent algorithms take advantage of improvements in resolution and mass accuracy provided by state-of-the-art mass spectrometers resulting in dramatic improvements over previous generation algorithms [27]. Additional efforts have been described that combine information from different fragmentation techniques (e.g. CID, electron transfer dissociation, electron-capture dissociation) for de novo sequencing of the same precursor successfully [28].

3 Next generation strategies

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 The optimal toolkit for exploring the Brazilian proteomosphere
  5. 3 Next generation strategies
  6. 4 Call to the proteomics adventure
  7. Acknowledgments
  8. 5 References

Several strategies that stem from the methods described above have recently emerged and have been demonstrated to be reliable, especially for characterizing exotic proteomes. Some examples are described below.

3.1 Large-scale sequence similarity searching

Despite the rapid progress in genomic sequencing, less than 0.01% of the approximately 1.7 million described species have their genomes fully sequenced (Fig. 1), which restrict the PSM approach to a minor fraction of the overall proteome. In these cases, identification of unsequenced proteins has been accomplished by a combination of de novo sequencing and similarity searching. Briefly, sequences from fragmented peptides are predicted and then mapped to those from known conserved proteins from phylogenetically related species according to a similarity measure (e.g. MS-BLAST searching [29]). In these approaches, the hypothesis space of de novo interpretations is considerably reduced by limiting the predicted sequences to similar sequences in the database. Since the sequence similarity search tolerates multiple amino acid mismatches between the predicted and reference peptide sequences, it considerably improves the identification of proteins from organisms for which no sequence information is available [29]. Nevertheless, the efficacy of protein identification from unsequenced organisms is directly related to the degree of similarity to available protein sequences. Moreover, sequence similarity search is still limited by the inaccuracy of de novo predictions and these limitations are extremely challenging to overcome when peptides contain many possible or unexpected posttranslational modifications.

image

Figure 1. Distribution of known species (The World Conservation Union. 2010); the proportion of species that have at least one Reference Sequence (RefSeq) in the NCBI database (release number 51–01/09/2011 http://www.ncbi.nlm.nih.gov/RefSeq), and the percentage of organisms with sequenced genomes (The Genome News Network – GNN (http://www.genomenewsnetwork.org). Mitochondrion, plasmid, and plastid RefSeqs were excluded.

Download figure to PowerPoint

In recent years, new pipelines have been designed to improve cross-species identification of proteins where conventional PSM is combined with sequence similarity searches via de novo sequencing in which approximately 25% more proteins are identified [29]. Even though the concept of joining de novo sequencing interpretations with sequence similarity search is not recent, only lately efforts have enable the development of pipelines that allows large-scale proteomics studies coming from large data sets generated from LC-MS/MS. This homology-driven strategy has proven to be a reliable way of improving the success rate of exploring the “unknome” [30, 31].

3.2 Spectral networks

To overcome some of the limitations of de novo sequencing, spectral networks [32] capitalize on sets of spectra from overlapping peptides. This relies on assembling mass spectra into spectral pairs by either joining overlapping spectra resulting from digestion using different enzymes, or from modified and unmodified versions of the same peptide. This procedure significantly reduces noise, and improves coverage, de novo sequencing predictions and identification of posttranslational modifications.

3.3 Spectral archives

In a complementary approach to search engines, the spectral archives strategy takes advantage of the ability to efficiently cluster billions of MS/MS and then build a library of consensus spectra to represent each cluster [33]. The major advantage over traditional search engines is that both identifiable (i.e. can be explained by a sequence database) and unidentifiable spectra can be matched against high quality consensus spectra. Eventually, spectra which were not identified will be explained by new sequence information from the respective organism. Moreover, this approach is applicable to broader studies such as identifying spectra (and consequently sequences) that are common across species and experimental conditions.

3.4 Note on quantitative proteomics

In any quantitative biological experiments several factors should be considered to be as accurate as possible so that data from different labs can be shared and ultimately joined toward a common cause (e.g. discovery of new pharmaceutical drugs). There are numerous methods for performing large-scale protein quantitation, including metaboliclabeling [34], chemical derivatization [35, 36], and label-free approaches [37-39]. While the advantages and disadvantages of each are beyond the scope of this paper, it is worth noting that most Brazilian publications and computational tools adopt label-free strategies [40-42] mostly because this provides results comparable (i.e. almost as sensitive) to labeled techniques [43] and are less expensive and laborious. Recent improvements in label-free methods have also been found by acquiring data using a data independent strategy, where they can be described as all ion fragmentation (MSE) [44], fragmenting windows of approximately 20 m/z [45] and correlating these windows with high resolution precursor information [46]. However, confident quantification can only be achieved when the complete sequence of the protein is present in the database, and misleading conclusions can be drawn by quantifying proteins from unsequenced/partially sequenced organisms. By relying on databases from similar species, quantification according to strategies that rely on PSMs (e.g. spectral counting) is of limited use since it is dependent on the identity of unknown proteins to related proteins in the database. Alternatively, extracted ion chromatograms can be used as quantification strategy for unsequenced organisms since they tend to be more accurate than spectral counts because the quantification is based on the relative intensities of identified precursors [39]. Yet, it is still not possible to predict the contribution of signal from the same peptide in another unknown protein within the unsequenced organism. Due to these unsolved limitations, it is clear that new methods need to be developed to overcome the challenges in quantification of proteins without primary structure information. Nonetheless, relative quantification within the same species using ion chromatograms remains a reasonable option to overcome the limitations described above [30].

4 Call to the proteomics adventure

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 The optimal toolkit for exploring the Brazilian proteomosphere
  5. 3 Next generation strategies
  6. 4 Call to the proteomics adventure
  7. Acknowledgments
  8. 5 References

Current demand in biological sciences means that natural and biological resources should be better explored. Here we described several complementary approaches which, when combined, help to pave the way toward exploration of a broader proteomosphere that has unprecedented potential. We argue that this new biology should place its emphasis beyond cell lines and model organisms, and that we need to become better equipped for this adventure where the exploratory space is virtually unlimited. Advances in mass spectrometry based proteomics combined with new developments in large-scale deep sequencing can strengthen research in a broader view of life and can bring important impacts of science to society. We hope that this Viewpoint will inspire scientists in the proteomics field to embark on exciting new adventures in diversity exploration.

Acknowledgments

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 The optimal toolkit for exploring the Brazilian proteomosphere
  5. 3 Next generation strategies
  6. 4 Call to the proteomics adventure
  7. Acknowledgments
  8. 5 References

Because of size limitations, we could not cover all essential work in the field and we apologize for not being able to discuss many other important contributions. The authors thank Dr. Gilberto Barbosa Domont, Dr. Livia Goto-Silva, Dr. Juliana Fischer, and Dr. Charles Bradshaw for fruitful discussions and critical reading of the manuscript. M. J. was financially supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), grant no. 478066/2010–4 MCT/CNPq – Universal. PCC was financially supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES-Fiocruz 30/2006).

The authors have declared no conflicts of interest.

5 References

  1. Top of page
  2. Abstract
  3. 1 Introduction
  4. 2 The optimal toolkit for exploring the Brazilian proteomosphere
  5. 3 Next generation strategies
  6. 4 Call to the proteomics adventure
  7. Acknowledgments
  8. 5 References