A. Historical Perspectives: Then and Now
In 2015, it will be 40 years since the advent of proteomics, revolutionized via the introduction of two-dimensional gel electrophoresis (2-DGE) (Klose, 1975; O'Farrell, 1975; Scheele, 1975) and later refined with the introduction of immobilized pH gradients (IPGs) (Bjellqvist et al., 1982; Righetti et al., 2008; Gianazza & Righetti, 2009; Görg et al., 2009) and MS (Aebersold & Mann, 2003; Yates, Ruse, & Nakorchevsky, 2009). Forty years can be considered a long time, but the fact is that the study of proteins within the term proteomics (Wilkins et al., 1995) is quite young, fluid, and diversifying as a technology. Being part of the three young high-throughput omics technologies of genomics (transcriptomics), proteomics, and metabolomics, which are now, allied to high-throughput phenotyping (phenomics), and being amalgamated into the field of systems biology (Ward & White, 2002; Bradshaw & Burlingame, 2005; Bradshaw, 2008; Souchelnytskyi, 2008; Coruzzi, Rodrigo, & Guttierrez, 2009). The relatively younger face of plant proteomics can be realized when we see its wide-spread application in isolation, identification & cataloguing of proteins, and addressing/answering biological questions from 2000 to now, more than a decade of research in plant proteomics (for reviews and books see, Finnie, 2006; Samaj & Thelen, 2007; Thiellement, 2007; Agrawal & Rakwal, 2008a; Ranjithakumari, 2008; Weckwerth et al., 2008; Agrawal et al., 2011) (Fig. 1).
As per publications on plant proteomics in PubMed, the progress in plant proteomics can be divided into phases: pre, initial, and progressive (Fig. 1). The prestage can be considered the beginning of proteomics where 1(one)-DGE and 2-DGE techniques were applied to separate proteins and their identification using N-terminal Edman sequencing. The initial stage started with the genome revolution in the year 2000 onwards. Since the publication of the draft genome sequences of two plants, Arabidopsis thaliana (weed and dicot model) (The Arabidopsis 2000) and rice (Oryza sativa L., cereal crop and monocot model: Goff et al., 2002; Yu et al., 2002) in 2000 and 2002, respectively, plant proteomics research has seen a rapid growth. In this initial phase we also could see an effort by the Arabidopsis scientific community to start working toward the proteome of this model plant via the establishment of a Multinational Arabidopsis Steering Committee Proteomics subcommittee (MASCP, www.masc-proteomics.org). Since then, plant proteomics has moved into the progression stage, where researchers have been involved in enriching the scientific community by concerted efforts to publish reviews in series on rice, plants, and protein phosphorylation and publication of five books in plant proteomics. The initial years of this decade also saw the development of an idea on a global initiative on plant proteomics that led to the establishment of the International Plant Proteomics Organization (INPPO, www.inppo.com). With more plant genomes being sequenced, from model to non-models (Feuillet et al., 2010; Agrawal et al., 2011), there is no turning back to the utilization of proteomics approaches in various aspects of plant biology research.
The biggest hurdles faced by the scientific community and the population in general are the issues of food security, human health, and our changing environment, and dealing with these issues is one of the visions behind the global movement on plant proteomics, starting from Arabidopsis (Jones et al., 2008; Wienkoop, Baginsky, & Weckwerth, 2010) and MASCP in the early 2000s to INPPO in 2011. At INPPO, we have defined ten initiatives that we hope to move forward on with the support of plant biologists around the world (Agrawal et al., 2011). We can also refer to it as the—Global Action Plan on Plant Proteomics in the 21st century (GAPs-21), and as the acronym symbolizes there is indeed a gap needing to be bridged between the plant proteomics researchers worldwide to engage in more cooperative research, breaking boundaries, and having an open-door policy to tackle the pressing need for translational proteomics, that is, from the lab to the field (Agrawal et al., 2012a, 2012b). With this background, we discuss below some of the advancements seen recently, which have a relevance in shaping plant proteomics research tomorrow and tackling the issues that are being raised in this review, namely food security and safety.
C. Proteogenomics and Genomic/Proteomic Databases
Facilitated by the speed and decreased cost of third-generation DNA sequencing, genome-wide sequencing of plant species, in particular main food crops, is on the rise after a decade of sequencing of A. thaliana, and the Indica and Japonica rice subspecies. In 2005, the first map-based sequence of the annotated rice genome was also completed (International Rice Genome Sequencing Project, 2005). Entire genomes are now becoming available for some of the major crops, such as maize (Schnable et al., 2009), sorghum (Paterson et al., 2009), potato (Xu et al., 2011), tomato, soybean, domesticated apple (Velasco et al., 2010), or banana (D'Hont et al., 2012), as recently reviewed in (Feuillet et al., 2010; Miller, Eberinin, & Gianazza, 2010; Sonah et al., 2011) and at http://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes#cite_note-42. With the explosion in the amount of available data, it is increasingly difficult to provide a complete and updated picture of genome availability. Thus, this review has to restrict itself to the main model and crop species and some basic aspects of their genomic and proteomic sequence acquisition and availability.
In general, genome sequence assembly and annotation of crops are challenging tasks due to large genome sizes and the fact that typically over 80% of the genome is constituted by repetitive transposable elements, as it is the case for the 2.3 billion bases large maize genome (Schnable et al., 2009), barley, and wheat. Polyploidy is another challenge to overcome for many cultivated crops, for example, wheat, potato, tomato, oil seed rape Brassica napus, and even fruit crops (such as banana or strawberry), thus requiring independent sequencing of the various wild-type haplotypes (Shulaev et al., 2011).
Genome annotation, which will give information on gene function predominantly, relies on the prediction of protein-encoding genes based on sequence comparison or in silico gene prediction. However, validation of open reading frames (ORFs) prediction depends on extensive transcriptomics sequence data, such as the recently published ten thousands of unique cDNAs that were sequenced and assembled for barley (Matsumoto et al., 2011) and maize (Soderlund et al., 2009). Alternatively or in combination, a proteogenomics approach using large-scale shotgun proteomics has proven to be extremely powerful in discovering unpredicted ORFs of extensively and intensively annotated genomes of model organisms, such as fly, human, and Arabidopsis (Castellana et al., 2008; Castellana & Bafna, 2010). For Arabidopsis, this is illustrated by 13% new ORFs that were identified in an in-depth proteo(geno)mics study (Castellana et al., 2008).
Improved MS-based proteomic workflows now allow proteogenomics to become the method of choice to validate exon–intron structures of ORFs by mapping the identified peptides to the genome and grouping these peptides into proteins (Ansong et al., 2008; Armengaud, 2010). Such approach has already been described extensively not only for Arabidopsis (Castellana et al., 2008) but also for rice (Helmy, Tomita, & Ishihama, 2011), and fungal wheat and barley pathogens (Bringans et al., 2009; Bindschedler et al., 2011). Proteogenomics can use imperfect genomic databases to identify proteins by proteomic means (Ansong et al., 2008; Castellana et al., 2010; Agrawal et al., 2011; Bindschedler et al., 2011) and help to annotate short or species-specific ORFs. Therefore, newly assembled and (poorly) annotated crop genomes still enable proteomic investigations. This is quite important as in protein sequence databases, such as UniProtKB the plant protein entries are well behind the entries for species of other kingdoms. For instance, the number of UniProtKB entries for Viridiplantae is underrepresented with only 32,666 entries out of 536,789 total entries, representing less than 10% of total protein entries (http://www.uniprot.org/program/plants/statistics - accessed in July 2012). Of these plant protein entries, one third consists of Arabidopsis entries (10,617) and over 2,000 entries from rice.
Concomitant with the emerging plant genomic and proteomic information (Armengaud, 2010; Renuse, Chaerkady, & Pandey, 2011), new bioinformatic tools are being developed to automatically map identified peptides on whole genomes (Sanders et al., 2011; Specht et al., 2011) and assign function to unknown proteins (Bindschedler et al., 2011). Such technological developments will make proteogenomic approaches even more popular and suitable for plant proteomics and complement quantitative plant proteomic measurements (Bindschedler & Cramer, 2011a) by providing the necessary data on protein identity and function for the investigation of plant proteomes.
Resources for crop proteomics such as genomic and proteomic databases are still in their infancy. Most resources, with the exception of some maize and rice databases (Tables 1 and 2), have been developed for the model plant system Arabidopsis as reviewed by Weckwerth et al. (2008).
Table 1. A non-exhaustive list of some genomic and proteomic databases and websites for food crops Table 2. Proteomics resources
D. Preparing the Stage for Systems Biology: Data Integration at Systems Level
Messenger RNA (mRNA)-based approaches are extremely powerful and highly automated, allowing massive screening of several genes at once. However, it is important to recognize that there might be a possible discrepancy between the messenger (transcript) and its final effector (mature protein). As most biological functions in a cell are executed by proteins and metabolites rather than by mRNA, transcript profiling does not always provide pertinent information for the description of a biological system. Expression studies on prokaryotic as well as lower and higher eukaryotic organisms revealed in certain cases a poor correlation between mRNA transcript level and protein abundance (Gygi et al., 1999; Griffin et al., 2002; Corbin et al., 2003; Greenbaum et al., 2003; MacKay et al., 2004; Tian et al., 2004; Carpentier et al., 2008b) or enzyme activity (Gibon et al., 2004). If transcripts are only an intermediate on the way to produce functional proteins and in turn proteins regulate the metabolite abundances, why measure mRNA? It is clear that a correlation between mRNA and protein abundance exists, and that several studies did find a correlation between mRNA (Goossens et al., 2003; Hirai et al., 2004) and metabolites, and in the cell all networks are connected. Furthermore, each approach has its bias and drawbacks. Hence, several biological variables coming from transcripts, proteins, and metabolites need to be integrated to understand systems biology and will lead to new insights. Saito, Hirai, and Yonekura-Sakakibara (2008) review the strategy of combining transcriptome and metabolome studies as a powerful tool for helping the annotation of plant genomes. But data integration from different biological variables to understand the dynamic phenotype of a plant are even more challenging: (i) good algorithms and statistics are needed to extract significant information and to cope with the high dimensionality structure of the data, (ii) the data have to be of good quality, and (iii) the experimental set-up with the plants should not be too complicated. Wienkoop et al. (2008) present an approach to investigate the combined covariance structure of metabolite and protein dynamics in a systemic response to abiotic temperature stress in Arabidopsis wild-type plants. The concept of high-dimensional data profiling and subsequent multivariate statistics for dimensionality reduction, and covariance structure analysis is a powerful strategy to the systems biology of a plant under particular conditions. The systematic integration of transcript, protein, and metabolite profiling needs to be modeled in time to find the correlations between the different levels and this is a challenge for bioinformaticians and statisticians (Weckwerth, 2011).