These authors contributed equally to this work.
Prediction of dual protein targeting to plant organelles
Article first published online: 7 APR 2009
© The Authors (2009). Journal compilation © New Phytologist (2009)
Volume 183, Issue 1, pages 224–236, July 2009
How to Cite
Mitschke, J., Fuss, J., Blum, T., Höglund, A., Reski, R., Kohlbacher, O. and Rensing, S. A. (2009), Prediction of dual protein targeting to plant organelles. New Phytologist, 183: 224–236. doi: 10.1111/j.1469-8137.2009.02832.x
Sensitivity = TP/(TP + FN); a measure of the amount of TPs that are correctly identified. Specificity = TN/(TN + FP); a measure of the amount of TNs that are correctly identified. (TP, true positives; TN, true negatives; FP, false positives; FN, false negatives.)
- Issue published online: 3 JUN 2009
- Article first published online: 7 APR 2009
- Received: 27 November 2008Accepted: 15 February 2009
- ambiguous targeting;
- genome annotation;
- intracellular sorting;
- Top of page
- Materials and Methods
- Results and Discussion
- Supporting Information
- • Dual targeting of proteins to more than one subcellular localization has been found in animals, in fungi and in plants. In the latter, ambiguous N-terminal targeting signals have been described that result in the protein being located in both mitochondria and plastids. We have developed ambiguous targeting predictor (ATP), a machine-learning implementation that classifies such ambiguous targeting signals.
- • Ambiguous targeting predictor is based on a support vector machine implementation that makes use of 12 different amino acid features. Prediction results were validated using fluorescent protein fusion.
- • Both in silico and in vivo evaluations demonstrate that ambiguous targeting predictor is useful for predicting dual targeting to mitochondria and plastids. Proteins that are targeted to both organelles by tandemly arrayed signals (so-called twin targeting) can be predicted by both ambiguous targeting predictor and a combination of single targeting prediction tools. Comparison of ambiguous targeting predictor with previous experimental approaches, as well as in silico approaches, shows good congruence.
- • Based on the prediction results, land plant genomes are expected to encode, on average, > 400 proteins that are located in mitochondria and plastids. Ambiguous targeting predictor is helpful for functional genome annotation and can be used as a tool to further our understanding about dual protein targeting and its evolution.
- Top of page
- Materials and Methods
- Results and Discussion
- Supporting Information
N-terminal targeting signals that target proteins to mitochondria, plastids and the secretory pathway are not conserved at the level of the primary sequence. Therefore, various machine learning approaches have been employed to identify typical features of such signals and to predict a protein's subcellular localization. Numerous approaches have been implemented and are available as prediction services (Nakai & Horton, 1999; Emanuelsson et al., 2000; Guda et al., 2004; Small et al., 2004; Boden & Hawkins, 2005; Höglund et al., 2006). Prediction accuracy is often high, yet, even the best chloroplast predictor, TargetP, has a false positive rate of approx. 69% and a true positive rate of approx. 86% (Zybailov et al., 2008) and there are sets of consistently misclassified proteins, some of which we address in this study. Existing tools usually assume that each protein is targeted to a single location (i.e. that the targeting signals unambiguously determine the final location of the mature protein). However, this is not always the case. In the last decade, dual targeting of a multitude of proteins has been described for native plant proteins (Peeters & Small, 2001; Silva-Filho, 2003; Mackenzie, 2005). For certain gene families, such as Arabidopsis thaliana aminoacyl-transfer RNA (tRNA) synthetases, dual mitochondrial/plastidal targeting is the rule (17/24 proteins) rather than the exception (Duchene et al., 2005). The previously unexpected high rate of dual targeting has even led to higher estimates for the size of the plastid (2700) and the mitochondrial (2000) proteomes (Millar et al., 2006), the plastid proteome having been estimated to be larger still (> 3400) in other studies (van Wijk, 2004). Protein targeting can depend on the developmental stage (e.g. the tissue type), as has been demonstrated for secretory pathway targeting in seeds and leaves of Nicotiana tabacum (Petruccelli et al., 2006). Also, protein folding, post-translational modification and protein–protein interaction can be involved in determining the targeting of proteins with multiple sites of action (Karniely & Pines, 2005). The importance of cis-elements, especially of the 5′ untranslated region (UTR), for determining the subcellular localization of dually targeted proteins has been demonstrated in several cases (Christensen et al., 2005; Kabeya & Sato, 2005; Sunderland et al., 2006; Puyaubert et al., 2008). Organelles might ‘compete’ for dually targeted proteins. In such a ‘tug-of-war’ scenario, highly efficient transport to one organelle might occur, obscuring the localization of the protein to the second target (Karniely & Pines, 2005). Also, a high abundance of a protein at one localization might render its detection in alternative localizations all but impossible (Duchene et al., 2005; Karniely & Pines, 2005). These factors make the straightforward detection of dual targeting using experimental approaches difficult.
During evolution, after transfer of an organellar gene to the nucleus the gene needs to acquire a targeting signal in order for the encoded protein to be imported into its genes’ organelle of origin. Acquisition and subsequent evolution of a hydrophobic stretch at the N-terminus might be a prerequisite for this (Michl et al., 1999). Such signals might come into being by exon shuffling (i.e. by the acquisition of a pre-existing exon), by integration into a gene already carrying a targeting signal (Adams et al., 2000) or by a random process involving transcription and translation of the 5′ stretch of DNA. In terms of evolution of dual-targeting capability, alteration of dual-targeting signals in response to dietary necessities has been observed in Herbivora and Carnivora (Birdsey et al., 2004). In a biotype of the weed Amaranthus tuberculatus, herbicide resistance evolved via a codon deletion conferring dual targeting to mitochondria and plastids (Patzoldt et al., 2006). Once two functionally redundant genes are encoded in the nuclear genome, the evolution of a dual targeting signal and the subsequent deletion of one of the gene copies follows the parsimonious principle of evolution. In fact, deletion of both organellar and nuclear gene copies has been demonstrated recently in the case of the dually targeted plant ribosomal protein, S16 (Ueda et al., 2008). On the other hand, establishment of dual targeting for nonredundant proteins might enable neo-functionalization of organelles. Dual targeting might also serve to couple cytoplasmic processes with organellar processes (e.g. division, signaling, stress tolerance). A substantial fraction of A. thaliana (56) and Oryza sativa (103) transcription factors were predicted to be dually targeted to the nucleus and plastids or mitochondria (Schwacke et al., 2007), and therefore, dually targeted proteins might enable enforcement of nuclear control upon organelles (e.g. through RNA polymerases, transcription factors and tRNA synthetases).
Two principally different dual targeting mechanisms have been suggested: twin signals and ambiguous signals (Mackenzie, 2005). Whereas twin signals rely on two forms of the preprotein being translated upon different transcriptional or translational initiation or alternative splicing, ambiguous targeting signals have been shown to guide the same preprotein into two different compartments. In addition, protein isoforms generated by a twin mechanism can also be subject to ambiguous targeting, increasing the combinatorial complexity (von Braun et al., 2007; Puyaubert et al., 2008). N-terminal targeting sequences conferring targeting to mitochondria or plastids have a similar overall composition. Ambiguous targeting signals are similar to both signals; they are enriched in serine and arginine, and deficient in asparagine, glutamic acid and glycine, by comparison with mature proteins (Pujol et al., 2007). Investigation of the leading 20 residues showed that arginine is abundant in mitochondrial targeting sequences compared with those of chloroplasts, and ambiguous targeting sequences represent an intermediate situation.
It appears as if dual protein targeting is an iceberg of which we know only the tip, as only around 12 twin targeted proteins are known from plants to date, while c. 50 proteins exhibiting ambiguous targeting have been described. Dually targeted proteins are often misclassified by current prediction tools because a potential second localization is neglected. Among the scarce amount of data available to date, a total of 40 plant proteins have been described to contain an ambiguous targeting signal that directs them to both mitochondria and plastids. Therefore, we aimed to develop a tool for the accurate prediction of such ambiguous dual targeting. We tested the prediction results using the model plant Physcomitrella patens because dual targeting has been described to occur in this organism (Richter et al., 2002; Kiessling et al., 2004; Kabeya & Sato, 2005), detection of fluorescent protein fusions using transient protoplast transfection assays is a standard technique (Frank et al., 2005; Quatrano et al., 2007) and the genome sequence is available (Rensing et al., 2008).
Materials and Methods
- Top of page
- Materials and Methods
- Results and Discussion
- Supporting Information
Cultivation of plant material, RNA isolation and cDNA synthesis
P. patens (Hedw.) Bruch & Schimp. ssp. patens‘Gransden 2004’ (Rensing et al., 2008) was cultivated as described previously (Bierfreund et al., 2003). To isolate RNA, protonema was harvested, frozen in liquid nitrogen and disrupted with a ball mill for 1 min at 30 Hz. The frozen material was mixed with 1 ml of Trizol reagent (Invitrogen) per 100 mg of plant material. After 5 min of incubation at 20˚C and 20 min of centrifugation at 5000 g and 4°C, chloroform extraction (0.2 ml ml−1 of Trizol) and isopropanol precipitation (0.5 ml ml−1 of Trizol) of the supernatant were carried out. Reverse primers were situated well behind the end of the putative signal peptide and contained an EcoRV restriction site at their 5′ end; the forward primers were situated at the first ATG codon (early response to dehydration 4 homolog (ERD4) or at the beginning of the 5′-UTR (pectin methylesterase (PME), phosphatidylinositol-dependent phospholipase C (PLC), a plastid division protein (FtsZ), fasciclin-like protein (FLP) and delta-aminolevulenic acid dehydratase 2 (Hem2)) and contained, at their 5′ end, a BamHI restriction site as well as two additional bases in front of it to ameliorate restriction efficiency. For complementary DNA (cDNA) synthesis the RNA was treated with DNAse I (2.5 U per 10 µg of RNA) and ethanol precipitation was carried out to remove remnants of enzyme and buffer. Per reaction, 1–2 µg of DNAse I-treated RNA was used. The first-strand synthesis was performed using M-MuLV reverse transcriptase (Fermentas, St Leon-Rot, Germany), according to the manufacturer's protocol.
Cloning of PCR products was performed using the ‘TOPO TA cloning kit for sequencing’ (Invitrogen), according to the provided protocol, and clones were sequenced for checking using T3/T7 primers. After digestion with BamHI and EcoRV the DNA fragments obtained were ligated into a modified reporter-vector, mAV4 (Kircher et al., 1999), containing a cyan fluorescent protein (CFP) gene instead of a green fluorescent protein (GFP) gene, yielding N-terminal fusions of the targeting signals to the fluorescent protein. The following oligonucleotide primers (Biomers, Ulm, Germany or Operon, Cologne, Germany) were used for cDNA synthesis and reverse transcription (RT)-PCR:
PME forward, ATGGATCCTCGTTCCTCGCTGGGATCAG;
PME reverse, GATATCAGGAATGTAGATCACAATGCG;
FLP forward, ATGGATCCGCACCGCAAATTTCAAACTG;
FLP reverse, GATATCATCTGGGGCAATTACGGTGAC;
FtsZ forward, ATGGATCCGCCGTGTTGCGTAGCCTTTG;
FtsZ reverse, GATATCCCGCTTCTGTAGATGCACAAG;
PLC forward, ATGGATCCATGGTGTCTATTGCGCGATTG;
PLC reverse, GATATCTACTCGGTGACCGTTAAATTC;
Hem2 forward, ATGGATCCATGGTAGGTGTGATGATGGC;
Hem2 reverse, GACATCTGGGAGGATGAAATTTGCAGG;
ERD4 forward, ATGGATCCATGACGGCTACAGCAGCGTTC;
ERD4 reverse, GATATCGAAGTTGTTATTCTCCGTCGC;
Transient transfection of P. patens protoplasts and confocal laser scanning microscopy
Protoplast transfection was performed as previously described (Frank et al., 2005). After at least 3 d of regeneration, protoplasts were analyzed using an LSM 510-i confocal laser scanning microscope (Carl Zeiss, Jena, Germany). To avoid false-positive detection of chloroplast signals, linear unmixing was carried out to separate the CFP spectrum from plastid autofluorescence. MitoTracker green FM (Invitrogen), mAV4–CFP and the signal peptide of FtsZ1-2 in-frame with GFP, which has been described as chloroplast localized (Kiessling et al., 2004), were used as controls.
Implementation of the ambiguous targeting predictor
Training and test data sets
All negative and positive examples are available as fasta files on the ambiguous targeting predictor website. The ambiguous targeting predictor training data set consists of 43 proteins that have been described in the literature to be ambiguously targeted to mitochondria and plastids (Table S1). Another 44 proteins were used as negative examples (10–12 proteins each that were described to be exclusively targeted to the cytoplasm, plastids, mitochondria and the secretory pathway) in order to achieve a balanced training set. The negative examples were mostly derived from the TargetP (Emanuelsson et al., 2000) data set. In order to keep the approximate species distribution of the positive examples, some more recent sequence entries from SwissProt were also included.
For testing, 27 additional (independent) single targeted proteins were added to the 44 negative examples mentioned above. The resulting 71 sequences (none of which share > 48% identical positions within the N-terminal 70 amino acids) are represented by triangles in Fig. 1(b). Together with the 43 positive examples mentioned in the previous paragraph (squares in Fig. 1b), these sequences were used to generate the receiver operating characteristic (ROC) plot (Fig. 1a). In addition, seven independent positives from several species (A. thaliana, N. tabacum, Zea mays), none of them sharing more than 31% sequence identity within the N-terminal 70 amino acids with any of the 43 positive proteins of the training data set, were used for testing (circles in Fig. 1b).
We used support vector machines (SVMs) to analyze the N-terminal part of the amino acid sequences. Therefore, the leading 70 amino acids were scanned using a sliding window approach with a step size of one in order to generate support vectors (Fig. S1). The window size was variable to be open for optimization, and the primary sequence in each window was neglected; instead, the amino acid composition was derived. The following 12 different amino acid features were used: hydrophobicity; random coil; alpha helix; beta sheet; beta turn; negative residues; positive residues; small residues; tiny residues; arginine; alanine; and leucine/phenylalanine. The amino acid composition was evaluated for each amino acid feature based on the AAindex (Kawashima & Kanehisa, 2000). The feature values provided by this database were normalized using the normalize command, which sets the smallest value to 0.0 and the highest value to 1.0. These values provided (in addition to the window size) a second variable for each amino acid feature to determine whether or not this particular feature is present at a given position of the sequence. For each amino acid feature a single SVM was trained and both variables (window size and feature cut-off) were optimized in a grid search approach (Table 1, Fig. S1). For the amino acid features alpha helix, beta sheet and beta turn, all three features were calculated and if one of the other features yielded better results than the main feature for a given position, the main feature was not taken into account, even if it was above the optimized cut-off.
|Feature||Short name||Window size||Cut-off||c||γ||MCC|
Based on the data of each SVM, a fivefold cross-validation was carried out and the Matthew's correlation coefficient (MCC) (Eqn 1) was calculated as a measure that takes sensitivity, as well as specificity, into account:
- (Eqn 1)
(TP, true positive; TN, true negative; FP, false positive; and FN, false negative.) As kernel for the SVMs, the radial-basis function was used, which has been shown to be very efficient for this type of biological targeting prediction (Höglund et al., 2006). The optimal SVM parameters c and γ were identified in a grid search (Table 1, Fig. S1).
As mentioned in the previous section, a single SVM was trained for each feature using the radial basis kernel. For the first training, c and γ were set to estimated default values (c, 0.03125; γ, 0.5) and the AAindex sliding window size and feature cut-off were optimized using a grid search based on five-fold cross-validation (Fig. S1). Using these optimized variables, the kernel variables c and γ were optimized in the second grid search (Fig. S1). In the third and final grid search, a second optimization of the AAindex sliding window size and feature cut-off was performed, yielding the final variable sets (optimized parameters, Fig. S1). Using these sets, training of each SVM was carried out individually on the positive and negative data sets.
Weighting and normalization
The individual SVM prediction results were weighted based on their MCC:
- (Eqn 2)
- (Eqn 3)
The score of each SVM (fScore) was weighted and normalized to the percentage it contributed to the total sum of all MCCs (wScore); the sum of all wScores is the resulting score (cScore), which is therefore (Eqn 3) normalized to [0.0–1.0] (Fig. S1).
Twin targeting analysis using existing tools
Eight examples of proteins previously described to be dually targeted by the twin mechanism (Table S2) were analyzed. The protein sequences were modified: the altered sequences simulated a second, shorter isoform, which might be generated by alternative transcription or translation initiation. The original sequence and the modified sequence were both tested using existing targeting prediction tools (as described later in this paragraph). In the best case both sequences should yield high values for different compartments other than cytoplasm (the latter would hint at the protein not being subject to dual targeting mechanisms). The original sequence was truncated at the N-terminal end just before the second methionine unless this methionine was within the first 25 amino acids. In that case, the first methionine beyond amino acid 25 was used. An internal ribosome entry site (IRES) motif search was also carried out, but no potential IRES motifs were found in the training data set. As prediction tools for twin targeting, MultiLoc/TargetLoc (Höglund et al., 2006), WoLF PSORT (Nakai & Horton, 1999) and TargetP (Emanuelsson et al., 2000) were used. The results were normalized to the respective highest possible value. To increase the informative value and to ensure that distinct results were favoured, the value of the second-best hit was subtracted from the value of the best hit. This quality measure was compared for the original sequence and the modified sequences.
Results and Discussion
- Top of page
- Materials and Methods
- Results and Discussion
- Supporting Information
Ambiguous targeting predictor architecture
The ambiguous targeting predictor uses SVMs (Vapnik, 1998) for the prediction of ambiguous targeting. Support vector machines have already been successfully used in several localization prediction tools (Park & Kanehisa, 2003; Höglund et al., 2006; Shatkay et al., 2007) and have shown very good performance. Typical chloroplast targeting signals are 30–80 amino acids long (average 58 amino acids) and typical mitochondrial signals are 20–60 amino acids long (average 42 amino acids) (Zhang & Glaser, 2002). The input features of the SVM-based prediction engine ambiguous targeting predictor are therefore constructed from the 70 N-terminal amino acids using a sliding window approach (Fig. S1). A total of 12 different amino acid properties were used, which were selected based on previous results (Peeters & Small, 2001) and textbook knowledge (Lodish et al., 2007). Certain features of ambiguous targeting signals have recently been analyzed in a mutational approach, revealing the importance of arginine residues and of the second N-terminal amino acid, often an alanine (Pujol et al., 2007). Ambiguous targeting predictor therefore includes the presence of arginine and alanine as amino acid feature vectors. For each of the 12 feature vectors, a distinct support vector classifier was trained and these classifiers were combined into a joint prediction using a simple weighted voting scheme. The individual SVM prediction results (one for each amino acid feature) are weighted based on their MCC value on the training set (Table 1). This weighted score is then normalized to yield a score between 0.0 and 1.0, the latter being the best achievable score (Fig. S1). Support vector machine classifiers with a better prediction performance (a high MCC) will thus contribute more to the final result than less reliable classifiers (with a low MCC). The combination of 12 independent classifiers yielded superior results compared with the standard approach (i.e. a combined feature vector for all 12 feature sets). The ambiguous targeting predictor web tool is available online at http://www.cosmoss.org/bm/ATP.
Importance of individual amino acid features
The influence of each amino acid feature on the ambiguous targeting predictor score can be derived from its MCC. The three top scoring features are random coil, alpha helix and negative residues, closely followed by hydrophobicity and beta turn (Table 1). Arginine and beta sheet also contribute well, while the other five features are of lesser importance. While for some of the features a qualitative difference exists for the full 70 amino acids (e.g. alpha helix, random coil, negative residues, arginine; Fig. S2), others exhibit regional differences (e.g. hydrophobicity, beta turn, beta sheet). Some of the more prominent differences that can be seen in the distribution plots (Fig. S2) are the lack of a hydrophobic stretch in the first 20 amino acids, a lower abundance of negatively charged amino acids and a higher abundance of arginine (Pujol et al., 2007) in the ambiguous targeting signals. The results from the feature optimization can be used to inform mutational research that aims to clarify the mechanism of ambiguous targeting.
Evaluation of the ambiguous targeting predictor prediction accuracy
For testing, the initial 43 positive examples (Table S1) were used and the negative examples were increased from 44 to 71 proteins by including 27 single localization proteins that were not part of the training data set (see the Materials and Methods for details on the training and test data sets). Different score cut-offs were evaluated based on their specificity (true negative rate) and sensitivity (true positive rate).1 At a threshold of 0.7, which is the best performing cut-off in the ROC plot (Fig. 1a), all single targeted proteins were detected as true negatives, while 98% of the ambiguous signal sequences were detected as true positives. The ambiguous targeting predictor score is clearly correlated with sensitivity (correlation coefficient −0.84). Based on seven additional positive examples from several species that were not part of the training data set (Table S2), the accuracy of the method was further evaluated and, by using a score cut-off of 0.7, demonstrated average sensitivity (43%). At a score cut-off of 0.6, the sensitivity was 57%; the lowest score achieved among the seven true positives was 0.39. Therefore, scores of 0.8 and higher are expected to yield a very low rate of false positives while missing some of the true positives. Scores below 0.8 recover an increasing number of the true positives with a rising rate of false positives. A score cut-off of 0.7 seems to represent a good trade-off for practical application (Fig. 1). As an additional negative control, scores for the Saccharomyces cerevisiae proteome (all proteins considered negatives) were predicted. This approach led to 49 out of 5784 proteins (0.85%, equaling 99% specificity) being predicted as false positives using a score cut-off of 0.7. Because some S. cerevisiae proteins might contain functional dual targeting signals (Huang et al., 1990), the actual specificity might even be slightly higher. The score distribution for the A. thaliana proteome (Fig. 1b) is spread around a score of 0.4 (i.e. the majority of (single targeted) proteins achieves ambiguous targeting predictor scores of c. 0.4). Comparison of these score values with those for the training and test data sets (Fig. 1b) demonstrates that scores of < 0.4 yield a high number of false positives, that the score range 0.4–0.7 should be taken into consideration with caution and that scores of > 0.7 usually represent true positives.
Comparison with other in silico approaches and databases
Recently, a combined approach using existing tools revealed a multitude of A. thaliana and rice transcription factors that are predicted to be targeted to either plastids or mitochondria in addition to being present in the nucleo-cytoplasm (Schwacke et al., 2007). Several of these proteins would be predicted to be dually targeted to both organelles by ambiguous targeting predictor, namely six out of 78 A. thaliana transcription factors predicted for plastid targeting (AT5G52020, AT2G22200, AT1G77640, AT2G44940, AT5G29000 and AT1G14410) and one out of 12 predicted for mitochondrial targeting (AT1G68180). Such proteins might thus exert nuclear transcriptional control in both semi-autonomous organelles.
The 39 A. thaliana proteins present in the training and test data set were compared with the A. thaliana subcellular database, SUBA v2.2 (Heazlewood et al., 2007). A total of 32 proteins (82%) are present among those 189 entries in the database for which dual mitochondrial and plastidal localization has been inferred by fluorescent protein fusion (16 proteins), mass spectrometry (three proteins), annotation based on The Arabidopsis Information Resource, TAIR (two proteins), AmiGO (19 proteins), Swissprot (two proteins) or a combination thereof. When applying ambiguous targeting predictor (with a score cut-off of 0.7) to the A. thaliana proteome, 523 proteins were predicted to be ambiguously targeted (Fig. 2). Of those, 37 overlapped with the 189 aforementioned SUBA entries (average ambiguous targeting predictor score for the entries: 0.5). A total of 35 proteins (90%) were present among those SUBA database entries predicted by computational tools. The individual tools predicted the proteins to be present either in plastids or mitochondria to a very different extent (TargetP 71/27%, Mitoprot2 0/96%, Subloc 0/24%, Ipsort 38/51%, Predotar 47/40%, Mitopred 0/44%, Wolf PSort 89/4%, Multi-Loc 69/31%, Loctree 49/29%, respectively). While those 8741 database entries that were predicted to be present in mitochondria and plastids based on a combination of computational tools contained 90% of the true positives checked, the number of false positives generated using this method is probably vast, given > 8700 entries compared with 523 predicted by ambiguous targeting predictor (Fig. 2).
We also compared the ambiguous targeting predictor prediction with proteomics data. For this purpose, all 690 nuclear-encoded A. thaliana plastid proteins present in the plastid protein (plprot) database (Kleffmann et al., 2006) were retrieved. In addition, all 457 A. thaliana mitochondrial proteins from the Arabidopsis mitochondrial protein database (AMPDB) (Heazlewood & Millar, 2005) that were determined by both gel-based and gel-free procedures, were selected. The intersection of both data sets, 66 proteins, was subjected to ambiguous targeting predictor. The average score was 0.43, which is significantly higher than the score of 0.34 (P = 6.09E-07, one-tailed t-test) achieved on all A. thaliana proteins. Still, this average score is probably negatively biased because of the fact that the databases are not manually curated and therefore might contain a certain number of false positives. A comparison with the manually curated plant proteome database (PPDB) (Sun et al., 2008) revealed an average ambiguous targeting predictor score for A. thaliana plastid proteins of 0.51, clearly demonstrating again that an intermediate ambiguous targeting predictor score is no clear indication of dual targeting. Yet, the average ambiguous targeting predictor score for those 53 manually curated A. thaliana TAIR7 PPDB proteins that are annotated as present in both plastids and mitochondria, was found to be 0.73 (i.e. above the suggested confidence cut-off of 0.7).
The recently published ‘Database of proteins with multiple subcellular localizations’ (DBMLoc) (Zhang et al., 2008) contains a total of 29 proteins for which both mitochondria and plastids are listed as subcellular compartments. Of those, only three were present in the ambiguous targeting predictor training data set. A close inspection of the remaining 26 proteins revealed that the majority (20 cytochrome c6, two apocytochrome f and two voltage-dependent anion-selective channels) were selected as a result of a combination of experimental data and homology evidence, probably leading to a false-positive dual-targeting prediction. The remaining two proteins, Spinacea oleracera protoporphyrinogen oxidase and N. tabacum DNA-directed RNA polymerase 2, are proteins dually targeted by the twin mechanism.
Twin targeting prediction
Some of the proteins selected for experimental validation (Table 2) were considered as candidates for targeting using a twin mechanism, based on the presence of a secondary methionine within the putative targeting signal. In order to be able to analyze putative twin targeting of these proteins in greater detail, existing tools for the prediction of subcellular localization were applied to eight examples of proteins previously described in the literature to be dually targeted using the twin mechanism (Table S2). Subsequently, the prediction method described in the remainder of this section was applied to the P. patens proteins selected for experimental validation (Table 2).
|Protein||Gene model||ATP score||Localization||Twin prediction|
By predicting the localization for the full-length protein as well as for a truncated form (starting at the putative secondary methionine), in conjunction with score normalization, the prediction of dual targeting based on tandemly arrayed signal sequences is possible. The normalized score cut-offs yielding the best combination of specificity and sensitivity were 0.3 (WoLF PSORT), 0.4 (TargetP; Emanuelsson et al., 2000), 0.8 (TargetLoc; Höglund et al., 2006) and 0.5 (MultiLoc; Höglund et al., 2006). None of the tools clearly outperformed any of the other tools. As it turned out, the difference between the normalized scores for the best localization and the second best localization predicted for a given protein isoform can be used as a quality measure to assess the probability of the prediction result. However, not all tools will always yield the correct result, suggesting that several tools should be used and a consensus approach applied.
Validation of the prediction results using fluorescent protein fusion
To evaluate the in silico results in vivo, CFP fusion constructs of P. patens protein-coding genes were generated to check their localization in transfected P. patens cells by confocal laser scanning microscopy (CLSM). As candidates, we chose P. patens proteins with ambiguous targeting predictor scores between c. 0.5 and 0.7 from the whole-proteome prediction (3619 proteins compared with 296 proteins with scores ≥ 0.7; Fig. 2) because this range is critical concerning the true positive/negative rate, as mentioned earlier. Moreover, we chose three proteins below this range to investigate whether proteins with a lower cut-off may also be dually targeted. The chosen candidates (Table 2) were PME (score 0.09), FLP (score 0.13), PLC (score 0.4), ERD4 (score 0.49), Hem2 (score 0.56) and FtsZ (score 0.73). The analyzed constructs contained the 5′ part of the coding sequence, encompassing the signal peptide, in-frame with the CFP. For PLC, PME, FLP, FtsZ and Hem2, the probability exists that their dual targeting is regulated via the twin mechanism (Table 2). Therefore, the 5′-UTR was included in those constructs in case it is important for regulating twin mechanism (Christensen et al., 2005; Sunderland et al., 2006; Puyaubert et al., 2008).
The localization of most of the fusion proteins confirmed the expectations. The FtsZ protein is localized in both mitochondria and plastids, confirming the prediction result of ambiguous targeting predictor (score 0.73, Fig. 3e, Table 2). In the case of Hem2 (score 0.56), the annotated localization in the plastid was confirmed but no additional fluorescence in the mitochondria could be found (Fig. 3g), making it a probable false-positive result. This confirms that the chosen score cut-off of 0.7 is reasonable if one wants to exclude false positives. However, for some of the proteins with lower scores, dual targeting could also be demonstrated. An interesting case is the ERD4 homolog (score 0.49), which, during the first days of protoplast regeneration, is localized in the mitochondria, whereas after 10 d of regeneration localization switches to the chloroplast (Fig. 3b/c). Therefore, this protein seems to be another example of targeting being dependent on environmental/developmental conditions (Karniely & Pines, 2005; Petruccelli et al., 2006). The observed dual localization of PLC (Table 2) has been surprising because the ambiguous targeting predictor score for this protein is rather low (0.4). However, the dual targeting to mitochondria and plastids in this case might also be a result of the twin mechanism. The results for FtsZ and PLC suggest that prediction using ambiguous targeting predictor might, in some cases, correlate with twin prediction if the protein is targeted to mitochondria and plastids (Table 2). This might be a result of the fact that ambiguous signals resemble both plastid and mitochondrial signals and therefore an ambiguous signal resembles the tandem array of targeting signals found in twin proteins and vice versa. The two chosen proteins at the lower end of the score range (FLP and PME with scores of 0.13 and 0.09, respectively) are clearly localized to only one compartment (Table 2), which suggests that at this low score the number of true positives is indeed probably very low. Taken together, three out of four proteins with an ambiguous targeting predictor score of between 0.4 and 0.73 could be shown to be dually targeted and thus are considered as true positive predictions (Table 2).
Comparison with experimental data
The protoporphyrinogen oxidase from A. tuberculatus, in which ambiguous targeting to plastids and mitochondria evolved by a codon deletion leading to a 30-amino acid extension of the N-terminus (Patzoldt et al., 2006), yields an ambiguous targeting predictor score of 0.57. The short form of the protein, from the herbicide-susceptible biotype, yields a distinctly lower score of 0.49. In vitro evidence suggests that the A. thaliana Whirly 2 protein might be dually targeted to mitochondria and chloroplasts; the ambiguous targeting predictor score for this protein is 0.61. Recently, dual targeting has been demonstrated of the Z. mays seryl-tRNA synthetase (Rokov-Plavec et al., 2008); this protein achieves an ambiguous targeting predictor score of 0.51.
The A. thaliana holocarboxylase synthetase 1 (HCS1) gene is essential for biotin metabolism. Alternative splicing of the 5′-UTR has recently been shown to remove a small upstream open-reading frame (ORF), which represents a switch for the selection of the translation initiation site among two in-frame AUG codons (Puyaubert et al., 2008). The resulting proteins have been shown to be localized in the cytoplasm or chloroplasts, respectively. However, enzymatic activity of the protein in mitochondria has been shown and thus suggests ambiguous targeting, which might be obscured by a more efficient transport to chloroplasts (Puyaubert et al., 2008). Targeting of HCS1 might be regulated in response to metabolic requirements, comparable to expression control by metabolite-binding riboswitches (Cheah et al., 2007). The ambiguous targeting predictor score for the A. thaliana HCS1 is 0.47, making an ambiguous targeting mechanism possible. The P. patens homolog, Phypa_143161 (http://www.cosmoss.org), even generates an ambiguous targeting predictor score of 0.62.
It has been demonstrated that multiple in-frame start codons alter the localization of A. thaliana tRNA nucleotidyltransferase by differing transcriptional initiation (von Braun et al., 2007). Fluorescent protein fusion experiments performed in Allium cepa and N. tabacum cells suggested that the targeting signal starting at the very first methionine (ambiguous targeting predictor score 0.65) leads to localization in both mitochondria and plastids, whereas the targeting peptide lacking the first five amino acids (ambiguous targeting predictor score 0.91) was targeted to plastids. The protein starting at methionine 69 remained in the cytosol. Therefore, the proposed ambiguous targeting is also suggested by ambiguous targeting predictor, although the protein lacking the first five amino acids generates an even higher score than the longest one. It should be noted, however, that the fusion constructs did not contain the 5′-UTR, which might influence the initiation site, and the localization experiments carried out in heterologous systems might be misleading. Interestingly, a probable involvement of the protein proper in shifting the protein localization to mitochondria was shown, implicating the involvement of cytosolic factors (von Braun et al., 2007). The gene model describing the P. patens homolog, Phypa_21288 (ambiguous targeting predictor score 0.37), is obviously truncated. Manual inspection using the http://www.cosmoss.org genome browser revealed a gene model with a longer and putatively complete N-terminus, all_Phypa_159494, yielding an ambiguous targeting predictor score of 0.46, which might be ambiguously targeted.
In a recent study, it could be shown that in Medicago truncatula and Populus alba, in which the rps16 gene has been lost from the plastid genome, the plastid gene was substituted with a nuclear-encoded rps16 gene of mitochondrial origin through the capability of the encoded protein to dually target both organelles (Ueda et al., 2008). Interestingly, dual targeting of RPS16 to mitochondria and chloroplasts seems to have evolved before the Liliopsida/eudicotyledon split. Moreover, RPS16 proteins of plants that still harbor the plastid copy of the gene (e.g. A. thaliana, Lycopersicon esculentum, O. sativa) also possess dual targeting ability. Those proteins from the latter organisms for which dual targeting to plastids and mitochondria could be shown by fluorescent protein fusion generate ambiguous targeting predictor scores of 0.46, 0.46, 0.49 and 0.51, respectively (Table S2). For the A. thaliana RPS16-1, which was found to be exclusively targeted to chloroplasts in the assay, the ambiguous targeting predictor score is 0.41.
It has been shown that differences in chloroplast targeting signals exist between O. sativa and A. thaliana (Kleffmann et al., 2007; Zybailov et al., 2008). Yet, dual targeting could be validated for several P. patens proteins predicted by ambiguous targeting predictor, and the ambiguous targeting predictor score generally correlates well with the examples from different plants, as discussed earlier. Therefore, either the amino acid properties for dual plastid/mitochondrial targeting sequences are conserved throughout land plants, or the ambiguous targeting predictor approach (taking 12 different amino acid features into account) enables the prediction across a diverse set of organisms.
By applying ambiguous targeting predictor to the A. thaliana proteome, 523 proteins were predicted to exhibit dual mitochondrion/plastid targeting. A comparison with other plant proteomes (Populus trichocarpa, O. sativa, Vitis vinifera, P. patens; Table S3) showed that, in general, c. 450 (1.27 ± 0.4%) of the proteins carry potential ambiguous targeting signals (Fig. 2). By contrast, the proteomes of several algae (Chlamydomonas reinhardtii, Ostreococcus tauri, Ostreococcus lucimarinus and Cyanidioschyzon merolae) encode significantly fewer (P= 0.0016, Fisher's exact test; on average c. 100) ambiguously targeted proteins. While this observation might be a result of different coding (and thus erroneous prediction) of targeting signals in the algae, it might also represent a correlation of dual targeting with increasing organismal complexity (especially if one considers the low number of predicted proteins in the highly reduced prasinophytes). Interestingly, of the A. thaliana proteins that exhibit putative ambiguous targeting, only 30% have their best blast hit among the lineages representing the ancestors of plastids and mitochondria (Cyanobacteria and alpha-Proteobacteria; cut-off 30% identity, 80 amino acids alignment length). While phylogenetic analysis will need to reveal details, this might suggest that a plethora of eukaryotic genes has evolved dual targeting capabilities during plant evolution. This would also suggest that neofunctionalization of the endosymbiotic organelles has taken place and that control by the nucleus (i.e. the host) is exerted using this mechanism.
Functional genome annotation requires accurate prediction of protein localization. It is therefore necessary to expand our knowledge further regarding dual targeting and to develop tools that enable prediction of dual targeting. In this study, we demonstrated that dual protein targeting can accurately be predicted by applying machine learning. We implemented a tool, ambiguous targeting predictor, for the prediction of ambiguous targeting signals. Our results demonstrated that land plant genomes encode, in general, > 400 proteins that are putatively targeted to mitochondria and plastids based on ambiguous N-terminal presequences. Evaluation of the prediction results using protoplast transfection demonstrates that proteins with ambiguous targeting predictor scores of > 0.3 might be ambiguously targeted to mitochondria and chloroplasts, while ambiguous targeting predictor scores of > 0.7 indicate high specificity. In terms of amino acid features that significantly contribute to the targeting predictability of the ambiguous targeting predictor, alpha helix, random coil, negative residues and arginine are important over the whole length of the N-terminal 70 characters, while hydrophobicity, beta sheet and beta turn exhibit regional bias. Ambiguous targeting predictor has been made available online via a web interface, allowing the user to check proteins of interest.
- Top of page
- Materials and Methods
- Results and Discussion
- Supporting Information
We are grateful to Simon Zimmer for assistance with implementation of the ambiguous targeting predictor (ATP) web tool and to Kirsten Krause for helpful comments on the manuscript. Financial funding by DFG (S.A.R. and R.R., grant Re 837/10-2); BMBF (S.A.R. and R.R., grant 0313921, Freiburg Initiative in Systems Biology) is gratefully acknowledged.
- Top of page
- Materials and Methods
- Results and Discussion
- Supporting Information
- 2000. Repeated, recent and diverse transfers of a mitochondrial gene to the nucleus in flowering plants. Nature 408: 354–357. , , , , .
- 2003. Use of an inducible reporter gene system for the analysis of auxin distribution in the moss Physcomitrella patens. Plant Cell Reports 21: 1143–1152. , , .
- 2004. Differential enzyme targeting as an evolutionary adaptation to herbivory in carnivora. Molecular Biology and Evolution 21: 632–646. , , , , .
- 2005. Prediction of subcellular localization using sequence-biased recurrent networks. Bioinformatics 21: 2279–2286. ,
- 2007. Dual targeting of the tRNA nucleotidyltransferase in plants: not just the signal. Journal of Experimental Botany 58: 4083–4093. , , , , , .
- 2007. Control of alternative RNA splicing and gene expression by eukaryotic riboswitches. Nature 447: 497–500. , , , .
- 2005. Dual-domain, dual-targeting organellar protein presequences in Arabidopsis can use nonAUG start codons. Plant Cell 17: 2805–2816. , , , , , , .
- 2005. Dual targeting is the rule for organellar aminoacyl-tRNA synthetases in Arabidopsis thaliana. Proceedings of the National Academy of Science, USA 102: 16484–16489. , , , , , , , , .
- 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology 300: 1005–1016. , , ,
- 2005. Molecular tools to study Physcomitrella patens. Plant Biology (Stuttg) 7: 220–227. , ,
- 2004. Mitopred: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics 20: 1785–1794. , ,
- 2005. Ampdb: the arabidopsis mitochondrial protein database. Nucleic Acids Research 33(Database issue): D605–610. , .
- 2007. Suba: the arabidopsis subcellular database. Nucleic Acids Research 35(Database issue): D213–218. , , , , .
- 2006. Multiloc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics 22: 1158–1165. , , , ,
- 1990. A yeast mitochondrial leader peptide functions in vivo as a dual targeting signal for both chloroplasts and mitochondria. Plant Cell 2: 1249–1260. , , , .
- 2005. Unique translation initiation at the second AUG codon determines mitochondrial localization of the phage-type RNA polymerases in the moss Physcomitrella patens. Plant Physiology 138: 369–382. ,
- 2005. Single translation–dual destination: mechanisms of dual protein targeting in eukaryotes. EMBO Reports 6: 420–425. ,
- 2000. AAindex: amino acid index database. Nucleic Acids Research 28: 374. ,
- 2004. Dual targeting of plastid division protein ftsz to chloroplasts and the cytoplasm. EMBO Reports 5: 889–894. , , , , , , ,
- 1999. Nuclear import of the parsley bzip transcription factor cprf2 is regulated by phytochrome photoreceptors. Journal of Cell Biology 144: 201–211. , , , , ,
- 2006. Plprot: a comprehensive proteome database for different plastid types. Plant & Cell Physiology 47: 432–436. , , ,
- 2007. Proteome dynamics during plastid differentiation in rice. Plant Physiology 143: 912–923. , , , , , ,
- 2007. Molecular Cell biology. Houndmills, UK: Palgrave Macmillan. , , , , , , ,
- 2005. Plant organellar protein targeting: a traffic plan still under construction. Trends in Cell Biology 15: 548–554. .
- 1999. Phylogenetic transfer of organelle genes to the nucleus can lead to new mechanisms of protein integration into membranes. Plant Journal 17: 31–40. , , , , .
- 2006. Recent surprises in protein targeting to mitochondria and plastids. Current Opinion in Plant Biology 9: 610–615. , ,
- 1999. Psort: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochemical Sciences 24: 34–36. ,
- 2003. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19: 1656–1663. ,
- 2006. A codon deletion confers resistance to herbicides inhibiting protoporphyrinogen oxidase. Proceedings of the National Academy of Science, USA 103: 12329–12334. , , , .
- 2001. Dual targeting to mitochondria and chloroplasts. Biochimica et Biophysica Acta 1541: 54–63. ,
- 2006. A kdel-tagged monoclonal antibody is efficiently retained in the endoplasmic reticulum in leaves, but is both partially secreted and sorted to protein storage vacuoles in seeds. Plant Biotechnology Journal 4: 511–527. , , , , , , , , , et al .
- 2007. How can organellar protein n-terminal sequences be dual targeting signals? In silico analysis and mutagenesis approach. Journal of Molecular Biology 369: 356–367. , , .
- 2008. Dual targeting of arabidopsis holocarboxylase synthetase1: a small upstream open reading frame regulates translation initiation and protein targeting. Plant Physiology 146: 478–491. , ,
- 2007. Physcomitrella patens: mosses enter the genomic age. Current Opinion in Plant Biology 10: 182–189. , , , , .
- 2008. The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319: 64–69. , , , , , , , , , et al .
- 2002. Two rpot genes of Physcomitrella patens encode phage-type rna polymerases with dual targeting to mitochondria and plastids. Gene 290: 95–105. , , , , , ,
- 2008. Dual targeting of organellar seryl-trna synthetase to maize mitochondria and chloroplasts. Plant Cell Reports 5: 5. , , ,
- 2007. Comparative survey of plastid and mitochondrial targeting properties of transcription factors in arabidopsis and rice. Molecular Genetics and Genomics 13: 13. , , , ,
- 2007. Sherloc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23: 1410–1417. , , , , ,
- 2003. One ticket for multiple destinations: dual targeting of proteins to distinct subcellular locations. Current Opinion in Plant Biology 6: 589–595. .
- 2004. Predotar: a tool for rapidly screening proteomes for n-terminal targeting sequences. Proteomics 4: 1581–1590. , , ,
- 2008. Ppdb, the plant proteomics database at cornell. Nucleic Acids Research 2: 2. , , , , , .
- 2006. An evolutionarily conserved translation initiation mechanism regulates nuclear or mitochondrial targeting of DNA ligase 1 in Arabidopsis thaliana. Plant Journal 47: 356–367. , , , .
- 2008. Substitution of the gene for chloroplast rps16 was assisted by generation of a dual targeting signal. Molecular Biology and Evolution 2: 2. , , , , , , .
- 1998. Statistical learning theory. Weinheim, Germany: Wiley-VCH. .
- 2004. Plastid proteomics. Plant Physiology and Biochemistry 42: 963–977. .
- 2008. Dbmloc: a database of proteins with multiple subcellular localizations. BMC Bioinformatics 9: 127. , , , ,
- 2002. Interaction of plant mitochondrial and chloroplast signal peptides with the hsp70 molecular chaperone. Trends in Plant Science 7: 14–21. ,
- 2008. Sorting signals, n-terminal modifications and abundance of the chloroplast proteome. PLoS ONE 3: e1994. , , , , , , .
- Top of page
- Materials and Methods
- Results and Discussion
- Supporting Information
Fig. S1 A diagram (Figure_S1.ppt) explaining the ATP architecture, including the sliding window approach for amino acid feature extraction and the training procedure.
Fig. S2 A figure (Figure_S2.png) showing distribution plots of the amino acid features used by ATP along the first 70 amino acids of the positive (blue) and negative example proteins (green).
Table S1 An Excel spreadsheet (Table_S1.xls) describing the ATP training dataset (positive samples)
Table S2 An Excel spreadsheet (Table_S2.xls) describing the additional (independent) ATP positive examples used for testing, details for some of the proteins described in Results and Discussion and the twin targeting test data
Table S3 An Excel spreadsheet (Table_S3.xls) containing the ATP prediction scores for the proteomes shown in Fig. 2
Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office.
|NPH_2832_sm_FigS1.ppt||195K||Supporting info item|
|NPH_2832_sm_FigS2.png||1303K||Supporting info item|
|NPH_2832_sm_TableS1.xls||27K||Supporting info item|
|NPH_2832_sm_TableS2.xls||20K||Supporting info item|
|NPH_2832_sm_TableS3.xls||20696K||Supporting info item|