Green Targeting Predictor and Ambiguous Targeting Predictor 2: the pitfalls of plant protein targeting prediction and of transient protein expression in heterologous systems



  • The challenges of plant protein targeting prediction are the existence of dual subcellular targets and the bias of experimentally confirmed data towards few and mostly nonplant model species.
  • To assess whether training with proteins from evolutionarily distant species has a negative impact on prediction accuracy, we developed the Green Targeting Predictor tool, which was trained with a species-specific data set for Physcomitrella patens. Its performance was compared with that of the same tool trained with a mixed data set. In addition, we updated the Ambiguous Targeting Predictor.
  • We found that predictions deviated from in vivo observations predominantly for proteins diverging within the green lineage, as well as for dual targeted proteins. To evaluate the usefulness of heterologous expression systems, selected proteins were subjected to localization studies in P. patens, Arabidopsis thaliana and Nicotiana tabacum. Four out of six proteins that show dual targeting in the original plant system were located only in a single compartment in one or both heterologous systems.
  • We conclude that targeting signals of divergent plant species exhibit differences, calling for custom in silico and in vivo approaches when aiming to unravel the actual distribution patterns of proteins within a plant cell.


Eukaryotic cells are characterized by compartmentalization of metabolic reactions allowing potentially incompatible processes to take place simultaneously within the same cell. This leads to distinct sets of proteins within different compartments. Each of these separate sets has evolved over time and acquired targeting signals that ensure their precise localization to the target compartment inside the cell (Balsera et al., 2009). Information on the subcellular localization of proteins is therefore crucial for assessing their actual function and has, in some cases, revoked previous expectations (Carrie et al., 2008; Millar et al., 2009). The rapid generation of sequence data from different organisms has stimulated the development of computational tools to analyze such data. New methods considering the training of predictors or algorithms that differentially weight the strengths and weaknesses of different predictions have been published (e.g. Schwacke et al., 2007); Briesemeister et al. (2010). Such approaches have improved the accuracy of predictions remarkably but have not yet solved some of the aggravating aspects. One of these aspects is the occurrence of ambiguous targeting signals that lead to dual targeting, that is, targeting to two or more compartments. These ambiguous targeting signals are a frequent source of misinterpretation by standard predictors (Silva-Filho, 2003; Mackenzie, 2005; Mitschke et al., 2009).

The evolutionary conservation of the organellar targeting apparatuses (Voos et al., 1999; Price et al., 2012) has prompted the use of heterologous expression systems to determine subcellular localization. Yet, the degree of conservation of the signal peptide recognition and the respective import machineries and their coevolution with signal peptides in distantly related organisms, is largely unexplored. In a recent study, Carrie et al. (2010) showed that in the course of evolution, major changes occurred in the composition of the mitochondrial outer membrane transport systems, some even within the green lineage. Concurrent with those changes, the signal peptides necessary for mitochondrial localization probably evolved in different directions as well. Plastid targeting signals and import complexes, on the other hand, seem to diverge less (Rosenbaum Hofmann & Theg, 2005; Patron & Waller, 2007). One reason for this might be that the plastid compartment is evolutionarily younger than mitochondria. The plastid endosymbiosis, however, might have increased the need to distinguish between the ‘old’ and the ‘new’ organelles and to ensure that mitochondrial proteins were not mistargeted to the evolving chloroplast compartment. Interestingly, plastid import signals are recognized in vitro as mitochondrial by fungi (Hurt et al., 1986a,b; Brink et al., 1994), suggesting that the mitochondrial import apparatus of nonplant species remains more permissive, because of a lack of interorganellar competition. This lack of competition is also evident in the composition of the targeting signals (Staiger et al., 2009). Strikingly, mitochondrial and plastid targeting sequences within a given species seem to differ for some characteristics to a similar degree as targeting signals for the same compartment among more distantly related species (e.g. yeast, rice, and Arabidopsis; Huang et al., 2009).

The current growth in sequence information from a wide variety of organisms stands in no relation to the paucity of confirmed protein localization data, which are biased towards a few model species, the majority of them not being plants. Studies addressing the validity of expression data gained from heterologous systems are scarce and the reliability of current predictors for phylogenetic groups not covered by the training data is uncertain. In the plant field, approaches for creating species-specific prediction tools were started with predictors specific for two model organisms, one dicotyledonous plant, Arabidopsis thaliana (AtSubP, Kaundal et al., 2010) and one monocotyledonous plant, Oryza sativa (RSLpred, Kaundal & Raghava, 2009). Both studies demonstrate that the species-specific tools perform better on proteins from the same species but worse on proteins from other eukaryotic species, including other plant species, than do generalist predictors. To estimate to what extent subcellular targeting predictions and localization studies are affected by phylogenetic distance within the plant kingdom, we conducted a comparative case study using two evolutionarily distant species of plants, namely the seed plant A. thaliana and the moss Physcomitrella patens. The divergence between the seed plant and moss lineages took place c. 500 million yr ago (Ma), representing the largest evolutionary distance available within land plants (Lang et al., 2010). Despite this distance, generalist predictors have been used for the prediction of P. patens protein localization (e.g. Polyakov et al., 2010). Furthermore, P. patens is being used for heterologous expression (Thevenin et al., 2012) and complementation studies with A. thaliana proteins (e.g. Mosquna et al., 2009). To tackle the question of how reliable data are from heterologous systems, we also included Nicotiana tabacum for localization studies, as its evolutionary distance to A. thaliana is lower, c. 125 Ma (Kumar & Hedges, 2011), and it is frequently used for heterologous expression experiments with A. thaliana proteins.

Here, we combined in silico and in vivo approaches to address the question as to how much the targeting specificity to chloroplasts and mitochondria, respectively, has been influenced by divergent evolution of plant lineages. We introduce a novel predictor (Green Targeting Predictor, GTP), comparing a species-specific training approach for proteins of P. patens with a generalist approach. To address the predictions of dually targeted proteins, an updated Ambiguous Targeting Predictor (ATP2) is presented.

Materials and Methods

Generation of the ATP2 data set

The new data set for ATP2 (Supporting Information, Table S1) is based on the data set used for ATP (Mitschke et al., 2009). To optimize the data set, two modifications were introduced. First, the positive data set was increased by the addition of newly published experimentally proven sequences (Pujol et al., 2007; Berglund et al., 2009; Carrie et al., 2009). Secondly, the negative data set was remodeled to remove sequences potentially biasing prediction towards false negatives. This was achieved by training a model consisting of all sequences, leaving out a single sequence (leave-one-out cross-validation approach). Each single sequence was predicted using the respective all-minus-one model. Sequences that reached a high score for dual targeting in this prediction were removed as potentially dual targeted sequences. To compensate for the numeric changes and to achieve a balanced size of the positive and negative data sets, additional experimentally validated single targeted sequences were added (see Table S1).

Generation of data sets specific for the moss P. patens

Separate data sets for four distinct locations (classes), comprising 64 nucleocytoplasmic proteins without N-terminal targeting signals (NUCCYT), 45 chloroplast proteins (CHL), 30 mitochondrial proteins (MIT), and 34 proteins of the secretory pathway (SEC), were compiled (Table S2). Confirmed experimental localization data were our foremost criteria for selection of the data set. As this yielded only seven mitochondrial P. patens proteins, we focused on two alternative ways to compile a larger set of mitochondrial proteins from P. patens. These searches yielded an additional 23 proteins with a high confidence for mitochondrial localization. The first approach consisted of a manual search for plant proteins that are part of a conserved mitochondrial pathway in the KEGG database (Ogata et al., 1999). The mitochondrial localization of each of these proteins was reconfirmed using the BRENDA database (Scheer et al., 2011). This set was then BLASTed against the P. patens V1.6 proteome to identify the corresponding mitochondrial proteins in this moss. For the second approach, we searched the P. patens proteome for proteins returning the highest BLAST hits with alpha-proteobacterial species. For this approach, 107 fully sequenced genomes from all kingdoms were used. The filtering parameters were an E-value cutoff of ≤ 1E–4, a minimum 30% alignment identity, and an alignment length of at least 80 amino acids. All new hits from both approaches were filtered for redundancy (similarity ≥ 94.5% or identity ≥ 93% in the N-terminal 70 amino acids) against each other and against the original seven mitochondrial proteins. The proteins that remained were then subjected to targeting predictions with YLocHighRes (Briesemeister et al., 2010), Sherloc2 (Briesemeister et al., 2009) and WoLF PSORT (Horton et al., 2007). Twenty-three sequences exhibited high mitochondrial scores within all three tools and were added to our mitochondrial data set. Eventually, each of the four data sets was randomly divided into two groups. Eighty per cent of the sequences were used for training while the rest of each class was included in a test data set for validation of the prediction results (Table S2).

Other data sets

The data set used for the training of the TargetP plant (Emanuelsson et al., 2000) was utilized as a reference. This data set contains mitochondrial sequences from animals, yeast, and seed plants, but none from bryophytes or algae. The division into test and training data sets was done as described earlier; however, as a mitochondrial test data set, an extra ‘plant only’ data set (mitoplant) was created out of the original mixed-test data set and comprised 37 sequences from eudicotyledons and Liliopsida.

Development of ATP2 and GTP

The Ambiguous Targeting Predictor (ATP) prediction tool (Mitschke et al., 2009) was modified by rewriting the codebase. Also, the enlarged data set described earlier was used with the goal of increasing prediction sensitivity (see Methods S1 for details).

As the validation of ATP2 (see the 'Results' section, Fig. 1) showed that it has a comparable accuracy to ATP, but represents a more sensitive and faster approach, its codebase was used to develop the new species-specific prediction tool Green Targeting Predictor (GTP; see Methods S1 for details). The prediction process is comparable to ATP2, with a total of four prediction steps in which one class is tested against the other three classes. The class with the highest score value reflects the actual prediction result. The features used to distinguish between the different classes are described in Table S3. As an estimate on the likelihood of the prediction, a confidence value was calculated from the difference between the best and second best scores (see Methods S1 and Notes S1). A review of confidence values on the test data showed that single targeted proteins and correctly predicted proteins typically have a higher confidence value, suggesting its usefulness for repressing false-positive predictions (see the 'Results' section).

Figure 1.

Comparison of ATP and ATP2. The x-axes depict the score range of the prediction on the ATP2 positive test data set. The y-axis depicts the sensitivity (a), the specificity (b) and the combined sensitivity and specificity (c). ATP2 is shown in blue, the original ATP in red. The peaks labeled with stars in (c) represent the best possible compromise between specificity and sensitivity and therefore the suggested score cutoffs (ATP 0.4 and ATP2 0.5).

Generation of fluorescent protein-tagged fusion proteins

RNA was isolated from A. thaliana or P. patens and transcribed into cDNA using commercially available enzymes (M-MuLV Reverse Transcriptase, Thermo Fisher, Oslo, Norway) according to the manufacturer's protocol. cDNA sequences for eight A. thaliana proteins and three P. patens proteins (Table 1) were amplified using specific primers (Table S4). Forward primers contained a BamHI or SalI restriction site while reverse primers featured eitheran HpaI, EcoRV or StuI restriction site. PCR products were subcloned into the TOPO TA vector (Invitrogen). The gene-specific fragments (encoding either the full-length protein or the targeting signal with its 5′UTR) were then excised using the introduced restriction sites and ligated in frame into the mAV4-vector (Kircher et al., 1999) to obtain C-terminal fusion constructs with a green fluorescent protein (GFP).

Table 1. In vivo test set
AT4G17300.1At_ASN-tRNAAsparagine-tRNA ligase Arabidopsis thaliana Peeters et al. (2000)
AT1G24040.1At_GNATGCN5-related N-acteyltransferase family protein A. thaliana
AT3G48250.1At_PPRPentatricopeptid repeat containing protein A. thaliana
AT5G50250.1At_CP31BRNA-binding protein CP31B A. thaliana Tillich et al. (2010)
AT1G14410.1At_WHY1Whirly transcription factor family protein 1 A. thaliana Krause et al. (2005)
AT1G71260.1At_WHY2Whirly transcription factor family protein 2 A. thaliana Krause et al. (2005)
AT2G02740.1At_WHY3Whirly transcription factor family protein 3 A. thaliana Krause et al. (2005)
AT2G03050.1At_mTERFMitochondrial termination factor like protein A. thaliana
Pp1s446_7V6.1Pp_PPRPentatricopeptid repeat containing protein Physcomitrella patens
Pp1s13_9V6.2Pp_PLCPhosphatidylinositol-specific phospholipase C P. patens Mitschke et al. (2009)
Pp1s219_94V2.1Pp_Hem2d-Aminolevulenic acid dehydratase 2 P. patens Mitschke et al. (2009)

Localization of fusion proteins by transient transformation of protoplasts

A suspension culture of A. thaliana mesophyll cells was grown in MS medium (Murashige & Skoog, 1962) supplemented with vitamin B5, 1-naphtaleneacetic acid (NAA, 0.5 mg l−1), Kinetin (0.1 mg l−1) and 87 mM sucrose under constant shaking (150 rpm) at 22°C in a light : dark regime of 16 : 8 h (130 μmolm−2 s−1). Polyethylene glycol-mediated protoplast transformation was performed as described previously (Schwacke et al., 2007). Protoplasts from N. tabacum were isolated from sterile-grown, 4- to 6 wk-old plants as described for potato protoplasts elsewhere (Krause et al., 2005). Isolation and transformation of P. patens protoplasts was also conducted as described before (Frank et al., 2005). Mitotracker Orange (Invitrogen) staining was conducted according to the manufacturer's protocol. Microscopy was done using a Zeiss 510 Meta confocal laser scanning microscope (excitation: GFP 488 nm, mitotracker 554 nm; detection: GFP BP550-550, mitotracker BP565-615, Chl ≥ 620 nm) or a Leica AOBS TCS SP5 (the same excitation and detection settings as for the Zeiss microscope).


In silico validation of ATP2

We compared the performance of ATP with that of the modified predictor ATP2 using the optimized ATP2 test data set (see Notes S1 for details). The cutoff representing the optimal compromise between high specificity (= true negative rate; i.e. how polluted the predictions are by false-positive predictions) and high sensitivity (= true positive rate; a measure of how many sequences will end up as not being predicted) is higher for ATP2 (0.5) than for ATP (0.4; Fig. 1c). Therefore, the new implementation is an improvement over the older version and can be used for more accurate dual targeting predictions. The tool can be accessed at

In silico validation of GTP

We developed a new tool for the prediction of single targeting to four plant cell compartments, the GTP. For the validation of this tool we performed a series of in silico tests for comparing its performance against other prediction tools that are summarized in Table 2 for ease of reference. In addition, we used in vivo localizations to confirm the in silico results. In order to determine to what degree predictions for different proteins are influenced by the composition of the training data set, two subversions were made that differ with respect to their training data sets: GTP_Pp was trained with the 173 proteins derived from P. patens, while GTP_Ref was trained with the TargetP data set that contains sequences from a variety of organisms, including nonplant sequences for mitochondria. Both predictors are publicly available at

Table 2. Overview of development and validation of GTP_Pp
In silico comparisons of GTP_Pp with
PredictorData set testedResults
  1. The table summarizes the in silico and in vivo comparison and shows which combinations of predictors and data sets were used.

10 generalist predictorsPhyscomitrella patens test data setTable 3, Supporting Information Table S5
GTP_RefIn vivo test set (Table 1)Fig. 2
AtSubPP. patens test data set + AtSubP data set IIFig. 3
ATP2In vivo test set (Table 1)Fig. 4
In vivo validation of GTP_Pp predictions
SystemData set testedReference
Homologous and heterologousIn vivo test set (Table 1)Table 4, Fig. 5

To compare the performance of GTP_Pp with other prediction tools, a P. patens test data set (Table S2) was subjected to predictions with this and 10 other prediction tools (Table 3). To facilitate the comparisons, precision (respectively average precision) was used as a measure, which sets the correctly predicted (true positive) sequences in relation to those that gain a high score but are incorrectly predicted (false positives; see Notes S1 for details). We conclude that the average prediction precision of GTP_Pp for P. patens proteins is similar or even superior to that of generalist predictors. Furthermore, the low standard deviation indicates that all compartments were predicted with approximately the same precision, allowing the use of the same cutoff value for reliable predictions for all classes (Tables 3, S5).

Table 3. Comparison of the performance of different tools on the Physcomitrella patens test data set
 WoLF PSORT *aYLoc *Sher Loc2 *TargetPGTP_PpProtein ProwlerPredotarCelloMulti Loc2ESL Pred2BaCel Lo
  1. a

    References for prediction tools: WoLF PSORT (Horton et al., 2007); YLOC (Briesemeister et al., 2010); SherLoc2 (Briesemeister et al., 2009); TargetP (Emanuelsson et al., 2000); ProteinProwler (Boden & Hawkins, 2005); Predotar (Small et al., 2004); Cello (Yu et al., 2006); MultiLoc2 (Blum et al., 2009); ESLPred2 (Garg & Raghava, 2008); BaCelLo (Pierleoni et al., 2006).

  2. Prediction performances based on P. patens sequences of the test set (cf. Materials and Methods/Supporting Information Notes S1). All values represent precision and are on a scale from 0 (no protein predicted) to 1 (100% proteins correctly predicted). The precision for each class (CHL, plastids; MIT, mitochondria; SEC, secretory pathway; NUCCYT, no N-terminal targeting signal) as well as the average precision (Avg) over all classes are given. The results are ranked according to the highest average precision (left to right). The SD is shown to illustrate the respective variance. For predictors with selection between high and low resolution, only the more sensitive high-resolution version is listed. The highest precisions for each class are marked in bold, while the second best precisions are underlined. The tools marked with an asterisk were used for creating the mitochondrial data set (see the 'Materials and Methods' section and Table S5).

CHL 0.89 0.670.44 0.89 0.89 0.560.560.330.44 1 0.78
MIT 1 1 1 0.83 0.67 1 1 0.67 0.83 0.670
SEC0.57 0.86 0.710.430.710.570.43 1 0.710.140.29
NUCCYT 0.92 0.77 1 0.850.690.850.850.850.770.540.77

Next, we tested the amino acid sequences of our in vivo test set (Table 1) with both GTP versions (Fig. 2). In general, GTP_Pp lacks sensitivity (Fig. 2a) but scores higher on the specificity than GTP_Ref (Fig. 2b). Based on this analysis, we recommend a GTP_Pp cutoff of ≥ 0.5 that combines a medium specificity of 50–70% and a medium to high sensitivity of 50–100% for predicting unknown protein localizations (Fig. 2c). In the case of GTP_Ref, the suggested cutoff of ≥ 0.7 (Fig. 2c) results in a lower sensitivity at a higher specificity than is the case with GTP_Pp (Fig. 2a–c). Differences of performance on the P. patens, the A. thaliana or the combined protein set are relatively minor (Fig. 2). It should be noted, however, that because of the small size of the in vivo data set (three and eight proteins, respectively) these results have to be interpreted cautiously.

Figure 2.

Comparison of GTP_Pp and GTP_Ref. The x-axes depict the score range of the prediction on the in vivo test data set. The y-axis depicts the sensitivity (a), the specificity (b) and the combined sensitivity and specificity (c). GTP_Pp, blue; GTP_Ref, red. The different shades of blue and red correspond to the different components of the in vivo test data set. The brightest shade corresponds to the whole data set, the darkest shade to the Physcomitrella patens proteins only and the intermediate color to the Arabidopsis thaliana proteins only. Note that, for GTP_Ref, no specificity for the P. patens proteins could be assigned as none of the proteins was predicted incorrectly. The peaks labeled with stars in (c) are the best possible compromises between specificity and sensitivity and therefore represent the suggested score cutoffs (GTP_Pp 0.5 and GTP_Ref 0.7).

To compare the performance of GTP_Pp with another species-specific prediction tool, we used the independent test set II from AtSubP (Kaundal et al., 2010) and tested GTP_Pp on it, while we reciprocally challenged AtSubP with our P. patens-specific test set. Both tools performed significantly better on their respective data set than on the phylogenetically distant one (Fig. 3). GTP_Ref, which was also tested with both data sets, performed with a lower precision than the ‘correct’ species-specific tools, but better than or comparable to the ‘wrong’ tool (Fig. 3).

Figure 3.

Comparison of two species-specific predictors, GTP_Pp and AtSubP. The average precision of the predictions of two species-specific prediction tools (GTP_Pp, blue; AtSubP, green) is shown on two data sets, on the left on the Physcomitrella patens test data set and on the right on the Arabidopsis thaliana test set II published by Kaundal et al. (2010). For comparison, the average precision of GTP_Ref (red) on both test sets is shown.

GTP confidence filtering

The average prediction precision of GTP_Pp was further increased by including a confidence value filter. Top scores that differ significantly from the second best hit were found to be more likely to indicate the correct compartment (conveying a higher confidence) than if the two top scores differed by only a small margin (see 'Materials and Methods'). Thus, the confidence value was calculated as the score distance between the top result and the second best result. The more stringently the applied confidence value filter was chosen, the higher was the precision (see Notes S1 and Fig. S1). The improvements that were achieved by applying confidence filtering on the P. patens test set were impressive, especially for mitochondria (increase of precision by 33%) and plastids (increase of precision by 11%). The precision for those two classes already increased to 100% correctly predicted P. patens proteins with a confidence filter of ≥ 0.1. For the proteins without an N-terminal targeting signal, confidence values higher than 0.3 returned mostly correct predictions. When using GTP_Pp and confidence filtering on a TargetP-derived test set, a generally lower precision (except for the secretory pathway) was observed (see Notes S1, Fig. S1). When all nonplant sequences were removed from the test set (mitoplant), the performance was found to be intermediate between the complete TargetP-derived and the P. patens test set. We conclude that species-specific/narrower training sets in combination with a confidence filter are generally superior to a generalist approach.

Dual targeting prediction with confidence filtering and ATP2

The confidence score did not only filter incorrect single targeting predictions, but also seemed useful as an indication for dual targeting, as evidenced by the prediction results on the in vivo test set (Tables 4, S6). From the five proteins targeted to mitochondria and plastids, none had a confidence value above 0.2, and most of them were even below 0.1 (Table S6). To test whether the confidence value can be used as a measure of potential dual targeting, we plotted the relations among dual targeting, GTP-score, confidence value and ATP2 score (Fig. 4). When GTP and ATP2-scores are compared, it is evident that high ATP2 and high GTP scores are mutually exclusive; dual targeting was only observed with GTP scores below 0.7 (Fig. 4a). When the confidence value is compared with the ATP2 score, the majority of the proteins with high ATP2 scores (≥ 0.6) possess confidence values below 0.2 (Fig. 4b). Therefore, prediction with GTP can potentially exclude dual targeting for proteins with high confidence (> 0.2) and prediction scores (> 0.7). For prediction results with low confidence and score values, on the other hand, the combination of the GTP prediction with a subsequent ATP2 prediction can give a good hint at potential dual targeting for those proteins.

Table 4. Summary of localization and prediction results on the in vivo test set
ProteinCompartmentPrediction toolSpecies used for protoplast transformation
GTP_RefGTP-Pp Arabidopsis thaliana Nicotiana tabacum Physcomitrella patens
  1. Compartments for which the protein was predicted, or detected by protoplast transformation, are indicated by black tick marks (✓). Underlined tick marks indicate the second predicted compartment, in cases where prediction score and confidence value pointed to a putative dual localization (see Supporting Information Table S5 for details). Boxes shaded in gray represent results obtained with transformations in the homologous systems, or predictions in agreement with the localization in the homologous system. CHL, plastid; MIT, mitochondria; NUCCYT, no N-terminal targeting signal. Secretory pathway localization (SEC, see text) was neither predicted nor observed in vivo, with the exception of one secondary, incorrect localization of Pp_HEM2 predicted by GTP_Ref (marked with an asterisk).

Figure 4.

Correlation among ATP2 score, GTP-score, confidence value and dual targeting on the in vivo test set. (a) Scatter plot depicting scores for GTP_Pp (circles) and GTP_Ref (triangles) on the x-axis in correlation to ATP2-scores on the y-axis. (b) Scatter plot showing the correlation between the confidence values of GTP_Ref (triangles) and GTP_Pp (circles) and the ATP2 score (x-axis). The proteins that show dual targeting in vivo in the homologous system are shown in red.

In silico analysis of targeting signal properties

Green Targeting Predictor uses support vectors, which are applied during the prediction to distinguish between the different locations by analyzing the N-terminal 70 or the C-terminal 30 amino acids of a protein for certain characteristics. These characteristics are referred to as features (Table S3). During the training of the algorithm, 15 features out of 66 possible were determined, being optimal for discrimination of each class from the other three classes. Information about which features allow an optimal distinction between different classes after training with different data sets can reveal whether the structure of the targeting signals is similar or different. In the present case, we found that only six out of the 15 optimal features were used by both GTP_Pp and GTP_Ref for mitochondrial prediction, followed by eight common features out of 15 for prediction of proteins of the secretory pathway. Nucleocytoplasmic and chloroplast predictions showed the greatest conformity, with 10 and 11 common top features, respectively (Table S3). The nonoverlapping features hint at evolutionary divergence of target signals and their recognition machinery between different lineages. As an example, negatively charged residues (feature NEGR) are important to recognize P. patens chloroplast targeting signals, but not mitochondrial ones – although in general (GTP_Ref) such negative residues can be used to identify mitochondrial signals.

In vivo validation and interspecies comparison

To validate the predictions experimentally and to compare their in vivo localization in different species, we selected a set of proteins from A. thaliana and P. patens (Table 1, and the 'Materials and Methods' section). Those proteins were fused to a C-terminal GFP and transfected into P. patens, A. thaliana and N. tabacum protoplasts. The candidates that were chosen were single or dual targeted to chloroplasts and mitochondria, respectively, and had a GTP_Pp/GTP_Ref score higher than 0.4 for the respective top result. The overall prediction results of both GTP versions were comparable for this set of proteins.

Validation of P. patens proteins in P. patens

The three representatives for in vivo validation of protein predictions from P. patens encompassed a pentatricopeptide containing protein (Pp_PPR) localized to mitochondria (MIT), a phosphatidylinositol-specific phospholipase C (Pp_PLC – CHL/MIT) and a d-aminolevulenic acid dehydratase 2 (Pp_Hem2 – CHL). Pp_Hem2 was predicted to localize to the plastids with a score of > 0.81 by both GTP predictors (Table S6) and this localization was in accordance with the previously observed expression in P. patens protoplasts (Mitschke et al., 2009). Pp_PPR, in turn, was predicted to be mitochondrial with scores of 0.58 and 0.54 for GTP_Pp and GTP_Ref, respectively (Table S6). Potential secondary plastid localization was implied by GTP_Pp only. The low confidence score (0.07) suggested dual plastid/mitochondrial targeting, an assumption that was not excluded by ATP2, which returned a value of 0.46, and indicated by ATP (score 0.67) as well as by the in vivo localization in isolated protoplasts (Fig. 5a). The third protein, Pp_PLC, was predicted by GTP_Pp to be without N-terminal localization signal (score 0.51, confidence value 0.04) while it was predicted by GTP_Ref to be plastid-localized (score 0.6, confidence value 0.01) (Table S6). In both cases, a mitochondrial localization was predicted in second place. The published localization of this protein in P. patens is mitochondrial and plastidal (Mitschke et al., 2009), confirming the GTP_Ref prediction, while the prediction of GTP_Pp was incorrect. The summary of predictions and in vivo findings for the P. patens proteins is given in Table 4 (cf. Table S6a for details).

Figure 5.

Exemplary in vivo localization experiments in Arabidopsis thaliana, Nicotiana tabacum and Physcomitrella patens (from left to right). Each big picture shows an overlay between two fluorescent channels: the GFP-fusion protein (a, Pp_PPR; b, At_CP31B; c, At_PPR) in green and the plastid autofluorescence in gray. The three small pictures below each big picture show the channels separately (autofluorescence in gray to the left; GFP in green in the middle and in addition the brightfield picture to the right). For At_CP31B in A. thaliana, the localization of mitochondria (mitotracker in red) is shown in perspective to the plastids to clarify the situation in A. thaliana cell culture protoplasts (see the 'Results' section). Squares in the pictures specify the position of the enlargement(s), which are shown in the upper right or left corner in each big picture. The enlargements contain information on the localization (MIT, mitochondria; CHL, chloroplast; NUCCYT, nucleocytoplasmic) observed for each protein in the respective organism, while arrows point to the exact position of the described localization. Bars, 10 μm.

Validation of A. thaliana proteins in A. thaliana

Arabidopsis thaliana protoplasts were transfected to verify the localization of eight proteins in vivo. These encompassed the Whirly transcription factors At_WHY1, At_WHY2 and At_WHY3, a mitochondrial translation termination factor-like protein At_mTERF, a GCN5-related N-acetyltransferase family protein At_GNAT, a PPR-protein At_PPR, an RNA-binding protein of the cpRNP family At_CP31B and an asparagine-tRNA ligase At_ASN-tRNA (for more information on these proteins, see Table 1). Transfection of A. thaliana protoplasts showed that three of these candidates (At_WHY1, At_WHY3 and At_CP31B) were located exclusively in the chloroplasts and one (At_WHY2) was found exclusively in the mitochondria. As the protoplasts for the localization experiments in A. thaliana were isolated from sugar-containing cell culture, thylakoid autofluorescence is often being observed in a minor part of the plastid, while the stroma is comparatively extensive. This can be seen clearly with the stromal protein At_CP31B in Fig. 5b. The localization of mitochondria as indicated by MitoTracker staining is shown for comparison and demonstrates no overlap with the GFP signal (Fig. 5b). Three further proteins were found to be dually targeted to chloroplasts and mitochondria (At_ASN-tRNA lig; At_PPR, Fig. 5c; At_mTERF), while the last protein (At_GNAT) showed a nucleocytoplasmic/plastidic localization. All localization results of A. thaliana proteins that are not shown in Fig. 5 are summarized in Table 4 (cf. Table S6b for details).

The best score of both versions of GTP corresponded to the in vivo localization for five of these proteins (Table 4; At_ASN-tRNA, At_CP31B, At_GNAT, At_WHY2, At_PPR; shaded in grey). At_WHY1 and At_mTERF were each predicted correctly only by one of the two predictors (Table 4). The dual localization pattern of four out of those eight proteins was predicted correctly with both tools in all cases except one (GTP_Ref prediction of At_mTERF). The score ranges for those predictions were between 0.42 and 0.67. Confidence scores were generally low (maximum 0.18), except for At_GNAT (0.41, Table S6b).

Localization studies using heterologous expression systems

All localization experiments were also done in heterologous systems (N. tabacum and A. thaliana for P. patens proteins and N. tabacum and P. patens for proteins from A. thaliana). The localization results varied to a large extent between homologous and heterologous systems. Comparison between N. tabacum and A. thaliana showed a clearly lacking secondary localization for At_PPR, AT_mTERF and AT-GNAT. These proteins were either also incorrectly predicted with at least one of the predictors (e.g. At_mTERF) or differently predicted by both GTP versions (e.g. At_PPR, Table 4 and Table S6). Comparison between A. thaliana and P. patens also revealed mistargeting in four cases. Between those two organisms the mistargeting comprised an additional second localization for At_WHY2 (plastids), a lacking second localization corresponding to the result in N. tabacum for At_mTERF and the lack of detection of a N-terminal targeting signal altogether for At_PPR and Pp_PLC. At_PPR (Fig. 5c) showed dual targeting in the homologous system, single targeting to plastids in N. tabacum protoplasts and cytoplasmic localization in P. patens. At_mTERF showed dual targeting to mitochondria and plastids in the homologous system as well, but did not show any plastid localization in the two heterologous systems (Table 4). For At_GNAT, nucleocytoplasmic localization in addition to the one in the chloroplast was visible in A. thaliana and P. patens, but not in N. tabacum.

Of the P. patens proteins tested here, Pp_PPR showed the same dual localization in each system (Fig. 5a). This is in strong contrast to Pp_PLC, which showed dual localization in P. patens to mitochondria and plastids, but could not be detected in any of these organelles in protoplasts of the seed plants.

Only one of the three Whirly proteins from A. thaliana showed an aberrant localization pattern in P. patens, although no homologs of these proteins were found in the moss. The predictor GTP_Pp, on the other hand, did not detect any N-terminal targeting signal for two of them, At_WHY1 and At_WHY3, while the third protein, At_WHY2, was predicted to be localized to the plastids and mitochondria, which proved to be correct in the moss (Table 4).


With GTP_Pp, we introduce a P. patens-specific prediction tool that is able – despite a relatively small training set – to predict P. patens proteins with a comparable or better precision than prediction tools based on a large, but multispecies, data set. One of the caveats of our small data set, however, is that not all determinants of targeting signals may have been represented during training of the tool, leading to a possible bias towards the characteristics present in the training data set. This drawback can be eliminated as soon as more localization data on proteins for this species become available. The comparison with an A. thaliana-specific prediction tool as well as with an A. thaliana-specific test data set showed a clear decrease in prediction accuracy when the nonspecific data set/tool was used. For optimal performance, predictors should therefore be species-specific in order to outperform generalist approaches. This confirms the advantage of the species-specific predictors described for A. thaliana by Kaundal et al. (2010) and for O. sativa by Kaundal & Raghava (2009). The comparison of the performance of GTP_Pp on different data sets showed some differences depending on the composition of the test set. The main differences were detectable for mitochondrial proteins, which indicates not only a difference between mitochondrial targeting sequences of all organisms represented in the TargetP-derived test set, but also differences of the P. patens mitochondrial sequences, which were predicted with higher precision at a lower confidence filtering step than the mitochondrial sequences from seed plants.

The training of two tools with data sets of different composition (GTP_Pp and GTP_Ref) made it possible to compare the setup of N-terminal targeting signals of P. patens and of a mixed eukaryotic data set by contrasting the predominant features used for distinction between classes. From this analysis we can infer that training with data sets of different composition can give clues as to whether the prevalent characteristics of targeting signals have changed over evolutionary time and by how much. For the situation described here we can conclude that mitochondrial signals changed a lot more than plastid signals, while signals targeting the protein through the secretory pathway are intermediate (Table S3). Our observation that mitochondrial targeting signals share the least common features corroborates the observations made in a study on mitochondrial targeting peptides some years ago, which showed that mitochondrial targeting signals of yeast and flowering plants share very few features (Huang et al., 2009). Surprisingly, the features that are used for distinction of yeast mitochondrial targeting signals in this previous analysis partially coincide with the ones regarded as critical for plastid targeting in another study (Patron & Waller, 2007). Accordingly, one of the observed differences between feature usage by GTP_Pp and GTP_Ref was the use of β-sheets as a distinctive feature by GTP_Pp. This feature is not used by GTP_Ref to distinguish between plastids and mitochondria (Table S3). In contrast to plant N-terminal sequences, where the occurrence of β-sheets is characteristic for plastid targeting signals, but is extremely rare in mitochondrial presequences, yeast mitochondrial presequences often exhibit β-sheet elements in the first 10 amino acids (Huang et al., 2009). Thus, β-sheets are most likely unsuitable for GTP_Ref as a result of the presence of yeast sequences in the TargetP-data set.

When the performance of both tools was analyzed on the proteins used for localization studies, they both performed similarly, with some minor differences: targeting characteristics to the organelles are similar enough that a moss-specific predictor can predict targeting in seed plants satisfactorily. But our data also imply that a tool trained with a mixed training data set is blind towards some characteristics of N-terminal targeting signals responsible for dual targeting and makes the species-specific tool superior in that regard. This is most obvious when analyzing and comparing confidence values. We observed that low confidence values can hint at a secondary (dual) localization to plastids or mitochondria. While a low confidence value is no guarantee of a secondary localization (and should be tested using ATP2), a high confidence value, however, all but excludes that possibility. This proved to be more pronounced for the species-specific tool than for GTP_Ref and could not be expanded to non-N-terminally encoded localizations.

Overall, only six of the 11 tested proteins had matching localization patterns in all three organisms. One explanation for this surprisingly low number is probably the high proportion of proteins exhibiting dual targeting in our set. These proteins were more often than not either entirely mistargeted or lacked targeting to one compartment in the heterologous system. In addition, mitochondrial targeting seemed to be slightly more prone to mistargeting (Tables 4, S6; At_WHY2 is also localized in plastids in P. patens and the secondary mitochondrial localization of At_mTERF is not observed in both heterologous systems). These observed differences confirm our in silico analysis and also support previous observations. While plastid transit peptides within the green lineage seem to be reasonably conserved, despite some differences between green algae and seed plants (Franzen et al., 1990; Patron & Waller, 2007), an in silico study of the receptors of the mitochondrial outer membrane (Carrie et al., 2010) revealed differences in composition between P. patens and angiosperms. Those differences are possibly one of the reasons for the rate of mistargeting of the proteins tested in vivo. It seems that the modifications necessary in presequences for dual targeting are more lineage-specific than the ones for single targeting. However, this might also reflect the evolutionary age of the individual dual targeting signal. Fitting into this train of thought, the only two proteins that showed correct dual targeting in all three organisms were Pp_PPR and the At_ASN-tRNA. The latter protein belongs to a family of proteins being described as dually targeted even in chlorarachniophytes (Hirakawa et al., 2012), while the Pp_PPR protein belongs to a family that expanded greatly during the evolution of land plants (summarized in Schmitz-Linneweber & Small, 2008) and that might have already possessed dual targeting in the last common ancestor of the organisms studied here.

As heterologous systems are widely used for protein localization studies, our results raise the question as to how many dual targeted proteins have so far evaded discovery. Not only tissue- and development-specific influences (Zhang et al., 2010; Faraco et al., 2011), but also eclipsed distribution (Regev-Rudzki & Pines, 2007), have to be considered when re-investigating protein targeting, as well as the choice of the proper (i.e. homologous) experimental system. As the impact of secondary localizations of proteins is undisputed (Balsemao-Pires et al., 2011), we feel that reliable localization results – and with that information on possible functions – can only be obtained with the respective homologous expression system, eliminating a biased result by as yet unknown cellular cofactors and regulatory mechanisms.

Taken together, our data suggest that predictions with tools trained on mixed data sets increase the risk of missing out on a potential secondary localization. Lineage-specific tools, on the other hand, seem to be better suited to give hints on an additional localization, at least in conjunction with dual targeting predictors such as ATP2.


  • ATP2 is superior to ATP (higher cutoff, higher sensitivity/specificity) in predicting dual targeting.
  • GTP_Pp and GTP_Ref predictions for mitochondria and plastids with a low confidence value (< 0.2) can indicate dual targeting (to be checked by ATP2), while high confidence values all but exclude this possibility.
  • Species-specific prediction tools are superior to prediction tools based on mixed data sets even when using much smaller training data sets.
  • Localization experiments should be performed in the homologous system whenever possible to gain as complete a picture as possible on subcellular targeting.


We thank Dr Ullrich Herrmann for helpful suggestions regarding cloning problems and for critical reading of the manuscript. Financial support in the form of a mutual exchange grant from the DAAD (Germany) and the NFR (Norway) within the program DAADppp to S.A.R. and K.K., respectively, and by NFR grant 180662/V40 to K.K. is gratefully acknowledged. We are grateful to Karol Buchta and Faezeh Donges for aid in implementing the webtools and to the ‘Bioimaging Platform’ at the University of Tromsø for granting access to the CLSMs.