Fast liquid chromatography coupled to electrospray tandem mass spectrometry peptide sequencing for cross-species protein identification

Authors

  • Erwin Witters,

    Corresponding author
    1. Universiteit Antwerpen, UIA, dept. Biologie, Laboratorium voor plantenbiochemie, Universiteitsplein 1, B-2610 Antwerpen, Belgium
    • Universiteit Antwerpen, UIA, dept. Biologie, Laboratorium voor plantenbiochemie, Universiteitsplein 1, B-2610 Antwerpen, Belgium.
    Search for more papers by this author
  • Kris Laukens,

    1. Universiteit Antwerpen, UIA, dept. Biologie, Laboratorium voor plantenbiochemie, Universiteitsplein 1, B-2610 Antwerpen, Belgium
    Search for more papers by this author
  • Peter Deckers,

    1. Universiteit Antwerpen, UIA, dept. Biologie, Laboratorium voor plantenbiochemie, Universiteitsplein 1, B-2610 Antwerpen, Belgium
    Search for more papers by this author
  • Walter Van Dongen,

    1. Universiteit Antwerpen, RUCA, dept. Scheikunde, Nucleoside Research and Mass Spectrometry Unit, Groenenborgerlaan 171, B-2020 Antwerpen, Belgium
    Search for more papers by this author
  • Eddy Esmans,

    1. Universiteit Antwerpen, RUCA, dept. Scheikunde, Nucleoside Research and Mass Spectrometry Unit, Groenenborgerlaan 171, B-2020 Antwerpen, Belgium
    Search for more papers by this author
  • Harry Van Onckelen

    1. Universiteit Antwerpen, UIA, dept. Biologie, Laboratorium voor plantenbiochemie, Universiteitsplein 1, B-2610 Antwerpen, Belgium
    Search for more papers by this author

Abstract

Using a parallel microcolumn switching liquid chromatography set-up coupled to a quadrupole time-of-flight mass spectrometer, a rapid liquid chromatography/mass spectrometric (LC/MS) protein identification method is presented. Without prior sample clean-up up to 300 protein digest samples a day can be processed. Using data-directed acquisition, up to 10 fragmentation analyses for each protein sample can be acquired in the same chromatographic run that can be used for database searching. Using internal peptide sequence information, protein databases and the various nucleic acid databases can both be queried for cross-species identification of the protein sample. The method was evaluated and put into force to generate data for a tobacco cell culture protein database. Copyright © 2003 John Wiley & Sons, Ltd.

At present, in the scientific community involved in mass spectrometry based proteomics, two mainstream schools exist concerning the separation and identification of complex protein mixtures: gel- and non-gel-based analyses. In most cases the latter method is based on a proteolytic digest of the protein mixture followed by LC separation and MS analysis of a restricted selection of peptides, representing the proteome pars pro toto. Gel-based methods use at least two different gel-supported electrophoretic protein separations prior to sequence specific cleavage of the individual proteins. Although an in depth pro and contra analysis is beyond the scope of this paper, the two schools exhibit advantages with respect to each other, and, as is often the case with opposing disciplines, each one possesses complementary values that can be put into force. In this paper the LC/MS method is developed and optimized for fast protein identification of gel-based separation procedures.

Once the proteins are isolated, by far the fastest protein identification method is matrix-assisted laser desorption/ionization mass spectrometric (MALDI-TOFMS) peptide mass finger printing (PMF). Though PMF identification is restricted to those proteins of which complete amino acid or gene sequences are present in databases, its principal use is restricted to well-sequenced model systems. When annotation relies on cross-species identification, peptide sequence information is required.1,2 While post-source decay (PSD) may give additional sequence information to PMF, the quality, accuracy and sequence coverage remain inferior to true fragmentation data. Depending on the sequence conservation of a protein or gene across different species, protein identification might rely completely on peptide sequence information. In this study we present data that show that the peptide sequence information needed for protein identification is dependent on the genetic relationship between the used model and the species that offered cross identification. With the recent introduction of MALDI tandem mass spectrometers, no doubt a great workload for LC-coupled electrospray tandem mass spectrometry will be taken away. It will leave LC-coupled ESI-MS/MS as a complementary identification technique to acquire additional sequence coverage, post-translation modification information and analysis of those specimens generating poor MALDI-MS/MS spectra.

The presented contribution intends to demonstrate the use of an automated rapid, on-line LC/MS/MS-based protein identification method useful for cross-species identification and is illustrated by its implementation in data generation for a tobacco BY-2 cell culture protein database.3–5

EXPERIMENTAL

Chemicals

HPLC-grade water, methanol and acetonitrile were obtained from Sigma (Pestanal grade; Riedel-de Häen, Belgium). Buffers and chemicals (PlusOne grade) used for electrophoresis were obtained from Amersham Biosciences UK. [Glu1]-fibrinopeptide B was obtained from Sigma (Sigma-Aldrich, Belgium). Trypsin (sequencing grade), reducing agent and alkylating agent were obtained from Roche (Roche, Germany). SYPRO Ruby was purchased from Perkin Elmer (Belgium).

Sample preparation

Sample preparation and analysis are described in detail elsewhere.3 Briefly, BY-2 cells were filtered from the medium and frozen in liquid nitrogen. Proteins were extracted by grinding the cells in acetone (−20°C), including 10% (v/v) trichloroacetic acid and 0.07% β-mercaptoethanol, followed by overnight incubation at −20°C. The suspension was centrifuged (15000 g, 10 min), the pellet was washed once with acetone + 0.07% β-mercaptoethanol (−20°C) and air-dried. The pellet was dispensed in resolubilization buffer (2 M thiourea, 7 M urea, 0.5% v/v IPG buffer pH 3–10, 1% DTT, 2% CHAPS). Insoluble particles were removed by centrifugation. The protein extract (100 μg) was separated by two-dimensional electrophoresis according to Görg6 using immobilized pH gradient strips (18 cm) ranging from pH 3 to 10 and 12% SDS PAGE gels. Gels were contrasted using an MS-compatible SYPRO Ruby fluorescent staining protocol. Random selected faint and dense spots covering basic and acidic pH as well as high and low molecular weight were isolated and submitted to tryptic digestion.7

Liquid chromatography

A capillary liquid chromatograph with an integrated autosampler (CapLC; Waters, MI, USA) was retrofitted with an additional actuated 10-port switching valve (VICI, USA) enabling us to perform parallel column switching (Fig. 1). Capillary C18 reversed-phase columns were obtained from Dionex (5 × 0.3 mm i.d. stationary phase: PepMap, Dionex, USA). Samples were dissolved in 50 μL of mobile phase A (2% CH3OH/98% CHOOH 2 mM) and 20 μL were injected into the first dimension of the LC system (mobile phase A, 20 μL min−1). Peptides were captured on the microcapillary column and desalted during the 5 min time-frame. After desalting had taken place, the loaded column was switched in-line with the second LC dimension and eluted using mobile phase B (80% CH3OH/20% CHOOH 2 mM) at a flow rate of 800 nL min−1.

Figure 1.

Parallel column switching set-up consisting of two microcapillary C18 columns. Both columns are alternatively loaded with sample and desalted for 5 min at a flow rate of 20 μL min−1 in the first dimension and eluted for 5 min using a flow rate of 800 nL min−1 in the second dimension.

Mass spectrometry

An electrospray quadrupole time-of-flight tandem mass spectrometer (QTOF-2; Waters, Manchester, UK) equipped with a picotip sprayer (coated picotip needle 8 μm i.d.; New Objective, USA, capillary voltage 1500 V, cone voltage 46 V) was used to record peptide fragmentation spectra that were interpreted with the accompanying software (Masslynx 3.5; Waters). By means of data-directed acquisition, peptide ions within a m/z 320–1500 survey scan mass range are analyzed for subsequent fragmentation. Doubly and triply charged ions exceeding a threshold abundance (TIC value 10 s−1, scan time 3 s, inter-scan delay 100 ms) were matched against a mass exclusion list (Table 1). When criteria were met, up to eight of the most abundant peptide ions were selected for collision-induced dissociation (CID). The selected precursor ions were subjected to three different collision energy states (33, 22, 28 eV, using Ar as the collision gas, P = 0.3 Pa, scan time 3.25 s, inter-scan delay 50 ms).

Table 1. Exclusion list containing frequently observed trypsin peptides (T), the internal standard fibrinopeptide (FPB) and unidentified system peaks (S)
m/zzOrigin
1137.02T
1092.92T
1081.92T
785.92FPB
777.33T
763.63T
758.43T
752.72T
734.33T
728.93T
721.63T
604.32S
607.32S
599.32T
596.33T
577.22T
547.83T
421.82T

Protein identification

Raw files containing accumulated fragmentation spectra obtained at different collision energies were processed using a smoothing and centroid function and exported as.pkl files. The files were trimmed using a Perl script to enhance the hit score. Merged.pkl files were submitted to Mascot (Matrix Science)8 for identification using sequentially the SwissProt, NCBInr and the EST databases. Identification was assumed for those queries with a significant score (p < 0.05). Additionally, when available, de novo sequence information of peptides not positively identified by the previous method was submitted for MSBLAST identification.9

RESULTS AND DISCUSSION

The LC/MS set-up was optimized for an automated and sample-independent protein identification method for species with an unknown or incomplete genome; hence the procedure was tuned for throughput, sequence coverage and robustness rather than maximum sensitivity. A compromise towards sensitivity was taken and was based on the premise that a 10 ng spot of a 50 kDa protein (i.e., 200 fmol of protein) should result in an unambiguous identification. An autolytic trypsin digest was diluted to 10 and 2.5 fmol μL−1 and used for a 200 and 50 fmol injection, respectively, demonstrating the sensitivity (200 fmol, signal-to-noise (S/N) ratio of 70:1 for the base peak) and the limit of detection (50 fmol, S/N ratio of 5:1 for the base peak) (Fig. 2(a)). In order to obtain an increased hit rate for cross-species identification, multiple diagnostic peptides of every protein sample were selected and fragmented in the same chromatographic run. As collision-induced fragmentation is dependent on the charge state and the amino acid composition, each of the precursor ions was subjected to three different collision energies to obtain at least a single good spectrum. By means of demonstration, two peptide fragmentation spectra, representing a protein obtained from an estimated 20 ng 2 D spot, are depicted (Figs. 2(b) and 2(c)). Each of the sequence spectra (scan time 3.2 s, 22 eV collision energy) led to the identification of the protein as a membrane-bound porin.

Figure 2.

Survey scan representing the limit of detection of a 50 fmol trypsin injection (S/N ratio 5:1, upper panel) and the sensitivity for a 200 fmol autodigest injection (S/N ratio 70:1, lower panel) (a). Sequence spectrum from two peptides m/z 504.37 (b) and m/z 662.96 (c) taken at a single collision energy regime (22 eV), each leading to the protein identity as porin.

As system pressures may rise with increasing numbers of injected samples, or since occasional inhomogeneities in the sample matrix may temporarily clog the column or electrospray needle, a splitless capillary LC system was chosen to ensure a constant flow rate. The robustness of the system ensured the continuity of automated analyses and could overcome differences in pressure of several hundred PSI units (Fig. 3(a), upper trace), without significant change in retention time (Fig. 3(b)). Chromatographic reproducibility was tested by monitoring the retention time of co-eluting peptides (SD < 0.06 min, n = 7). Sample carry-over was typically less than 0.25%. Temporal stability with respect to sensitivity was monitored and based on the intensity of spiked fibrinopeptide B (200 fmol/sample; Fig. 3(c)). The observed small fluctuations are well within acceptable bounds and reflect the difference in matrix ion suppression rather than the accumulative reduction of sensitivity.

Figure 3.

Pressure versus time profile of the second LC dimension illustrating the benefit of a constant flow LC system enabling us to overcome severe disturbances in system pressure during batch analysis (a) and assuring an accurate retention time of the co-eluting peptides (b). Ion spectra of spiked fibrino peptide showing the observed variation in sensitivity at different time points of the batch analyses (c).

To increase the duty cycle and to allow for an extensive sample clean-up, a parallel column switching system was mounted on a 10-port switching valve (Fig. 1). Whilst a sample was being desalted, a previously loaded sample could be analyzed. The chromatographic analysis time was optimized to allow at least up to five peptide fragmentations. Empirical experiments with respect to ion statistics and different collision energy regimes for a maximum identification rate indicated an optimal 3 s survey scan time and a fragmentation time of around 3 s per collision energy state. These settings resulted in a 10-s analysis time for a single precursor ion. Accordingly, an adjusted flow rate of 800 nL min−1 generated a peak elution profile of around 50 s (FWHM) within a 5-min time-frame.

Depending on the requirements for sequence analyses of multiple peptides both the scan time and the flow rate of the set-up can be altered, either to increase throughput or to increase the cross-species identification rate. By doubling the flow rates of both dimensions a 3-min duty cycle is reached. Consequently, less time for fragmentation analysis and sequence coverage is available. When enhanced sequence coverage is required, a reduction of the flow rate will allow for a higher number of peptides that can be selected for CID. Likewise, the scan time can be altered. If high peptide concentrations are expected, both the survey scan time and the MS/MS scan time can be reduced accordingly. On the other hand, when required, prolonged scan times in agreement with adjusted reduced flow rates result in an increased sensitivity. The large degrees of freedom make this LC/MS set-up a versatile platform suitable for various screening methods.

The overall performance of the complete system was tested using protein digests obtained from a Sypro Ruby stained 2D gel representing the proteome of a Nicotiana tabacum cell suspension culture.4 This model system is widely used to study the plant cell cycle. Unlike Arabidopsis, Nicotiana has a poorly described proteome and genome (Table 2). Whereas Arabidopsis represents more than 20% and 35% of the plant kingdom entries of the SwissProt and translated EMBL nucleotide sequense database, respectively, Nicotiana represents less than 5% and 1.5%. Taking into account that the Nicotiana genome is an estimated 50 times larger, protein identification relies largely on cross-species identification. Out of 80 randomly selected proteins yielding sequence data, 58 (72%) were identified; 22 samples (28%) were not identified using the described identification method, either due to insufficient sequence coverage (see below) or absence of database information. Only nine proteins (11%) were identified with protein entries at the species level (Nicotiana tabacum). Sixteen more proteins (20%) were identified with protein entries at the family level (Solanaceae). All other proteins (69%) were identified with entries of phylogenetically more distant species. It was observed that 80% of the submitted peptide sequences resulted in a significant hit when the protein at the species level was present in the database, so in most cases a single peptide fragmentation spectrum already resulted in protein identification. Only 40% of the submitted spectra resulted in an identification with a protein entry at the family level, meaning that on average more than two sequence spectra had to be submitted for positive identification. In cases where identification relied on even more distantly related species, 25% (or only one out of four spectra) resulted in unambiguous identification. A limited set of examples illustrating the different situations encountered is presented in Table 3. Evidently, due to the small sample population, these figures are a mere approximation and are restricted to tobacco. In a particular case, even a set of six peptide fragmentation spectra did not result in a protein identification (Table 3). In another case, however, reflecting conservation throughout species, nine out of ten submitted CID spectra resulted in an identification obtained from a protein entry of a species belonging to a different super order (Table 3). However, it is clear that the further the genetic distance, the more peptide fragmentation spectra (i.e. sequence coverage) are necessary for unambiguous protein identification. A direct implication of these findings is the extra caution that has to be taken when performing pars pro toto proteomics. Since in this case the number of peptides representing a protein is drastically reduced, identification relying on cross-species proteins will become increasingly difficult with the increased genetic distance between the subject and the species with a known proteome or genome.

Table 2. Presentation of the number of taxon entries present in the SwissProt and the TrEMBL databank of tobacco and three other leading species
Sample IDHit rate (Mascot score)Peptide sequence informationProtein identification (Other proteins matching the same set of peptides)Species
Submitted spectravs. identified spectraSwissProt/NCBInrDbESTTop database entry
Family, Order
  • Redundant spectra are omitted, [identified spectra], C = carbamidomethylcystein, M° = oxidized methionine, N+Q+ deamidation.

BlSpo11280% GNNILVM°CDAYTPAGEPIPTNKRGlutamate-ammonia ligaseNicotiana plumbaginifolia
5/4(241) TLSGPVTDPAKLPK(0)JN0041
   TLSGPVTDPAK Solanaceae, Solanales
   EDGGYEVILK  
 40% GNNILVMCDAYTPQGEPIPTNKRGlutamate-ammonia ligasePopulus x canescens
 (79) TLSGPVTDPAQLPK(0)Q94L36
     Salicaceae, Malpighiales
BlSpo12160% GTVAVGFDTHPN+GEVKMonodehydroascorbate reductaseLycopersicon esculentum
8/7(225) LSDFGVQGADSK(NADH2)T06407
   EAVAPYERPALSK(0)Solanaceae, Solanales
   QGVKPGELAIISK  
   NIFYLR  
 40% EAVAPYERPALSKMonodehydroascorbate reductaseMesembryanthemum crystallinum
 (111) QGVQPGELAIISK(0)Q93YG1
   NIFYLR Aizoaceae, Caryophyllales
  40%EAVAPYERPALSKMonodehydroascorbate reductaseGossypium arboreum
  (104)QGVQPGELAIISK(0)13247881
   NIFYLR Malvaceae, Malvales
  10%VVGVFLESGTPEENKMonodehydroascorbate reductaseCitrus unshiu
  (79) (1)30465566
     Rutaceae, Sapindales
BlSpo118B90% M°ASTFIGN+STSIQEM°FRTubulin beta-1 chainMedicago falcata
10/10(443) GHYTEGAELIDSVLDVVR(0)Q949G6
   AVLM°DLEPGTM°DSLR Fabaceae, Fabales
   LHFFM°VGFAPLTSR  
   NSSYFVEWIPNNVK  
   M°M°LTFSVFPSPK  
   VSEQFTAM°FR  
   FPGQLNSDLR  
   KLAVNLIPFPR  
 90% M°ASTFIGN+STSIQEM°FRTubulin beta-1 chainLupinus albus
 (428) GHYTEGAELIDSVLDVVR(0)586076
   LHFFM°VGFAPLTSR Fabaceae, Fabales
   NSSYFVEWIPNNVK  
   M°M°LTFSVFPSPK  
   YGGDN+ELQLER  
   VSEQFTAM°FR  
   FPGQLNSDLR  
   KLAVNLIPFPR  
Blspo12030% QGDKVGSSEAALLAK60S acidic ribosomal protein P0-CArabidopsis thaliana
7/3(131) GTVEIITPVELIK(4)Q8LAM3
     Brassicaceae Brassicales
 30% GDKVGSSEAALLAKPutative 60S acidic ribosomal 
 (128) GTVEIITPVELIKprotein P0Zinnia elegans
    (2)Q8LNX8
     Asteroideae, Asteraceae
BlSpo14125% GILAADESTGTIGKFructose-bisphosphate aldolaseOryza sativa
4/3(62)  (16)ADRZY
     Oryzeae, Ehrhartoideae
  50%GILAADESTGTIGKFructose-bisphosphate aldolaseGossypium hirsutum
  (112)TASGKPFVDVLK(112)5044236
     Malvoideae, Malvaceae
  50%VAPEVIAEYTIRFructose-bisphosphate aldolaseSolanum tuberosum
  (102)GILAADESTGTIGK(2)gi|13179497
     Solanaceae, Solanales
BlSpo142025%QVLTDFQENPKStress-induced proteinMalus × domestica
4/2 (71) (3)3087933
     Rosaceae, Rosales
  50%Q+VLVDFQ+ENPKStress-induced proteinMesembryanthemum crystallinum
  (55)KGAVQFFM°K(2)18022002
     Aizoaceae, Caryophyllales
BlSpo1530%0%[1594.7] (276.2)VGNVHSV (547.3)  
6/0  [1134.5](564.4) DPVLK  
   [1074.5] DPYLLEGLR  
   [827.4] ASLVD(VP)K  
   [972.5] LLP(650.3)  
   [846.4] LLEPFTK  
Table 3. Limited list of samples illustrating the different identification situations and peptide hit rates
ViridiplantaeSwissProt entries 10027TrEMBL entries 109791ViridiplantaeSwissProt entries 10027TrEMBL entries 109791
  • °

    Selected model species with a large-scale sequencing project, SwissProt release 41.15 of 03-07-2003: 130.006 entries, TrEMBL release 24.0 of 27-06-2003: 944.142 entries.

Solanales11495045Fabales12034613
Solanaceae10904222Fabaceae12014599
Nicotiana4611596Medicago°117506
Brassicales252539 665Poales161727 397
Brassicaceae250339 546Poaceae161126 826
Arabidopsis°226938 428Oryza°42120 251

Acknowledgements

K. L. is a Research Assistant and Dr. E. W. is a Postdoctoral Fellow, both of the Fund for Scientific Research—Flanders (Belgium) (F.W.O.—Vlaanderen). This research was supported by the “Interuniversity Attraction Poles Programme—Belgian State—Federal Office for Scientific, Technical and Cultural Affairs”. The F.W.O.—Flanders and the University of Antwerp are also acknowledged for financial support.

Ancillary