Notice: Wiley Online Library will be unavailable on Saturday 27th February from 09:00-14:00 GMT / 04:00-09:00 EST / 17:00-22:00 SGT for essential maintenance. Apologies for the inconvenience.
Intrinsically disordered (ID) proteins function in the absence of a unique stable structure and appear to challenge the classic structure-function paradigm. The extent to which ID proteins take advantage of subtle conformational biases to perform functions, and whether signals for such mechanism can be identified in proteome-wide studies is not well understood. Of particular interest is the polyproline II (PII) conformation, suggested to be highly populated in unfolded proteins. We experimentally determine a complete calorimetric propensity scale for the PII conformation. Projection of the scale into representative eukaryotic proteomes reveals significant PII bias in regions coding for ID proteins. Importantly, enrichment of PII in ID proteins, or protein segments, is also captured by other PII scales, indicating that this enrichment is robustly encoded and universally detectable regardless of the method of PII propensity determination. Gene ontology (GO) terms obtained using our PII scale and other scales demonstrate a consensus for molecular functions performed by high PII proteins across the proteome. Perhaps the most striking result of the GO analysis is conserved enrichment (P < 10−8) of phosphorylation sites in high PII regions found by all PII scales. Subsequent conformational analysis reveals a phosphorylation-dependent modulation of PII, suggestive of a conserved “tunability” within these regions. In summary, the application of an experimentally determined polyproline II (PII) propensity scale to proteome-wide sequence analysis and gene ontology reveals an enrichment of PII bias near disordered phosphorylation sites that is conserved throughout eukaryotes.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
Nearly 30% of the human genome encodes intrinsically disordered (ID) protein sequences that assume no unique, stable structure under native conditions.1 The prevalence of ID segments presents a challenge to the structure–function paradigm; because ID sequences lack stable structure yet perform crucial functions. ID proteins differ in amino acid composition from structured proteins,2 resulting in higher charge to hydrophobicity ratios.3 Presently, whether or how compositional differences manifest as conformational propensities that may be tunable and thus utilized for functions such as signaling is not well understood. Recent studies examined ID protein sequences to identify molecular recognition features proposed to be local sequence regions prone to adopt structures important for protein-protein interactions.4, 5
One conformation suggested to be highly populated in the unfolded states of proteins6, 7 and peptide sequences with high net charge8–11 is the polyproline II (PII) conformation. Binding of proline-rich regions that are biased to the PII conformation is vital for cell signaling related to growth and differentiation.12, 13 Presently, it is estimated that the human genome may encode over 500 copies of domains (including SH3, SH2, WW, EVH1, and GYF) that interact with proline-rich regions.14, 15 In fact, genomic analysis has identified proline-rich regions as one of the most commonly encoded motifs in eukaryotes.16 However, concerning the PII propensity of all amino acids (not just proline-rich regions) at a proteome-wide scale, little is known of the distribution of PII bias among protein sequences, the evolutionary conservation of PII bias within protein sequences, or the possible utilization of PII for functions outside of cell signaling.
To address whether sequences select for regions of high PII propensity and potentially utilize PII propensity functionally, a complete, calorimetrically-determined amino acid propensity scale is developed for the PII conformation. Amino acid PII propensity has been the focus of several experimental17–21 and computational22–25 studies. We employ our calorimetrically-determined PII scale and scales of others17–25 to investigate whether PII is enriched in ID proteins, and if so what functional roles such proteins play. Our approach maps amino acid PII propensities onto protein sequences in order to characterize the functional roles of PII at a proteome-wide level.
Results and Discussion
PII propensities for all amino acids were determined using a peptide host-guest system and isothermal titration calorimetry.26–28 The C. elegans Sem-5 (sex muscle five) SH3 (Src-homology 3) domain binds a peptide corresponding to the recognition sequence of its binding partner, Sos (son of sevenless). Importantly, in the Sem-5 SH3-Sos complex, the Sos ligand is bound in the PII conformation29 [Fig. 1(A)]. Substitution at a noninteracting position results in a decrease in binding affinity (Kapp) relative to the wild-type peptide. Comparison of Kapp from isotherms [Supporting Information Fig. 1(A)] reports on the change in binding affinity upon substitution. Because the substitution is made at a site that is surface exposed in the bound complex and does not perturb the binding interface [Supporting Information Fig. 1(B)],26, 29 we can infer that the observed change in binding affinity reflects a change in the conformational equilibrium (Kconf) between binding incompetent and binding-competent (i.e., PII) states of the Sos ligand [Fig. 1(A)]. After correction for effects of the guest residue on cis/trans isomerization of the preceding proline residue using 1H-NMR experiments,30, 31 the measured change in Kapp can be interpreted as a PII propensity26 (Supporting Information). Cyclic sidechains (HIS, PHE, TRP, TYR) and GLY have low PII propensities, while long, charged sidechains (GLN, GLU, LYS) have higher PII propensities [Fig. 1(B)]. However, there is no apparent correlation of PII propensity with any single physico-chemical property.32 Being charged, for example, does not necessarily correspond to high PII propensity, as ASN, ASP, and HIS are average or below [Fig. 1(B)].
Differences in free energy of binding calculated for ALA and GLY correspond to previous measurements,26 and agree with PII propensities determined by others.22, 33 A study using a hard sphere collision model recently evaluated the ability of conformational bias to different regions of Ramachandran space to reproduce experimentally measured binding of ALA and GLY substituted Sos peptide to SH3.34 The results of the simulations indicated that only bias to the PII conformation at rates similar to those measured previously26 and in this study were capable of quantitatively reproducing experimentally determined binding energies.34 In fact, the same hard sphere model used by Whitten et al.34 can quantitatively reproduce the PII propensities for ALA and GLY relative to proline.32 Further, the rank order of PII propensities determined by our calorimetric scheme can be reproduced by CD spectroscopy of representative Sos peptides [Supporting Information Fig. 1(C)]. Together, these data support the validity of our calorimetrically determined PII propensity scale.
The experimental PII propensity scale reported here [Fig. 1(B) and Supporting Information Table I] was used to investigate the PII content of representative proteomes. An algorithm was developed to calculate PII propensity along a given sequence by determining the average PII bias within a sliding window [Fig. 2(A)]. The effectiveness of this approach was determined by examining sequences previously reported to be high in PII [Fig. 2(B–E)]. The high PII regions of human tau,35 the PEVK domain of the enormous human titin protein,36 and the periplasm-spanning domain of the bacterial TonB protein,37 are all detected by our algorithm as being significantly above the average PII propensity. In contrast, the PII propensity calculated along the sequence of an outer membrane protein, known to have a β-barrel structure,38 appears as noise about the average PII propensity [Fig. 2(E)]. The ability of the algorithm to (1) detect high PII regions in systems investigated by others and (2) discriminate between these regions and proteins known not to have significant PII structure, suggests that our algorithm can reasonably detect the level of PII bias in protein sequences.
The distributions of average PII propensities for structured protein segments (extracted from the PDB39) and ID regions from DisProt40 (which have been experimentally verified to be disordered or contain disordered segments) reveals important differences (Fig. 3). Although there is considerable overlap in the distributions, ID regions show enrichment of high PII propensity. Specifically, 92% of all windows with an average PII propensity of 40% or more occur in ID regions. All of the most extreme PII segments (i.e., >47% PII) are in disordered sequences. Importantly, this is not to claim that all ID segments are high in PII bias. Notably, ID sequences often contain GLY-rich regions that are on the low end of the PII distribution.
To determine whether the computed enrichment of PII in ID sequences is dominated by a small number of residues, or whether a broad repertoire of amino acids contribute to the signal, PII propensities were recomputed with PRO, LYS, and GLN artificially set to mean values [Supporting Information Fig. 2(A)]. ID segments nonetheless maintained their relative enrichment, indicating that enhancement of PII content is robustly encoded and not an artifact of selecting only PRO, LYS, or GLN rich segments. The observed enrichment was also insensitive to the window size employed to calculate the PII bias [Supporting Information Fig. 2(B,C)]. The enrichment of PII in ID sequences was also observed with PII propensities measured by others17, 19, 22 [Fig. 4(A–C)]. However, the enrichment could not be captured by randomly generated PII scales [Supporting Information Fig. 2(D)]. In summary, the fact that the enrichment observed in Figure 3 could also be captured using other experimental or computational (coil library or molecular dynamics) scales, indicates that the enrichment is strongly encoded and not dependent on the method of PII determination.
We note that Avbelj and Baldwin report that neighboring β-branched or aromatic residues may promote β-strand conformations41 in a coil library. Another study reports that aromatic amino acids may disfavor the PII conformation.21 Our calorimetrically-determined PII propensities are consistent with these data,21, 41 as most cyclic amino acids in our scale have low PII propensities, even with cis/trans isomerization corrections [Fig. 1(B) and Supporting Information Table I]. Differences in amino acid composition between our datasets [Fig. 4(D)] directly support our expectation that aromatic residues contribute only modestly to our analysis.
Importantly, in our sequence analysis, we assume that context dependence of PII propensity is minimal, which is supported by recent studies of blocked dipeptides.20 Correlation of PII propensities reported by Pappu and coworkers22 also suggests that nearest neighbor context has negligible impact on the rank order of PII propensities (Supporting Information Table II). PII propensities calculated in PRO, ALA, GLY, VAL, and PHE contexts are all statistically correlated (P < 0.05), with Spearman coefficients ranging from 0.800 to 0.979 (Supporting Information Table II). While the numerical PII propensities differ in these contexts,22 it is clear that host context does not significantly change the rank order of a PII scale (Supporting Information Table II). These observations support our assertion that low and high PII sequences can be differentiated in the proteome. Most importantly, PII scales derived from host systems of different contexts including biological peptides (this study), proline-rich peptides,17, 22 and dipeptides19 all detect the enrichment of PII bias within ID proteins [Figs. 3 and 4(A–C)], suggesting the coding of PII in these regions is robust despite possible near neighbor effects.41, 42
Analysis of the amino acid composition of sequence datasets elucidates the ability of multiple PII scales to detect the enrichment of PII in ID segments because, even though the numerical amino acid PII propensities may differ between scales, there is a general consensus regarding which amino acids have high (PRO, LYS, GLN, GLU) and low (HIS, TRP, TYR, PHE) PII propensity. The consensus can be clarified by examining the average rank order of all PII scales, which shows that amino acids with high average PII rank from all scales also tend to be enriched in amino acid composition within ID sequences [Fig. 4(E)]. Further, the PII rank order averaged from all scales correlates (P < 0.05) with the rank order of the TOP-IDP scale, a scale of amino acids proposed to promote disorder43 [Fig. 4(F)], lending further evidence to the hidden correlation between PII scales and their ability to detect enrichment in ID segments.
To investigate whether PII distributions in ID and structured proteins arise from enrichment within specific local sequence stretches, the distributions were compared to artificial sequences constructed by shuffling sequences within each group. PII propensities of shuffled sequences of structured proteins showed no change from the original distribution [Fig. 5(A)], indicating that PII bias is randomly encoded (i.e., not enriched) along sequences of structured proteins, a result consistent with proteome-wide sequence correlations reporting nearly random site-to-site correlations between amino acids in sequences of structured proteins.44 Shuffling of ID sequences, in contrast, reveal a dramatic change (P < 0.05) in the distribution [Fig. 5(B)], suggesting that PII is selectively enriched within particular segments of the ID sequence.
To determine the extent to which evolution selected for high PII sequences, in silico evolution was performed to monitor robustness of sequence PII propensities to amino acid substitution. Substitutions were performed by two methods: (1) substituting randomly, but maintaining the dataset amino acid composition, and (2) using the well-established BLOSUM62 matrix,45 which generally preserves physico-chemical properties of substituted amino acids. Regardless of substitution method, the PII propensities of structured protein sequences were maintained [Fig. 6(A)], further supporting the conclusion from sequence shuffling [Fig. 5(A)] that PII in structured proteins is not an evolutionarily selected trait.
Unlike the sequences of structured proteins, substitution of ID sequences, either randomly or by BLOSUM62, resulted in a significant decrease in the PII propensity [Fig. 6(B,C)]. These results are consistent with bioinformatics analyses suggesting that, relative to sequences encoding structured proteins, conservation of ID within a sequence is more difficult in in silico evolution experiments.46 Sensitivity to substitution, even when the physico-chemical properties are maintained, indicates that the high PII segments occupy a small, highly specialized sequence space that has evolved specifically to preserve PII propensity.
What function do these specialized, high PII proteins perform? To address this question in a global, systematic, and unbiased way, the PII content of six eukaryotic proteomes was calculated [Fig. 7(A)], and gene ontology (GO) analysis was performed for the top 1% of PII proteins from each proteome. The proteins in the top 1% of each eukaryotic proteome exhibited strong conservation of features and functions. Of the top GO terms returned in order of statistical enrichment, many were identical in all proteomes [Fig. 7(B) and Supporting Information Table III]. High PII proteins are associated with a diverse array of functions (Supporting Information Table III), not for one specialized purpose. One GO term of note that was reproducibly enriched was “collagen,” an archetype PII triple helix.47 In addition, transcription regulation also appears to employ high PII proteins, consistent with the observation that transcription factors are enriched in ID.48 Similar GO terms were obtained with other PII propensity scales that utilize experimental or computational (coil library and molecular dynamics)17–25 methods [Fig. 7(C)]. A consensus on the GO features of high PII proteins was evident from comparison of results obtained using different PII scales; in stark contrast to that observed using randomly selected protein sets.
A striking feature of the analysis is that high PII proteins have a remarkable propensity for phosphorylation (P < 10–8) [Fig. 7(B)], adding clarity to the comparatively weak correlation observed between phosphorylation sites and disorder.49 Statistical enrichment of the “phosphoprotein” GO term was robust, being observed independent of including PRO, SER, THR, or TYR in the calculation of the top PII proteins in the proteome [Fig. 7(D)]. Enrichment of the “phosphoprotein” GO term among the highest PII sequences in the proteome was also detectable using other PII scales17–25 [Fig. 7(D)]. Calorimetric determination of the impact of phosphorylation of SER, THR, and TYR [Fig. 8(A)] revealed amino acid specific effects. While the PII propensity of SER and TYR are not affected by phosphorylation (within error), a dramatic increase in the PII propensity of phospho-THR was observed, reaching a value that compares to the high PII seen for PRO residues.
Investigation of the origin of this effect reveals that the steric consequences of phosphorylation at THR are significantly higher than with SER, a result that is not altogether unexpected given that THR phosphorylation introduces additional bulkiness to the β-carbon (Supporting Information Fig. 3). Such changes are qualitatively similar to mutational strategies that modulate accessible conformations by introduction or removal of β-branched amino acids.50 Phosphorylation of SER or TYR, on the other hand, produces no such effect. We note that in the context of the Sos peptide system utilized here, SER, THR, and TYR are observed to behave differently upon phosphorylation. In other contexts, however, phosphorylation or other post-translational modification may have other effects on conformational propensity. In any case, our results [Fig. 8(A)] indicate that post-translational modifications have a capacity to dramatically change local bias toward the PII conformation.51 Biologically, this result provides a compelling mechanism for local “tuning” of backbone conformational bias, which may be exploited for multiple functions including molecular recognition, targeting for degradation, or allosteric regulation.
Although gene ontology reveals that high PII proteins are often phosphorylated, it is not clear whether these proteins are phosphorylated within high PII regions or at other locations in the protein. Analysis of the PII propensities of known, experimentally validated phosphorylation sites52, 53 indicates that phosphorylation sites are indeed coincident with higher PII segments [Fig. 8(B)]. The contexts of phosphorylation sites are diverse, consisting of regions composed of charged residues as well as PRO-rich regions containing few charged residues. Analysis of phosphorylation site density as a function of PII propensity reveals a striking enrichment of phosphorylation sites in both high and low PII contexts [Fig. 8(C)]. The increased density at high PII is surprising given that the PII propensities of phosphorylation-competent residues (SER, THR, and TYR) have average or low PII propensity [Fig. 1(B)]. Of note is that the enrichment in phosphorylation site density is common throughout many eukaryotic proteomes, including human [Fig. 8(C)], mouse, fly, and yeast (Supporting Information Fig. 4). Because the majority of phosphorylation sites occur at SER, these sites dominate the enhancement distributions [Fig. 8(C) and Supporting Information Fig. 4]. Inspection of THR phosphorylation site density, however, shows a distinct preference for localization in high PII contexts. The enrichment of THR in the high PII contexts, which is the only residue to show a phosphorylation dependent increase in PII propensity (i.e., tunability) in our host guest system, strongly suggests that nature has specifically selected THR for tuning of local conformational bias in high PII regions. It is remarkable that THR exhibits alternate distribution relative to SER, particularly in light of a recent study suggesting THR sites evolve at different rates than SER or TYR sites.54
The distributions of site densities observed in Figure 8(C) (also observed in other eukaryotes, Supporting Information Fig. 4) prompted investigation of the correlations of PII propensity with phosphoprotein functions at a proteome-wide level. Figure 9 reports a condensed representation of the different biological processes and molecular functions associated with phosphoproteins whose experimentally validated phosphorylation sites reside in either low or high PII contexts. The low PII phosphoproteins are typically proteins involved in kinase or transferase activity [Fig. 9(A,B)]. These proteins employ SER, THR, and TYR phosphorylation sites. The relative enrichment of TYR sites is expected from the site density [Fig. 8(C)], and the identities of the low PII phosphoproteins (kinases and transferases) is consistent with lower PII sequences tending to have globular structures [Figs. 3 and 8(B)]. TYR enrichment is also in agreement with amino acid composition bias we have noted for structured proteins [Fig. 4(D)]. In contrast, the high PII phosphoproteins rarely employ TYR sites in the human proteome [Figs. 8(C) and 9(C,D)]. Close examination of Figure 9(C,D) reveals that SER and THR sites exhibited very similar GO term enrichment, with “macromolecular complex assembly” [Fig. 9(C)] being the only GO term exclusively enriched in high PII THR sites. High PII phosphoproteins are enriched in a biological processes and molecular functions associated with mitosis, chromosome organization, cell cycle regulation, and transcription (consistent with previous results suggesting transcription factors are prone to be ID48) [Fig. 9(C,D)]. Comparing the difference in amino acid utilization (color) and the GO terms listed in Figure 9, it is immediately evident that low and high PII phosphoproteins perform different cellular functions. Perhaps even more interesting, we note that the GO terms enriched for high PII phosphoproteins [Fig. 9(C,D)] differ slightly from those observed for all high PII proteins from the proteome at-large (Fig. 8 and Supporting Information Table III), suggesting that modification (phosphorylation, in this case) within high PII context may be selected for a specialized functional utility.
Here we demonstrate for the first time a proteome-wide correlation between an experimentally determined conformational bias for PII and the propensity to be intrinsically disordered and thus unfolded. The functional importance of this relationship is revealed through a dramatic enrichment of phosphorylation sites within high PII segments. Proline-directed phosphorylation sites contribute to the enrichment of phosphorylation sites within high PII segments. Yet, there are hundreds of experimentally validated phosphorylation sites that also contribute to the enrichment but contain no nearby proline residues. We speculate that the proteome-wide bias of phosphorylation sites to high PII (and therefore likely disordered) segments may be a result of evolutionary pressures to facilitate kinase accessibility to these disordered regions. The conformational biases within these disordered regions may be a thusfar unappreciated means of regulating kinases or phosphatase accessibility and as a consequence their activity in signaling or other functions.
The conservation of these trends across multiple proteomes and the differential sensitivity of THR and SER to phosphorylation provide a compelling argument for their differential usage. We have endeavored to elucidate the different biological processes and molecular functions for which THR and SER phosphorylation sites have been selected for, finding that the functions of these sites in low and high PII contexts are completely different. Not explored, but equally as plausible is the possibility that other post-translational modifications (acetlyation, methylation, etc.) may also utilize ID segments and perhaps even differentially impact PII conformational propensity as is the case with phosphorylation.
Our results reveal a potentially new way that ID proteins can modulate activity. Instead of using post-translational modification to induce a conformational change between two ostensibly discrete conformations (i.e., T and R states) in the context of a folded protein, ID proteins can potentially affect functional changes by tuning the distributions of otherwise disordered states specifically with respect to the PII conformation, as in the case for THR in this study. Whether and how this can be functionally utilized awaits further investigation.
The Sos peptide (Ac-VPPPVPPRRRY) and variants of the peptide with guest “X” at position (Ac-VPPXVPPRRRY) were acquired commercially from GenScript USA or Neo BioSci. Purity of the peptide samples (>98%) were estimated using reverse high performance liquid chromatography and by mass spectrometry. Sem-5 C-terminal SH3 domain from C. elegans was purified as described previously.29 The SH3 used in this study is a pseudo-wild type protein. CYS 55 has been mutated to an ALA to prevent possible oxidation and intermolecular cross-linking.
A Microcal VP-ITC system was used to perform all titration experiments.55 SH3 was dialyzed against phosphate buffer, pH 7.5 (20 mM sodium phosphate, 200 mM sodium chloride (Fisher)). Lyophilized peptides were dissolved in buffer from the final protein dialysis. Protein and peptide sample concentrations were determined using the Edelhoch method.56 Protein samples for titration experiments ranged in concentration from 0.5 to 0.65 mM, and peptide concentrations were approximately 10 times the protein concentration. All Sos peptide variants exhibited similar solubility. At 25°C, a series of 8 uL injections were made (34–35 total injections), with a spacing of 280 s between injections for equilibration. An initial injection of 2 uL was made and the data discarded for each titration to account for heat anomalies caused by instrument equilibration and pre-titration mixing by diffusion. Data were corrected for ligand heat of dilution by performing a titration of peptide into buffer and directly subtracting the resulting heats. These corrected data were fit in Origin 7 (OriginLab) using a nonlinear least squares regression varying the stoichiometry (n), binding constant (Kapp), and the molar heat of binding (ΔH). The apparent free energy of binding (ΔGapp) and the entropy (ΔS) were calculated using the best-fit binding constant (Kapp) and the thermodynamic relation below:−RT ln Kapp = ΔGapp = ΔH − TΔS(1)where R is the gas constant (1.985 cal/K/mol) and T is the temperature (298 K). PII propensities were determined from the ΔΔGapp of each mutant relative to wild type (proline) as described previously.26–28 Statistical analysis of error propagation in the PII propensity scale was calculated in Python, where the error is the standard deviation in the difference in ΔGapp from the fit Kapp.
All CD scans were performed on a Jasco J-720 spectropolarimeter. Sos peptide samples were prepared by diluting ITC ligand solutions (with identical buffer) to concentrations suitable for CD, ∼0.1 mg/mL. As such, the buffer conditions were identical to ITC (20 mM sodium phosphate, 200 mM sodium chloride, pH 7.5). Spectra were measured at 298 K from 200 to 250 nm, at a scan rate of 10 s/nm. Data were collected in nanometer increments and represent an average of three scans.
A detailed derivation is provided in the Supporting Information to explain how PII propensities are determined.
Calculation of PII propensity in sequences and in silico evolution of PII propensity
Several protein sequence datasets39, 40, 52 were employed for our analysis. Several protein sequence datasets were employed for the analysis of the PII content of the proteome, including a nonredundant set of human protein sequences extracted from the PDB39 consisting of proteins from each SCOP family,57 an ID protein dataset DisProt 5.5,40 and the complete proteomes of six eukaryotes—H. sapiens (human), M. musculus (mouse), D. melanogaster (fly), C. elegans (worm), A. thaliana (plant), S. cerevisiae (yeast) obtained from the Integr8 project.52 Algorithms for calculating the PII propensities of amino acid sequences were written in C++ and Python, with additional data processing in Perl, the R Project, and Microsoft Excel. The PII bias at a specific position was computed as an average of the PII propensities for a given window. The process is customizable- varying propensity scales, inclusion/exclusion of amino acids, window size (1–100) have been tested (Fig. 2 and Supporting Information Fig. 2). The calculations can be performed ignoring the contributions of specific amino acids (such as proline, for example) and with any window size. Statistical difference between PII propensity distributions were assessed by a t-test. Calculations shown in Figures 2 and 5 were performed over a window size of 32 residues, conservatively assuming 60% PII propensity for proline. Each distribution is comprised of over 40,000 data points. In Figure 4(A–C), a window size of 10 residues was used. Statistical significance of correlations shown in Figure 4(F) were assessed by mathematical standards.58
An algorithm was developed to compare the PII content of a biological “parent” sequence to those created by random substitution (maintaining background amino acid frequencies of the datasets) or by substitution that conserves physico-chemical properties, BLOSUM6245 or PAM59 substitution matrices. To quantify the effect of mutation on PII content of a sequence, the average PII propensities of “parent” and the in silico evolved “daughter” sequences were calculated as described above. Substitution within the algorithm is completely adjustable, and can involve any part of the sequence, maintaining any arbitrary level of identity to the parent sequence, and with any input substitution frequencies. The means and standard deviations of the PII distributions for in silico evolved “daughter” sequences (n = 10,000) were computed from which the z-score of the average PII propensity of the “parent” sequences was calculated. The z-score was converted into a p-value, the logs of which are plotted in Figure 6. In all cases except the PII distribution of “daughter” high PII, BLOSUM6245 sequences, the distributions were normal. Code and scripts for shuffling tests, in silico evolution, and analyzing the PII content of the sequences were written in Python.
Gene ontology of high PII proteins
The database for annotation, visualization, and integrated discovery (DAVID) was identified enriched features and functions of the top 1% of high PII proteins in eukaryotic proteomes obtained from Integr8.52 To assess GO term enrichment in the top 1% of proteins selected for GO analysis (n = 200), a one-tailed Fisher exact test was used,60 and P-values of enrichment reported by DAVID61, 62 were normalized to P-values that could be obtained by submission of random protein datasets (10–3). In Figure 7(B), reported is the number of GO terms returned with significance (P ≤ 10–6), with “phosphoprotein” enriched (P ≤ 10–8) in all three species. The exact normalized p-values for “phosphoprotein” enrichment in H. sapiens, M. musculus, and S. cerevisiae [Fig. 4(B)] were P = 7.6 × 10–11, P = 5.4 × 10–8, and P = 8.7 × 10–14, respectively. Code and scripts for mining the proteomes and analyzing the sequences were written in Python and Perl. DAVID output was processed manually in Microsoft excel. Calculations for Figures 7–9 were performed over a window size of 50 residues, assuming 60% PII propensity for proline. Each distribution is comprised of over 10,000 data points. Gene lists of proteins extracted from the proteome based on their PII propensities were nonredundant. High PII proteins (top 1% for GO analysis) had the longest continuous segments of high PII bias. Shown in Figure 7(D) is the log of the p-value for “phosphoprotein” divided by log 10–3 (noise). In Figure 8(C), the number of phosphorylation sites of each type (SER, THR, or TYR) is normalized by the number of peptides in each PII bin from the H. sapiens proteome. Phosphoproteins were classified as low PII (PII% < 31%, one standard deviation below the mean) or high PII (PII% >38%, one standard deviation above the mean) based upon the sequence PII context of their experimentally validated phosphorylation sites. The sizes of the pie pieces in Figure 9 correspond to the relative statistical enrichment of the GO term obtained from DAVID for phosphoserine, phosphothreonine, and phosphotyrosine containing proteins that were then grouped in the pie representation to show relative enrichment. GO terms grouped as “Other” (Fig. 9) were individually statistically enriched (P < 0.05), yet relative to other GO terms comprised less than 1% (individually) of the pie and were grouped for clarity.
The authors acknowledge James O. Wrabl and Jiin-Yu Chen for reviewing the manuscript and George D. Rose for insightful discussions. WAE, TPS, and VJH designed the research. WAE, TPS, and AJC performed the work and analyzed the data. WAE and VJH wrote the paper.