Deep Learning Empowers the Discovery of Self‐Assembling Peptides with Over 10 Trillion Sequences

Abstract Self‐assembling of peptides is essential for a variety of biological and medical applications. However, it is challenging to investigate the self‐assembling properties of peptides within the complete sequence space due to the enormous sequence quantities. Here, it is demonstrated that a transformer‐based deep learning model is effective in predicting the aggregation propensity (AP) of peptide systems, even for decapeptide and mixed‐pentapeptide systems with over 10 trillion sequence quantities. Based on the predicted AP values, not only the aggregation laws for designing self‐assembling peptides are derived, but the transferability relation among the APs of pentapeptides, decapeptides, and mixed pentapeptides is also revealed, leading to discoveries of self‐assembling peptides by concatenating or mixing, as consolidated by experiments. This deep learning approach enables speedy, accurate, and thorough search and design of self‐assembling peptides within the complete sequence space of oligopeptides, advancing peptide science by inspiring new biological and medical applications.

Both pentapeptides and decapeptides with α-helix conformation have lower total energy in simulated conditions (Figure SD2a, b).Although most peptides with α-helix conformation carry higher bonded energies (Figure SD2c, d), the non-bonded interactions (Figure SD2e, f) dominate the self-assembling and final morphology of peptides in solvent 5 .Therefore, α-helix should be a reasonable starting conformation in CGMD simulations (Figure 1a).suggesting that α-helix is a reasonable starting conformation in aggregation simulation.
To further identify the effect of conformation on aggregation, we perform simulations on randomly selected 620 pentapeptides with both β-strand and α-helix conformations for 125 ns.
The mean absolute difference of 620 AP values between the two conformations is 0.08 (Figure SD3a), evidencing that most pentapeptides with different conformations generate similar AP.mean absolute difference of 0.08 is achieved.The AP difference is divided into 40 bins with a bin size of 0.01878, and 70% of peptides are found to have an absolute AP difference of 0.10, 86% are within the absolute difference of 0.15.b.AP difference of 184 peptides (selected from a with .AP α -AP β .> 0.098) calculated at 500 ns.A mean absolute difference of 0.12 is achieved.c.AP of both β-strand and α-helix conformation calculated at 125 and 500 ns.APs' remain relatively constant, with 500-ns AP just slightly larger than 125-ns AP due to longer simulation time promoting further aggregation.It is thus reasonable to choose 125 ns as simulation duration in the initial phase of generating training data, as a good compromise between computational cost and simulation accuracy.
To eliminate the non-convergence effect on AP difference due to simulation time, we pick 184 pentapeptides with AP difference between different conformations larger than 0.098 at 125 ns, then perform aggregation simulations for another 375 ns to a total of 500 ns.The mean absolute difference between the two conformations of 184 pentapeptides is 0.12 at 500 ns (Figure SD3b).We thus conclude that setting different conformations has negligible effect on the AP of short peptides calculated by CGMD simulations.Comparing 125-ns and 500-ns AP of the 184 peptides, they remain relatively constant with 500-ns AP slightly larger than 125-ns AP (Figure SD3c), but no peptides are observed to change from "no aggregating" to "strongly aggregating" or vice versa, therefore it is appropriate to choose 125 ns for generating training data, as a reasonable compromise between computational cost and simulation accuracy.
In addition to initial energy of a single peptide in water solvent, energy of peptide with two conformations after aggregation is also examined, as an alternative approach to investigate possible secondary structure transition, which is calculated by averaging the energies over the last 150 ns duration (from 350 ns to 500 ns, Figure SD4a).It is found that 59 peptides out of the total 184 peptides with b-strand conformation have lower total energy than that of -helix conformation after aggregation (Figure SD4b), although all peptides have higher initial energy with b-strand conformation.This could indicate a transition of secondary structure from -helix to b-sheet or the formation of intermolecular b-sheet conformation during self-assembling process 6 .Investigation of secondary structure transition/formation during self-assembly falls out of scope of this research and will be conducted in future work.
Examining the difference of Lennard-Jones potential with respect to AP difference (Figure SD4c), a clear linear relation can be observed as marked by red line, indicating that from the energy perspective, part of peptides with b-sheet conformation could bear stronger propensity for aggregation.The inset of Supplementary Figure SD4c illustrates no discernable discrepancy between Lennard-Jones potential difference and total energy difference, demonstrating that the non-bonded interactions of Lennard-Jones potential dominates the self-assembly process of studied pentapeptides, which coincides well with available literature 5 .
It should be acknowledged that self-assembly of peptides can be driven synergistically by various interactions, such as hydrogen bonding, - stacking, hydrophobicity, Coulombic interaction, etc., and this Lennard-Jones interaction can be different combinations of these interactions (exclusive of Coulombic interaction).Regarding peptides with charged amino acids, the Coulombic interaction is found more than one order of magnitude smaller than the Lennard-Jones potential, and the bonded interaction (bond stretching, bond angle, proper dihedrals, and improper dihedrals interactions) is generally one order of magnitude smaller than the non-bonded interactions (Lennard-Jones and Coulombic) for all peptides.

SD2. Improvement of score function AP H (to AP HC )
Since AP values generated by CGMD cannot provide sufficient details for distinguishing between aggregation, precipitation, crystallization, or self-assembly to ordered structure, a hydrophilicity-corrected score AP H 7 has been developed as AP H = AP' α  (logP)' β for selecting appropriate peptides for more practical use.The prime symbol denotes normalization between 0 to 1 and logP is hydrophilicity calculated directly from the Wimley-White whole-residue scale 8 .We list the rank of AP H 2-1 (α = 2, β = 1) and AP H 2-0.5 (α = 2, β = 0.5) scores in Table SD1 SD5a).The original score function then assigns high scores to peptides with both high AP' and logP', leading to overestimation of scores of soluble peptides.
It can be deduced that a high logP' indicates high solubility and the peptides will thus possibly not aggregate in real world, while the CGMD simulation with Martini force field is incapable of describing intermolecular interactions with full precision and could generate high AP scores for peptides with high logP', inducing bias on selecting peptides with high solubility.
However, it should be noted that peptide self-assembling can be promoted by various interactions other than hydrophobicity, such as hydrogen bonding, - stacking, electrostatic interactions, etc., thus, it is possible that a high logP' incurs relatively high AP values (such as KFAFD and VKVEV).It is also worth being noted that for aggregated peptides in experiment, the ratio of AP prd '/logP' is 1.4085, while for non-aggregated peptides, the ratio is 1.0980, which is quite reasonable since with a relatively small logP' (i.e., neither soluble nor insoluble) and relatively large AP (i.e., medium aggregation without precipitation), the peptides have the highest possibility of self-assembling.The ratio can be a reference value in selecting self-assembling peptides and has been listed in Table SD1 as well.
To correct the bias of the original score functions to highly soluble peptides, we introduce a penalty factor for hydrophilicity, i.e., , to the original score function: Here, α = 2 and β = 0.5.µ (= 0.4113) and σ 2 (= 0.0657) are the average and variance of logP' of aggregated peptides among the 26 pentapeptides aforementioned, respectively.
In the selection of aggregated peptides or the exclusion of non-aggregated peptides, we want the rankings (the score is ranked in descending order, with the highest score having a rank of 1, while the lowest score having a rank of the last) of the scores of possibly aggregated peptides to increase while those of non-aggregated peptides to decrease as much as possible.
Comparing the original and corrected AP H (Figure SD5b, c and undesired bias can be introduced 9 .Another possible approach is to select a mass of peptides with great diversity based on the current score function, and experimentally synthesize them for examination of aggregation or not, and subsequently train AI models bearing the underlying physical laws of aggregation for prediction.With increasing amount of data, the performance of AI models could be iteratively improved and capable of suggesting peptide sequences with greater aggregation propensity or predicting the aggregation propensity of interested peptides.
= 1 (0) denotes aggregation (non-aggregation) in experiment; logP' is rescaled hydrophilicity between 0 to 1; AP prd , AP sim , AP prd ', and AP rep are predicted AP, simulated AP, rescaled predicted AP, and reported AP values in reference 9, respectively; Ratio is calculated by AP prd '/logP', which can assist in selecting aggregating peptides; r-AP H 2-1 and r-AP H 2-0.5 are the ranking of AP H value calculated by original score functions with parameters of α = 2, β = 1 and α = 2, β = 0.5, while r-AP HC is the ranking of AP HC score calculated by the revised score function with parameters in Equation 1, and the percentage is the improvement of r-AP HC compared to r-AP H 2-0.5 .For aggregated peptides, it is calculated by (r-AP H 2-0.5r-AP HC )/r-AP H 2-0.5 , while for non-aggregated peptides, it is calculated by (r-AP HCr-AP H 2-0.5 )/r-AP H 2-0.5 , the minus sign denotes performance degradation of r-AP HC compared with r-AP H 2-0.5 .

Figure SD1 .
Figure SD1.Evaluation of total energy of a single pentapeptide/decapeptide with either α or β conformation in water at 300 K (example: YNWAT with β-strand conformation).

Figure
Figure SD2.a-b.Total energy of a single pentapeptide/decapeptide with either α or β conformation in water at 300 K.Both pentapeptides and decapeptides with α-helix exhibit lower total energy.c-d.Bonded energy.Most pentapeptide and decapeptides have higher bonded energy with α-helix conformation.e-f.Non-bonded energy.All pentapeptides and decapeptides have lower non-bonded energy with α-helix conformation in water at 300 K,

Figure
Figure SD3.a. AP difference (AP α -AP β , filled circles) between β-strand and α-helix conformation and number distribution (blue-edged diamonds) of 620 pentapeptides within each AP difference bin, calculated at 125 ns.A maximum of 0.40, a minimum of -0.35, and a

Figure
Figure SD4.a. Energy evolution during the self-assembling process of 150 individual PYYAL  (α-helix conformation) peptides in water for 500 ns.b.Final energy of 184 pentapeptides after aggregation, averaged by both the number of peptides in aggregation simulation (i.e., 150) and simulation duration from 350 ns to 500 ns.59 pentapeptides with b-strand conformation has lower energy after aggregation, indicating possible transitions of secondary structure during self-assembly process.c.Lennard-Jones potential energy difference with respect to AP difference, with the inset showing total energy difference versus Lennard-Jones potential energy difference.The Lennard-Jones potential dominates the self-assembly process and part of peptides exhibit stronger aggregation propensity with b-strand conformation.
), introducing the penalty factor improves the ranking of most aggregated peptides, while lowering that of non-aggregated peptides, leading to more accurate selection.Although the rankings of aggregated pentapeptides VVVVV, VKVFF, VKVEV and non-aggregated pentapeptides VKVKV, WKPYY, PTPCY, and PPPHY (indicated by symbol  in FigureSD5b, c) have been lowered due to the large deviation of logP' from the mean value, the ranking is still close to the original one.Care thus should be taken that penalty imposed here may degrade the ranking of peptides with relatively high AP values containing V, P, and Y amino acids with logP' values deviating from mean value.In general, the penalty factor improves the performance of the original score function (19 over 26 peptides) by decreasing the weight of logP in the more hydrophilic range.There is no doubt that the score function can be further improved by either optimizing the weighting factor or including more parameters denoting other interactions or the formation of β-sheet structures or ordered morphologies after aggregation.For example, Rohit Batra has tentatively included a β-sheet propensity parameter (reported by Chou and Fasman10 ) with an associated weight in the score function and found the performance of score function is improved, but selection of associated weight of β-sheet propensity is still empirical

Figure
Figure SD5.a. Distribution of logP' of aggregated and non-aggregated peptides.A mean value of 0.4113 and 0.5346 is obtained for aggregated and non-aggregated peptides.b, c.Comparison of original and corrected AP H (AP H 2-0.5 vs AP HC ) score for aggregated and non-aggregated peptides, respectively.AP HC score exhibits performance improvement over 19 peptides out of the total 26.The color bar represents the performance improvement, calculated by (r-AP H 2-0.5r-AP HC )/r-AP H 2-0.5 for aggregated peptides, and by (r-AP HCr-AP H 2-0.5 )/r-AP H 2-0.5 for non-aggregated peptides.

Figure 5 '
Figure S1.a. Relation between AP H 2-0.5 ' and logP'.Four ranges of AP H 2-0.5 ' is indicated by low-E, medium-F, medium-high-G, and high-H.The number of peptides in each range is 633202, 1129893, 1386777, 50128.b.Distribution of AP H 2-0.5 ' within four AP H 2-0.5 ' ranges.c.Distribution of logP' within four AP H 2-0.5 ' ranges.d.Distribution of 20 amino acids in within four AP H 2-0.5 ' ranges (E, F, G, and H).The amino acids have been divided into five groups according to their contribution to aggregation, in terms of AP prd '.

Figure
Figure S2.a. Relation between AP HC ' and logP'.Four ranges of AP HC ' is indicated by J, K, L, and M. The number of peptides in each range is 708051, 1232471, 1224516, 34962.b.Distribution of AP HC ' within four AP HC ' ranges.c.Distribution of logP' within four AP HC ' ranges.d.Distribution of 20 amino acids within four AP HC ' ranges (J, K, L, and M).The amino acids have been divided into five groups according to their contribution to aggregation, in terms of AP prd '.

Figure S3 .
Figure S3.Distribution of each amino acid at each position, calculated based on AP prd '.

Figure S4 .Figure S6 .
Figure S4.Distribution of each amino acid at each position, calculated based on AP H 2-0.5 '.

Figure S7 .
Figure S7.Photographs and TEM images of WWWWW, FFFFF, and YYYYY peptides in water.WWWWW forms suspensions with amorphous precipitates while FFFFF and YYYYY form hydrogels with nanosheet and nanofiber structures, respectively.All results are obtained at pH = 7 after 8 hours of standing at room temperature (R.T.).

Figure S8. a .
Figure S8.a. Transferability relation between AP of decapeptides (AP deca ) and averaged AP of two pentapeptides [AP avepen = (AP penta1 + AP penta2 )/2].b.Transferability relation between AP of mixed pentapeptides (AP mixpen ) and AP avepen .c-d.Distribution of AP difference, i.e., AP deca -AP avepen and AP mixpen -AP avepen .e. Relation between AP deca -AP avepen and AP mixpen -AP avepen .Data of AP deca and AP penta employed here are predicted by the combo model and data of AP mixpen is predicted by the mixed model.Totally 1 million groups of data of AP deca , AP mixpen , and AP avepen are studied.

Figure S9. a .
Figure S9.a. Transferability relation between AP of decapeptides (AP deca ) and averaged AP of two pentapeptides [AP avepen = (AP penta1 + AP penta2 )/2].b.Transferability relation between AP of mixed pentapeptides (AP mixpen ) and AP avepen .c-d.Distribution of AP difference, i.e., AP deca -AP avepen and AP mixpen -AP avepen .e. Relation between AP deca -AP avepen and AP mixpen -AP avepen .Data of AP deca and AP penta employed here are generated by coarse-grained molecular dynamics.Totally 2253 groups of data of AP deca , AP mixpen , and AP avepen are studied, for comparing the results of predicted data in Supplementary Figure 8.
, calculated based on the original score function and the AP prd obtained in this work.The

Table S1 .
Bead representation of 20 natural amino acids.Currently eighteen particle types are defined, divided into four main categories: P, polar; N, intermediate polar; C, apolar; Q,

Table S3 .
AP information of decapeptides, corresponding pentapeptides, averaged AP of two pentapeptides, and mixed pentapeptides, with top 200 AP mixpen -AP avepen values.All AP data presented here are predicted by AI models.

Table S4 .
AP information of decapeptides, corresponding pentapeptides, averaged AP of two pentapeptides, and mixed pentapeptides, with top 100 AP mixpen -AP avepen values.All AP data presented here are generated by CGMD simulations.