HydrogelFinder: A Foundation Model for Efficient Self‐Assembling Peptide Discovery Guided by Non‐Peptidal Small Molecules

Abstract Self‐assembling peptides have numerous applications in medicine, food chemistry, and nanotechnology. However, their discovery has traditionally been serendipitous rather than driven by rational design. Here, HydrogelFinder, a foundation model is developed for the rational design of self‐assembling peptides from scratch. This model explores the self‐assembly properties by molecular structure, leveraging 1,377 self‐assembling non‐peptidal small molecules to navigate chemical space and improve structural diversity. Utilizing HydrogelFinder, 111 peptide candidates are generated and synthesized 17 peptides, subsequently experimentally validating the self‐assembly and biophysical characteristics of nine peptides ranging from 1–10 amino acids—all achieved within a 19‐day workflow. Notably, the two de novo‐designed self‐assembling peptides demonstrated low cytotoxicity and biocompatibility, as confirmed by live/dead assays. This work highlights the capacity of HydrogelFinder to diversify the design of self‐assembling peptides through non‐peptidal small molecules, offering a powerful toolkit and paradigm for future peptide discovery endeavors.


Random Sampling Method
The random sampling mentioned in this article is implemented using the "shuffle" function provided by the "random" module of the python standard library.Specifically, random.shuffle() uses a random number generator to shuffle the elements in the sequence, so each call to it produces a different result.However, we have set random seed here to ensure the repeatability of the random extraction of the experiment.

RDKit Data Filtering
Filtering molecular data and checking whether a molecule can be converted to a graph using RDKit is divided into the following 3 steps: Step 1: Importing RDKit Library.We began by importing the RDKit library, a powerful tool for cheminformatics and molecular informatics.
Step 2: Loading Molecular Data.We loaded molecular data into RDKit.The molecular data can be represented using SMILES (Simplified Molecular Input Line Entry System) notation or other supported molecular file formats.
Step 3: Checking Molecular Validity.After loading the molecule using Chem.MolFromSmiles() or similar RDKit functions, we checked whether the molecule was valid and could be converted into a graph.

High-throughput Prediction HydrogelFinder-predict Model
Support vector machine (SVM) belongs to supervised learning methods.It is a widely used machine learning algorithm for binary classification tasks.In our experiments, we are using the radial basis function (RBF).For the SVM models, the parameter optimization was performed using grid search.The model with C = 10 and γ = 0.01 was considered to have the highest AUROC (0.9862) towards the testing set of the HYDROGEL dataset.We carefully selected relevant molecular features and descriptors for input to the SVM model.The model was trained to discriminate active compounds that could self-assemble to form hydrogels from inactive ones according to their 2,048-bit-radius extended connectivity fingerprint (ECFP) representations.We split the dataset into training and test sets in the number 9:1 to train the model (Supplementary Table 2).
The evaluation metrics used Receiver Operating Characteristic (ROC) Curve.We plotted the ROC curve and calculated the area under the ROC curve (AUROC) to assess the model's discriminatory power, where the AUROC is calculated as follows: where  is the number of positive samples,  is the number of negative samples,   is the positive sample prediction score, and   is the negative sample prediction score.We set the threshold to 0.5, which is the default threshold for binary classification tasks in machine learning.At this threshold, the accuracy of the model is 99.56%.The formula for calculating the accuracy rate is as follows: where  is predicted to be a positive sample and is actually a positive sample,  is predicted to be a negative sample and is actually a negative sample,  is predicted to be a negative sample and is actually a positive sample,  is predicted to be a positive sample and actually a negative sample.

Calculate the area of overlap of kernel density maps
We used Simpson's law (simps function) to calculate the overlap area of the kernel density maps.Specifically, for two kernel density estimation curves, () and (), we aim to determine their overlap area using the formula: where [, ] represents the region of intersection of the two curves, and ((), ()) signifies selecting the smaller value of the two curves at each  point.
Simpson's law estimates this overlap area by discretizing this integral.First, the interval [, ] is divided into small intervals, ((), ()) is then computed within each of these intervals, and finally, the area over these small intervals is accumulated.

Up-sampling strategy
In addressing the imbalance between positive and negative samples in a dataset, an upsampling strategy is employed.This method is crucial for improving the performance of machine learning models by balancing the class distribution.The Python pandas library is utilized for data manipulation, and sklearn.utilslibraries is leveraged for performing the upsampling.Specifically, the resample function in the sklearn.utilslibraries allow us to adjust the number of samples in a class by repeating instances.Thus, the final ratio of positive and negative samples used in train HydrogelFinder-predict was 15728:15497.

Mathematical
Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia 6 Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, China 7 School of Software, Shandong University, Jinan, China 8 Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China # These authors contributed equally.

Figure S1 .
Figure S1.The property distributions of these three clusters.Quantitative analysis of

Figure S2 .
Figure S2.Chemical structures and identification numbers (IDN) of peptides selected from

Figure S21 .
Figure S21.Live/dead assays of SHED cells seeded on standard tissue-culture

Figure S22 .
Figure S22.UMAP visualization of chemical space distribution of small molecules

Table S1 .
Chemical properties of selected molecules.

Table S3 .
Number of modifiers in datasets.

Table S4 .
Quantifying the area of overlap of logp properties.

Table S5 .
Quantifying the area of overlap of Hba properties.

Table S6 .
Quantifying the area of overlap of Hbd properties.

Table S7 .
Quantifying the area of overlap of Nbase properties.

Table S8 .
Quantifying the area of overlap of Tpsa properties.

Table S9 .
Quantifying the area of overlap of Mol.wt properties.