A Multimodal Protein Representation Framework for Quantifying Transferability Across Biochemical Downstream Tasks

Abstract Proteins are the building blocks of life, carrying out fundamental functions in biology. In computational biology, an effective protein representation facilitates many important biological quantifications. Most existing protein representation methods are derived from self‐supervised language models designed for text analysis. Proteins, however, are more than linear sequences of amino acids. Here, a multimodal deep learning framework for incorporating ≈1 million protein sequence, structure, and functional annotation (MASSA) is proposed. A multitask learning process with five specific pretraining objectives is presented to extract a fine‐grained protein‐domain feature. Through pretraining, multimodal protein representation achieves state‐of‐the‐art performance in specific downstream tasks such as protein properties (stability and fluorescence), protein‒protein interactions (shs27k/shs148k/string/skempi), and protein‒ligand interactions (kinase, DUD‐E), while achieving competitive results in secondary structure and remote homology tasks. Moreover, a novel optimal‐transport‐based metric with rich geometry awareness is introduced to quantify the dynamic transferability from the pretrained representation to the related downstream tasks, which provides a panoramic view of the step‐by‐step learning process. The pairwise distances between these downstream tasks are also calculated, and a strong correlation between the inter‐task feature space distributions and adaptability is observed.

For AlphaFold2 predicted structure, it produces a per-residue confidence metric called predicted local distance difference test (pLDDT) on a scale from 0 to 100. pLDDT estimates how well the prediction would agree with an experimental structure based on the local distance difference test Cα (lDDT-Cα). A cut-off of pLDDT > 70 corresponds to a generally correct backbone prediction [1] .
We have examined the accuracy of the used AlphaFold2 predicted structures. As shown in Re Fig.1a and b, 79.5% of the used predicted structures are accurate during the pretraining phase (pLDDT > 70) while 72% of predicted downstream PPI structures (SHS27k/SHS148k/STRING) are accurate (Re Fig.1c). We then investigated whether structural accuracy would impact the performance of PPI tasks. As depicted in Re Fig. 1d, the experimental group that masked low-accuracy structures achieves comparable results to the original group for both BFS and DFS splitting settings, indicating that the inclusion of low-confidence structures may not significantly affect model performance. Other modalities and the PPI graph networks may be able to correct the bias caused by low-accuracy structures.

5.
There are two primary linear transformation matrices in the calculation described above: for scalar feature transformation and ′ for geometric feature conversion. The subsequent two activation functions are based on L2 normal form and + . In addition, before scalar feature conversion, GVP splices the feature vector after geometric feature conversion is L2 regular, which can extract rotation invariance information from . GVP inserts an additional linear transformation matrix ′ prior to the nonlinear transformation of geometric features, which can be separated from the extracted norm in order to control the output's dimension.
GVP-GNN is a comprehensive module comprised of GVP and GNN. The main calculation is as follow: where h ( ) and h ( → ) represents the node feature set and edge feature set of the graph, (1) outputs the intermediate feature h ( → ) = ( , ) ∈ × ×3 that combines information from neighboring nodes and edges.
In formula (2) Structure and GO term mask technique (1) For proteins lacking structure information, we assign a unified structure by setting the coordinates of all atoms to "NaN". Prior to alignment with the sequence embedding, the obtained structure features are masked following encoding by the structure encoder. We generate a mask of the same shape as the structure's features with all values set to zero, which is then used to mask the structure's features with a very small number before being passed to the softmax function. This is equivalent to generating a feature embedding with the same shape as the structure feature, but with all values set to a small constant value, such as −1 × 10 −9 .
(2) For proteins lacking GO annotations, we uniformly annotate them as "No goterm," i.e., we construct a graph with a single "No goterm" node. After encoding with the GraphGO encoder and the GO encoder, we perform a mask operation on the resulting GO features before aligning and fusing them with the protein features. The alignment module is a Transformer Decoder with protein features as the source input and GO features as the target input. In the attention mechanism, the mask operation on the GO features eliminates GO information by querying the K and V of the protein features with a constant Q.
By utilizing the above processing and operations for structure and GO annotations, our model can avoid the need for structure and GO annotations in the embedding generation, allowing it to perform tasks, such as protein property prediction, that require only sequence inputs.

Ablation study on homologous proteins
In order to make fair comparisons with other landmark methods, the train/valid/test sets of all protein property datasets were obtained from their original sources (TAPE [2] : github.com/songlab-cal/tape, ProteinBert [3] : github.com/nadavbra/protein_bert). The samples within the train, valid, and test sets were identical with other methods Supplementary Fig. 6 Protein sequence homology ablation experiments (stability and secondary structure).
Supplementary whether the difference between the two was statistically significant. As shown in Re Fig. 6, the statistic p value is less than 0.05 for both stability and fluorescence tasks, indicating that the difference between MASSA and ProteinBERT is statistically significant. In contrast, the statistical p value is greater than 0.05 for the secondary structure and remote homology tasks, indicating that the difference between MASSA and PorteinBERT is not statistically significant.
Supplementary Fig. 7 Test for statistical significance (MASSA and ProteinBERT)

Pretraining details
A server with four NVIDIA GeForce RTX 3090 GPUs was used to pretrain the MASSA for 150 epochs over the course of 32 days. On the same server, the fine-tuning operations took one to three days, depending on the downstream task. It was done using the following hyperparameters: RAdam [4] optimizer with 1 = 0.9, 2 = 0.999, = 10 −8 and _ = 10 −4 , ℎ _ = 10 −4 . In addition, we employed the Lookahead [5] optimization approach with k=5,=0.5.
The comparison of transferability metrics A significant advantage of optimal-transport(OT) over other approaches is that it permits efficient and accurate comparison of distributions with little or no intersection, which is a common scenario in bioinformatics involving heterogeneous tasks.
In accordance with your suggestion, we compare OT's trade-off between time cost and performance to four approaches, including Euclidean distance, cosine distance, H-score [6] , and LogMe [7] . As depicted in Re Fig. 8, the computation time of all approaches increases quadratically with the size of the set, whereas OT achieves significant performance improvements at less than double the cost in time. Specifically, the matrix comparison-based methods, Euclidean distance and cosine distance, require approximately half the time of OT, but their performance is significantly diminished. H-score and LogMe, which were proposed to deal with inter-task transferability in the CV field, fail in the complex scenario of heterogeneous tasks in bioinformatics. OT strikes a better balance between time cost and performance, making it the