State-of-the-art computational methods to predict protein–protein interactions with high accuracy and coverage

Prediction of protein–protein interactions (PPIs) commonly involves a significant computational component. Rapid recent advances in the power of computational methods for protein interaction prediction motivate a review of the state-of-the-art. We review the major approaches, organized according to the primary source of data utilized: protein sequence, protein structure, and protein co-abundance. The advent of deep learning (DL) has brought with it significant advances in interaction prediction, and we show how DL is used for each source data type. We review the literature taxonomically, present example case studies in each category, and conclude with observations about the strengths and weaknesses of machine learning methods in the context of the principal sources of data for protein interaction prediction.


INTRODUCTION
Proteins are the basic building blocks of organisms, but a protein does not function solely on its own.Rather, proteins interact physically and specifically with one another to perform particular cellular processes.These interactions occur through either transient or stable non-covalent bonds between amino acid side chains, which guide the quaternary superstructure of macromolecular complexes and enable functional properties [1].
Because of their biological complexity, identifying protein-protein interactions (PPIs) remains a major challenge for researchers.
Understanding which proteins interact with each other, either in a pairwise fashion or as components in a multi-subunit complex, is an important task because these interactions reveal basic functional mechanisms and suggest the potential druggability surfaces of molecules for pharmacological modulation.
Traditionally, as for 3D protein structure determination, PPIs have been mapped using a diversity of experimental techniques [1].There This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.© 2023 The Authors.Proteomics published by Wiley-VCH GmbH.exist many experimental methods such as yeast two-hybrid and protein microarrays to find PPIs [2,3].Some of these methods are high-throughput, being able to test many PPIs at the same time, but experimental elucidation of PPIs requires time and resource dedication, and these strategies are subject to diverse sources of technical errors [4].As a result of these drawbacks, recent effort has gone into the development of computational methods to predict PPIs.In concert with advances in 3D structure prediction [5,6], computational methods have now emerged as a viable technique to infer PPIs because of their advantage of being scalable with less resource dedication.
Computational methods, in this context, refer to the prediction methods used to convert a source of biological data to a PPI prediction.
The field has mainly been using machine learning (ML) algorithms to convert the biological data to predictions, so this paper will be covering the ML methods used and what advances have been made in the PPI prediction field.
Because of the recent increase in the use of computational methods to predict PPIs, there exists a need for a review of current approaches to the problem.This paper will review the recent progress in computational methods used to predict PPIs.We categorize the work in the PPI prediction field based on the data inputs to the prediction models.There are three main data inputs used in the PPI prediction field: protein sequences, protein structure, and co-fractionation mass spectrometry data.In the body of this paper, we review work using each of these data sources, and we present case studies exploring methods that use each of these data sources.Table 1 presents an overview of the computational methods we survey.
From Table 1, it is shown that the datasets used to train and benchmark models are generally consistent.Saccharomyces cerevisiae, Mus musculus, Homo sapiens, Caenorhabditis elegans, Escherichia coli, and Helicobacter pylori are all well-studied organisms and are common benchmark datasets for PPI prediction models.Some researchers have made their own datasets using PPIs from the organisms mentioned above to create more balanced datasets for training and testing.
Therefore, the Dsets we see in the table are also prevalent in the PPI prediction field.The datasets mentioned are curated from publicly available datasets and their respective description and statistics on how the datasets were optimized to allow for good performance are available through the corresponding citation in Table 1.It is important to note that between models that datasets taken from the same organism can be different depending on considerations made by the different authors, so statistics and performance can vary depending on the considerations the authors took into account.Dsets mentioned in the table are curated datasets from specific organisms by other researchers, so Dsets are the same when mentioned in the table.

METHODOLOGICAL BACKGROUND
PPI prediction models aim to use some source of protein data to predict known interactors.Before reviewing these computational models for predicting protein interactions, we provide a brief overview of concepts to provide background on the methods discussed.

Machine learning
ML is a subfield of artificial intelligence that focuses on using data to learn associations and make predictions [68].In this review, we focus on the methods of ML that use supervised learning for classification since these are the most relevant to PPI prediction.Supervised ML refers to the use of labeled training data, in which data items having associated features that are equipped with labels.For PPI prediction, these labels are strictly binary: whether the two input proteins are interactors.In this setting, the goal of classification is to construct a model that infers labels for test data -that is, data coming from the same distribution as the training data, but without labels.In the context of PPI prediction, the input to an ML model is two protein representations, and the label associated with the input is whether the two proteins are known interactors.Essentially, the ML model tries to extract the relevant features from each of the protein representations, and then, use these features for a PPI prediction.
In supervised ML, training data is used to adjust model parameters toward settings that tend to predict known interactors accurately.
Once trained, we expect the model to predict labels in the training data accurately.Test data, that serves as unseen data, provide information on how well the ML model performs on data it has not seen to test the model's ability to generalize.

Classical machine learning
We use the term classical ML to refer to algorithms developed before the advent of deep learning (DL) techniques (which are discussed further below).Classical ML methods include decision trees, support vector machines (SVMs), and random forest classifiers (RFCs).Decision trees construct a tree-structured partition of the data's feature space [69].
Each node in the tree represents a single logical test of a single data feature's values -such as the presence of an alpha helix in a protein and so splits the data space based on that feature.Leaves in the tree are associated with a label and correspond to predictions.In contrast, SVMs non-linearly map input feature vectors to a high-dimensional space and construct linear decision boundaries in that space to assign labels for classification [70].Finally, RFCs use an ensemble method comprised of many decision trees constructed from random samples of the training data and uses a voting system among these decision trees to determine the predicted classification [71].

Deep learning
DL is a subset of the ML field that refers to the use of artificial neural networks (ANNs) for tasks such as classification [72].The basic building block of a neural network architecture is the use of artificial neurons which typically take in a weighted sum of inputs and then applies a nonlinear transformation to produce an output that is shared with other neurons downstream.Neurons are organized into layers, and each layer consists of multiple neurons that takes their input either from data or from a previous layer's output.Learning occurs by adjustment of the weight parameters in a manner that seeks to achieve good classification performance.In the context of PPI prediction, weights would be optimized to maximize the number of correct predictions of known interactors and non-interactors.A neural architecture refers to the particular number of layers and interconnection pattern linking those layers.A number of well-studied neural architectures are typically found; in the following subsections, we discuss architectures prominently used in protein interaction prediction.

Convolutional neural networks
The convolutional neural network (CNN) is one of the most popular architectures used in DL [72].A CNN architecture uses convolutional TA B L E 1 Overview of computational methods comprehensive list of PPI prediction pipelines.

Recurrent neural networks
Early attempts to handle variable length inputs led to the recurrent neural network (RNN) architecture.In an RNN, feedback loops exist between layers, so that data can be input in sequence, and previously seen data can affect the classification of subsequent data [72].RNNs Example architecture of an unrolled recurrent neural network (RNN) for PPI prediction.Adapted from Alzubaidi et al. [72].
are capable of identifying relationships between the different inputs, making them optimal for applications in video processing and natural language processing.Because the protein sequence-to-PPI prediction problem can be seen as a derivation of problems in the natural language processing field, protein sequence models have a solid foundation of text models and research to adopt from for PPI predictions.Just like words in a sentence need context to understand their meanings in a sentence, amino acid residues need context from other amino acid residues to understand their roles in the protein sequence.Therefore, text-based architectures are a good fit and have been used for the protein sequence to PPI prediction problem.
Figure 2 is a general pipeline for an RNN architecture for PPI prediction.In this example, the RNN is displayed in its "unrolled" state.
Because RNNs are cyclical, RNN diagrams are displayed in an unrolled state over the input sequence (protein amino acid sequence) to display its architecture.This example RNN is referred to as a many-to-one RNN as it takes an input sequence and generates one output (PPI prediction value).
Unfortunately, due to the presence of feedback loops, RNNs are subject to the vanishing gradient problem, meaning that the network can sometimes fail to propagate useful information from the early layers of the network during training.The long short term memory (LSTM) architecture has been developed to overcome this issue; it introduces gates within the architecture to better control the flow information from layer to layer [72].
Nonetheless, training an RNN or LSTM with variable-length protein sequence data is computationally expensive since the network requires long training time to capture long-range amino acid residue interactions in the input sequence.

Transformers and attention
The attention mechanism is the key advance leading to the transformer neural architecture [73].Figure 3 is a general example for an attention mechanism.The input amino acid sequence is converted to a vector through some embedding.
The weight matrix is then constructed taking the dot product of the vectors.The weight matrix can then be multiplied by the output vectors from the embedding to create an amino acid representation that is context aware.

Embeddings and encodings
Embeddings are representations of feature-rich inputs that are scaled down to reduce the dimensions from the input.This scaling down process highlights important features within the network.Embeddings come from pre-trained neural networks that generate feature representations as an output.For example, Word2Vec is a neural network architecture that represents words in a sentence as vectors in a sentence that capture their semantic and syntactic attributes [74].It is important to note that embeddings are of a fixed length; therefore, the method that produces the embedding can take a variable length input and yet still produce a fixed-length output.Word2Vec and other relevant embedding architectures that have been adapted for protein representations will be explained further below.
In this review, we use the terms embeddings and encodings interchangeably.Embeddings and encodings refer to data transformations that either approximate or provide a direct one-to-one transformation of an input.We briefly describe some examples of encodings and embeddings that will be referenced in future sections.An example of an encoding is the one-hot encoding.The one-hot encoding converts categorical data, like amino acids in a protein sequence, into an n × L dimensional vector, where n is the number of categories and L is the length of the input.For example, a one-hot encoding of a protein's amino acid sequence would be a 20 × L dimensional vector, where 20 is the number of categories (amino acids) and L is the length of the protein's amino acid sequence.This encoding has a direct map allowing conversion back to the input.
Word2Vec: Word2Vec is a neural network architecture that, as the name suggests, converts words into a vector.The network learns word associations by training on a body of text and represents these word associations in vectors.The goal of this vectorization is to be able to perform simple algebraic operations to combine the associations of different words.For example, with Word2Vec, it has been shown that vector(King) − vector(Man) + vector(Woman) ≈ vector(Queen) [74].
FastText: FastText is similar to Word2Vec in that its goal is to produce a vectorization of words.However, FastText's main contribution lies in its architecture: words are represented as a combination of vector representations of characters.This approach makes FastText, as suggested by its name, a faster method for training and producing these vector representations [75].
BERT: BERT is a masked-language model that aims to predict a missing word in a sentence.BERT is trained using sentences with a masked token on some words.BERT will then try to predict these masked words given context from the other words in the sentence [76].

Autoencoders
Autoencoders are different from the previously mentioned architectures because they use unsupervised learning.Simply, autoencoders try to project their input into a different dimensional feature vector (encoding) and then use this feature vector to recreate the input (decoding).The goal of this architecture is to create an encoder that is capable of projecting an input to a different dimension.Most commonly, autoencoders are used to project data into a lower dimension in order to enhance signals, highlight features, and reduce noise.This attribute of autoencoders is important for mass spectrometry data, which will be discussed in a future section.

Training and testing data partitioning
Since the goal of ML is to be able to make predictions on unseen data For example, if a highly connected protein appears in both testing and training, the ML model may learn to simply predict that the protein has further interaction partners without taking into account the properties of the partners.This can lead to unrealistically high accuracies.To address this issue, the best practice is to partition the available data based on proteins themselves and not on PPIs.This split means that proteins in the training set will not be seen in the test set, giving more confidence in the model's ability to generalize [77].

PROTEIN SEQUENCES FOR PPI PREDICTIONS
Having covered the necessary ML background, we now turn to the use of protein sequences and the computational methods used to integrate these sequences into a PPI prediction pipeline.PPI prediction models that use protein sequences are plentiful mainly because of how readily available protein sequences are.Below, we discuss the current general pipeline and give examples of existing pipelines in the field.
Protein sequence refers to the primary structure of proteins: the amino acid sequence.The amino acid sequence is a 1D chain of twenty possible amino acids.This specific arrangement of amino acids is referred to as the protein's primary structure [1].While this 1D sequence's relationship to the protein's 3D structure is not easily apparent, there exist patterns, such as certain amino acid combinations, within the protein sequence that define folding motifs (e.g., alpha-helix, beta-fold) within the protein's structure.The goal of using computational methods with protein sequences for PPI prediction is to capture these patterns from the interfaces of known interacting polypeptides to predict PPIs based on the similarity between other proteins' sequences.Here, we discuss the computational work that has been done to assess the interaction between two candidate proteins based on their sequences alone.
Much of the work that focuses on computational methods for PPI predictions uses protein sequences as its sole input.These protein sequences are far more readily available as compared to the 3D protein structural models: there are around 189 million protein sequences available on UniProt, while there are only a little more than 200,000 protein structures deposited in the Protein Data Bank (PDB) [78,79].
The plethora of protein sequences and the ease of accessing them has contributed to an abundance of computational models that predict PPIs based solely on protein sequences alone.
When it comes to training data for PPI prediction based on reference protein sequences, researchers can access a diversity of experimentally derived PPIs based on well-studied organisms [4,11,20,33,34,80].They can split these into and training and test (hold out) sets to demonstrate that a particular model is able to perform well, capable of getting high accuracy and not overfit to one particular reference study.

F I G U R E 4
General pipeline of PPI prediction models using protein sequences.The input is the primary amino acid sequence.A feature extraction step is used to distill important features from the amino acid sequence.This step includes metric representations, embeddings, encodings, or some use of the original protein sequence.A network architecture then integrates the two separate protein feature vectors and outputs a binary prediction.

Metric representations
Distilling protein sequences to metric representations not only reduces training time but also avoids deeper networks, which are usually necessary to capture interactions.It is important to mention that extracting these features from protein sequences is a relatively simple task.Metric representations involve distilling the raw protein sequence into statistical, physical, or chemical properties and seems to be the most popular method in the field currently.These representations include properties such as hydrophobicity, secondary structures, backbone angles, amino acid composition (AAC), and many other features [11, 20, 28, 33, 34, 42-44, 80, 84-87].Other commonly used features including AAC and conjoint triad (CT) provide a global representation of the protein, while also fixing the length of the input feature [4,34].While a fixed-length input is important for most neural network architectures, RNNs do not require a fixed-length input.When one considers the higher payoff and lower effort needed to generate these metrics, it becomes evident why this approach is among the most popular in the field.

Case example of metric representations and attention: SDNN-PPI
As an example of systems that use metric representations, we will review the design of SDNN-PPI [4].SDNN-PPI is a neural networkbased method that pre-processes protein sequences into metric representations before their input to the neural network, producing a binary PPI prediction output.The three derived features are an AAC, a CT distribution, and an autocovariance (AC) which considers the proximity effects of nearby amino acids.To obtain the CT, SDNN-PPI first clusters the amino acids into seven different clusters based on the biophysical properties of their side chains and dipoles: if two amino acids are in the same cluster, they are treated the same.The CT score is then calculated as the distribution of amino acid triplets with respect to these clusters.The AC is determined by replacing the amino acids in an input protein sequence with a number corresponding to some basic biophysical property (i.e., hydrophobicity, hydrophilicity, net charge index, polarity, polarizability, soluble surface area, and side chains).
The AC of each such sequence is calculated and used as the third input feature to SDNN-PPI.
The SDNN-PPI architecture model architecture is based on a feedforward network with six fully connected layers.The architecture itself is simple in that it uses one of the most basic layers in DL -a fully connected layer that also includes the use of attention.The authors tested multiple different encodings of protein sequences and were able to get an area under the receiving operator (ROC) curve (AUC) value of 0.986 when the algorithm was tested on multiple high confidence PPI datasets.ROC is a metric used for classification methods that plots the true positive rate as a function of the false positive rate.AUC is the area under this ROC curve -a metric that measures the model's performance across all thresholds for classification.The highest AUC score possible is 1.00, meaning the 0.986 achieved by SDNN-PPI is high.The authors used the S. cerevisiae dataset to evaluate SDNN-PPI's performance.While the authors evaluated across different combination of metrics, no comparison to other methods of encoding such as neural network sequences or raw protein embeddings [4].SDNN-PPI highlights the power of the attention mechanism, namely achieving excellent performance in PPI prediction based purely on sequences and a simple DL architecture.An advancement in attention networks is influencing the field heavily as the most recent papers have started to shift to architectures with the self-attention mechanism, and most recent papers have largely shifted to these attention mechanisms over the span of a few years.SDNN-PPI also highlights the trajectory in which protein sequence pipelines are moving: pipelines that leverage attention mechanisms along with readily available protein sequences are able to produce remarkably accurate results with only the primary protein structure.However, as the authors have pointed out, metric representations do not provide comprehensive protein characterization such as structural, evolutionary, and protein-residue relationship information [4].In addition, information regarding how the data for training and test set was not specified, which means that information leakage, as described previously in the Methodological Background, could have occurred.

Embeddings
The next strategy for protein sequence representation is the use of embeddings and encoders.Encoders such as one hot encodings can be used to represent protein sequences, while embeddings such as Word2Vec, FastText, and BERT [7,34,43,44,48,67,88] are commonly used to transform the protein sequence into an interpretable feature for a neural network.The main advantage to this strategy is that it keeps some resemblance of the entire protein sequence while also providing an interpretable feature representation for a neural network [43].In addition, neural network embeddings can project an input sequence to a lower dimension.The advantage of this projection is that it can reduce noise and extract important features from the input [82,83].Conversely, the main drawback to this approach is the issue of interpretability: while metric-derived features have meaning (since they represent statistical or biophysical representations of the protein), embeddings from neural networks create features that lead to output representations that do not have direct interpretation.
While this disadvantage may not affect how well a model predicts PPI using these embeddings, it causes issues when trying to interpret the features the model is using to make these predictions (black box phenomena).

Case example of embeddings and encodings: GOSeqPPI
An example of embeddings and encodings in a PPI pipeline is GOSe-qPPI [43].GOSeqPPI incorporates both a neural network embedding and a text encoding to represent both the protein sequence and associated gene ontology (GO) annotations [89,90] the protein into a neural network architecture.To represent the protein sequence, GOSeqPPI uses a one hot encoding, distilling the protein sequence into a (20 × L) -dimensional feature vector.Twenty for the number of possible amino acids and L for the length of the protein sequence.A pre-trained NCBI-blueBERT model [91], which is a version of the BERT model that was specifically trained on biomedical text data, was used to distill the GO annotations into a (768 × N) -dimensional feature vector.The one hot encoding and the BERT embedding are passed into a CNN layer and an LSTM layer to combine the features from the protein sequence and GO annotations to create an embedded representation of pairs of proteins.Finally these feature representations are passed into a fully connected network layer with an attention mechanism for a PPI prediction.
Compared to extracted protein features, GOSeqPPI's embeddings and encodings have a distinct advantage: the ability to incorporate additional information.The BERT model extracted important features the GO annotations and was able to use them for PPI predictions.
Because neural network embeddings project the input data into a lower dimension, they reduce noise and capture important features.Indeed, compared against other models such as PIPR, GoSeqPPI shows improved accuracy across multiple datasets.However, this method makes the extracted features uninterpretable to the human eye.

Raw protein sequences
The research in PPI predictions seems focused primarily on global interactions: a binary output classification on whether two proteins are interacting.However, some of the research looks at local interactions: determining which residues from a pair of proteins are interacting.The literature for predicting residue interaction interfaces is not as rich as inferring global PPIs.Most models that predict interaction networks tend to use statistical and physiochemical representations for PPI predictions.This decision makes sense since global interaction models are not concerned with the specific residues of the two interacting proteins.However, models that predict local interaction tend to use a combination of statistical, physiochemical representations as well as some representation of the overall protein sequence that captures local features of the protein (e.g., overall fold or domains).As described in the previous sections, protein sequence representations encompass encoding methods such as metric representations, text embeddings, and neural network feature embeddings, but some groups have also leveraged raw protein sequences [39,42,44,67,86,87].Using unprocessed protein sequences for PPI prediction creates an issue for neural network architectures since most models depend on an input of fixed length.This issue means that the architectures used in these models must either use RNNs, which can handle variable length inputs or else somehow fix the input lengths of a given protein sequence.Given the limitations to using RNNs noted before, efforts in this area opt to fix the input length of the proteins using neural network embedding models, text embeddings, or a pre-determined length [39,44,87].Overall, however, these strategies have seen less success than approaches that alter the protein sequence.

3.3.1
Case example of using protein sequences: ctP2ISP ctP2ISP [39] is a pipeline that interprets protein sequences to predict protein interaction sites between a pair of interacting protein at the individual amino acid residue level.To achieve this precision, ctP2ISP inputs six different features, representing the physicochemical properties of each protein to the neural network.The first input, covering the first 500 residues of a protein consists of a number from 1 to 20, representing different amino acids.If the protein is less than 500 residues, the input feature is padded with zeros.The algorithm then uses SPIDER3 [92], a protein structure prediction method, to generate information about a protein's secondary structure, including its solvent accessible surface area, and peptide backbone angles.This information is presented as a 1D vector to the neural network pipeline.The other two features are a position-specific scoring matrix, and basic biophysical properties (such as charge, volume, and hydrophobicity).

PROTEIN STRUCTURE FOR PPI PREDICTIONS
A protein's 3D structure is a result of biophysical interactions among the primary amino acid sequence.The side chains on the amino acid sequence make energetically favorable contacts with other residues in the sequence.The types of interactions that drive protein folding include hydrogen bond formations, electrostatic interactions, and van der Waals [1].Experimentally determined protein structures curated in the PDB display physically interacting residues as a graph structure.
However, deducing protein structures is more complicated than determining protein sequences, which is why protein sequences are more readily available compared to protein structures [78,79].Nevertheless, innovative computational methods based on DL for determining protein structure has recently become a popular field.
In contrast to physics-based methods like ClusPro [93], which depend on free energy minimizations for determining how a protein folds, ground-breaking methods like AlphaFold (AF) use multiple sequence alignments (MSAs) to gain insight into protein structure [6].
Because of AF's recent prominent success in the protein structure prediction field, an AF derived Protein Database was produced, consisting of around 200 million protein structure predictions [94].
Recently, the PPI prediction field has started to integrate protein structures, known ones from PDB or predicted ones from the AF Protein Database, into computational pipelines to see if they improve prediction performance.The general idea behind using protein structure for PPI prediction is that a protein's structure contains key features that can be extracted and then used to compare for complimentary (i.e., lock-and-key fit) against another protein's structure for PPI prediction.
Below, we discuss how protein structure has been leveraged for PPI prediction.The field so far has evolved two main strategies for PPI prediction using structure: PPI prediction using AF-inspired models and graphical representations based on known protein structures.Figure 5 shows the general pipeline of models in this part of the field.

AlphaFold
Historically, the implementation of protein structure has been more challenging compared to the use of protein sequences.However, the pioneering AF tool [6] has increased activity around structure for PPI predictions.By demonstrating the power of attention-based DL, the work in the PPI prediction field has now shifted to using this General pipeline of PPI prediction models using protein structure models.The input is the 3D structure of a pair of proteins.Protein structures consist of experimentally verified or computationally predicted structures.These 3D structures are converted to residue contact graphs to make it an interpretable graph for the graph neural network architecture.The graph neural network synthesizes the two graphs into a feature vector, which is then used for a PPI prediction.
innovative DL mechanism.AF has moved the PPI field forward by both leveraging the power of DL while also making use of accurate large-scale protein structure models for PPI prediction.Notably, use of protein structure for PPI prediction is no longer restricted to proteins with experimentally derived protein structure, rather researchers can use computational protein structure models for PPI predictions, leveraging the AF Protein Structure Database collection of about 200 million predicted protein structure [6,94].
While the advent of the AF Protein Structure Database and AF-Multimer is inspiring, considerable progress in the PPI prediction field protein structure-based PPI prediction is still not as extensive as studies using protein sequences.It is also important to mention that protein structure models do not have the same vast starting foundation that protein sequence models had, which draws from extensive previous research with text-based tools, so much of the field is currently working with off-the-shelf models.Rather, existing structure-based methods still depend on protein sequence but use it in conjunction with protein structure information.

Multiple sequence alignments
AF's success comes partly from the power of MSA alignment of three or more homologous protein sequences from a diverse set of species to both define similar functions in different organisms and identify boundaries on sequence variation, which reflect the protein's 3D structure [95,96].Because of this detectable relationship between homologous protein sequences, MSAs have been widely used to identify evolutionary relationships needed for protein structure prediction.More specifically, MSAs can reveal both global and local structural features, including secondary structures, backbone angles, and even residueresidue interactions [96].MSAs and other evolutionary information have been popular for protein structure prediction for some time [97][98][99], but the emergence of the attention mechanism in DL tools such as AF truly allow this information to be extrapolated for PPI predictions.
The success of AF has inspired an AF-Multimer model, which produces medium quality protein docking structures at scale [5].Because of its limited reliability, using AF features directly for PPI prediction was somewhat disappointing initially.Nevertheless, the AF-Multimer model has been used as a high throughput method to generate candidate PPIs and even multi-protein complexes [100].Researchers have found that using the AF model in conjunction with paired MSAs yields the best results, achieving an AUC value up to 0.87 when predicting PPIs for E. coli [101].However, AF has a long run time to find the necessary sequences for the MSA needed for the model.This drawback means that performing modifications to AF architecture to enhance PPI predictions will require not only resources to handle AF's computationally intensity but also long run times, which makes this strategy not as feasible compared to using pure protein sequences for PPI prediction.Techniques to speed up AF's MSA generation have been reported [101,102], but, the long run time still pose an issue.
In theory, long run times can be circumvented by using protein structure predictions in the AF Protein Database, but this strategy means that the model will not be able to use intermediate computed features in AF but rather just AF's output.Hence, with work in this part of the field is still in its early stages, literature for using AF-inspired PPI prediction models remains sparse.

4.2.1
Case example of MSAs: PITHIA An example of a neural network architecture incorporating MSAs is PITHIA [67], which uses not only MSAs but also uses other side features such as embeddings from protein language models and other representations such as a PSSM.While PITHIA does not exactly make PPI predictions, it can find protein interaction sites on a singular protein.
The authors tested the performance of multiple different ML architectures including a multilayer perceptron, an RNN, a CNN, and a transformer with self-attention with the latter performing the best.
Surprisingly, however, when testing the impact of additional features such as a PSSM, physiochemical characteristics, or evolutionary conservation, the use of MSAs alone yielded the best results [67].PITHIA demonstrates the power of MSAs within the attention mechanism compared against other architectures and features.

4.2.2
Case example of AF output for PPI prediction: TAGPPI Using a slightly different approach, some recent work in the field has looked at using parts of AF instead of MSAs for protein structure prediction.TAGPPI, for example, incorporates AF's protein structure prediction by creating a residue level contact map to pass onto a graph neural network with an attention mechanism -referred to as a graph attention network (GAT), along with a CNN to interpret the protein sequence after pre-processing.The authors tested multiple variations of their method, including versions that used the CNN and GAT layers independently versus together.They found that the addition of the structure feature provided by AF provided the highest accuracy [18].
It is important to note that this model incorporates both sequence and structure, and compared against other models such as PIPR, a neural network model that only uses protein sequences, TAGPPI improved PPI prediction performance albeit marginally [18].
The combined use of protein sequence and protein structure in TAGPPI somewhat obscures the impact of protein structure on PPI prediction accuracy.To assess the role of protein structure for PPI predictions, it would be helpful to look at a pipeline that does not look at direct sequence information and how this impacts the model performance.

4.2.3
Case example of pure protein structure for PPI prediction: SGPPI SGPPI uses protein structures from the AF Protein Structure Database and additional information regarding protein secondary structures but does not use a direct metric representation of the protein sequence.
The protein structure is converted into a contact map and passed onto a graph convolutional neural network (GCN) for a PPI prediction.While the authors did not report the overall accuracy of their model, they did indicate an F1-score of 0.375 by a 10-fold cross validation on Pan's dataset [14].An F1-Score is a function of the model's sensitivity and specificity; the closer the score is to 1, the better the model's performance.Pan's dataset is a generated dataset based on the Human Protein References Database and is commonly used for benchmark-ing [14].While each model was tested with a different benchmark, these results suggest using protein sequence has a greater impact on performance relative to protein structure, again stemming from the solid foundation provided by previous natural language processing research.Since protein structure models have lagged behind protein sequence models, the difference in performance could result from the individual field being able to move forward more quickly and is not a representation of the inherent utility for PPI predictions.

Non-AF methods
Multiple models have used known protein structures instead of putative structures computationally derived by AF for PPI predictions.The general methodology is to convert the structures into a graph, usually a residue contact map, and apply a graph neural network to extract key features to predict PPIs [22,57,103].

Case example of using experimentally derived protein structures: Struct2Graph
Struct2Graph is a network model that converts experimentallyderived PDB structure files into graphs.These graphs are then passed into a GAT to identify similarities in the graphs.The authors report that Struct2Graph outperforms the other state-of-the-art models using an attention network with protein structures, achieving a claimed accuracy ≥98.89% by five-fold cross validation [57].These results indicate that introducing known structures into a PPI prediction model can increase prediction performance.However, the AF Protein Database currently has about 200 million proteins, while the PDB consists of only 200,000 [78,94].Hence, models that work with predicted protein structures have a harder problem since they scale to 100-fold more proteins.Because PDB is much smaller (fewer proteins to be trained or tested on), there is inherently less variability that neural network architectures need to capture, so the problem for known protein structures is an easier task.On the other hand, working with known protein structures is intrinsically limiting.Therefore, for the field to move forward and incorporate inferred protein structure, improvements to AF-inspired models are needed for a highly accurate yet scalable method for PPI predictions.
While still dynamic, a few papers have successfully combined attention networks with protein structure information [57,103].So far, models that incorporate known protein structures seem to generate better results compared to those using the predicted protein structures [57,103].Yet, the AF model is an important step forward for PPI prediction field as it allows structural information to be extracted for PPIs with only protein sequence available.The attention mechanism seems to provide the best performance -one recent study found that the GAT outperformed the GCN [103].However, obstacles hindering faster progress include the MSA alignment because the process to finding appropriate protein sequences is computationally expensive [102].

F I G U R E 6
General pipeline of PPI prediction models using co-fractionation to mass spectrometry (CF/MS) data.The input is a pair of protein profiles.A feature extraction step does data processing to reduce the influence of experimental noise.The processed profiles are converted to correlations and then used as input to some PPI prediction step such as a machine learning (ML) model.The prediction step outputs a PPI prediction.

CF/MS DATA FOR PPI PREDICTIONS
Biochemical co-fractionation to mass spectrometry (CF/MS) is a powerful experimental method for mapping PPIs on a large-scale that critically depends on extensive computational modeling.In CF/MS experiments, intact soluble protein complexes in a cellular lysate are fractionated by high performance liquid chromatography (HPLC) prior to proteolysis and standard denaturing liquid chromatography mass spectrometry (LCMS) [104].Since subunits of stable multi-subunit complexes are expected to co-fractionate together, bioinformatics pipelines can be used to compare pairwise sets of protein profiles with highly correlated pairs used to predict PPIs.One key advantage of CF/MS is that it can be to examine different experimental contexts.
However, a major challenge is chance co-elution, which can lead to incorrectly predicting interactions among functionally unrelated proteins.Multiple pipelines have been derived to interpret CF/MS data to predict PPIs by addressing the issue of chance co-elution [49,60,65].
From a computational perspective, the individual protein profiles recorded by CF/MS can be represented in three different dimensions: peptides (detected proteolytic sequences), fractions (HPLC retention times), and conditions (experimental variables) [65].Two key steps are data processing to remove inherent sources of experimental noise that can produce error and protein correlation analysis.ML models such as SVMs, RFCs, and Naive Bayes classifiers have been introduced to best interpret CF/MS data [46,49,60,65].The inputs to these models are pre-processed MS data to predict PPIs.Some pipelines use additional information such as functional annotations to aid in eliminating spurious correlations [65].Below, we discuss some current strategies for CF/MS data processing and PPI predictions while noting how the field can leverage recent advancements in neural networks.Figure 6 demonstrates the general methodology of models that use CF/MS protein profiles for PPI prediction.

Data processing
Data processing is an important step for CF/MS analysis since this step aims to enhance signal-to-noise during subsequent correlation analysis to better detect PPIs.A multitude of different strategies have been introduced, including data normalizing, correlation analysis, and signal processing.For data normalization, the data is scaled to account for spurious measurement variations.Z-score scaling and fitting Gaussian models are some examples of ways to reduce experimental variance [60,65,105].
Because interacting proteins co-elute together, a common strategy is to find correlations between protein elution profiles.Correlation metrics include Jaccard, Euclidean distance, and Bayes correlation [65].
Determining coordinate changes in protein abundance (e.g., HPLC peak height, width, retention time) is another strategy for characterizing CF/MS profiles for PPI predictions [46,60].

Case example of CF/MS pipeline: EPIC
One of the first tools to process CF/MS data, EPIC (elution profilebased inference of complexes) [65] calculates eight different correlation scores between all proteins profiled in an experiment.These values are then used as an input to an ML engine (RFC and SVM) for PPI prediction.These classifiers are trained by using annotated protein complexes obtained from the CORUM database -if two proteins are curated to same complex, they are deemed as expected PPI.When fully optimized, EPIC achieves an overall accuracy score of 0.65 when applied to C. elegans data [65].Given the accuracy of EPIC is not as high as some of the computational due to chance co-elution, CF/MS data can be used to reveal dynamic rewiring of PPI networks, which is hard to infer computationally.

Improving performance in CF/MS field
Notably, CF/MS-based models have yet to exploit advanced neural network architectures.Therefore, there is ample opportunity to improve the accuracy of CF/MS-based predictions by using neural networks and incorporating other sources of information.The first point to consider is neural networks are capable increasing the signal-to-noise ratio in CF/MS.Autoencoders and transformers (see Section 2) are unsupervised and semi-supervised architectures, respectively, that reduce noise present in the inputs.This capability makes them an ideal candidate architecture for a CF/MS prediction pipeline as chance co-elution is one of the leading issues hindering CF/MS bioinformatics pipelines.
Transformers also incorporate attention into their architecture [73].
Another point to consider is the use of additional information to aid CF/MS predictions.There are multiple examples of PPI predictions pipelines that use additional information to aid in prediction such as EPIC using functional annotations [65] and GOSeqPPI using GO annotations [43].CF/MS data, specifically, can benefit from additional information in order to reduce chance co-elution.Neural networks are useful in this instance as they can seamlessly integrate multiple sources of information into a PPI prediction.As an example, a text-based neural network model was used to interpret GO annotations in the GOSe-qPPI pipeline [43], and a similar approach can be done with CF/MS data.
Protein sequence models have shown success, and integrating protein sequences with CF/MS data is a feasible strategy to improve accuracy in CF/MS predictions.

DISCUSSION
Overall, attention networks have been impactful in the PPI prediction field.The trend in the field is that the current models the field is publishing involve attention.PITHIA demonstrated that the use of the transformer architecture with attention yielded the best results when testing different architectures of the model [67].Therefore, for the foreseeable future, in order for models to compete with the current state-of-the-art, they will need to incorporate an attention mechanism.
When reviewing the current state of the PPI prediction field, it seems as though AF has had an important impact on the work being done.Neural networks that use protein sequence and/or protein structure have started to use attention in their architectures after AF's publication.While AF has been influential with PPI prediction models using protein sequences and structures, the CF/MS field has barely started to move toward the use of neural networks.There exists an opportunity to progress the CF/MS field through the use of attention networks.Because of chance co-elution, it may be beneficial for the CF/MS field to expand on PPI prediction architectures that use protein sequences or protein structures.Protein sequence models specifically have seen some success, so adapting an architecture and incorporating CF/MS data through an autoencoder, to enhance the signal-to-noise ratio, can be an important first step to progress the CF/MS PPI prediction field.
When comparing PPI prediction models that use protein sequences versus those that use protein structure models, these models seem to yield better results, but this is not a reflection of their superiority over models that use protein structures but rather possibly an artifact.
Protein sequence models seem to have a more consistent set of benchmarking datasets based on well-studied organisms.Along with these datasets and the wealth of models from the natural language processing field, the protein sequence PPI prediction field has the necessary tools to create well-performing models.While protein structure models have not moved as far, the AF Protein Database [94] can make way for models that are more applicable to more proteins.
Overall, the PPI prediction field has moved far within the past few years and has the potential to move further within the next few years too.Protein sequence PPI prediction models have leveraged natural language processing models and created a solid foundation for the field to move forward.It may be best for future models to start incorporating multiple sources of information, leveraging the advantages from each data input: amino acid residue information from protein sequences, protein characterization from protein structures, and different experimental contexts from CF/MS.

F I G U R E 3
Similar to RNNs, transformers are an encoder-decoder architecture but use semi-supervised learning, unlike the previously discussed methods.Transformers were developed to address the problems with RNNs and LSTMs: while RNNs process the input sequentially leading to long training times, transformers process all elements in the input simultaneously, allowing for faster compu-Example attention mechanism.Each element in the weight matrix is the result of the dot product of the corresponding vectors.Adapted from Vaswani et al. [73].tations.Google first introduced the attention mechanism with the transformer architecture.Simply, the attention mechanism represents each sequence in the input as a numerical representation of the other sequences in the input.This vectorization of each sequence assigns relationships to each other sequence input, giving the neural network context for each sequence.Transformers use these contextualized vectors to decode into a new representation.Because of this, transformers are popular in language translation models as they can capture the individual meaning and context of each word in a sentence and translate it to a different language.Both the transformer architecture and the attention mechanism have applications for PPI prediction.The transformer architecture takes in sequential data, like a protein's primary sequence, and produces an output representation.However, as discussed in the coming sections, the attention mechanism has been ubiquitous in the PPI prediction field and for good reason.Future sections will discuss the application of attention in the PPI prediction field.
(i.e., generalization), one commonly splits available data into training and testing sets, with model training taking place on the training set and model testing performed using the (previously unseen) test set.However, performing the train/test split requires care in the context of PPI predictions.One must avoid "leaking" information from the training data to the testing data.If the set of interactions are simply partitioned, typically many proteins will appear in both the training and testing set, thereby leaking information from training to testing.
sequence-based PPI prediction techniques have used well known ML tools such as SVMs and RFCs, but the field started to shift toward novel DL within the past few years because of the advantages neural networks offer in terms of performance.Coupled with the recent advancements with the attention mechanism, neural networks have become significantly more powerful compared to classical ML algorithms.Indeed, the vast number of architectures developed for text-based problem such as text mining and language translation can be leveraged toward PPI prediction.Strategies used with neural networks such as transfer learning[81] and model adaptation allows for an easy transition from text-based models to PPI predictions using pairs of protein sequences as input[42,48].In addition, research has shown that data processing such as using metric representations or embeddings (see Sections 3.1 and 3.2) for neural networks can reduce experimental noise or unimportant features and highlight important features in the data, which leads to a reduced training time[82,83].Below, we discuss the shift toward neural network-based models in the protein sequence-to-PPI prediction field and the different strategies taken by different authors to improve predict performance.Sequence-based models can be distinguished mainly on what representation of the protein sequences is used as inputs into the neural networks.Since input features affect the neural network architectures that can be used, careful consideration needs to be given to how on how best to represent interacting polypeptides.When it comes to representing protein sequences for input into neural networks, the field has developed three different encoding strategies: metric representations, text or neural network embeddings, or simply raw protein sequences as input.Figure4illustrates the general pipeline of many models that use protein sequences solely.
ctP2ISP separates itself into a local and global block for semantic feature mining.The global block is the part of the architecture that uses global metrics of the protein sequence along with the 500 amino acid sequence of the protein.The local block uses the 30 amino acid fragment for interaction site prediction.Even though both blocks contain transformer layers, the input to the global block consists of the entire 500 residue protein sequence, while the input to the local block is a 30 residue protein subsequence from using a sliding window.The sliding window allows for data augmentation, generating more data to train and test the model without accessing more datasets.Using protein sequences is not necessary for global PPI predictions but is most likely needed for determining local protein interactions.The results are concatenated and sent into fully connected layers for target classification.The advantage of ctP2ISP's architecture means that all protein sequence inputs are of a fixed-length, allowing the use of the CNN architecture instead of an RNN, but as noted above, other workarounds may perform better to represent an entire protein sequence without sacrificing the ability to use other architectures.