Recent developments in deep learning applied to protein structure prediction

Abstract Although many structural bioinformatics tools have been using neural network models for a long time, deep neural network (DNN) models have attracted considerable interest in recent years. Methods employing DNNs have had a significant impact in recent CASP experiments, notably in CASP12 and especially CASP13. In this article, we offer a brief introduction to some of the key principles and properties of DNN models and discuss why they are naturally suited to certain problems in structural bioinformatics. We also briefly discuss methodological improvements that have enabled these successes. Using the contact prediction task as an example, we also speculate why DNN models are able to produce reasonably accurate predictions even in the absence of many homologues for a given target sequence, a result that can at first glance appear surprising given the lack of input information. We end on some thoughts about how and why these types of models can be so effective, as well as a discussion on potential pitfalls.

ties of DNN models and discuss why they are naturally suited to certain problems in structural bioinformatics. We also briefly discuss methodological improvements that have enabled these successes. Using the contact prediction task as an example, we also speculate why DNN models are able to produce reasonably accurate predictions even in the absence of many homologues for a given target sequence, a result that can at first glance appear surprising given the lack of input information.
We end on some thoughts about how and why these types of models can be so effective, as well as a discussion on potential pitfalls.  3 and JPred. 4 The field has recently seen a surge of interest relating to the use of deep neural network (DNN) models. DNN models have shown excellent performance in image and language based problems, to name a few. 5 Very recently, this excellent performance has been seen to extend to some specific CASP areas. The first application area where DNNs have had a major impact on CASP was arguably residue-residue contact prediction, which saw a particularly marked improvement in accuracy in CASP12 and 13.
In CASP13, a few groups extended these techniques further to the prediction of interatomic distances, which in some cases could then be used directly for accurate tertiary structure generation. [6][7][8][9][10] Although not currently an area of direct interest in CASP, deep learning is also starting to show a lot of promise in the area of protein design. 11,12 This article is not intended to be a detailed exposition of every key deep learning concept; our aim is instead to provide CASP participants and observers with a working understanding of the most important DNN architectures that have been successfully applied to the core problem areas in recent CASP experiments.
We will then discuss what advantages such models may have over those more traditionally used in various areas. We will then end on some thoughts on why and how these models work, their limitations, potential pitfalls, and their correct application. All discussion will be limited to supervised learning models, 13 as the most performant DNN models used in CASP so far have been of this type.

| DEEP NEURAL NETWORKS
Artificial neural networks have proven to be valuable in data modeling as they are known to be universal function approximators. This means that when configured and trained correctly, they can approximate any arbitrary continuous function to any desired approximation accuracy. 14,15 In fact, the first universal approximation theorems were proved for NN comprised of just a single hidden layer (although the theory allowed for arbitrarily many hidden units in that layer). The theorems say nothing, however, about how one might discover the network parameters that achieve a particular level of approximation, and finding good parameters for a given model architecture is achieved by the process of training. To train a NN in a supervised fashion, besides the model itself, one requires a set of training examples (a paired collection of inputs and corresponding outputs), and a cost or loss function that measures how far away from the "correct" answer a given model is. Training a NN is achieved by random initialization of the network parameters, followed by an iterative process comprising: (a) a forward pass of the NN with the current parameters to arrive at its predicted output for a training example; (b) calculation of the loss or cost for the example in question; (c) backpropagation of the loss to determine its gradient with respect to each network parameter; and (d) updating of the network parameters in proportion to the gradients. The backpropagation algorithm was popularized by a seminal paper by Rumelhart et al 16 although the underlying ideas are much older. 17 In general, having more artificial neurons in a model, organized in multiple layers, provides a model with a large number of adjustable parameters, and allows the model to express ever more complex functions. DNNs are, as the name suggests, composed of many layers of artificial neurons. There appears to be no consensus for how many hidden layers a network needs to have before it can be termed "deep"; 17 a rule of thumb is that two or more hidden layers is sufficient. Of course, practical DNN models usually have many more than two hidden layers. Although an effective procedure for training these multilayer networks was developed quite early on, 18 DNNs were rarely used in practice due to difficulties in training them; training a network of more than two hidden layers with the conventional sigmoid activation function frequently suffers from the so-called vanishing gradient problem. Hochreiter et al 19 describe this problem in the context of recurrent architectures, but the underlying problem is the same for deep feedforward architectures: as training the network parameters depends on the gradient of the loss function with respect to these parameters, the gradients in the earlier (nearer to the input) layers is the product of the gradients of all intermediate activations leading up to the output. This means that for small or large intermediate activation values, the resultant gradient in early layers can vanish (approach zero) or even sometimes explode (approach infinity) if the network weights are not properly tuned.
An early solution to this problem was proposed by Hinton,20 where deep networks were trained layer-by-layer using a mixture of supervised and unsupervised learning. Ultimately, this difficulty was addressed more easily by a series of seminal works that introduced new activation functions such as rectified linear units (ReLU), 21,22 new weight initialization schemes, 23 and other innovations such as batch normalization 24 and residual architectures 25 to better enable the training and use of increasingly DNN models. These advances occurred side-by-side with advances in computing hardware, specifically the availability of affordable, fast graphics processing units (GPUs), which can also perform the mathematical operations used by NN in a massively parallel fashion. 26 30 Caffe, 31 TensorFlow, 32 Keras, 33 Lasagne, 34 Torch, 35 and PyTorch. 36 Most of these frameworks also implement reverse-mode automatic differentiation, 37 a feature that hugely accelerates the application development cycle. The training of NN models by the backpropagation algorithm requires the calculation of the derivative of the loss function with respect to each parameter in each layer, and this is managed automatically by the automatic differentiation framework, using the same declarations used to build the NN model in a program. Thus, there is no need to rewrite the expressions for both the forward and reverse pass of the network, as the latter is computed from the former. This allows one to quickly experiment with different architectures for a model. It has reached the point where, within reason, as long as the NN architecture can be expressed in code (Python most usually), the network can be simulated and trained.

| CONVOLUTIONAL NEURAL NETWORK (CNN) MODELS
In the most basic implementation of a NN, all layers of artificial neurons are fully connected, that is, the output of any neuron in a prior layer is fed to the input of every neuron in the next layer. Convolutional nets act on 2D image-like inputs (but can also be applied to 1D and 3D data) by applying small filters or kernels to colocated groups of pixels in the image (see Figure 1). Each filter can actually be thought of as a small single-layer NN (perceptron), where the values in the filter are trainable weights. Although the filter is applied to every pixel in the input, the weights are shared across the whole image, and so it is equivalent to defining one single NN and applying it at every row and column position. Functionally, there is no difference between doing this and using a single sliding-window approach, but there are clear efficiency gains from the use of convolution as it allows massive parallelism. In CASP11 and CASP12, for example, MetaPSICOV 38,39 made use of two NNs, each of which used a shallow fully connected NN and a traditional sliding-window approach to produce competitive performance in contact prediction. With the exception that convolutional filters typically use just a single hidden layer, algorithmically the sliding window approach in MetaPSICOV is equivalent to a convolutional operation, just far less computationally efficient.
After training, the outputs from a convolutional layer operating on image-like inputs are also image-like, but now convey extra information, for example, the presence of an edge between two objects in the image. Convolutional architectures are suited to data that exhibit some form of spatial structure, such as images or covariance matrices.
The filter weights are the same for each output pixel, meaning the network can recognize local features regardless of their spatial location in the input. As the same filter is moved across the input image to generate the output, fewer adjustable parameters are needed (as compared to a fully connected layer). Multiple filters (channels) can be learned in a single convolutional layer, each recognizing a different pattern within the data.
An important parameter of CNNs is the receptive field. This simply refers to the area of the input image (or more generally, the input feature set) that can be "seen" at any one time. Concretely, the receptive field is the spatial extent of the inputs that are used in the calculation of a single output value, and is typically calculated for a single neuron in a given convolutional layer in the network (most commonly the last). Output neurons in a network comprising a single layer of 3 × 3 filters would have a 3 × 3 receptive field, as the final calculation carried out by the network for each output pixel only considers a central pixel and its immediate neighbors in the input (Figure 1). Composing a model with successive convolutional layers, however, can grow the receptive field, that is, the area around each input pixel that can be included in calculating an output in the final layer (see Figure 2A). A caveat is that the size of the receptive field is bounded by the size of the input; a CNN can be configured to have a large receptive field by adding more convolutional layers, but if it only ever operates on inputs with spatial dimensions of, say 32 × 32, then the receptive field can only grow to a maximum size of 32 × 32, regardless of the number of layers, even though its "theoretical" receptive field may be much larger. In practice, the maximum receptive field needs to be large enough to capture the relevant structures in the input data.
Dilated convolutions 40 can also be used to increase the receptive field with far fewer layers. In a dilated convolution, each filter is "stretched" by including spaces between each pixel ( Figure 2B). A 3 × 3 filter with a dilation rate of 2 would actually cover the same area as a 5 × 5 filter, but with only nine learnable parameters rather than 25 ( Figure 2B).
The downside would be that the dilated filter will only be able to sample nine out of the 25 pixels and so will have "gaps." However, these gaps can be filled by later dilated layers, so a network built with a mixture of dilated filters can cover an arbitrarily large receptive field without requiring an exponentially growing number of learnable parameters. In CASP13, dilated convolutions were used in a number of the top-performing CNN models. 7,41,42 Typical CNN models (eg, for image classification) take the output of one or more convolutional layers and usually downscale them with a "max pooling" operation. Max pooling simply looks for the maximum F I G U R E 1 A 2D convolutional filter (orange) is applied to an input layer (blue) to obtain the values for an output layer (green). The output value (−8 in this example) is the sum of the pointwise products of the filter weights and the corresponding elements in the input (the bias is zero in this example and no nonlinear activation function is used). The same set of filter weights is used to generate the output values at every placement of the filter on the input F I G U R E 2 A, Illustration of the growth of the receptive field of a 2D CNN as convolutional layers are added. The 6 × 6 grids represent the output from three consecutive convolutional layers with filter sizes of 3 × 3, and information flows from layer 1 to layer 3. A single output at layer 3 (yellow cell) is obtained using a 3 × 3 window of inputs from layer 2. Each of these nine cells in layer 2 uses a 3 × 3 window of values from layer 1. These windows overlap, and the set union of the cells used by the highlighted cells in layer 2 is marked on layer 1 (5 × 5 grid). Thus, from the point of view of each output cell in layer 3, the receptive field is 5 × 5 cells in layer 1. B, A single dilated convolutional filter is shown, with a 3 × 3 filter and a dilation rate of 2. This layer has a receptive field of 5 × 5 despite having only nine adjustable weights. Stacking dilated convolutional layers allows the receptive field to grow exponentially using a linearly increasing number of parameters. In contrast, both the receptive field and the number of adjustable parameters grow linearly when using regular convolutional layers, as shown in A value within an area, but this operation reduces the size/resolution of the image. Ultimately, the final max pooling output is used as input to one or more fully connected layers. The output (using a softmax function usually) of the last fully connected layer will represent the output of the network, which is typically a classification of the input image into a fixed number of predefined categories ("cat" or "tree" for example). It is, however, also possible to have CNN models that take in image-like inputs and produce image-like outputs. This is achieved using fully convolutional networks (FCNs; not to be confused with fully connected networks), 43 which are simply composed of a stack of convolutional layers all the way up to the output, omitting max pooling or fully connected layers that either change the image resolution or lose the image structure. Thus, an attractive property of FCNs is that they can be configured to produce output images of the exact same dimensions as the input. An example application of such a setup is to take in an image and produce an identically sized output image that highlights particular objects in the input image, which is known as image segmentation. In structural bioinformatics, this type of architecture has been used to great effect in contact prediction by a number of groups, 42,[44][45][46][47][48] where the inputs to the network are one or more features dependent on the (squared) length of the target sequence (eg, amino acid covariance matrices), and produce outputs (contact maps) of the same shape. short-cuts can be used that can bypass some layers and provide information from earlier layers directly to later layers 25 (see Figure 3).
These so-called residual NNs (ResNets) are becoming the standard architecture for training very deep CNNs, and have been used in the best-performing methods for contact and distance prediction in CASP, 9,10,41,42,47,48 including those by the Zhang and A7D groups in CASP13.
As we briefly mentioned earlier, CNNs are not just limited to 2D problems. Also of use recently have been 3D CNNs, where the input is generally a representation of the protein tertiary structure and the output is a measure such as an estimation of model accuracy 50

| CNNs in contact prediction
In contact prediction, CNNs have proven themselves to be significantly more effective at the problem than the global statistical models that created a lot of excitement in the field just a few years ago.
Examples of such global models include direct-coupling analysis (DCA), 53-55 pseudolikelihood maximization, [56][57][58][59][60] and sparse inverse covariance estimation (SICE). 61 As mentioned previously, given the correspondence between residue covariance matrices and contact maps, it is natural to treat these as image-like inputs in order to derive a mapping. CNNs are ideally suited to such prediction problems, as the key idea in convolutional layers is to recognize local patterns regardless of their spatial position in the input. Taking this idea into the realm of contact prediction, applying convolutional filters to an amino acid covariance matrix, say, allows the model to detect interactions between local sequence motifs that are separated by an arbitrary number of residues, 45 which corresponds nicely to observed structural patterns (eg, variable length loops or even entire domain insertions can be accommodated with no changes needed to the model).
The fact that CNN models, in which the key functional units are designed to only use local subsets of the data, can outperform global models in which all residue covariation data are considered simultaneously, can at first glance seem surprising. On the other hand, the ability to stack successive convolutional layers to increase the overall receptive field of the model theoretically allows the model to use as much of the covariation data for a target protein as necessary when predicting individual contacts. In a recent work, 45 we created CNN models with varying sized receptive fields in order to assess whether a completely global view of the covariation data is necessary in order to achieve high precision when predicting contacts. We found that increasing the receptive field of the network led to increased precision, as might be expected, but significant gains were realized only up to a maximum receptive field size of 15 residues or so; further increases in the receptive field size (up to the evaluated maximum of 49) led to little or no gain in mean precision. The model, which we termed DeepCov, was also found to be significantly more precise than

| WHY IS DEEP LEARNING EFFECTIVE?
We now give some personal thoughts as to why we think DNN-based models are effective at various problems, as well as some potential pitfalls. In general, DNN models work well on data that exhibit structure such that some form of hierarchical parsing is both possible and meaningful. For this reason, DNN models are generally not particularly effective in modeling unstructured data, which is unfortunately very common in many areas of biology and medicine. By unstructured data, we mean data where there is no geometric relationship between the inputs for example, features such as "cost," "height," "molecular weight," "radius of gyration," and so forth. In bioinformatics, an illustrative example of this is the DeepBipolar method 64 that used CNNs to tackle the Bipolar exome task at the fourth Critical Assessment of Genome Interpretation (CAGI) experiment. Despite using a complex CNN architecture, it performed only slightly better than traditional classifiers such as random forests, likely due to the unstructured nature of the inputs (presence or absence of particular gene variant calls), but also, to be fair, possibly due to the limited amount of available training data. One useful take-home message we can take from this is that deep learning has only proven to be effective in a fairly narrow (but still important) range of problems, and is not going to be the best approach in every machine learning application area.

| Deep learning as a neighborhood density estimation method
In order to find better ways of using deep learning in future CASP In that scenario, the training data are effectively stored within the weights of the network and during inference the network simply assigns a weight to these stored patterns and produces as output a weighted average of the original training outputs. A network like this is essentially useless for making predictions because unless the input is really close to one of the training examples, it will simply produce a output an average of all of the original training outputs, which is unlikely to be informative. This is what we mean by overfitting in the context of NN training.
It is easy to see that an overtrained model would exhibit very low to zero training error and could be obtained by a relatively straightforward optimization of a loss function. A number of regularization techniques can be used to avoid overfitting, such as adding penalty terms to the loss function (eg, L1 or L2 penalties, which are commonly used in regression models [65][66][67] ), Dropout, 68,69 or early stopping, 70 to name a few. Besides regularization, the main key to avoiding overfitting is always to collect a lot more data, but some benefit can be gained from simply reducing the complexity of the network. Reducing the complexity of the network basically means reducing the number of adjustable weights or parameters, and so typically means either using fewer layers, and so producing a shallower network, or making each layer narrower that is, reducing the number of weights in each layer. This model complexity reduction forces the NN to be more "creative" in the way it stores the training input-output pairings. It still tries to memorize the training data, but because it no longer has enough memory capacity to simply store the information, it is forced to make use of more complex representations of the data to store the same amount of data in less memory. This of course is recognizable as data compression, and so another useful way of looking at a NN is that it is learning how to compress the input training data. The better this compression is, the more we assume the network has learned about the underlying shape or structure of the data. Of course there must be limits to this, otherwise we would opt to use a NN with just a single adjustable weight to learn any sized data set. As with any compression method, data can only be compressed so much until it is simply not possible to reconstruct the original input to any kind of useful accuracy. This is what we refer to as underfitting a NN model. Considering the NN models used in A7D (DeepMind's AlphaFold), 10 RaptorX, 8 or our own DMPfold, 9 it is important to understand the limitations of what these models are capable of doing.
The first thing we notice about these models is that they employ a variety of sequence-derived input features. The majority of these features encode evolutionary information of various kinds, particularly direct coupling features, which have also had an impact on the last few CASP experiments. Indeed, DCA and SICE predictions have on their own been used successfully to fold proteins. 53,60,[76][77][78][79] It should therefore come as little surprise that models that take these features as input are successful at predicting contacts, as in these cases the model does not have to achieve much more than "clean up" the initial predictions by recognizing local contact patterns across the training set. However, some DNN-based contact prediction methods do not use direct coupling features, and instead operate on raw residue covariance matrices 45 or even the input multiple sequence alignment (MSA) directly. 80 Regardless of whether direct couplings, covariances, or the MSA itself is used, these features are arguably a very long way away from the simplest case of encoding a single amino acid sequence as the sole input. This means that a network using these features is almost certainly not learning anything specifically about the actual target sequence, but instead is learning statistical features of the family to which the target sequence belongs. This immediately creates a limitation on what the network can ultimately learn. If the inputs comprise information summarized from hundreds or perhaps thousands of different proteins, then it is clearly not ideal to train the network to associate this input with just a single target set of distances. Each member of the family will have a slightly different structure from every other member, and so this creates an inherent accuracy limit to deep learning-based modeling, at least in its current form. At best, the network can learn the structure of an "average" member of the family, but at worst the known structure used in training could turn out to be something of an outlier. In that case the network will likely model that family fairly inaccurately, as it has been provided with a highly biased sample of the ensemble of structures represented in the family. This bias will vary from protein family to protein family, but it is reasonable to guess that the average of these biases is going to be somewhere in the region of 3-5 Å RMSD. So, without many more samples of known structures, which would require a huge increase in the number of structures in PDB, covariation-based deep learning models will likely only ever directly produce predictions of around this accuracy. This perhaps emphasizes the increased importance of structure refinement to the future of protein structure prediction, as "fold level" modeling may well effectively become a solved problem within a few years.

| Robustness to missing or noisy inputs
An aspect of DNN-based predictors that can be surprising is that they seem able to produce reasonable predictions even when the inputs are noisy or sparse (incomplete). An example in contact prediction is when the input MSA only has very few sequences in it. 45 In contrast, when properly trained, DNN models can be quite robust to missing data. Figure 4 illustrates this idea using an image classification model that is available online. Given an input image, this model is trained to predict scores for a predefined set of concepts that describe the content of the image. Figure 4A shows the 10 highest-scoring predictions for the original image. In Figure 4B,C, a significant fraction of the image has been "censored" or more technically "ablated," mimicking a situation where one has missing data. The model clearly still returns reasonable predictions, although the overall accuracy is clearly reduced as more data get ablated. If we consider the task of contact prediction from just raw covariance data, when an input MSA has only a few sequences, the covariance estimates can only take on a limited set of values. In the realm of digital images, this is (very loosely) similar to using only a limited number of colors for an image. Once again, properly trained DNN models for image recognition can be surprisingly robust to this effect ( Figure 4D).
Robustness to missing or noisy data is not necessarily a new aspect of deep learning (DL) models; instead, the lack of such robustness can be considered an obvious failure mode of global models such as DCA, pseudolikelihood methods, and SICE. When the inputs to such models are sparse, one is forced to use techniques such as shrinking the covariance matrix or adding pseudocounts so that the model has complete information to work with. DL models can simply be trained to naturally deal with such missing data without needing it to be explicitly filled in.
To better understand this, imagine an MSA with two distinct regions (this could be a two domain protein). One region has many aligned sequences, and few long gaps, the other region has few aligned amino acids and very many long gaps. An unsupervised covariation method such as PSICOV will struggle to produce an accurate contact map in this situation. This is because every decoupling equation in the calculation will include pairwise covariance terms from both the good and bad parts of the alignment. Thus, the whole predicted map will be of low accuracy because of the missing data in the bad section. A convolutional network method such as DeepCov, on the other hand, will have no such problems. The learned filters in the convolutional layers will still detect local patterns corresponding to real contacts in the good parts of the alignment, but will be unhindered by the noisy data from the bad part of the alignment. In the bad part of the alignment, the filters will simply detect few or no contacts. This is analogous to the way that convolutional nets outperform fully connected nets in image labelling tasks (see Figure 4, wherein a confident prediction for "breakfast" is still returned after much of the input image has been ablated). In the case of shallow alignments with more or less uniform coverage across the whole query sequence, the network would simply look at the data for those columns that do show some covariation, and possibly detect contacts in these regions. Additional benefits may be gained from the fact that the network has seen multiple proteins and MSAs during training, which is in contrast to the way in which global statistical models operate.
In fact, a convolutional network can be made even more robust by be sufficient in cases where structural information is not relevant, the problem is that sequences in the same family and with similar structures can have sequence identity well below 30%, and sometimes with 0% sequence identity. In this case, a correct prediction by the network is likely to be repeating what it has already seen during training rather than making a true prediction of the task at hand. This problem affects prediction tasks including secondary structure prediction and contact or distance prediction, to name but a few. We can think of vanishingly few problems for which a split based on sequence identity is sufficient; one example is function prediction, in which two highly sequence-similar proteins can have very different functions.
While it is difficult to encourage researchers to adopt a more stringent procedure that is both harder to implement and makes their benchmarking results look worse, a shift to using rigorous training and testing splits is essential if we are to accurately assess the impact of deep learning-based methods, and for practitioners to trust the predictions provided by such models in the future.

| CONCLUSIONS AND OUTLOOK
Research in NNs and deep learning continues to develop at a rapid pace, with ideas for new architectures, training tricks, weight optimization algorithms and other tools appearing on a weekly and sometimes daily basis. Indeed, there are a number of very important ideas that we have not covered in this article. Perhaps the most important is that of recurrent architectures, which map sequences of data to other sequences. Recurrent NNs are widely used in the prediction of secondary structure, solvent accessibility, disorder and backbone torsion angles. [86][87][88] Recurrent architectures have also been used in contact prediction. 48,89 More recently, a recurrent architecture has been used to model tertiary structure. 90 This latter method has the attractive property of being end-to-end differentiable, meaning that all parts of the process from taking in the input features to predicting 3D coordinates (via predicted torsion angles) can be simultaneously optimized during the NN training process. Other methods such as deep reinforcement learning 91 and generative models (such as generative adversarial networks 92 and variational autoencoders 93 ) have not yet had a clear impact in CASP, but perhaps will in the future.
Deep learning is clearly taking bioinformatics by storm. As reviewed here, this is due to the ability of deep learning models to take into account different levels of structure in data, to deal with noisy data, to take in raw features without the need for feature engineering, and to interpolate sensibly to make reasonable predictions for data not used in training. This trend looks likely to continue for at least a few years due to a few reasons: constant improvements in hardware, architectures and algorithms; the ever-increasing amount of experimental data collected; and the increasing crossover between the machine learning and bioinformatics communities. As deep learning for bioinformatics moves into a more mature phase it is essential that rigorous benchmarking and evaluation becomes more common in published literature. Of course, the ultimate aim of bioinformatics is not just accuracy on prediction tasks but understanding of the underlying biological processes at work. As research into the interpretability of NNs improves, it would be beneficial for successful networks in bioinformatics to be interrogated to see which features and signals are important. Such understanding could even be used to help the networks themselves become more robust and accurate.

ACKNOWLEDGMENTS
We are grateful to members of the group for helpful comments and discussions.