Predicting new protein conformations from molecular dynamics simulation conformational landscapes and machine learning

Molecular dynamics (MD) simulations are a popular method of studying protein structure and function, but are unable to reliably sample all relevant conformational space in reasonable computational timescales. A range of enhanced sampling methods are available that can improve conformational sampling, but these do not offer a complete solution. We present here a proof‐of‐principle method of combining MD simulation with machine learning to explore protein conformational space. An autoencoder is used to map snapshots from MD simulations onto a user‐defined conformational landscape defined by principal components analysis or specific structural features, and we show that we can predict, with useful accuracy, conformations that are not present in the training data. This method offers a new approach to the prediction of new low energy/physically realistic structures of conformationally dynamic proteins and allows an alternative approach to enhanced sampling of MD simulations.


| INTRODUCTION
Molecular dynamics (MD) simulations of proteins are a popular method of studying aspects of protein function and dynamics. 1 They require input structure(s), which are preferably experimentally determined, usually by X-ray crystallography. However, as proteins are often highly flexible, they adopt multiple conformations, which interconvert over a wide range of timescales, 2,3 which can be predominantly longer than the feasible MD simulation length of ns-μs.
Enhanced sampling methods have been developed to improve the sampling of MD simulations, 4,5 but these do not offer a complete solution to the MD sampling problem, partly because some knowledge of the system is necessary to define the coordinates (eg, collective variables) along which sampling should be performed. Machine learning offers an alternative approach.
Machine learning (ML) has been successfully applied to the analysis of the high-dimensional data produced by MD simulations 6 and in structure prediction where an experimentally derived structure or homology model is not available. 7 Enhanced sampling techniques that use ML to guide the MD simulations (eg, by identifying collective variables and imposing biasing potentials) have also been developed [8][9][10][11][12][13][14] ; a conceptually simpler and more flexible approach is to utilize ML for the prediction of new protein conformations based on existing MD simulations, as has been recently demonstrated. This approach has recently been demonstrated using an autoencoder to encode the structural data into a low-dimensional representation, either onto the autoencoder's default latent vector 15 or using the sketch-map algorithm 16 to improve the interpretability of the low-dimensional representation. 17,18 New structures were then predicted by decoding points on the resulting low-dimensional surface.
Here we employ a related but different approach, to use a simple, pre-defined low-dimensional conformational landscape to guide the search rather than use the machine learning algorithm define the lowdimensional representation. The aim is not to create a more robust machine learning algorithm than those discussed above, but to explore whether a very simple representation of a MD-derived conformational landscape can successfully be used to predict new, physically plausible conformations. In principle, this approach could then be used with an arbitrary representation of the conformational landscape, which can consist of structural parameters of choice such as contact matrix, backbone dihedrals (as used in ref 17,18 ) or a combination of specific parameters. For this proof-of-concept study, an autoencoder was trained to map the structures onto two simple conformational landscapes and trained to decode points within this landscape into new structures. Two test cases are used, a short homoalanine peptide and the calcium-binding protein calmodulin (CaM). Two conformational landscapes descriptors were also used, the first two principal components of a 2D-RMSD matrix and two dihedral angles that describe the relative orientation of the two CaM globular domains. We show that it is possible to predict physically plausible conformations which were not sampled during the MD simulation(s).

| Molecular dynamics simulations
All simulations were performed in Gromacs 2016.4 19 using the Amber FF14SB 20 force field. Each system was solvated with a water box at least 13 Å larger than the peptide/protein on each axis with counterions (if required) generated in AmberTools 16. 21 All calculations used a periodic boundary condition and LINCS constraints on all bonds involving hydrogen atoms, the Verlet cut-off scheme with 10 Å cutoffs. Energy minimisation was followed by 100 ps of constant volume (NVT ensemble) and 100 ps of constant pressure (NPT ensemble; 1 bar) solvent equilibration, using the Parrinello-Rahman pressure coupling with a time constant of 2 ps, and positional restraints with a force constant of 10 kJ mol −1 nm −2 applied to the protein/peptide. Constant pressure was also used for the subsequent unrestrained production run, and all simulations were run at 300 K.

| Conformational landscape
Our machine learning algorithm takes a conformational landscape in the form of a series of vectors, as input. For initial development and testing, a simple conformational landscape was defined based on the 2D-RMSD matrix, a square matrix of RMSD values for every structure relative to every other structure (ie, each cell is the pairwise RMSD between structures and the diagonal elements are therefore 0). For m total structures, the 2D-RMSD matrix is an m × m matrix, and principal components analysis the results in m eigenvectors, or principal components (PCs). We then used the top two PCs (those with the largest eigenvalues) to define a 2-dimensional conformational landscape, although the input is not limited to 2-dimensional vectors, so a more complex, multidimensional landscape can be used by using additional PCs. For further validation of the method we used a conformational landscape defined by a pair of dihedrals, which describe the relative conformations of each CaM globular domain.

| Machine learning
Our code and data for model 1 are available at https://github.com/ Imay-King/MDMachineLearning. The protein structures (Cartesian coordinates) were first extracted from MD simulations using the MDanalysis package. 22,23 The structures were then aligned to the starting structure by minimizing the RMSD for the same atom selection (model 1: all atoms; model 2: heavy backbone atoms) subsequently used for the ML, and the Cartesian coordinates were normalized using MinMax scaling. For the complete set of protein where n is the number of atoms per structure and m is the total number of structures, the normalized coordinates for atom i in structure k are given by: Note that we also tried using z-score normalization, which is suitable for Gaussian distributions, but this performed poorly (in terms of the final predictions) as the 2D-RMSD matrix projected onto principal components eigenvectors are not normally distributed.
Our modified autoencoder was built in Python 3.6 using Keras

| Model 2
As a more biochemically relevant example we turned to CaM, which is known to adopt several distinct conformational states. 28 the testing set, and for the 800 structures in the testing set the average RMSD between the predicted structure and the target was 0.90 ± 0.71 Å, which is similar to that observed for the L-Ala 13 peptide. We also tested the effect of different numbers of layers in the autoencoder (SI Figure S4), with very similar results, although the loss convergence was significantly less smooth with five layers.
As before, we then tested whether our algorithm can predict structures that are distinct from those in the training set by repeating the prediction for seven structures in different regions of the PCA plot (Figure 3(A)-(G)). For each of the predicted structures, the training set consisted of the MD simulation from which that particular target structure did not originate; that is, when predicting structures taken from the MD1 simulation the model was trained only on structures from MD2, and vice versa. We again experimented with the effect of different numbers of layers in the autoencoder, and found that overall the 3-layer model performed best (SI Figure S5).
For the seven predicted structures, the average RMSD relative to the target structures is 1.89 ± 0.81 Å, and even for the worst predictions (b and f) the overall gross structural features were successfully predicted.
The target CaM structures in Figure 3 are compared with the most similar structure from the training set in Table 1. The predicted structures have conformations that are not found in the training set, and in each case the RMSD to the target structure is smaller than the minimum RMSD to the structures in the training set. The two structures with the biggest improvement (a and e) are shown in Figure 4.  2 dimensions) which would allow more complex conformational space to be mapped in higher dimensions.
It is important to note that by necessity (so that target structures can be defined for comparison), the PC space for each prediction analyzed so far ( Figure 3 and Table 1) originated from the PCA of the 2D RMSD matrix for the entire simulation (training + test data). This means that some sampling information in the test data are retained in the principal components of the training data. We will address this point below. Firstly, we address the issue that model 2 does not include the sidechains in the machine learning, because this is more computationally efficient and also forces the PCA to describe gross tertiary structure/conformational space without the added complication of multiple side chain conformations. In principle there are several ways to use the predicted backbone structures for additional modeling: input geometries can be generated by building in the sidechains using rotamer libraries, [30][31][32][33] through partial structural alignments with the original MD simulation data, or by using techniques such as steered MD 34,35 to rapidly drive the MD simulation to new predicted conformations. We chose to rebuild the sidechains of the predicted CaM structures using the protein sidechain prediction algorithm in SCWRL4. 36 To benchmark this approach, we rebuilt the sidechains for structures a-g in Figure 3, which resulted in an average RMSD between the rebuilt and original structures of 2.99 ± 0.02 Å (SI Table S1). However, since structures taken from an MD simulation are typically high-energy structures with non-optimal sidechainsidechain interactions (at 300 K only a small minority of conformations sit at the bottom of the potential energy well) that SCWRL4 is not designed to reproduce, we then energy minimized the sidechains of both the original and rebuild structures using the FF14SB force field in Amber using implicit solvation (5000 steps of steepest descent with a harmonic constraint of 500 kcal mol −1 A −2 on the backbone atoms). This decreased the average RMSD to 1.25 ± 0.80 Å, suggesting that this approach is able to rebuild the sidechains and generate structures that are physically realistic, with a strong correspondence between the original and rebuilt structures.
From the NMR structure of apo CaM (PDB ID: 1LKJ) we can see that the ensemble of structures covers more conformational space than is sampled during MD1 and MD2 simulations, due to extensive domain motion (SI Figure S6). However, the first two PCs of the 2D RMSD matrix does not adequately capture this sampling (SI Figure S7A), suggesting that in this case 2D RMSD captures more intra-domain structural changes than domain motion. In order to predict new structures, we therefore chose to employ a different conformational landscape defined by two dihedral angles, θ 1 and θ 2 , which F I G U R E 4 Overlay of the predicted (red) and target (green) structures a and e from Figure 3, with the most similar structure from the corresponding training set (blue), and a range of structures from the training set (the structrues in Table 1 Figure S7). We defined a regular grid of (θ 1 ,θ 2 ) values and mapped this over the conformational space described by (θ 1 ,θ 2 ) for the combined MD1 and MD2 simulations (SI Figure S8). The combined MD1 and MD2 simulation data were used for training using this new (θ 1 ,θ 2 ) conformational landscape descriptor and the regular grid was used for subsequent prediction. The (θ 1 ,θ 2 ) values of the predicted structures were often observed to differ from their target values, so that the majority of predicted structures subsequently lie near or within the conformational space of MD1 and MD2 ( Figure 5(A)). This suggests that the autoencoder will not arbitrarily predict a structure in a region of conformational space for which there is insufficient data for successful extrapolation. There is, however, a large region of predicted structures with distinct (θ 1 ,θ 2 ) values. Only some of these could be successfully energy minimized after sidechain reconstruction using SCWRL4, with others failing due to steric clashes. Clearly there is room for improvement here, for example, using structural costfunctions based on C α -distances and dihedral angles as employed by the EncoderMap algorithm, 18 or possibly by performing the ML with the entire protein (without sidechains removed). Nevertheless, using this method we were able to identify a predicted structure, indicated by a red circle in Figure 5(A), which is more similar to a structure from the apo CaM NMR ensemble (PDB 1LKJ) than to any of the structures from MD1 or MD2 (SI Figure S9). Starting from this predicted structure (with side chains reconstructed) as the input geometry, we ran an additional 100 ns MD simulation, which results in a much greater coverage of conformational space compared to the initial MD simulations ( Figure 5(C)).

| CONCLUSIONS
In summary, we have demonstrated a proof-of-principle method of combining MD simulation with machine learning to explore a userdefined, arbitrary conformational landscape. An autoencoder maps snapshots from MD simulations onto the conformational landscape, and we show that we can predict, with useful accuracy, conformations that are not present in the training data. This method allows the prediction of new physically realistic structures of conformationally dynamic proteins that can be used for enhanced sampling of MD simulations, by rapidly generating new structures from which additional MD simulations can be initiated for a more efficient search through conformational space.