Haruspex: A Neural Network for the Automatic Identification of Oligonucleotides and Protein Secondary Structure in Cryo‐Electron Microscopy Maps

Abstract In recent years, three‐dimensional density maps reconstructed from single particle images obtained by electron cryo‐microscopy (cryo‐EM) have reached unprecedented resolution. However, map interpretation can be challenging, in particular if the constituting structures require de‐novo model building or are very mobile. Herein, we demonstrate the potential of convolutional neural networks for the annotation of cryo‐EM maps: our network Haruspex has been trained on a carefully curated set of 293 experimentally derived reconstruction maps to automatically annotate RNA/DNA as well as protein secondary structure elements. It can be straightforwardly applied to newly reconstructed maps in order to support domain placement or as a starting point for main‐chain placement. Due to its high recall and precision rates of 95.1 % and 80.3 %, respectively, on an independent test set of 122 maps, it can also be used for validation during model building. The trained network will be available as part of the CCP‐EM suite.


Introduction
Theresolution revolution in single particle electron cryomicroscopy (cryo-EM) yields macromolecular structures of unprecedented resolution. These structures allow us to identify new drug targets,f or example in the Zika virus, [1] to fight tuberculosis [2] or to understand the fundamental processes of life,such as the process of translation by ribosomes. [3] However,m odelling an atomic structure to these maps remains difficult as researchers mostly rely on algorithms developed for the interpretation of crystallographic electron density maps.InX-ray crystallography,the measured diffraction corresponds to the amplitudes of the Fourier transform of the electron density,asthe X-rays interact with the electrons in the molecular assemblies in ac rystal and the phases are reconstructed only during refinement. In cryo-EM, on the other hand, the measured micrographs already contain phase information, but are very noisy,w hich is overcome by 3Dreconstruction and averaging.T he individual micrographs show the interaction of the electron beam with the entire electrostatic potential of asingle molecular assembly.Hence, cryo-EM reconstruction maps differ in both their nature and error distribution [4][5][6] from X-ray crystallographic electron density maps.C onsequently,t heir modelling might be improved greatly by tools that consider these specific properties of the data at hand. Such modelling tools should not only provide good functionality,but also be easy to use and freely available to academic users worldwide.
Parallel to the advances in cryo-EM during the last decade,d eep neural networks have achieved remarkable image segmentation capabilities, [7] making them the most powerful machine-learning approach currently available. Convolutional neural networks (CNN) combine traditional image analysis with machine learning by cascading layers of trainable convolution filters and are exceptionally well-suited for volume annotation. They have been successfully applied to biological problems,s uch as breast cancer mitosis recognition [8] and, in conjunction with encoder-decoder architectures,tovolumetric data segmentation. [9,10] Given that acryo-EM reconstruction map is essentially at hree-dimensional image volume,CNNs seem agood choice for their annotation if good "ground truth" data to train the network could be provided.
In this work, we demonstrate that deep neural networks are not only capable of annotating protein secondary structure,b ut also oligonucleotides (RNA/DNA) in cryo-EM maps,a nd provide ap re-trained network, named Haruspex. Assigning af old to regions in ac ryo-EM map is the first step in modelling as tructural region. This can be am ajor challenge,i np articular for novice users,i nl ow resolution regions,o rw hen little is known about the composition of the macromolecular complex in question. Haruspex can be readily used to annotate cryo-EM maps, which will prove useful in model building and supporting the placement of known domain folds,t hus accelerating the modelling process and improving the accuracy of cryo-EMderived molecular structures.

Network Architecture and Implementation
In low-resolution cryo-EM maps, a-helices can often be discerned as long cylindrical elements.T his has been exploited by the program helixhunter, [11] which searches for prototypical helices in reconstruction maps using ac ross-correlation strategy. b-Strands are more difficult to identify as they are more variable in shape and therefore require morphological analysis. [12] Ac ombination of these approaches led to the development of SSEHunter, [13] which uses ad ensity skeleton to detect secondary structures.Deep learning offers an alternative approach:F ully convolutional networks [9,14] allow as wift generation of segmentation maps for objects of variable size.H ere,w ee mploy as tate of the art U-Net-style architecture [9] to demonstrate that at an average map resolution of 4 or better, experimentally derived reconstruction maps allow the training of aw ell-performing network that can be used for aw ide range of specimenswith no re-training necessary.T he network was implemented with Te nsorFlow [15] and processes 40 3 voxel segments with av oxel size of 1.0-1.2 3 (covering as econdary structure element and its immediate surroundings) to annotate 20 3 voxel cubes (corresponding to the center of the input volume). Theo utput volume has four channels containing the probabilities that the voxel is part of an a-helical or bstrand protein secondary structure element, nucleotide,o r unassigned. 40 3 voxel segments were chosen as acompromise between computational power and network complexity on one hand and covering the secondary structure including surrounding interaction partners on the other hand. A4 0 3 voxel segment covers 40-48 3 ;a na verage a-helix with 10 residues,for example,is1 5 in length. [16] Theinput is asingle channel containing the reconstruction density.D uring prediction, this three-dimensional volume is passed through multiple convolutional layers (image filters) that extract learned image features relevant for structure detection, and through pooling layers,w hich select the most significant of the detected features.Inthe second ("upconvolutional") part of the network, these activations are combined with higher-level activations of the network to recover spatial detail. Theo utput has four channels representing the probabilities for the four classes (helix, sheet, nucleotide, unassigned) and represents the annotation of the central 20 3 voxel cube of the input volume.

Training Data Selection
Forn etwork training,w ep re-selected EMDB (Electron Microscopy Data Bank [17] )r econstruction maps with an average resolution of 4 or better as stated in the EMDB entry.F rom 576 entries (as of 15/2/2018), we picked 293 EMDB/PDB (Protein Data Bank [18] )p airs (Supporting Information, Table S1) by three criteria:1 )map and model represent the same structure and fit visually well to each other;2)the presence of at least one annotated a-helix or bsheet in the PDB model;3 )preference of higher resolution maps in case the same authors deposited several instances of the same macromolecular complex (as the model was most likely fitted to the highest resolution map). Maps with severe misfits,m isalignments,o rm odels without corresponding reconstruction density (and vice versa) were omitted. Visual evaluation was supplemented with acomparison between the entire map and the part which is occupied by the model using histograms,m ean and median values;t his provided an additional test of how well map and model fit each other. Furthermore,t he training data were filtered by map root mean square deviations (r.m.s.d.) values (see below).
Cryo-EM maps are often post-processed, stitched or otherwise filtered, but it can be difficult to determine how exactly agiven map has been altered. Hence,wedid not apply any additional criteria pertaining to map modification and instead decided to train the network with all possible representations of the features in question. It is worth mentioning that some types of post-processing,s uch as map sharpening,a re in principle equivalent to linear convolution filters.C onvolutional neural networks (CNNs) can learn to apply or compensate for these during training (if they are relevant for predicting the annotated structure) and hence, can become insensitive to these procedures.

Training Data Annotation
To generate ground truth data for network training, ap ython script was implemented to automatically annotate the reconstruction map according to the deposited structural model as a-helical, b-strand, nucleotide or unassigned. The script extracts the original annotations from PDBML format [19] files using acustom parser. To obtain suitable training data, additional secondary structure information was necessary.W ei mplemented av ariant of the DSSP algorithm [20] omitting strand direction, and at orsion-angle-based secondary structure detection inspired by STRIDE: [21] annotated or DSSP-detected secondary structures were extended by neighbouring amino acids if they matched the same Ramachandran profile.B efore usage,t he voxel size of the reconstruction map was re-scaled to 1.
Following that, if asecondary structure was identified, and if the average main chain atom map r.m.s.d. (root mean square of the map density distribution) was above 2, all voxels within 3 of backbone atoms were annotated accordingly. Secondary structure residues below 2b ut ! 1.0 r.m.s.d. were masked and excluded from error calculation during training. All voxels not within 5 of model atoms,b ut with density ! 1r.m.s.d. were masked and excluded from training,asthey had high density,b ut were not modelled. Ther emaining voxels were marked as "unassigned".

Network Training
Them aps were split into at otal of 2183 segments of 70 3 voxels,o fw hich 110 segments (5 %) were set aside for evaluation during network training.E ach segment had to contain at least 100 atoms ! 1.0 r.m.s.d.,ab ackbone mean density of ! 3r.m.s.d.,a nd at least 5% of the total segment volume annotated. Thet raining data were augmented through on-GPU 908 8 rotations (24 possibilities), and by randomly selecting a4 0 3 voxel sub-segment (translational augmentation).
Then etwork was trained for 40 000 steps with 100 segments employed per step.I nt raining data generation, the average EMDB map had roughly 95 %u nassigned voxels after annotation with the PDB model. From this,w e estimated that non-true negatives needed to be weighted approximately 16-fold stronger than true negatives.T his was necessary as the majority of the space within areconstruction map is not made up of secondary-structure/oligonucleotideassociated voxels and thus the network can reach approximately 70-90 %accuracybypredicting "unassigned" (not ahelical, b-sheet or oligonucleotide) structure only.

Network Performance Test
After training, the network was tested on an independent set of 122 EMDB maps (selected by the same criteria as training data and deposited after February 2018, for the complete list, see Supporting Information, Table S2). For evaluation, we investigated residues with mean backbone densities ! 1.0 r.m.s.d. and compared the predicted secondary structure on aper-residue basis with the one derived from the deposited PDB model. Fort his analysis,t he r.m.s.d. value given in the header of each map file was used. Using this criterion, the network achieved similar performance on training,e valuation, and test data. Over all test maps,t here were 75.4 %t rue positives t p (correctly predicted residues), 18.8 %f alse positives f p (wrongly predicted residues) and 4.0 %f alse negatives f n (non-predicted residues), resulting in amedian recall rate 100*t p (t p + f n ) À1 of 95.1 %and aprecision 100*t p (t p + f p ) À1 of 80.3 %. Precision and recall did not correlate significantly with average resolution (as given in the EMDB entry), Molprobity [22] score or deposition date.
Thecorresponding residue-level F 1 score (harmonic mean of precision and recall) on the test set for Haruspex (87.05 %) is the highest reported so far on ap er-residue-level when compared to other recent work. [23][24][25] Direct comparison of these values is,h owever,d ifficult since these other networks were tested on small test sets of lower-resolution simulated and experimental maps,w hereas we used al arge set of exclusively experimentally derived higher-resolution maps. Moreover,these networks did not annotate oligonucleotides, which affects the composition of the F 1 score.I narecent preprint, [26] the authors use deep learning for atom-level prediction and report 88.5 %correctly predicted C a atoms on 50 pre-cleaned experimental maps at 4.4 or better, which suggests similar performance for their intermediate secondary structure prediction.
As atypical example,human ribonuclease Pholoenzyme (EMDB entry 9627) illustrates the power of our approach (see Figure 1). Haruspex is not only able to accurately predict the RNAv s. protein distribution in this complex, but also correctly assigns secondary structure elements in the protein areas with only af ew exceptions.T hese notably include astem-loop element in the RNA(upper left in the structure), regions that resemble b-sheets but do not follow the characteristic hydrogen bonding pattern, as well as secondary structure elements currently not covered by Haruspex, such as polyproline type II (P II )helices ( Figure 2C,D). Additional examples are shown in Figure 3.

Haruspex Usage
Haruspex can be used as acommand line tool, which reads in an MRC format reconstruction map.N of urther parameters are needed and ap rediction for as ingle map takes approximately 30 seconds to af ew minutes on an ormal workstation, depending on the available hardware (it can be used with or without GPU);o na no lder laptop,t he annotation may take as long as 45 minutes for av ery large structure.T he output consists of four MRC format maps corresponding to the a-helical and b-strand protein, nucleotide,and "unassigned" portion of the input map.T hese maps can be displayed in Coot, [27] Pymol [28] or Chimera [29] and together represent the entire input map.

Network Performance
Herein, we have described the development of the neural network Haruspex for the annotation of protein secondary structure and RNA/DNAincryo-EM reconstruction maps in order to facilitate the modelling of such maps.W et rained Haruspex on 293 experimentally derived reconstruction maps with ar esolution of 4 or better and obtained recall and precision rates of 95.1 %a nd 80.3 %, respectively,o na n independent test set of 122 maps.T he pre-trained network can be readily applied to annotate newly reconstructed maps to support domain placement or to supply astarting point for main-chain placement.
When considering the 18.8 %f alse positives and 4.0 % false negatives,two fundamental limitations in the annotation of EMDB maps should be mentioned:firstly,the map can be wrongly modelled (see Figure 2C), which biases our annotation towards human modelling errors.Secondly,the deposited model may have been built employing additional information, such as structure-specific information from an external source,f or example backbone folds established prior by crystallographic means, [30] NMR or structure prediction, or more than one map generated from different particle alignments. [31] This would in particular introduce higher rates of false negatives at the outer edges of the map,where the model covers secondary structure that was established by other  Figure 2C and 2D are marked #and *, respectively. means,b ut the map does not provide enough information to make this assignment.
Closer inspection reveals that false positives are often elements closely resembling helices,s heets or RNA/DNA (see Figures 1, 2, and 3). In particular, semi-helical structures, b-hairpin turns,and residues belonging to polyproline type II (P II )h elices [32] are misclassified as a-helical, and loosely parallel structures without the typical hydrogen-bond pattern are frequently misclassified as b-strands.I nt he case of P II helices,t his is partly due to the STRIDE-like annotation. It would be very desirable to quantify the false positives in this respect, but this was not possible within the scope of this work, as no automatic annotation algorithms seem to exist for such cases.F or the future development of Haruspex, predicting additional classes,such as b-turns,polyproline helices,and perhaps even membrane detergent regions would be very desirable,a st his would potentially lower the number of incorrectly identified secondary structure elements,w hile at the same time supplying additional information to users.

Resolution Range and Comparison to Similar Algorithms
Haruspex was trained for average resolutions as low as 4 ,and the median resolution of published cryo-EM maps is improving every year,and has been better than 4 since 2017 (see Figure S5 in the Supporting Information). Irrespective of this,w ew ill extend our approach to lower resolution data in the future,where our automated annotations should be even more advantageous for users.Still, low resolution experimen- Figure 2. Network performance. A) Network precision vs. recall rates, with one marker per EMDB entry (training set entries are shown as orange, test set entries as blue markers). Both perform similarly well;w ith the training set producingafew more outliers. B) Frequency vs. map r.m.s.d. level for EMDB 9627 on ap er-residue basis:T rue positives (green), false positives (orange),a nd false negatives (blue). This plot is typical:f alse negatives often occur in low-density map regions. C) a-Helical false positives (PDB 6AHU, residues 131-139 in chain J): The model partly occupies the conformational space of apolyproline type II helix (P II ), which is often misinterpreted as a-helical and may have been modelled incorrectly( given that the model does not completely fit the density). D) False positives in a b-sheet (6AHU, residues 215-221 in chain B). The deposited model does not maintain the hydrogen bonding that defines aregular b-sheet;tothe network, however,the fold still "looks" like a bsheet and athird segment (top) is also assumed to be part of it. tal maps with aw ell-matching model for training and testing such an etwork are difficult to obtain. This obstacle has previously been faced by Si et al. [33] (SSELearner), Li et al. [23] and Subramaniya et al. [25] (Emap2Sec) who developed machine learning approaches for protein secondary structure prediction in cryo-EM maps,but not oligonucleotides, [23] and consequently resorted partly to simulated maps generated with pdb2mrc. [34] These simulated maps lack the error structure and processing artefacts found in experimentally derived reconstruction densities, [4][5][6] as they assume aperfectly processed data set of ah omogenous sample where all atoms interact with the electron beam as if they were uncharged and unbound. Si et al. tested their support vector machine on 10 simulated maps of relatively small structures (less than 40 kDa) and, as available data were still very limited in 2012, only 13 experimental maps paired with individually selected training maps.H aslam et al. [24] used a3 DU -Net, which was trained on 25 simulated and 42 experimental maps between 3-9 resolution to predict helices and sheets obtaining an F 1 score 2(recall À1 + precision À1 ) À1 between 0.79 and 0.88. However,the network was only tested on six simulated maps and one experimentally derived map.W e, on the other hand, used at otal of 293 experimentally derived maps in as emi-automated workflow to provide amore realistic training environment. Furthermore,t he amount of newly released highresolution structures in conjunction with our processing infrastructure permitted us to test our network performance on ar epresentative set of 122 unique depositions.T he semiautomated workflow for the selection and annotation of training data (see the Methods section of the Supporting Information) allows for an easy expansion of ground truth data and re-training.H owever,g iven that Haruspex has already been trained on ad iverse range of macromolecular structures,t he network can be used to interpret any map at 4 or better without any additional (re-)training necessary.

Augmentation of Automatic Model Building
Haruspex ideally complements tools for automatic mapbased structure building, such as MAINMAST, [36] Roset-taES, [37] ARP/wARP, [38] phenix.map_to_model [39] or Buccaneer [40] by providing an independent method to locate secondary structure elements of proteins to assist the validation of an automatically built protein main-chain. Haruspex may even be employed in the future to serve as starting point for such methods.T he ability of Haruspex to automatically recognize RNA/DNAi so fp articular interest for the analysis of ribosomes,spliceosomes,and polymerases, which all contain substantial amounts of oligonucleotides.As these and similar structures are among the most common specimens studied by single-particle cryo-EM, Haruspex, which, to our knowledge,i st he first to use machine learning for the identification of nucleotides in cryo-EM reconstruction maps,offers aunique advantage for the analyses of these structures.

Conclusion
We demonstrate that an eural network can be used to automatically distinguish between nucleic acids and protein and to assign the two main protein secondary structure elements in experimentally derived cryo-EM maps.T his technique will render the process of protein structure determination faster and easier.H aruspex was trained on ac arefully curated ground truth dataset based entirely on experimental data from the EMDB.The pre-trained network can be straightforwardly applied to annotate newly reconstructed cryo-EM density maps.Besides guidance for domain placements,t he network also proves useful for model validation during building due to its high median recall and precision rates of 95.1 %and 80.3 %, respectively,ashas been demonstrated by early users at our institute,f or example in the modelling of the mycobacterial type VII secretion system. [2] Then ewest version of Haruspex is online available at https://github.com/thorn-lab/haruspex and will be distributed as part of CCP-EM. [35] We plan to refine and adapt the network as new data become available,a nd extend the approach to lower resolution and more structural classes in the future.