Assessment of three-dimensional RNA structure prediction in CASP15

The prediction of RNA three-dimensional structures remains an unsolved problem. Here, we report assessments of RNA structure predictions in CASP15, the first CASP exercise that involved RNA structure modeling. Forty-two predictor groups submitted models for at least one of twelve RNA-containing targets. These models were evaluated by the RNA-Puzzles organizers and, separately, by a CASP-recruited team using metrics (GDT, lDDT) and approaches ( Z -score rankings) initially developed for assessment of proteins and generalized here for RNA assessment. The two assessments independently ranked the same predictor groups as first (AIchemy_RNA2), second (Chen), and third (RNAPolis and GeneSilico, tied); predictions from deep learning approaches were significantly worse than these top ranked groups, which did not use deep learning. Further analyses based on direct comparison of predicted models to cryogenic electron microscopy (cryo-EM) maps and x-ray diffraction data support these rankings. With the exception of two RNA-protein complexes, models submitted by CASP15


| INTRODUCTION
Soon after the establishment of the cloverleaf structure of transfer RNA, 1,2 three-dimensional models of RNA structures appeared. 3,4wever, it took more than 10 years before the first refined experimental structures of the 76 nucleotide yeast tRNA Phe were published. 5,6For many years, x-ray crystallographic structures of RNA nucleosides and nucleotides allowed us to grasp the fundamentals of RNA stereochemistry.After 1995, following progress in chemistry and x-ray technology, a steady stream of RNA structures with sizes equivalent to or larger than tRNAs, culminating with fully functional ribosome structures, revealed the many intricacies of RNA architectures.
In parallel, computer programs for RNA modeling appeared (for overview, see Reference 7).However, it was not until 2011 that a regular assessment of models, called RNA-Puzzles, was set up. 7,8 The models for the RNA sequence of each RNA-Puzzle were collected prior to publications of the x-ray structures.Since not enough targets were available for a short CASP-like season, the Puzzles were organized to occur right as the structures were solved (for those structures for which an agreement between the structural biologist and RNA-Puzzles organizers was made).[11] In 2021, it became clear that accelerations in RNA structure determination 12 would allow enough targets for a single CASP season.Here we report on the first collaborative effort between CASP and RNA-Puzzles teams on a set of RNA targets.Following the success of AIbased tools in protein structure prediction 13 and a surge of interest in RNA during the COVID pandemic, 14 the hope of the organizers and assessors was to generate motivation and attention from protein modeling groups to develop and evaluate methods for RNA.
Between April and July of 2022, sequences of 12 RNA targets were received from experimental contributors and disseminated on the CASP website.Models were submitted by over 40 groups, and a double-blind assessment was carried out.Inspired by prior joint assessments by CAPRI and CASP for protein complexes (see References 15-19), two assessments were carried out for RNA: one assessment was performed by the RNA-Puzzles team (Z.Miao & E. Westhof) and a completely independent analysis was performed by assessors nominated by the CASP organizers (R. Das and team).During a dedicated assessors' meeting in October 2022, the two assessments' results were critically compared, revealing a striking consensus in rankings and choice of top predictors, despite the use of distinct metrics and ranking schemes.Further analysis based on visual inspection of RNA-protein targets, direct comparison to cryogenic electron microscopy (cryo-EM) maps, and molecular replacement trials for targets solved by x-ray diffraction-catalyzed by the general CASP15 conference in December 2022-revealed additional insights into the limitations and potential of current RNA 3D modeling, which are described here.The identification of accurate models also led to insights by CASP15 RNA experimental contributors and development of novel methods for cryo-EM model refinement, described in two separate papers co-submitted to the CASP15 special issue. 20,21| METHODS

| Computation of RNA-Puzzles-style metrics
The RNA-puzzles-style assessment relied mainly on the Root Mean Square Deviation (RMSD) measure complemented by the Deformation Index (DI). 22The RMSD is the usual measure of distance between all atoms (excluding H atoms) of the two superimposed structures.
The DI score complements the RMSD values by introducing features specific to RNA in the metric in the following way.The pairs formed by the nucleotides are identified, counted, and annotated in the experimental structure.They are broadly classified as either of the Watson-Crick complementary type (WC, comprising AU, GC, or GU pairs whose geometry are compatible with the standard Watson-Crick-Franklin double helix) or of the non-Watson-Crick type (NWC).
The base-base network, that is, WC, NWC, and stacking interactions in both reference and predicted models are extracted using the MC-Annotate 23 tool.We then compute, for each of the three types of base-base interactions, the number of correctly predicted pairs, the true positive (TP), the number of predicted pairs with no correspondence in the reference model, the false positive (FP), and the number of pairs in the reference model that are not present in the predicted model, the false negative (FN).The Interaction Network Fidelity (INF)   is then computed as the Matthews Correlation Coefficient, the geometric mean of the positive predictive value and sensitivity as in Gorodkin 24,25 : posing each nucleotide of the predicted model over the corresponding nucleotide of the reference model one at a time.It is computed using the "dp.py" command from the "SIMINDEX" package. 22For simplification, we also calculate the sum, mean and median of the deformation profile to account for the general accuracy of the prediction.The stereochemical correctness of the predicted models was evaluated with MolProbity, 26 which provides quality validation for 3D structures of proteins and nucleic acids.For the latter, Mol-Probity performs several automatic analyses, from checking the lengths of H-bonds present in the model to validating the compliance with the rotameric nature of the RNA backbone. 26,27As a single measure of stereochemical correctness, we chose the clash score, that is, the number of all types of steric clashes per thousand residues. 28The assessment also considered the coordinate comparison metric TMscore as computed in RNA-Align 29 and the Mean of Circular Quantities 30 to assess accuracy in the torsion angle space.All the source codes and an example notebook are available at: https://github.com/RNA-Puzzles/RNA_assessment.

| Computation of CASP-style metrics
Independently from the RNA-Puzzles-style computations, we assessed the accuracy of the submitted models in a manner closer to recent CASP assessments for protein structure prediction through Z RNA , a weighted Z-score average of several different assessment metrics.To perform the Z RNA evaluation, we developed the casp-rna pipeline, which encompasses our workflow for data wrangling, job parallelization, and ranking visualizations.In consideration of RNA as a flexible molecule in which irregular loops may affect RMSD measures, Z RNA explored additional metrics beyond RMSD to capture the global accuracy, local accuracy, and geometries of RNA.We selected the following tools for our ranking scheme: (1) US-align, 31 which was used to compute TM-score through a heuristic alignment approach improving on the original RNA-align 29 ; (2) Local-Global Alignment 32 which yielded GDT_TS, the average percentage of aligned C4 0 atoms (rather than that of Cα in proteins) at cutoffs of 1 Å, 2 Å, 4 Å, and 8 Å; (3) RNA-tools, 33 a toolkit used to determine the accuracy of contact classifications among base stackings, Watson-Crick interactions, and noncanonical interactions.INF scores were calculated from interaction predictions dependent on ClaRNA 34 ; (4) OpenStructure, 35,36 a framework used to find lDDT, a metric that measures structural similarity (unlike for proteins, our implementation of lDDT for RNA did not penalize for stereochemical violations); and (5) PHENIX, which reports a clashscore metric for all non-hydrogen bonded atom pairs that overlap worse than 0.4 Å. 26,37 For TM-score and GDT_TS, superposition of models and experimental models were calculated with default atoms for those packages, C3 0 and C4 0 , respectively (repeating GDT_TS calculations with different atoms P, C3 0 , and C4 0 gave negligible differences).Two alignment modes were considered for GDT_TS: a fixed residue-residue correspondence approach and an automated search for the best superposition, ignoring sequence; these gave nearly identical group rankings, so we opted for the former approach.
INF scores were computed with ClaRNA to help increase robustness of base pair assignment for low resolution models; these values were slightly different than but highly correlated with INF scores computed with MC-Annotate, the tool normally used by RNA-Puzzles.
Similar to the assessment of protein models in past CASP assessments, we employ a two-pass procedure for Z-scores. 13,38For each target and for each of the considered metrics, the Z-score (difference with the mean, normalized by the standard deviation) was calculated by taking the mean and standard deviation for the best model from each group with respect to each considered metric.To prevent distortion from very poor outlier predictions, models with initial Z-scores that fall under a tolerance threshold of À2 were discarded, and the Zscores were recomputed with the new mean and standard deviation.
After this second pass, models with Z < À2 were re-assigned Z = À2.
For Z-scores that involved linear combinations of multiple components (e.g., Z RNA ), the Z-score values for individual components were then summed.To prevent penalization of novel methods that might give poor models for some targets, the sums of just the positive Z RNA over all targets were used to make final rankings.For targets where experimentalists provided multiple conformations to either represent experimental uncertainty or bona fide conformational diversity (e.g., different copies in the crystallographic asymmetric unit or multiple conformations captured by cryo-EM 39 ), predictor models were compared to all available experimental models.Groups were rewarded based on their best score.Code for the analysis of submitted models, assessment tools, and documentation using casp-rna are available as an open-source repository at https://github.com/DasLab/casp-rna.

| Generation of simple template-based structures as comparison models
As baselines for the accuracy of predicted models, we prepared template-based structures generated using homology models with the rna_thread application in Rosetta 3 (version tag v2019.27-dev60818-134-g04678680f9c). 40For the CPEB3 ribozymes (R1107 and R1108), we generated template-based structures using the HDV ribozyme structure (PDB ID: 3NKB).We used residues

| Computation of map-to-model metrics for cryo-EM targets
All models for the 6 targets determined by cryo-EM (R1126, R1128, R1136, R1138, R1149, and R1156) were assessed directly against the experimental maps.The RNA-protein targets (R1189 and R1190) were excluded from this analysis because none of the predicted models for these targets fit sufficiently well into the density to give robust alignments, but in principle, this analysis is compatible with RNA-protein targets.First, models were fit into maps using two approaches.Models were aligned to the reference models (built by experimentalists into density maps) using US-align 31 and then fit locally using the command fitmap in ChimeraX. 41We also tested an iterative phenix.dock_in_map 37procedure.For the well-fitting models, there was very little difference between these two methods and thus the fitmap method was selected.The following programs were used to measure the listed metrics, in all cases using default parameters (1) Phenix, 37 43 and density occupancy; and (4) MapQ, 44 for Q-score.An RMSD filter was selected for each target based on visual inspection.
Ranking of all the models was carried out by Z-score, following the two-pass procedure described in Section 2.2.Code for the analysis can be found at https://github.com/DasLab/CASP15_RNA_EM.

| Scoring against x-ray data and molecular replacement (MR)
All models for the four targets determined by x-ray crystallography (R1107, R1108, R1116, and R1117) were assessed directly against the x-ray data by superimposing them on the target structure with RNAalign 29 and calculating the Log Likelihood Gain (LLG) with respect to the diffraction data using Phaser. 45For R1108 and R1117, with two RNA molecules in the asymmetric unit, the LLG was calculated for a single copy of the model ideally placed on chain A. A ranking of groups was derived from Z-scores computed from equal weighting of LLG, TFZ (translation-function Z-score from the model search), and CC (correlation coefficient of the map based on phases from the ideally placed model compared to the map computed by the experimentalists with their final phases).These ranking Z-scores were based on the same two-pass procedure as described in Section 2.2.
Molecular Replacement was carried out using the CCP4 package 46 via CCP4 Cloud 47 and specifically the programs Phaser 45 and MOLREP. 48Map correlation coefficients were calculated with the phenix.get_cc_mtz_pdbtool. 37 into three structural segments using the Birch algorithm from the Sci-Kit toolbox. 53| RESULTS

| Classification of the difficulties and qualities of the targets
In Table 1, the 12 targets are gathered along with notes on protein and ligand binding, evidence for multiple conformations, and experimental technique and resolution.The difficulty was considered as "easy" when homologous structures were present in the PDB and as of "medium" difficulty when the structural similarity could be deduced due to similar functions (e.g., the CPEB3 ribozymes self-cleave like a ribozyme of known structure from hepatitis delta virus).Two targets were ranked as "difficult" since no homologous structures had been published and the number of nucleotides was larger than 120.Finally, a fourth "non-natural" category was considered for targets that were human-designed and not found in nature (and thus without homologous sequences), since it was not clear a priori whether these cases would be easy or difficult to model.The majority of targets (8) were solved by cryo-EM, with the rest (4) by x-ray crystallography.
T A B L E 1 Summary and descriptions of the 12 RNA targets in CASP15. ).
e R1117, R1126, and R1136 were noted as RNA/ligand targets during prediction season.A K + ion in a G-quadruplex in R1126 and small molecules bound to aptamers displayed in R1136 were not well-resolved in their respective cryo-EM maps and not assessed.Assessment of the pre queuosine ligand in R1117 is included in the overall CASP15 assessment of ligand binding, described separately (Xavier Robin, Gabriel Studer, Janani Durairaj, Jerome Eberhardt, Torsten Schwede, and W. Patrick Walters, "Assessment of Protein-Ligand Complexes in CASP15," under revision). f The model of the mature conformation has a clashscore of 0.09 and the top 10 CASP15 predictions matched this model better than the early conformation (clashscore 63.7).
g Similar RMSD predictions came from TS229, TS239, and TS439, all submitted by the same laboratory (Yang).

| Assessment and ranking based on RNA-Puzzles metrics
The RNA-Puzzles assessment recognizes that RNA architecture results from a set of coherent interaction networks stabilizing a given fold.There are several interaction networks: the set formed by all Watson-Crick pairs, the set of contacts formed by the stacking between the bases, and finally the set formed by the non-Watson-Crick pairs, the interactions characteristic of tertiary folding.
In a 3D structure, the set of Watson-Crick is not always the one predicted because in the folded structure, pairs at the extremities of the helical segments can either disappear or new ones can be formed.
The correct choice of stacking between nucleotides or helices is critical for the overall global fold of the RNA.A wrong choice in the helices of the core can lead to very different folds from the native one.
Finally, the appropriate positions and orientations of several elements allow for specific non-Watson-Crick pairs to form and lock in the native structure.An approximate association of helices may yield a molecular shape or envelope roughly similar to the native structure, but generally more open and much less compact than the native fold.
In such cases, the key sequence conservations that maintain the actual native RNA fold are neither observed nor understood from the modeled structure.Therefore, in addition to using RMSD as a major metric for assessment, the analyses also included distinct metrics that are more sensitive to the interaction networks that comprise RNA.1), the overall folding shapes are reproduced, as can be seen in Figure 1 where all targets are superimposed on the best predicted model as ranked by RMSD.
Table S1 presents the number of times that each of the modeling groups produced the 1st, 2nd, or 3rd best model as scored by the various metrics.Separate analyses are shown, based on the best of all five models from each predictor group and based solely on each groups' model 1.Taking a weighted sum of these placements (with weights of 3, 2, and 1 assigned for placing 1st, 2nd, or 3rd) enables ranking of the groups.Whatever the way of counting or of scoring, even with methods that used metrics besides RMSD, two groups consistently reached the first and second ranks, TS232 (AIchemyRNA_2) and TS287 (Chen), respectively.The groups TS081 (RNApolis) and TS128 Display of all CASP15 RNA targets (green) with the best-ranked model (blue) superimposed for each, chosen based on RMSD comparison of all five predicted models from all predictor groups compared to all available experimental structures.
For ease of visualization of RNA global folds, protein binding and small molecule ligands (see Table 1) are not shown.(GeneSilico) appear both at third positions.Considering those predictions with best RMSD that were ranked first among a set of all models submitted (up to five from each group), the groups TS232, TS287, TS081, and TS128 are the top four, with the other groups having weighted sums 50% lower.Among the latter, considering only at best RMSD rankings, TS229 (Yang-Server), TS416 (AIchemy_RNA), TS239 (Yang-Multimer), and TS439 (Yang) occupy the middle range.

| Assessment based on CASP-style metrics
In a second assessment fully independent of the assessment based on RNA-Puzzles above, we explored the use of distinct metrics, largely drawn from assessment methods developed for proteins in previous CASP events and expanded here to RNA.For evaluating the global fold of predicted RNA structures, we computed the template modeling score (TM-score 29,31 ) and the global distance test (GDT 32 ).For the latter, we focused on the GDT score for tertiary structure (GDT_TS) rather than the high-accuracy GDT score (GDT_HA 57 ) since the RNA models lacked nucleotide-level, much less atomic accuracy.To evaluate models' local quality, to complement the RNA-specific INF score described in Section 3.2, we used the Local Distance Difference Test (lDDT 35 ) score, which compares distances between atoms that are nearby in the experimental structure to the distances between those atoms in the predicted structure and may generalize well between proteins and nucleic acids.
The global fold accuracy metrics (TM-score and GDT_TS) suggest that all targets, aside from the two RNA-protein complexes R1189 and R1190, elicited some predicted models that recovered correct global folds, based on criteria that have been previously discussed in the context of RNA template identification (TM-score > 0.45, 29,31 Figure 2A) or protein global fold assessment (GDT_TS > 45, 58 Figure 2B).We note that these criteria for "correct fold" may not apply at the extremes of lengths for our RNA targets.On one hand, the "easy" PreQ 1 riboswitch target (R1117) is small with only well, but the relationship between the two varied across different targets (Figure 3A).The difference between GDT and TM-score is due to the distance cutoffs that the two metrics use.For example, TM-score applies a soft distance threshold d 0 that depends on RNA length, which helps account for the flexibility of larger RNA's. 29,31For R1138 (720 nt), d 0 = 13.59Å and most of the residues in a visually good model like R1138TS232_4 align within this threshold in the TM-score calculation.In contrast, GDT_TS uses fixed distance cutoffs of 1 Å, 2 Å, 4 Å, and 8 Å, and most of the RNA residues for the large molecules R1138TS232_4 do not align to the cryoEM structure within these thresholds (Figure S2).These comparisons suggest that TMscore and GDT_TS are useful for ranking models for a given target but thresholds for "good" TM-score and GDT_TS may need recalibration for very small and very large RNA molecules, respectively.
As a metric for model quality that might generalize between protein and RNA, we considered lDDT.While not measuring global shape upon superposition, lDDT has been used as a primary accuracy indicator in numerous prediction contexts, including CAMEO, where a threshold of lDDT > 0.75 is used to denote a good match when comparing templates to target structures and to assign difficulty. 59,60ross all targets, lDDT values for best predictions ranged from 0.5 to between each pair of scores labeled on each row and column, colored by high correlation (dark blue), no correlation (white).RMSD and clashscore were multiplied by À1 before calculating the correlation so that higher scores correspond to better accuracy for all metrics.
0.9, again with the lowest performance in RNA-protein complexes (Figure 2C).Interestingly, for the 10 RNA-only targets, CASP15 predictors achieved models with lDDT close to 0.75, and visually excellent models for the small, "easy" target R1117, the "medium" target R1108, and the "non-natural" and larger targets (R1128, R1136) achieved the 0.75 threshold.For future CASP, CAMEO, and other modeling challenges, lDDT may provide the most cleanly interpretable measure of accuracy, with a cutoff of 0.75 applicable across nucleic acids and proteins.
These CASP-inspired metrics correlated well with RNA-puzzle based metrics described in Section 3.2.For global fold metrics, while RMSD and GDT_TS are not linearly correlated (Figure 3A), they have positive rank-based correlation (Spearman correlation coefficient 0.61, Figure 3B).The local interaction metrics, INF and lDDT, correlate excellently (Spearman correlation coefficient 0.91, Figure 3B) in what seems to be a near-linear and size-independent relationship (Figure 3A).3A).
To provide a more quantitative threshold for good model accuracy for each target, we sought to estimate the deviation between experimentally determined structures.Where possible, we measured the deviation in TM-score, GDT_TS, INF, INF_WC, and lDDT between distinct experimentally captured conformations (red lines in Figure 2).
More specifically, we compared the following structure pairs in targets with multiple conformations (see also Table 1): the point-mutations for the CPEB3 ribozyme 61 (R1107 vs. R1108), the apo and holo structures of the aptamer Apta-FRET 62  scores were based on clashscore, 28 which has been used widely for both protein and RNA structural assessment.We used the following weighted sum of scores: Because we did not expect atomically accurate models in this first RNA round of CASP, we chose to reward models that recover the global fold (high weight for TM-score and GDT_TS terms) compared to those that recover local details (low weight for local environment scores) or produce correct nucleotide geometries (low weight for clashscore).Each group's Z-score for a given target was computed using their best predicted model, and groups' total scores were calculated as the sum of all positive Z-scores across all targets (Figure 4A).
The top performing predictor groups based on this combined Z-score ranking were AIchemy_RNA2 (TS232), Chen (TS287), RNAPolis (TS081), and Genesilico (TS128).These were the same groups as the top four highlighted by the independent analysis by the RNApuzzles-style assessment.
Interestingly, the top four groups did not include any server submissions; the top-ranked servers (Ultrafold-server, TS125; and Yangserver, TS229) placed at positions 8 and 9, and gave Z-scores that were more than three-fold lower than the top two predictor groups.
We note that these top server submissions additionally exhibited secondary structures (Watson-Crick base-pairing) with lower accuracy than some other top predictors, as measured by INF_WC (orange and cyan points, Figure 2), suggesting that there is room for improvement in automated prediction of secondary structure.Furthermore, based on abstracts collected for the CASP15 conference, while the majority of CASP15 RNA predictors groups tested deep learning methods (orange highlights in Figure 4A), the top 4 RNA groups did not use deep learning approaches (see also articles by RNA predictor groups co-submitted for the CASP15 special issue [65][66][67][68] ; and https:// predictioncenter.org/casp15/doc/presentations/Day3/).
To better understand uncertainties in the rankings, we repeated the Z-score analysis using sub-components of the Z-score.Ranking groups by the two "global fold" terms (GDT_TS and TM-score) alone or in combination, or using RMSD, gave rankings with the same top four groups, up to some switching of third and fourth place (Figure 4B (A) and Table 2).Use of the more local accuracy terms (lDDT and INF) retained the same top three predictor groups, with some groups switching in ranks of the groups after the top three.After the top four, the rankings are less consistent, which is not surprising given the small numerical score differences in these placements (Figure 4A and not told a priori that they would be assessed on clashscore.Overall, the ranking of the top four groups in CASP RNA structure modeling was robust to changes in metrics used and across two independent assessments.

| Detailed assessment for RNA-protein complexes
The poorer predictions and the presence of RNA-protein contacts for the two RNA-protein complexes RT1189 and RT1190 largely precluded useful accuracy rankings from the metrics described above, so we carried out a detailed visual assessment for these targets.This assessment involved checking whether predictions had the right nucleotide-amino acid contacts and then visually assessing whether the fold was correct.For the contact-based analysis, a contact was defined as any pair of nucleotide and amino acid containing atoms within 5 Å of one another.The Matthews Correlation Coefficient (MCC) was used to score the contacts made by the predictions against those of the targets.The distribution of scores is shown in Figure 5A.
The highest scoring model from each group with MCC scores above 0.1 (roughly the beginning of the non-zero peak in the distribution) were then visually assessed.
For the RNA folding pattern analysis, we needed to establish a well-defined descriptor for the RNA-protein binding arrangement that was not dependent on superposition (which was difficult for all the models).This was achieved by coloring each protein by the regions of interaction in the RNA with the lowest order.Region order was determined by RNA sequence position (where 5 0 is low).Using this scheme, the colors blue (B), then red (R), then green (G) were assigned to the three RsmA homodimers in RT1189, and this pattern was compared for each model against the experimental structure (folding pattern: BRGRGB).In the case of RT1190, which involved only two RsmA homodimers, not all six regions of the RNA were bound; in particular, the regions of the RNA at approximately nucleotides 25 and 50 should not interact with a dimer.For RT1189, no models exhibited the correct folding pattern for interacting with the 6 RsmA proteins (Table 3).
For RT1190 (folding pattern string: B-R-RB), the best model according to the MCC score (MCC = 0.39) predicted the non-interacting RNA regions correctly ("-" in Table 3) but the RNA-protein contacts were  3).In contrast, ranking based purely on RNA RMSD highlighted models from TS229 and other models from the Yang laboratory (Table 1); these models were less satisfactory from the point of view of protein-RNA contacts, showing the importance of complementary analyses in ranking these very difficult targets.

| Ranking based on direct comparison to cryo-EM maps
The "native" experimental models built from RNA cryo-EM maps may be particularly susceptible to biases from computational procedures or biases in human interpretation due to the generally low resolution of these maps (see, e.g., experimental model clashscores higher than 10 in Table 1, which typically arise from fitting errors).In particular, for RNA, when the cryo-EM map has resolution worse than $3 Å, the separation between bases cannot be resolved and thus base placement can be highly dependent on the modeling approach used by the experimentalists.We therefore sought to rank CASP predictions based not on comparison to the reference coordinates provided by the experimenters ("model-to-model") but by comparison directly to the experimental maps ("map-to-model").The feasibility of refining these predictions to model the cryo-EM maps is discussed elsewhere in this issue. 21r all six RNA-only cryo-EM targets, there were models that could visually fit well into the maps (Figure S3).To determine a quantitative ranking of predictor groups, previously available map-to-model metrics were computed (Section 2; Figure S4).These map-to-modelmetrics were developed to assess goodness of fit for models prepared with knowledge of maps; many were not designed to account for very poorly fitted models, with unmodeled density and atoms outside density, as we have here.For example, atomic inclusion 43 penalizes predicted atoms that appear outside of density, and correlation coefficient at peaks (CC peaks ) 37 penalizes density that is not accounted for by a prediction.We attempted to find a combination of scores to balance these problems; however, in the end, we decided that no weighted combination of metrics was sufficient to enable ranking of all available models and predictors.Although overall correlation of map-to-model metrics to model-to-model metrics was high (Figure S5), there were outliers receiving high map scores for poor models by, for example, condensing all atoms into a single small area, most notably group 238 (Figure S6C).Thus, as in previous CASP evaluation for cryo-EM of protein targets, 69 we used a filter (Figure S6B), only ranking models that exhibited sufficiently high model-to-model scores.Due to the size dependence of TM and GDT-TS noted above, we decided to set this cutoff based on RMSD.The correlation between metrics was generally improved after this filtering (Figures S6A and S5B).
For ranking, we selected a set of metrics that correlated well with visual inspections of fit and chose the standard measures of cross-  Only a subset of models with clear alignments to maps were included in the comparison; see Figure S5 for analysis over all models.(B) Group ranking for x-ray crystal structure targets based on Z-scores for metrics that directly compare the models to the crystallographic data (Z MX ).
correlation, accounting for modeled (CC mask ) and unmodeled regions (CC peaks ), and scores developed or shown to be most discriminatory for medium-resolution maps, atomic inclusion (AI), mutual information (MI), and Segment based Manders' Overlap Coefficient (SMOC). 70,71 note that no metrics tested were RNA specific and can be used to assess any macromolecular complex.We used Z-score-based ranking, previously described, with uniform weight of the selected metrics: AIchemy_RNA2 (TS232) achieved the highest Z EM score, followed by Chen (TS287), GeneSilico (TS128), and RNApolis (TS081), and then others (Figure 6A).This ranking matched with the model-to-model assessment (orange bars in Figure 6A).This overall ranking was also maintained, barring group 238, without filtering out poor models (Figure S5A); however, the filter should be maintained until Z EM is robust to the problematic high scores of condensed models, by for example the inclusion of clashscore.
Overall, the results show that assessing models based on direct comparison to cryo-EM maps, appears feasible and that results are consistent with rankings based on model-to-model comparisons.
Direct map-to-model assessments may be particularly important in future CASP events as prediction accuracy increases and approaches the level of detail obtained at typical cryo-EM map resolutions.

| Ranking based on direct comparison to crystallographic data
In analogy to the map-based assessment of cryo-EM targets in the previous section, we investigated whether similar comparisons to the experimental data might enable ranking of the four RNA targets solved by x-ray macromolecular crystallography (MX).Similar to above, the only use of the experimentally derived model was to align predictor models.All predictor models were compared directly to the crystallographic data by first ideally placing the model using RNAalign 29 and then calculating a Log Likelihood Gain (LLG) and translation-function Z-score (TFZ) with Phaser's RNP search 45 and a global map CC with phenix.get_cc_mtz_pdb. 37We used a Z-based ranking after a round of outlier removal (see Section 2) with a uniform weighting of these metrics: The rankings are most strongly influenced by performance on R1117 since Z MX scores for the other targets were relatively uniform and comparatively poor (Figure 6B).The top-ranking groups by this metric were TS232, TS287, and TS128 (AIchemy_RNA2, Chen, and GeneSilico, respectively), which were also the three groups that succeeded in follow up molecular replacement trials for R1117; see Section 3.8.

| CASP15 RNA models with accurate global folds miss detailed features and aspects of conformational heterogeneity
Ranking CASP15 RNA predictions based on the quantitative comparisons above highlighted several models for more detailed visual inspection, which revealed their potential and limitations.One example, the chimpanzee CPEB3 ribozyme R1108 (Figure 7), illustrates the use of the Deformation Profile and variable accuracy in targets of "medium" difficulty (Table 1).In Figure 7A One of the highly successful models is that of the paranemic crossover triangle (PTX) R1128, a molecule with no natural homologs whose difficulty for modeling was unclear before the CASP15 results. 73It is a designed sequence made of four 4-way junctions and a co-axial stack between terminal helices (Figure 7C-F).The modeling success can be partly explained by the folding constraints of the design and the use of known structural modules.The helices are regular with known GU pairs and capping UNCG loops, without unpaired or bulging residues (Figure 7C).The tight junctions and the bulky RNA helices impose strong constraints on the fold and prevent knot formation (Figure 7D).The good accuracy of the modeling (TS232_1) with an RMSD of 4.3 Å and an INF of 0.88 is apparent in the deformation profile with a rather uniform deformation throughout (Figure 7E).The origins of the main errors are in the twist angles between stacked helices in the 4-way junctions that propagate maximally toward the apical loops (Figure 7F).In the experimentally determined structure, at those 4-way junctions, there are H-bonds linking one hydroxyl O2 0 atom to an anionic phosphate oxygen of a residue on the crossing strand, maintaining a tight packing.These H-bonds are not present in the modeled structure, leading to a looser packing and slightly larger twist angle (Figure 7G).Despite these errors in fine details, the CASP15 blind model TS232_1 was closer to the cryo-EM-derived structure than the original model of the PTX structure designed by Andersen and colleagues (see paper co-submitted to CASP15 special issue 20 ).
Indeed, for all four non-natural RNA targets in CASP15 (Table 1), the AIchemy_RNA2 group (TS232) submitted models that were visually accurate (Figure 1).Furthermore, this group, along with Chen (TS287) and RNAPolis (TS081) were notably separated from other groups, including all automated servers, for these non-natural targets, suggesting that these predictors benefitted from their human intuition to recognize the secondary structures and overall tertiary folds intended by the nanostructures' human designers.Interestingly, in all four cases, the predictor groups were able to blindly predict structures that agreed better with the cryo-EM maps than the original models made by Andersen and colleagues when they designed the nanostructures.As another example, for R1138 (six-helix bundle, Figure 7G,H of 0.623, well above the 0.45 threshold (Figure 7G,H).Nevertheless, the AIchemy_RNA2 model TS232_4 achieves an even higher TMscore of 0.800 (Figure 7I).These results suggest that, despite the lack of natural sequence homologs, "non-natural" RNA targets could be considered "easy" for 3D RNA structure prediction, as long as they are composed of readily identifiable helices and noncanonical motifs.
Interestingly, for the same R1138 six-helix-bundle, cryo-EM also captured a distinct "young" structure for the RNA (Figure 7J) that is dominant immediately after the transcription of the RNA and requires hours to resolve into the "mature" form. 62The "young" and "mature" structures do not differ in their Watson-Crick-Franklin helices but, to interconvert, would require breaking of a kissing loop interaction, twisting of the two kissing elements about their helical axes, and then reformation of the kissing loop. 63None of the CASP models produced models close to the "young" structure.Other natural and designed RNA systems are known to display similar kinetic traps and topological isomers, 74,75 and it will be interesting to see if in future CASPs, such conformations can be blindly predicted.
A common theme was that the model ordering as submitted by the predictor groups generally did not correspond to the ranking based on RMSDs (or other metrics) between experimental and model structures.This was the case for the R1108 and R1138 targets noted above, where the fourth models from group TS232, and not the first models, were most accurate.Overall, in 63% of the sets of CASP15 predictor submissions across all 12 RNA targets, a model submitted as 2-5 was better than model 1 by GDT_TS, and the difference in GDT-TS between model 1 and the top scoring model for each group was no lower than if model 1 had been randomly selected (Figure S7).The models from group TS110 (DF_RNA) for the "difficult" target R1149 (the SL5 domain from SARS-CoV-2) provides an additional example.
The best RMSD of all CASP15 submissions is model #2 by TS110 as depicted in Figure 8A-D.The RMSD between the experimental structure and TS110_2 is 6.9 Å (superposition shown on Figure 8A with the respective Deformation profile on Figure 8B).On the other hand, the RMSD between the experimental structure and first model TS110_1 is 21.7 Å.The superposition (Figure 8C) and the corresponding Deformation profile (Figure 8D) confirm that the global fold of TS110_1 is inaccurate despite its submission as model 1.In particular, the reddish regions indicate where the discrepancies are largest; they concentrate at the 4-way junctions where the experimental structure is more compact and with H-bonding contacts between the strands than the model structure as shown in Figures 7C and 8A.
Further inspection of TS110_2 helps illustrate the requirement of paying attention to the non-Watson-Crick pairs beyond the standard Watson-Crick pairs of the secondary structure, both in prediction and in assessment of RNA targets resolved by cryo-EM.Figure 8E,F shows the 2D structures for R1149 as derived from the cryo-EM map (Figure 8E) and the best RMSD model TS110_2 (Figure 8F) structures.
The region within the black ellipse (Figure 8G) contains a GU and a UU pair, but in the modeled structure, only the GU pair is reproduced and, while the right Us face each other, they do not form a pair (Figure 8H).In the region circled in red, the fold of the single-stranded loop is missed and in the one circled in green, the fold leads to several bad contacts between residues, which may explain the rather high clashscore of 31 for TS110_2, despite the overall good fit in the relative orientations between the helices (Figure 8A).It is important to note that for these regions, alternative structures in the experimentalists' 10-model cryo-EM ensemble show breaking of the features, similar to the prediction TS110_2; and so it is possible that the conformations modeled in TS110_2 occur in the actual cryo-ensemble for the target R1149.Nevertheless, these model discrepancies lead to deviations of the strands in the four-way junction that, in turn, lead to variations in the arms at the junction (Figure 8G).Indeed, all 10 members of the experimental cryo-EM ensemble show complete base pairing at the molecule's central four-way junction, which is inconsistent with incomplete junction base pairing in TS110_2 (Figure 8F).
The presence of alternative structures, noted above for the nonnatural six helix bundle R1138, was a common theme in RNA targets in CASP (Table 1), and was particularly interesting in one target with continuous heterogeneity.R1156 is a homolog of the same SARS-CoV-2 SL5 domain as R1149, and showed flexibility in one helix (blue, Figure 8H,I), which was represented in the cryo-EM analysis as four subclassified maps.Comparing models directly to these experimental maps highlighted models of particular excellent quality that fit into the maps nearly as well as the reference models prepared by experimentalists using the maps (Figure S3).In particular, the model TS128_5 from GeneSilico fits into the experimental map with excellent scores (Figure 8H,I).Fitting this model into the highest resolution of these four maps, conformation 1, we can see visually and numerically, that the model fits well with respect to 3 helices but poorly with respect to the flexible helix (Figure 8H).However, the model fits better in the second conformation, obtaining map-to-model atomic inclusion scores comparable to scores achieved by models derived with knowledge of the map (Figure 8I).This comparison revealed the importance of representing the ensemble of structures the RNA can form so as to not penalize prediction of structures that do form but cannot be captured by a single experimental structure.In summary, inspection of top-ranked CASP15 RNA models confirms, in each case, good prediction of global fold but also reveals fine details and/or aspects of conformational heterogeneity that have not been captured by the models.Ordering each set of five models by the predictor groups also typically did not correlate with the models' accuracy.Similar conclusions for R1108, R1128, R1138, R1149 and R1156 based on alternative analyses by RNA experimental groups, are described in a separate paper prepared for the CASP issue. 20

| Potential utility of RNA models for molecular replacement
The general global fold accuracy of the CASP15 RNA tertiary structure models motivated us to explore their potential utility for phasing x-ray diffraction data by molecular replacement, which has previously been carried out in very few cases. 76While we began these explorations in studies described above to rank models based on agreement with x-ray data (Section 3.6), such scores based on optimal placements do not necessarily reflect models' value as search models for real-world Molecular Replacement (MR).For example, a largely accurate model may prove unsuccessful if inaccurately modeled portions lead to severe crystal lattice packing clashes.
We therefore carried out more realistic MR runs, first, on all unmodified models of R1117.This initial analysis was restricted to R1117 since visual examination and LLG calculations suggested that models of other targets would require some kind of editing to succeed (see next).Across the up to five models submitted by groups, we found that 3 out of 34 groups succeeded with at least one model, using global map correlation coefficient CC > 0.2 as the criterion of success (Figure S8).Among these successful groups, however, the quality of MR solutions varied significantly.The result (Figure 9) has an R free of 26% and visible density for the missing part of the RNA molecule confirms that it could be readily refined and completed.R1108, a close homolog of R1107, proved much more difficult to solve, perhaps owing to the different conformations observed between the two RNA chains in the asymmetric unit.When attempting to solve this structure similarly (protein first then RNA) we could place the protein component, but the RNA component was reversed, providing only a partial solution.The truncated group TS232 models for R1108 were of a sufficient quality to solve R1107 and the resulting protein/RNA complex could then be used to solve R1108 with an R free of 41%.
Inspection of the Group 232 models for R1116 showed that more extensive model editing would be required.A modified version of Slice'N'Dice 52 was therefore used to split model_1 into three structural units.A portion comprising nucleotides 1-24;125-157 could then be placed with MOLREP which indicated a partial solution after F I G U R E 9 Molecular replacement (MR) of x-ray crystallographic data using CASP15 models (and AlphaFold 2 models of U1ABD in the cases of R1107 and R1108).Group TS232 models formed the basis of all successful search models shown except R1117 (group TS287).
refinement (R fact 48%, R free 52%).Three copies of a second fragment comprising 38-63 could then be placed to largely complete the structure with Phaser scores of LLG: 1324 and TFZ: 9.6.These values are unambiguously indicative of successful Molecular Replacement: for example, TFZ > 8 corresponds to "Definitely" solved according to Phaser software guidance. 80The result (Figure 9) has an R free of 43%, an acceptable value for a model immediately after MR.These results demonstrate that all of the RNA crystal structure targets in CASP15 could, one way or another, be solved by MR, although it is recognized that further refinement and completion (not attempted here) could be challenging, especially at 3.0 Å or worse resolution.Nevertheless, fine details such as noncanonical pairs and hydrogen bonding at junctions were inaccurate in these models, even when taking into account sources of uncertainty for the experimental structures.Conformational heterogeneity in some targets, R1136 and R1156, was indicated by the presence of multiple structures captured by cryo-EM but was not captured by any group in their range of submitted models (Figure S9).Despite these caveats, the general global fold accuracy for RNA-only targets-even those without homologs of known structure-and the ability of models, with some curation, to enable molecular replacement of all 4 x-ray diffraction data sets suggest reason for optimism.
Has there been improvement in RNA modeling in CASP15 compared to prior RNA-puzzles?Achieving accurate positioning of helices with respect to each other by modeling is often feasible when the single-stranded segments are short and unpaired because RNA helices are bulky and the interconnecting strands each have a polarity, leading to a reduced search space for modeling helix arrangements.][10][11] During this CASP15 experiment, some research groups tried to use prediction approaches that were similar to AI-based methods for predicting the structure of proteins.For example, AlChemy_RNA 81 uses an end-to-end differentiable network inspired by AlphaFold 2. 50 However, these AI-based predictions did not perform as well as expected and did not surpass prediction methods previously tested in RNA-Puzzles (SimRNA, Chen, RNApolis), which have been continuously improving for the past decade.The AIbased approaches 81,82 also failed to demonstrate the accuracy claimed in their preprint papers, perhaps due to the limited amount of training data.
In addition to not using deep learning, the top four RNA predictors shared the property that they were not servers and, based on their own accounts (see papers co-submitted for this CASP issue [65][66][67][68] ), they appeared to still make use of human intuition.While there were cases where server models were more accurate than "human" models from the same laboratory (e.g., Yang), generally server models were worse in quality than the top 4 human predictor groups.Going forward, an important frontier for the RNA structure prediction field to focus on will be automation, so that methods can be more widely used and applied at the genomic scale, as is now the case for protein structure prediction methods.While the sparser data available for RNA structure, compared to protein structure, has complicated development of robust deep learning algorithms, recent accelerations in RNA structure determination-particularly from cryo-EM 12 -and the availability of high-throughput sequencing-based methods sensitive to RNA structure 83 may help close the gap between RNA and protein computational methods.Interestingly, secondary structures from even the top server predictions were poorer than those from "human" groups, highlighting an area of potentially immediate improvement.
In addition to being the first CASP experiment for RNA structure prediction, CASP15 was also the first CASP experiment for RNA structure assessment, and future CASP RNA trials can benefit from some lessons learned by the assessors, three of which we discuss here.First, CASP15 included few truly difficult RNA targets, and these were solved by cryo-EM at resolutions worse than 3 Å.It will be important for upcoming CASP competitions to bring in experimental groups solving natural RNA targets without previously solved homologs at near-atomic resolution.Such molecules are being discovered and structurally characterized at increasing frequency, particularly for biologically interesting RNA-protein complexes.It may also be useful to develop a fully automated classification scheme for easy, medium, and difficult RNA targets and separately assess targets from these categories, as was traditional in CASP before the success of deep learning approaches rendered these categories less useful for proteins.
Second, while only 2 of 12 targets in CASP15 were RNA-protein complexes, it seems feasible that CASP16 will involve more RNA-protein complexes, given their biological importance and amenability to cryo-EM.For assessment, it will therefore be increasingly important to develop quantitative metrics that make sense across RNA, protein, and RNA-protein interfaces.We found here that standard metrics for protein structure accuracy assessment, GDT_TS and TM-score, were useful in ranking RNA models, but their values for visually excellent RNA models seemed anomalously low for large and small targets, respectively.More local measures of accuracy, like lDDT, and assessments of contact accuracy, appeared useful here for both RNA and RNA-protein targets.These more local measures may be less affected by length variation and also more robust to dynamic fluctuations that appear common in large, extended RNA structures.The recent availability of lDDT for RNA may allow more testing of this metric in continuous trials like CAMEO and RNA-Puzzles before the next CASP.
Third, many and perhaps most of the CASP15 RNA targets showed conformational flexibility, for example, as evidenced by differences in conformations of different monomers in crystallographic asymmetric units or, in cryo-ensembles captured by electron microscopy as classes of conformations separable by automated subclassification and/or 3D variability analysis. 39In the current assessment, predictor groups were scored based on the best observed agreement of all their submitted models vs. all available experimental models, effectively assuming that modelers were predicting single structures.
Modeling of the full ensemble nature of these RNA systems was neither incentivized nor assessed.In future CASPs, acceptance of multimodel ensembles (with e.g., 100s or 1000s of models within each of 5 ensembles), rather than separate single-structure models, would better incentivize development of methods for predicting conformational ensembles of macromolecules, including molecular dynamics methods that have been previously difficult to assess.Furthermore, scoring of these ensembles directly against data should be feasible; for example, log-likelihood frameworks and GPU-enabled software 84 might enable predicted multi-model cryo-ensembles to be compared to the entire collection of electron micrographs collected for a target.
s , The DI is then computed as: RMSD/INF.Several partial INF values (and respective DI) can be computed considering only the Watson-Crick (WC) base pairs (INF WC ), the non-Watson-Crick (NWC) base pairs (INF NWC ), both WC and NWC base pairs (INF BPS ), or the stacking interactions (INF STACK ).Finally, the Deformation Profile is a distance matrix computed as the average RMSD between the individual bases of the predicted and the reference models while superim- score, GDT_TS, lDDT, INF, and INF_WC values for all targets.Scores for all models submitted for all targets are depicted (points are randomly jittered horizontally to aid visualization).Models from the four top performing groups and the top two server groups are highlighted as colored points, and all other groups' models are shown as gray points.Red lines indicate the median deviation between experimentally determined models for alternate conformations, black lines indicate the deviation between alternate models derived from experimental data for the same conformation, and blue lines indicate the deviation between homologous structures (see main text).
30 nucleotides, and the TM-score values, which involve a lengthdependent distance parameter, are much lower than GDT_TS values (Figure2A,B).The accuracies reflected by GDT_TS match expected accuracies gauged by visual examination.On the other hand, models that visually captured correct folds for large designed RNA's (R1126, R1128, R1136, R1138) were properly assigned high TM-scores, while GDT_TS scores were mostly lower than 45 (Figure2A,B).For predictor models for a given target, the TM-score and GDT_TS correlated Comparison of assessment metrics for RNA targets.(A) Scores for all models for representative short target R1107 (blue) and long target R1136 (orange): top-left TM-score vs. GDT_TS, top-right RMSD vs. GDT_TS, to compare across global fold metrics; bottom-left lDDT vs. INF compares the two local metrics; and bottom-right lDDT vs. GDT_TS compares global fold to local metrics.(B) Average Spearman rank correlation coefficient (calculated separately per target, then averaged over all targets)

Figure 2 )
Figure2), though in no case were there models whose accuracies exceeded the experimental precision expected for a single captured conformation (black lines, Figure2).To rank the performance of predictors, we developed a Z-score metric that enabled combined evaluation of models' global fold, local accuracy, and stereochemical correctness.Our global fold accuracy scores included the TM-score and GDT-TS, our more local accuracy scores consisted of INF and lDDT, and our stereochemical correctness

F
I G U R E 4 CASP-style Z-score based Rankings.(A) Heatmap of groups ranked by Z RNA .Groups which used deep learning, as reported in the participant's abstract to CASP15, are indicated in orange.The summation of positive two-pass Z-scores for each of the 12 targets is summarized in the barplot (right).Groups are ordered by their Z RNA rankings.(B) Robustness of ranking to different choices in assessment.Columns show group rankings based on subsets of the Z RNA score or individual metrics; coloring reflects rankings under each metric.

F I G U R E 5
Folding pattern analysis of RNA-protein complexes.(A) Histograms of Matthews Correlation Coefficients (MCC) for RNA-protein contact accuracy in the two RNA-protein targets RT1189 and RT1190 (RsmZ-RsmA RNA-protein complexes).(B) Scheme for classifying the folding pattern of RNA based on order of protein contacts to RNA.Each dimer is assigned a color based on the order it was visited in.Experimental cryo-EM structures are shown at top with positions of binding on RNA diagrammed below.made in the wrong order (B-R-BR).Many of the lower scoring models (MCC = 0.21-0.29),did contain interacting regions in the correct order but misplaced the non-interacting regions.As judged by this MCC contact-based score supplemented by protein-binding folding pattern analysis, TS119 (Kiharalab) and TS329 (LCBio) produced top-3 models for both targets (Table

F
I G U R E 6 Ranking of CASP RNA predictions based on direct comparison to experimental data.(A) Ranking of six RNA-only cryo-EM targets based on Z-scores for map-to-model metrics (Z EM ).
, the superimposition of the experimental structure with the best model (TS232_4, from AIChemy_RNA2) is shown with the large deviations at the apical loops.The positions of these loops on the Deformation Profile (Figure 7A,B) are indicated highlighting the restricted regions with high discrepancies.
), the original design and the cryo-EM structure of the "mature" form of the RNA agree in overall global fold, as reflected by a TM-score F I G U R E 7 Detailed inspection of "medium" and "non-natural" targets.(A) For R1108 (chimpanzee CPEB3 ribozyme), superimposition of the experimental structure (green) with the best model (TS232_4 from AIChemy_RNA2, as blue, RMSD 4.5 Å) is shown.Notice the large deviations at the apical loops (as red, yellow and pink) and their positions on (B), the Deformation Profile.(C) Diagram of the secondary structure (2D) of target R1128, a designed paranemic crossover triangle.The helices are numbered from H1 to H12.The secondary structure contains four 4-way junctions.In the two 4-way junctions drawn as "open," helix H1 stacks with H2 and H3 with H7 for one 4-way junction and, for the second one, helix H8 stacks with H9 and H10 with H12.Helices H1 and H8 are stacked together.The pairs between G and U are marked by a dark dot (G•U pair).The Leontis-Westhof 72 symbols are used to annotate the Watson-Crick/Sugar edge pair between G and U in the capping apical 5 0 UUCG3 0 tetraloops.(D) Experimental structure (green) superimposed on the model TS232_1 (blue) with the lowest RMSD (4.3 Å). (E) The deformation profile (see Section 2) between the same set of structures (at the right, the color scale where white represents excellent superimposition).The reddish regions indicate where the discrepancies are largest; they concentrate at the 4-way junctions where the experimental structure is more compact and with H-bonding contacts between the strands than the model structure as shown in (F).(G-J) Models for R1128 (Paranemic Crossover Triangle, PXT).Cryo-EM of mature conformation (G) agrees better with blind CASP model TS232_4 (H) than with original models prepared by this nanostructure's designers (I).Cryo-EM also captured an early folding intermediate (J) that was not predicted well by any CASP15 groups.

F I G U R E 8
Detailed inspection of "difficult" targets, two coronavirus SL5 domains solved by cryo-EM.(A) Superposition between R1149 cryo-EM structure (first of 10 models representing experimental uncertainty) and the closest CASP15 prediction according to RMSD (TS110_2 with 6.9 Å). (B) Deformation profile between the same two structures.(C) Superposition between the experimental (R1149) and the model ranked #1 by the modeling group (TS110_1 with 21.7 Å). (D) Deformation profile between the same two structures.(E) Diagram of the secondary structure (2D) of target R1149 (first of 10 models representing experimental uncertainty).(F) Diagram of the secondary structure (2D) of the closest model TS110_2.The outlines indicate regions with large discrepancies due to wrong 2D pairs and absence of 3D pairs.For example, in the model structure, the U54/U36 pair is not present, and the region circled in green shows a region with high clashscore.(G) Backbone traces of the experimental (green) and model (blue) structures showing the overall fit of the helices; however, as shown in inset, the wrong choices in internal loops lead to large deviations in the path of the backbone at the central 4-way junction.(H, I) Experimental maps and models (gray) for R1156, whose cryo-EM data were subclassified into four separate conformations; conformation 1 (H) and 2 (I) compared to top scoring CASP prediction TS128_5 (color).
Figure 9 shows the successful solution with model 287_3.For the other three CASP targets solved by x-ray diffraction, visual inspection and the Z MX values in Figure6Bmade clear that editing of the predictions would be required for successful MR, and, to focus resources, models from TS232 (AIchemy_RNA2) were subjected to various editing procedures.For solution of the two CPEB3 ribozymes, R1107 (one protein chain, one RNA chain) and R1108 (two protein chains, two RNA chains), the structural variance observed in group TS232 models after structural alignment with Theseus 78 was used as an indication of local prediction reliability and divergent regions removed before the edited model_1 was used as a search model.This approach borrows from that taken for proteins by the MR pipeline AMPLE.79R1107 was successfully solved by first placing the protein chain then the edited RNA search model, both with Phaser.

Table 1
gives the best RMSDs reached by the modeling groups for the 12 targets; they range between 2 Å and close to 17 Å, with many models being in the range between 4.3 Å and 8.3 Å.The trend follows the difficulty level of the targets.Interestingly, for the nonnatural designed RNAs, the RMSDs reached are below 8.3 Å.It can be recalled that in a double stranded RNA helix, the average distance between two successive phosphate groups is 7 Å.However, broadly speaking, except for targets R1189 and R1190 (for which the RMSDs reached are beyond 16 Å, see Table

Table 2 )
. Ranking groups by clashscore alone did not correlate with the other rankings (Figures3B and 4B), presumably because different predictors used somewhat different refinement schemes and were T A B L E 3 Matthews correlation coefficients and folding pattern of the best model from each group with an MCC greater than 0.1.
Note:The symbols B, R, G, and "-" indicate blue, red, green, and unbound regions as per Figure5B.