On the evolution of the quality of macromolecular models in the PDB

Crystallographic models of biological macromolecules have been ranked using the quality criteria associated with them in the Protein Data Bank (PDB). The outcomes of this quality analysis have been correlated with time and with the journals that published papers based on those models. The results show that the overall quality of PDB structures has substantially improved over the last ten years, but this period of progress was preceded by several years of stagnation or even depression. Moreover, the study shows that the historically observed negative correlation between journal impact and the quality of structural models presented therein seems to disappear as time progresses.


Introduction
Structural biology has fulfilled a history-changing mission in science at the interface of physics, chemistry, and biology when for over six decades it has maintained its leading role in providing the structural basis for our understanding of life [1][2][3][4]. Its results were always regarded as exceptionally solid, and created a gold standard in biological research, almost unattainable in many other areas of life sciences. This view has largely persisted until today, in part even fortified by the incredible technical advances in the generation and detection of X-rays, progress in computer software development, revolution in biotechnology, and innovations in crystallogenesis. However, with the expansion of the Protein Data Bank (PDB) [5] from merely seven structures at its inception in 1971 to~160 000 today, it is inevitable that some of the macromolecular models will be subpar and sometimes even incorrect. Unfortunately, suboptimal structures have a tangible negative impact on biomedical research that relies on structural data [6]. However, crystallographers, who have always been in the forefront of structural biology, also in this regard seem to be setting example of how to deal with suboptimal or irreproducible science. The protein crystallographic community has been made painfully aware of these problems [7][8][9][10], partly in the rising wave of concern about irreproducibility of scientific research and of biomedical research in particular [11]. This awareness has led to positive outcomes, such as, for example, development of structural validation criteria and protocols, or development of tools for the detection and correction of model errors [12].
The PDB itself, who is the chief custodian of the structural treasury amassed by structural biologists, has been developing tools and standards for the assessment of the quality of the structural models deposited in its archives [13,14]. Similarly, more and more journals are starting to require structural validation reports generated by the PDB upon manuscript submission. However, in some opinions these actions are still insufficient and many problems could be better checked at source rather than being tracked in time-delayed model-correction actions [6,15], when the ripple effect of structural errors may have already taken its toll. Objectively speaking, however, in view of the immense scale of the PDB, one should be in fact grateful for all the effort already taken and the plans proposed for the future of the data bank. In particular, the PDB has been developing a consistent and informative set of quality indicators, which now accompany each new crystal structure deposition. These indicators have been recently used to assess the evolution of the quality of the PDB deposits with time [16].
However, it is not only the PDB that has the responsibility for maintaining high standard of the structural information generated by structural biology. The prime burden is of course on the authors, but this is usually the weakest link: rarely because of ill intention or fraud and more frequently because of haste, lack of training, lack of supervision, or the delusive belief that the incredible recent progress has converted crystallography to a very easy and almost completely automatic analytical method. An important deal of responsibility rests with the referees and editors of the journals that publish those results, as the ripple effect of error and fatal contamination of science are most efficiently propagated through cited literature [17]. More than a decade ago, Brown and Ramaswamy (hereinafter B&R) published a survey of the quality of crystallographic models of biological macromolecules [18] and correlated the results with the journals in which those models had been published. The results came as a bit of a shock to many because it turned out that the journals usually regarded as the most prestigious were found to publish worse than average structures when compared to other journals. The FEBS Journal was one of the first to request structure validation reports and thus in the ranking list of B&R was among the top journals. Similar questions have been raised by Read and Kleywegt (hereinafter R&K), albeit using different statistical tools [19]. In contrast to the B&R study, R&K reported very small quality differences between structures published in high-impact journals and in other venues.
Nearly 13 years after the B&R study and with the PDB expanded nearly four times, we decided to conduct a similar analysis to see whether the community at large, or at least its journals, have improved. In our approach, we used the statistical methods of data imputation and principal component analysis (PCA) of the model quality indicators recommended by the PDB. In contrast to previous studies, which focused on protein structures only, our analysis comprises all crystallographic structures in the PDB, that is, also includes nucleic acids. Moreover, we also consider models marked as To be published, which were not analyzed by B&R or R&K. Although the scope of data and the statistical tools we are using are different from those used by B&R in 2007, we are still able to compare the journal rankings of the two surveys because our approach may be easily adapted to a retrospective analysis of data from past versions of the PDB. It is important to clarify that omission of NMR and Cryo-EM structures was intentional. Considering the difficulties connected with estimation of quality of NMR and Cryo-EM structural models, and also the very small contribution of both these methods to the characterization of structures that contain ligands (and are thus most interesting and important), we decided to focus on models provided by X-ray crystallography, which represent 89% of all models currently deposited in the PDB.
Our results show that the overall quality of PDB structures has substantially improved over the last 10 years. However, our study also shows that this period of improvement was preceded by several years of stagnation or, if one considers the improvement of software and hardware over time, even depression. Finally, the observation made by B&R that journal impact factor (reputation) is frequently negatively correlated with structure quality is no longer true.

Results
Measure of overall model quality and missing data imputation quality of structures published in particular journals, we initially attempted to use the Q1 p measure proposed by Shao et al. [16]. Q1 p is a measure of overall protein structure quality that combines into one number five different indicators: R free , RSRZ (normalized real-space R-factor) outliers, Ramachandran outliers, Rotamer outliers, and Clashscore [20] using the following formula: where P R free , P %RSRZ , and P PC1(geometry) are ranking percentiles (the higher the better), characterizing for a given structural model, respectively, its R free , percentage of RSRZ outliers, and the first principal component of the PCA of Ramachandran outliers, Rotamer outliers, and Clashscore (see Methods section for details). Once Q1 p is calculated, each PDB deposit is ranked within the population to obtain its final ranking percentile P Q1 p , with the lowest (worst) value of Q1 p at 0% and highest (best) at 100% [16]. We note that in this paper we took an averaging approach to percentiles, that is, a group of tied Q1 p values was assigned the same percentile rank, one that is the average rank of the group. By combining five distinct quality measures, P Q1 p provides a simple way of comprehensive comparison and ranking of many structural models. The P Q1p metric was originally designed to assess protein structures only. For nucleic acid structures, which are also present in the PDB, Q1 p cannot be used directly because the notions of Ramachandran and Rotamer outliers are not applicable to those structures. However, for proteins both missing elements are implicitly contained in P PC1(geometry) . Therefore, for nucleic acids we calculated analogous Q1 n without the use of PCA, but applying the following simplified formula: where P Clashscore is the ranking percentile of Clashscore. In the following analysis, Q1 p and Q1 n (and, consequently, P Q1 p and P Q1 n ) were computed separately for proteins and nucleic acids, respectively. This way, the percentiles were used for Q1 p and Q1 n rank structures of the respective type. Protein-nucleic acid complexes were assigned to the protein group, since it is possible to calculate all quality metrics for such structures.
Since averaging of multiple quality metrics might potentially blur the spotlight on models with serious problems, an alternative aggregation method could involve taking only the minimum percentile of all the metrics used. In this approach, a structure is considered as good as its weakest feature, according to the following formulas: In the remainder of the paper, we will focus mainly on the averaging approach using Eqns (1, 2), but will also compare it with the minimum approach based on Eqns (3,4).
It must be emphasized that P Q1 p can be computed only for those PDB structures that have all five (or in the case of P Q1n all three) component measures attached to them. The PDB has done an excellent job of calculating these metrics for most of the deposits, but not all structures have all the necessary data to perform these calculations. Overall, 12.7% of all considered deposits are missing at least one quality metric, with RSRZ being the dominating missing value (Table 1). Leaving this situation as is would effectively limit the analysis to structures published after 1992, that is, to the time after R free was introduced [21]. To circumvent this dilemma and to perform a study encompassing the entire timespan of the PDB, we have developed a protocol for the estimation of the missing values based on a machine-learning data imputation method.
The validity of the data imputation procedure was assessed on the complete portion of the PDB, to which artificially missing (i.e., deliberately removed) values were introduced at random following the missing data proportions of each metric. The missing values were then replaced using either the metric's mean, median, or by an iterative method called multiple imputation by chained equations (MICE) [22,23] with Bayesian linear regression [24]. MICE builds regression functions for subsequent metrics based on nonmissing values from  Table 2.
It can be seen that MICE is superior to mean/median replacement for all metrics according to the mean absolute error (MAE) and root-mean-square error (RMSE), and for all but two metrics according to the median absolute deviation (MAD). All the differences between MICE and the remaining methods are statistically significant according to the Friedman and Nemenyi post hoc tests [25] (p < 0.001). In terms of absolute values, the MAE of MICE is usually two to four times smaller than the standard deviation of a given quality metric ( Table 1, Fig. S1). The results are particularly good for Clashscore and R free , owing to the small number of missing values and high correlation with R, respectively. In the remaining part of the paper, we discuss results obtained for the full PDB dataset with missing values imputed using the MICE method. We want to stress that in doing so our goal is to give an approximate overview of the average quality of structures in the early years of the PDB, and not to provide a way to assess individual deposits with missing quality metrics or to create nonexistent data.
Model quality at the time of deposition Figure 1 shows that P Q1p and P Q1n tend to gradually improve over the years. Almost identical trends can be noticed when looking at deposits without imputed data ( Fig. S2) and when using the minimum approach (Fig. S3). Obviously, this trend is correlated with the advances in the generation of X-rays and in data collection procedures, with better computer hardware and software, with heightened structural validation standards, and with progress in crystallogenesis. If one were to use P Q1 (i.e., P Q1 p or P Q1 n depending on structure type) calculated over all the analyzed years to rank journals, then journals with longer history would be at a disadvantage because they contain old, quality-wise inferior structures. Thus, even though a structure might have been refined to an impressively high standard in its time, today it might be treated as a poorly refined case. One could, of course, recalculate the percentiles separately for each decade or even shorter time periods, but this might not be enough to cure this problem (see the rapid improvement in quality over the last 10 years) or could drastically reduce the data volume and effectively make journal comparisons impossible. Therefore, we introduce here a new, time (t)-dependent P Q1 (t) parameter, which corresponds to P Q1 calculated at the time of structure deposition. For example, the 1990 PDB deposition 2RSP [26] achieves an overall quality percentile P Q1 of 36%, meaning that it is better than only 36% protein deposits that are currently held in the archive. Should the structure be ranked against the 416 structures deposited prior to 2RSP, it achieves P Q1 (t) of 69%, meaning that it was significantly aboveaverage at the time of its deposition.
Moreover, in view of the very high correlation between quality and resolution ( Fig. 1, Figs S2 and S3), we propose yet another measure, called P Q1 (t,d). P Q1 (t,d) is the Q1 percentile calculated at the time of structure deposition (t) for a given resolution interval (d), where the resolution is rounded to the nearest 0.1 A and capped at 1 and 4 A. The 2RSP structure from the previous example scores a P Q1 (t,d) of 75%. The advantage of using P Q1 (t,d) is that data resolution will not affect the journal ranking list.
Using P Q1p (t,d) and P Q1n (t,d), one can assess the quality of protein and nucleic acid models over time.  (14) 1.32 (16) 3.85 (15) 3.82 (4) Best values for each error estimation method are given in bold. MAD, median absolute deviation; MAE, mean absolute error; RMSE, rootmean-square error.

2688
The The average P Q1p (t,d) for proteins in the PDB is 58.7%, whereas nucleic acids have the average P Q1n (t,d) of 59.9%. Figure 2 shows how the model quality at the time of deposition of these two types of macromolecules has evolved over the years. For many years in the past, newly deposited nucleic acid models were usually of better quality than newly deposited protein models, especially between 1993 and 2004. However, the steady improvement of the quality of protein models in the last decade has made them currently to be on a par, if not better, than currently deposited nucleic acid models. Similar trends were observed using P Q1 p min ðt; dÞ and P Q1 n min ðt; dÞ, that is, the minimum approach (Fig. S4).
In the following subsections, we will focus on ranking structures and their corresponding journals according to P Q1 (t,d). The rankings associated with P Q1 (t), P Q1 min ðt; dÞ, and P Q1 min ðtÞ are available in the online supplementary materials for this publication. For the purposes of ranking journals, the percentiles for proteins and nucleic acids will be combined and denoted jointly as P Q1 (t,d) or P Q1 (t).

All-time journal ranking
Out of 800 unique journals being the primary citations for the 141 154 deposits found in the PDB, we selected those that published papers presenting at least 100 macromolecular structures. We decided to limit the list of journals to such a subset, as we believe that it may be too early to assess journals with less than 100 described structures. The resulting 91 journals were ranked according to average P Q1 (t,d) ( Table 3) as well as P Q1 (t), P Q1 min ðt; dÞ, and P Q1 min ðtÞ (Tables S1-S3).
Surprisingly, the first place in all versions of the ranking is occupied by Tuberculosis, a venue that is not well known as a structural journal. However, this place is well earned since Tuberculosis has over 16 percentage points of advantage over the second ranked journal in terms of P Q1 (t,d) and 12 percentage points of advantage in the P Q1 (t) ranking. A closer inspection of the structures published in Tuberculosis reveals that the vast majority of structures refer to one publication titled 'Increasing the structural coverage of tuberculosis drug targets' [27]. The publication and its corresponding structures are the result of the joint effort of various departments working in the Seattle Structural Genomics Center for Infectious Disease. This finding is in accordance with the conclusions of B&R [18] that structural genomics initiatives usually deposit structures of above-average quality [28,29]. Indeed, taking into account all 12 494 deposits attributed to structural genomics projects, they achieve a mean P Q1 (t,d)

Proteins
Nucleic acids of 63.7% and P Q1 (t) of 64.3%, substantially above the average of the entire PDB (58.6% and 57.7%, respectively). These differences are statistically significant according to Welch's t-test (p < 0.001) and are much more prominent than those reported in the R&K study [19]. This discrepancy most probably stems from the fact that in our study we used a relative measure that combines several quality metrics, and had 2.3 times more structural genomics deposits at our disposal and 6.1 times more structures overall. When looking at the most popular journals, that is, those with more than 1000 structures (Table 3, gray rows), the top three spots are occupied by Biochemical Journal, FEBS Journal, and Nature Chemical Biology. At the other end of the spectrum, we have EMBO Journal, Cell, and Nature Structural & Molecular Biology, which were ranked last according to P Q1 (t,d). It is worth noting that the latter three journals are the only journals that have average P Q1 (t,d) below 50%. This means that, on average, at the time of deposition, the structures presented in these journals were already worse than over 50% of PDB structures of similar resolution. A similar ranking was obtained using P Q1 (t) (Table S1), the main difference being that journals publishing structures at superior resolution, such Chemistry or Acta Crystallographica D, achieved much higher positions in the journal ranking. Table 3 and Tables S1-S3 also identify journals whose average P Q1 (t,d), P Q1 (t), P Q1 min ðt; dÞ, and P Q1 min ðtÞ are significantly different from the expected values of the entire PDB population.
It should be noted that the ranking presented in Table 3 takes into account over 45 years of structural data. This means that the ranking averages the entire lifespans of journals, which in their own individual history might have evolved over time. That is why in the following section we analyze how the ranking of the most popular journals has changed over the years.

Quality of journals' structures over time
Owing to the fact that P Q1 (t,d) assesses structures at the time of deposition, we also analyzed rankings of journals as a function of time. Figure 3 presents the ranking of 25 all-time most popular journals in periods of 5 years. To minimize the effect of noise on the ranking, journals were assigned to a given 5-year period only when they contained primary citations to at least 30 structures within that period.
As Fig. 3 shows only six of the 25 journals published at least 30 structures before 1991, however, these six journals were the primary reference for 482 out of 666 PDB deposits from this period. Biochemistry remains one of the top journals in terms of structure quality to date, PNAS and J Biol Chem are in the middle of the ranking, whereas Nature, Science, and J Mol Biol occupy the bottom half of the ranking. A journal that has steadily remained at the top of the ranking list for most of the years is FEBS Journal. Apart from Biochemistry and FEBS Journal, Proteins can also pride itself with a solid presence in the top 10 of the ranking throughout the years. It is worth noting that these three journals were also highly ranked in the study of B&R [18].
Disappointingly, the relatively poor ranks of highly reputable venues are not a new concern, but rather have been a steady trend for many years. It must be noted, however, that the overall structure quality of practically all 25 of the most popular journals has greatly improved in the last ten years, with Science and Nature noting the most positive trends (Fig. S5). Similar observations were made when the journals were ranked according to P Q1 (t) (Figs S6 and S7).
A separate comment is required for the 'venue' To be published, most frequently found in PDB deposits. This category of PDB entries, omitted in the studies of B&R [18] and R&K [19], presents a very interesting pattern over the years. For several decades,

Retrospective comparison with the results of Brown and Ramaswamy
The journal rankings presented in this work were inspired by the study of Brown and Ramaswamy (B&R) [18]. Although the methodologies used in these two analyses are different (most notably because of incorporation of nucleic acids and the use of data imputation in the present work), it is worth verifying how the two approaches compare, and what has changed since the original B&R study. To help answer these questions, Table 4 presents the journal ranking reported by B&R in 2007 together with two lists of the same journals ranked according to P Q1 (t,d): based on PDB deposits available in 2007 and based on all currently available data. It can be noticed that the rankings bear several similarities, although they are not identical. Journals that were at the top of the B&R ranking generally remain highly ranked according to P Q1 (t,d). Similarly, the bottom regions of the rankings are occupied by the same group of journals. However, there are some notable differences. For example, Bioorg Med Chem Lett is ranked 19 places lower according to P Q1 (t,d), whereas J Biol Journals that have average P Q1 (t,d) significantly different than the average P Q1 (t,d) of the entire PDB, according to Welch's t-test with the Bonferroni correction at significance level a = 0.001. Mean denotes the arithmetic mean, G-mean denotes the geometric mean (log-average), and V-mean denotes the mean in A À3 .

2692
The Inorg Chem, FEBS Lett, and Nucleic Acids Res are ranked 11 places higher. These differences may be the result of the number of structures taken into account by each ranking. Compared to the time of the B&R study, significantly more precomputed quality metrics are now available, even for older PDB deposits. Moreover, the methodology proposed in this work imputes missing values, allowing for inclusion of 12.7% additional structures. As a result, the rankings based on P Q1 (t,d) were compiled using much more data, occasionally changing a journal's rank substantially.

Correlation between structure quality and journal impact
The low ranking of high-impact journals in the current study raises the question of whether structure quality is negatively correlated with journal impact. The study of B&R [18] strongly suggested that this was the case, whereas the slightly more recent work of R&K [19] showed that the differences in structure quality between high-impact and other venues were relatively small. However, both studies manually categorized journals as high-or low-impact venues rather than investigating actual impact metrics for a large set of journal titles.
In this study, we decided to measure journal impact quantitatively and correlate it with our quantitative measure of structure quality. For this purpose, we used two metrics: impact per publication (IPP) and source normalized impact per paper (SNIP) [30]. IPP is calculated the same way as 3-year impact factor (IF3) but using only publications that are classified as articles, conference papers, or reviews in Scopus. SNIP is a modification of IPP that corrects for differences in citation practices between scientific fields [30]. Both journal metrics have 20 years (1999-2018) of publicly available statistics and are based on the same source data. Figure 4 shows the relation between P Q1 (t,d) and the journal impact over time (separate plots for each year are presented in Fig. S8). It is evident that structure quality has substantially improved over the last decade and that the negative correlation between journal impact and the quality of structural models presented therein seems to disappear as time progresses. This observation is confirmed when the relation between journal impact (IPP, SNIP) and structure quality (P Q1 (t,d)) is gauged using Spearman's rank correlation coefficient. Figure 5 shows that even though structure quality and journal impact were indeed negatively correlated 20 years ago, currently there is no correlation between these two criteria.  Figure 4 also shows a very interesting situation in the low-IF range, namely that low-IF journals publish just about anything: the most fantastic work as well as structures beneath contempt. On the other hand, medium-IF journals used to be primary citations of mostly poor structures in the past. At present, however, they are doing a much better job, publishing mostly betterthan-average structures.

Discussion
Our analysis confirms recent reports that the quality of crystallographic macromolecular structures has improved over the last years [16]. However, we also found out that at the time of the B&R analysis the quality of PDB structures temporarily stopped improving and that this is most likely why B&R did not report any correlation between quality and time [18]. In addition to confirming earlier findings, by using a data imputation algorithm we were able to put into context the quality of structural models going back in time as far as 1972. As convincingly illustrated by Figs 1 and 2, the quality of PDB structures had rapidly improved over the first two decades of the database.
The ability to analyze quality over time using the proposed P Q1 (t,d) measure (Fig. 3) shows that there is tight competition among journals as their number increases. Quite interestingly, it is also evident that the PDB treasures many good quality structures that do not have primary citations. The fact that a structure remains To be published indicates that it is getting more and more difficult to publish papers based solely on crystallographic results, even if they are of high quality. Indeed, our study shows that structures without primary citations are on average of higher quality than structures published in many popular journals. Therefore, although many structures do not have any accompanying journal publications, they present a substantial value in their own right. As each PDB deposit Table 4. Comparison of journal ranking by Brown and Ramaswamy [18] with rankings of the same journals created using P Q1 (t,d). Numbers of structures considered from a given journal are shown in parentheses. The top three journals according to B&R are highlighted in green, and the bottom three journals are highlighted in red B&R ranking [18]   a Journals whose quality was determined to be significantly different from the average quality of structures the entire PDB, at significance level a = 0.001.

2694
The has its own digital object identifier (DOI), citation of structures should be acknowledged not only by PDB IDs but also by DOIs. Full implementation of this mechanism would allow for easy estimation of the impact of To be published structures. The proposed P Q1 (t,d) and P Q1 (t) measures manifest the overall attitude of authors toward the quality of the PDB: Each next deposited structure should be better than the average previous deposit. Each new quality metric [20], visualization technique [31], set of restraints [32], validation algorithm [33], hardware improvement [34], or software update [35] makes it easier to produce good quality structures and to avoid simple mistakes.
In an effort to promote constant improvement of overall PDB quality, it would be a desirable ideal to expect that newly added models are above the current average. However, such a recommendation should be applied judiciously as each case is different and should be always judged in a context-dependent manner. It is gratifying to see that almost all journals publish structures that are, on average, better than most of the previous ones while those that are not at that level yet seem to be heading in the right direction.

Data collection and cleaning
To provide a comprehensive analysis of structure quality over time, we examined all X-ray structures available in the PDB that included 141 154 deposits between 1972 and 2019. The data were downloaded by performing an SQL query on PDBj [36,37] as of December 10, 2019.
In order to perform an analysis of structure quality in correlation with the primary citations, journal names had to be extracted from PDB files, cleaned, and unified. Initially, the dataset contained 1342 unique values describing the primary citation journal. After eliminating typos, unifying punctuation and ISSNs, and taking into account that some journals have changed their titles over time, the number of unique journal names dropped down to 800.
Bibliometric indicators of journals (IPP, SNIP) were downloaded from the Leiden University CWTS website (https://www.journalindicators.com/) and joined with the PDB data using ISSNs. Both indicators were calculated based on the Scopus bibliographic database produced by Elsevier.

Missing data imputation
To fill in missing data, three approaches were tested: filling missing values with the metric's mean value, the metric's median, and using the multiple imputation by chained Fig. 4. Scatterplot of mean journal P Q1 (t,d) and the journal's impact over time. Variation in mean journal P Q1 (t,d) (y-axis) in a given year (color) plotted against the journals IPP. IPP uses the same formula as the 3-year impact factor, but is based on publicly available Scopus data. The two regression lines show linear trends for 1999 (indigo) and 2018 (yellow) along with 95% confidence intervals (gray areas). Correlation between structure quality and journal impact. The plot shows Spearman's rank correlation (y-axis) over time (x-axis) between structure quality measured by P Q1 (t,d) and journal impact measured using the IPP and SNIP metrics. IPP uses the same formula as the 3-year impact factor but is based on Scopus data, whereas SNIP additionally takes into account the scientific field.  [22,23] with Bayesian ridge regression [24] as the predictor.
To see how well each of the three methods performed, the nonmissing (i.e., complete) portion of the PDB data was used as the basis for creating a test set. We randomly introduced missing values to the complete portion of the data in the same proportions as those present in the actual dataset. As a result, the test dataset had the same proportion of deposits with at least one missing value and the same percentage of missing values per metric as the original (full) dataset. Next, these randomly introduced missing values were imputed and compared against the values originally present in the dataset. To quantify the imputation error, we used the MAD, MAE, and RMSE error estimation methods [38]. The procedure was repeated 100 times with different subsets of values randomly eliminated from the complete dataset in each run. Imputed missing values were clipped when they were outside the range of possible values of a given metric.

Principal component analysis
The PCA required to calculate Q1 p was performed as described by Shao et al. [16]. The PCA was performed on three quality metrics: Clashscore, Ramachandran outliers, and Rotamer outliers. Since Ramachandran outliers and Rotamer outliers are meaningful only for proteins, the PCA was performed for protein structures only. In the assessment of the quality of nucleic acid structures, the PCA step was not needed, as Clashscore was the only geometry-related quality index.
Upon visual inspection of the metrics' values ( Fig. S9), structures were marked as outliers and removed when the following criteria were reached: Rotamer outliers > 50% or Ramachandran outliers > 45% or Clashscore > 250. In total, 16 structures were marked as outliers: 1C4D, 1DH3, 1G3X, 1HDS, 1HKG, 1HPB, 1PYP, 1SM1, 2ABX, 2GN5, 2OGM, 2Y3J, 3ZS2, 4BM5, 4HIV, and 5M2K. These structures were temporarily removed prior to PCA to decrease the effect of outlying values on the principal components, but they were not removed from the quality analysis. After removal of outstanding outliers, the input data for PCA were standardized by setting the mean to be 0 and standard deviation to 1. Running PCA on the standardized data resulted in three principal components: PC1, PC2, and PC3, explaining 78%, 14%, and 8% variance, respectively. The coefficients of PC1 were 0.60, 0.58, and 0.56, indicating nearly equal contribution of Clashscore, Ramachandran outliers, and Rotamer outliers. The explained variance of each principal component and the coefficients of PC1 were practically identical to those reported by Shao et al. [16].
As noted by one of the reviewers, the PC1 coefficients (0.60, 0.58, 0.56) are almost identical and roughly equal 1 ffiffi 3 p , making the respective weights of these contributions near equal for all three of them. This means that approximately the Q1 p measure could be presented as follows: The above formula provides a simple metric that can be used without performing PCA. However, this approximate formula assumes that the relations between Clashscore, Ramachandran outliers, and Rotamer outliers are fixed and will not change. For this reason, we chose to use the exact formula (1) as proposed by Shao et al. [16]. Nevertheless, the approximate formula (5) may be considered a simpler solution for less technical studies.

Computation
Data were extracted directly from PDBj using its SQL interface. All computations were performed with PYTHON 3.7 using the SCIPY [39] and SCIKIT-LEARN [40] libraries. The SQL query used, the resulting dataset, and fully reproducible analysis scripts in the form of a Jupyter notebook are available at https://github.com/dabrze/pdb_structure_quality. 11366222. Datasets and reproducible experimental scripts are available at GitHub: https://github.com/dab rze/pdb_structure_quality.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article. Fig. S1. Histograms of quality metric values of structures found in the PDB. Fig. S2. P Q1 analysis without imputed values. Fig. S3. P Q1min analysis (minimum approach). Fig. S4. Comparison of P Q1min (t,d) of protein and nucleic acid structures over time. Fig. S5. Average P Q1 (t,d) of popular journals for each year. Fig. S6. Journal ranking over time according to P Q1 (t). Fig. S7. Journal quality over time according to P Q1 (t). Fig. S8. Scatterplots of mean journal P Q1 (t,d) and the journal's impact. Scatterplots of mean journal P Q1 (t, d) and the journal's impact. Fig. S9. Scatterplots of the values of Clashscore, Ramachandran outliers, and Rotamer outliers found in the PDB. Table S1. All-time journal ranking according to P Q1 (t). Table S2. All-time journal ranking according to P Q1min (t,d). Table S3. All-time journal ranking according to P Q1min (t).