Caveat Usor: Assessing Differences between Major Chemistry Databases

Abstract The three databases of PubChem, ChemSpider, and UniChem capture the majority of open chemical structure records with February 2018 totals of 95, 63, and 154 million, respectively. Collectively, they constitute a massively enabling resource for cheminformatics, chemical biology, and drug discovery. As meta‐portals, they subsume and link out to the major proportion of public bioactivity data extracted from the literature and screening center assay results. Therefore, they not only present three different entry points, but the many subsumed independent resources present a fourth entry point in the form of standalone databases. Because this creates a complex picture it is important for users to have at least some appreciation of differential content to enable utility judgments for the tasks at hand. This turns out to be challenging. By comparing the three resources in detail, this review assesses their differences, some of which are not obvious. This includes the fact that coverage is significantly different between the 587, 282, and 38 contributing sources, respectively. This not only presents the “who‐has‐what” question, but also the reason “why” any particular inclusion is considered valuable is rarely made explicit. Also confusing is that sources nominally in common (i.e., having the same submitter name) can have significantly different structure counts, not only in each of the three but also from their standalone instantiations. Assessing a series of examples indicates that differences in loading dates and structural standardization are the main causes of this inter‐portal discordance.


Overview
This trio of PubChem (PC), UniChem (UC), and ChemSpider (CS) are the largestp ublic data sources for bioactivec hemistry and drug discovery. [1][2][3] Crucially,t heir fundingh as allowed each of them to maintain as teady rate of content expansion from the subsumation of new sources. This review cannot cover all the features of this impressive triad, but the focusw ill be on providing insight into differentialc ontent, the complexities thereof, and how this translates to complementarity.B ecause content is the principal value proposition for any database, it is important for users to appreciate differences when decidingrelativeu tility.I naddition, knowledge of the contributing sourcesc an indicate where there are advantages to query these directly,v ia their standalone instantiations, rather than what may be stripped-down records subsumedi nto integrated resources.
It is assumed that readerso ft his article are not only aware of PC and CS but have had at least some experience using one or the other.T he community familiarity with the youngest of the three, UC, is likely to be lower (although ChEMBL and Sure-ChEMBL users may have noticed the UC cross-references nested in each compound record). It should also be noted that most researchers assessing the use of integrated sources would include SciFinder (SF) as the fourth major source. [4] This currently declares 134 million organic and inorganic chemical substances (although at 105 million,t he Reaxys commercial of-fering is not that far behind). Via academic departmentsa nd pharmaceutical company licensing, many cheminformaticians are thus likely to have access to the "big four".
This review focuses on the three public resources, as their detailsa re largely open. However,t he approaches outlined for content dissection and comparison (a.k.a. slicing and dicing) can also be appliedt oc ommercial databases (depending, of course, on what their internal queryf unctionality allows). In context,i ti si mportant to note that, despite the adjectiveo f "proprietary" being often applied to describe licensed databases, their content is drawn entirely from public primary sources. Notwithstanding, selective capturem eans they may stillc ontain ap roportion of unique structures. [5] Thec hallenges associated with divergente xpansion of commercial and public sources werer eviewed in 2015 (this includes discussions and references to qualitya spects that cannotb ec overed here). [6] However,t his new analysis includes more comparative detail and covers recent changes in the public "big three" (n.b.,a ll numbers reported herein were initially harvested in November 2017 with somepost-refereeing updates in January 2018).

Growth
Despite being in the middle in terms of size ranking, we can begin with the description of PC as an archetype against which to comparet he other two databases. Growth since 2005 showsa na pproximately linear increase in just over ad ecade, now approaching9 5m illion distinct chemical structures ( Figure 1).
Those cheminformaticsp ractitioners who remember LBC (life-before-PubChem) may be more appreciative than their younger colleagues of what an achievement this represents for an open public resource, fundedb yt he US National Institutes of Health (NIH). The obviousf eature is that the increasei sa pproximately linear,i nc ontrastt ot he exponential growth of sequenced ata for the sister discipline of bioinformatics. So why is this?T he minimum parsimoniousa ssumption is that global output over this period was related to the number of chemists making compounds.T he chart of Figure 1i ndicatest hat long-The three databases of PubChem, ChemSpider,a nd UniChem capturet he majority of open chemical structure recordsw ith February2 018 totals of 95, 63, and1 54 million, respectively. Collectively,t hey constitute am assively enablingr esource for cheminformatics, chemical biology,a nd drug discovery.A s meta-portals, they subsume and link out to the major proportion of public bioactivity data extracted from the literature and screening center assay results. Therefore, they not only present three different entry points, but the many subsumed independentr esources present af ourth entry point in the form of standalone databases.B ecause this creates ac omplex picture it is important for users to have at least some appreciationo f differential content to enableu tility judgments for the tasks at hand. This turns out to be challenging. By comparing the three resources in detail, this reviewa ssesses their differences, some of which are not obvious.T hisi ncludes the fact that coverage is significantly different between the 587, 282,and 38 contributing sources, respectively.T his not only presents the "who-haswhat" question,b ut also the reason "why" any particulari nclusion is considered valuablei sr arely made explicit. Also confusing is that sources nominally in common (i.e.,h aving the same submitter name)c an have significantly different structure counts,n ot only in each of the three but also from their standalone instantiations. Assessing as eries of examples indicates that differences in loading dates and structurals tandardization are the main causes of this inter-portal discordance. term growth has been sustained (although there has been distinct slowdownd uring 2017), despite the shrinking of medicinal chemistry resources in the pharmaceutical sector over the same decade (i.e.,f ewer people in companies making compounds). [7] Concomitantly, there is no clear sign of automated chemicals ynthesis accelerating output( but this could change). While the subject will be expanded on later,aproportion of this total is certainly derived from enumerated virtual structures that have never been made. Notwithstanding, there is no evidencet hese are making as ignificant contribution to the overall growth rate.
As for all three databases, PC is submitter-based. This means that chemical structures conforming to standardization rules are accepted as primary database recordsassigned to each discrete submitter by means of Substance Identifiers (SIDs). These are then merged, according to PubChemc hemistry rules,i nto non-redundant Chemical Identifiers (CIDs). Consequently,t he 236 million SIDs, from different PC submitters, merge to 94 million CIDs, representing an average of just over 2.5 SIDs per CID. However,t his is ah eavily skewed distribution because 48 million CIDs in PC are unique as defined by being derived from as ingle source (i.e.,h ave only one SID). As we can see from ab reakdown of the top-ten submitters (Table1)t hese already cover substantial amounts of single-source structures.
These contrast with more "popular" subsets, for example, approvedd rugs,w here this is reflected in having many submitters. Using aselection query from the SID side we can establish that the source IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb) includes 1180 approved drugs as CIDs (as of release 2017.5). [8] Transforming these across to their contributing SIDs indicates aC ID/SID ratio of 1:110. However,t his average also covers as kewed distribution where newer approved drugs generally have fewer submitters than old drugs.W ec an thus take aspirin as one of the most "popular" examples to discern the extent of multiplexing( the same structure in different sources). The PC sourcem appings are indicated in Figure2.
Thus, acetylsalicylic acid (aspirin) has 282 single structures, plus 1602 mixtures, as SIDs. But these collapset o6 48 distinct CIDs, inferring that some differents ources are submitting identical mixtures. The "same connectivity records" (as distinct CIDs) turn out to be 13 isotopically labeled derivatives.

PubChem major sources
PC has 548 data sources, but only 488 of these have live (i.e., not on-hold) substance counts. This is because some contribute annotation links but not structures (e.g.,C linicalTrials.gov contributes 7788 annotations). The distribution has al ong tail, where 282 have more than 1000 SIDs but over 90 sources have 10 or fewer SIDs( includingt he author's own submitted set of 10 as "TW2Informatics"[SourceName]).T he top ten are listed in Ta ble1.
We will return to the Table 1f eatures when comparing the other sources, but somea spects can immediatelyb ep icked out. The first is the dominance of vendora ggregators. Another feature is that automated patent extraction comes in as second (sources 3a nd 4). [9] However,5 ,7 ,a nd 8a re not straightforward to classify;i ndeed 7a nd 8h ave neither source Dr.Christopher Southan has been a Senior Cheminformatician for the IUPHAR/BPS Guide to PHARMACOLO-GY at the University of Edinburgh since 2013, but works remotely from Sweden. Prior to this he set up TW2Informatics, working on patent informatics for SureChem (2011/12) and Astra-Zeneca (AZ) on Chemistry Connect and Pharma Connect (2009-2011). In 2008/9 he coordinated the ELIXIR Database Provider Survey at EBI and was Te am Leader in AZ Molecular Sciences (2004)(2005)(2006)(2007), preceded by senior bioinformatics positions at Oxford Glycosciences, Gemini Genomics and SmithKline Beecham. His PubMed papers (Southan C) encompass pharmacology,d rug discovery,bioactivity database comparisons, bioinformatics, and cheminformatics. Further information is on his LinkedIn profile, tweets at @cdsouthan and blogs at https://cdsouthan.blogspot.se/.  outlinks nor metadata in the SIDs. Thus, along with 9, these are arguably legacy submissions, as neither has updated in recent years. Uniqueness is ak ey aspecto fd atabases, but it is not alwaysc lear what this means for users. The numbersi n Ta ble 1a re defined by the chemistry rules for the formation of CIDs, althought he details of these are not completely exposed in PC. However,t he relationship computation and query navigation allows al evel of exploration to divine what these rules actually are by observing the consequences they have in mapping results. Arguably, these can be considered stringent in the senset hat differences in isotopes, stereochemistry,t automers and mixtures all lead to the formation of distinct CIDs and InChIKeys (i.e.,aphilosophy of splitting rather than merging).

PubChem features
The other two databases solve the necessary redundancy collapse of submissions in as imilar way to the SID/CID splits, but where the InChI system plays am ore central role. [10] However, there are unique aspects of PC that should be mentioned. The first of these is that the substances ubmissions include entries that cannotf orm CIDs, since they have not been transformed into SMILESo rS Df iles by the submitters. To becomeaCID, the SIDs have to be within the current upper limit of 1000 atoms, approximating to % 70 residues for ap eptide. Those excluded from CID merging are thus larger peptides, polynucleotides, or siRNA reagents but also include biological therapeutics such as antibodies that have ad esignated INN or ac linical candidate designation.F or example, as ubstance search with the INN "natalizumab" includes the SIDs shown in Figure 3. Thus PC indexes biologics of various kinds from different sourcesb ut cannot merget hem. Because in UC and SC every-thing has to conform to chemistryr ules, they do not currently have this category of submission. The second key difference is the surfacing of biological activity data in PubChem BioAssay (this can also include SIDs without CIDs). The many aspects of this third dimensiono fP Cd ata cannot be detailed here, but they are crucial for PC'sv alue to chemicalb iology andd rug discovery. [11] The top-level statistics are that 2.4 millionC IDs have been tested in 1.2 million assays (i.e.,w ith distinct assay identifiers as AIDs) and 1.1 milliono ft hese have at least one SID recorded as "active". Compared to at ypical HTS assay, where hitr ates are usuallyt hreshold filtered to around 1%, there are clearlyc aveats with the definition of "active" when this is as high as 45 %. However,g iven that ChEMBL submissions dominate BioAssay (with 1.24 millione ntries compared with only % 1000 each fort he next four screening centers, rankedb yA ID counts) there is ac lear biast owardp ositive results extracted from 67 722 papers into ChEMBL 23 (but note these include inactives from the same papers). BioAssay contains as mallp roportion of alternative uses of this as as ubmission category.F or example, the 1216 SIDs in AID 1195 are approvedd rugs with US Food and Drug Administration (FDA) Maximum Daily Dose assignments (i.e.,n ot assay results). Bio-Assay also has what appear to be false negatives in the informationc aptures ense. For example, as imple INN query (which, from al imited number of recordc hecks, seemed to be substantially true positives) retrieves 8949 CIDs, but only 4399 of these have BioAssays scored as "active". Because INNs are specific for advanced development compounds (mostly reaching at least Phase 1c linical trials) accrued over many decades, we  Reviews would expect the majority to have published bioactivity.W e can test possible explanations directly from the PC source intersects. For example, only 3926 are in the NIH Molecular Libraries screening collection (and these have not all been screened against probablem olecular targets anyway). We can also determinethat ChEMBL has captured 7053 of the same structures but only recoded "actives"f or 4321 of those.T hus, nearly half the INNs have neither explicit activity data against human protein targets reported in publications that ChEMBL has extracted, nor have accumulated such data from the screening centers. Looking at examples indicated at least three causes: firstly new INNs, secondly old INNs, and thirdly the BioAssay data had been assignedt oad ifferent stereoisomer.A na nalogous inferenceo fs ignificant false negatives also applies to the 15 514 CIDs that Medical Subject Headings (MeSH) annotators have classified as having pharmacological action by curation from PubMed. While these are implied to be active in vivo, only 7016have positive BioAssay results.
Another keyd ifferencei st he integration of PC into Network Entrez. [12] This is ap owerful and extensive system for the crossreferencing of information about biological and chemical entities from the 136 databases within the NCBI. This includes the direct connectivity between Compound,S ubstance, BioAssay, protein sequences, protein structures, and BioSystems (pathways). Another feature that presents an advantage for PC is the ability to upload, immediately visualize and then download, Entrez result sets from bulk queries. This can be accomplished either via the Structure Search file upload or the Pub-Chem Identifier Mapping Servicew eb page (capacity may vary, but can be in the thousands). For inter-databasec omparisons this meanss maller sources from otherd atabases with appropriate download options( e.g.,S DF,S MILES, and InChI) can be mapped "into" PC to determine exact intersects and differences.

UniChem features
UC is very different from PC in being al arge-scale database of pointers between chemical structures.T his means, unlike PC and CS, it does not store the actual structures( e.g.,a sS Df iles or SMILES), but these can be accessed via the source URLs. Initially conceived to integrate chemistry across the internal EBI databases (ChEBI, ChEMBL, SureChEMBL, PDBe, and most recently Metabolites) it now extends to 32 external sources. It is also designed to enable "on-the-fly" linking via REST web services. It is also designated with aC C-0l icense as specifiedo n the website. Thed ifferent automated downloading procedures and loading dates are summarized for each of the 37 sources: These are compiledi nto the weekly release, regardless of individual source update cycles. The InChIKey (IK) centric crosspointinga nd source redundancy reduction is conceptually similar to the SID merging rules in PC. It also uses features of the Standard InChI to enable mappings between molecules that share common atom connectivity via the inner Key layer.I mportantly,t his extends across isotopes, stereo forms, mixtures, and salts. This is analogous( but again, not exactly equivalent) to the PC "same connectivity" relationshipsbetween CIDs identifier mapping service.

UniChem major sources
The top-ten sourcec ounts are shown in Ta ble 2. We can return to Table 2w hen discussing comparative content, buts ome unique characteristics of UC can be introduced. As an approximate (but not exact) equivalence to PC CID:SID ratios, the UC structure:assignment ratio is 1.37. This is not unexpected,a s there are fewer sources. Unsurprisingly,P ubChem becomes the major source contributor,c orresponding to 30 %o fa ll unique structures. However,a sU Cp oint out, because many sources are loaded into both PC and UC independently (i.e.,t wice) this confounds the intra-database statistics.C omparative insights can be gleaned from the entry for aspirin in Figure 4.
We can see that, as IK matches, aspirin is indexed in 27 of the 37 sources. In addition, there are some multiple records on the source side for identical structures (e.g.,A tlas, NMRShift, ACToR, and BRENDA). Another uniquea specto fU Ci st he computationo fd ifferent forms of equivalence between sources, as no less than seven 37 37 overlap matrices at different comparisons tringencies. This is explained in detail on the site and the publications.Asmall section of the results based on the full IK is shown in Figure 5.
The matrix can be read as follows for GtoPdb (the IUPHAR/ BPS Guide to PHARMACOLOGY): The total (i.e.,r ow and columnf our) is 6575. The overlap with ChEMBL (as IKs in common) is 5079, DrugBank1 823, and PDB (i.e.,t he heteroatom small-molecule entries) is 1246.  By executing as earch with the InChIKey inner layer (also sometimes termedaskeleton match), 160 data sourcesa re indicated, but some of these are multiplexed by one-to-many entries. For example, the 10 ChemIDplus entries include nine mixtures and the nine Crystallography Open Database entries are each from distinct 3D structure determinationr eferences. While this multiplexing would expand to well over1 60 individual links, these are still substantially less than the analogous 1884 PC SID entries in Figure 2. As outlined below,b oth CS and PC contain as imilar number of large sources, indicating that the disparity is due to organizational and chemistry rule differences between thet wo. These are complex in both cases arising from the challenges of integrating many different sources.

ChemSpider major sources
The top-ten sources for CS are listed in Ta ble 3. Compared with PC and UC there is an otablea bsence of the large automated patents ources of SureChEMBL and IBM (Discovery Gate did   Because both databases have expanded extensively in the intervening years, as hasb een pointedo ut, the value of such legacy partial cross-pointing is questionable. [13] 5. Comparative Content

Sourcesinc ommon
Differences between the major sources are listed in Tables 1-3. However,w ec an take am ore detailed look at the distributions of the top 50 ( Figure 7). As expected, because it contains only 34 sources, Figure 7s hows as teep fall-off in UC. While the other two both show al ong tail, it is clear that CS is dominated by more smaller sources than PC. We could get more insight by comparing these by name.H owever, it is already clear from Ta bles 1-3 that some names are differentb ut probably related.F or example, "Aurora Fine Chemicals LLC" in PC corresponds to "Aurora Feinchemie" in CS (i.e.,b oth the US and German links connect through to the same website, but UC does not include their feed). Similarly," ThomsonP harma" has the same name in both CS and PC, but in UC it is named "pub-chem_tpharma". Yeta nother example was the pointers to different Singapore or Chinesew eb addresses forA ngene in CS and PC. By standardizingn ames,a tl east for the larger sources, it was possible to get an outline of divergence. The result is shown in Figure 8.
We can see from Figure 8d ivergence between PC and CS but some degree of convergencew ith UC in that over 70 %o f its sources are shared with either of the other two. We can pick out as election of unique sources to get an idea of differential value (even if there are overlaps betweenm osts ources). Comparing Figures 7a nd 8i ndicates that CS has on the order of 200 unique smaller sources, at least as judged by having a different name (n.b.,i tw as not feasible to check all the websites to rule out if some had the same origins). One challenge here is discrimination of primary vendors versus aggregators. The former,b yi mplication, are assumed be the primary manufacturers of the compoundso ra tl east holdingt hem as local stock. Them ajority of the "long-tail" sourcesi nC Sa nd PC are probably in this category.A ggregator (or secondary) vendors are brokerso fm anym erged primary vendors and thus appear as some of the largests ources in the top-ten of all three databases. While some may self-declare as one or the other,w ith-   out effecting some kind of due diligence it is difficult to cleanly discriminate secondary from primary( so use of the term "vendor"f rom this point on will not be qualified with such a distinction).

Unique sources
This refers to sourcest hat appear in only one of the three databases (but note, this does not imply unique content As we can see from Figure 8P Ca lso has many unique sources (while one of these is PubChem BioAssay,t his can be classified as ac ollectiono fs ub-sources), but space limitations preclude more than af ew to be notedh ere. It is usefult ob ea ble to read off the explicitu niqueness within PC internally by simply adding "1[DepositorCount]" to source selects. For example, the automated patente xtraction source SCRIPDB has 4.0 million CIDs, of which 0.48 milliona re only from that source. In the case of Collaborative Drug Discovery,0 .87 milliono ft heir 1.40 million are unique. As one of the smaller sources,W iki-Pathways indexes 1997 structures with only 23 unique. Some larger PC-only sourcesh ave low levels of uniqueness, as we can see for the figure of 0.2 %w ithin the 10.2 million from NextBio. These turn out to be re-submission errors that have connectivity to other existing CIDs, confirming that this source has simply performed ao ne-off extraction and resubmission of pre-existing content labelleda st heir own SIDs in June of 2009.
One recent new source (October 2017) is from Springer Nature for their journal connectivity initiative. [14] This is currently at 0.61 million CIDs, of which 0.25 million are unique.Acrucial advantage for internal comparisons of sources within PC is the facility to perform Boolean intersects betweenqueries. This can not only reveal the exact overlap and difference between sourcesAa nd B, but can be extended to many combinations and the use of filters (e.g.,m ixtures can be counted as two or more noncovalent units).
From the data in Figure8,w ec an ascertain that UC has six large sources not in CS or PC. Intra-database overlaps between sourcesc an be calculated in av ariety of ways depending on the definition of structuralidentity.Asexplained in the UC documentation this is reported at three levels, as:1 )identity of the full IK, 2) the connectivity layer of the IK, and 3) the connectivity layers of multiple molecular components (e.g.,s alt splitting). However,i ts hould be noted that intra-source comparisons within UC will reflect circularity from the co-integration of PC into UC. The largestu nique source is the US Envi-ronmental Protection Agency (EPA)'sA ggregated Computational To xicologyO nline Resource( ACToR) at 411229 (4.7 % unique). This is followed by the BRENDAE nzyme information system with 119395 (37 %u nique), Lincs (Library of Integrated Network-based Cellular Signatures) 41802 (0.2 %u nique), Me-taboLights( sic) 19789 (no uniques tructures), PharmGKB (Pharmacogenomics Knowledgebase) 1633 (no unique structures), and the Recon knowledge base of human metabolism, 1529 (18 %u nique).

Source differentials and dates
This sectiond escribes comparing counts for nominally same sources across the databases as well as their standalone instantiations. In some cases, these can have occupancy in all four categories (i.e.,t hree databases and In situ), but we can also consider three-way and two-way cases. As election of these is listed in Ta ble4.
This matrix of discrepancies from nominally identical sources is surprising from ac heminformaticss tandpoint. Indeed, not even one single pair agree exactly.R elativelym inor differences, on the order of af ew percent, are not unexpected. These can be attributed to differences in chemistry standardization rules, loading filtration stringencies and, in the UC case, generation of Standard InChIs. However, as can be seen in Ta ble 4, many showed bigger differences in numbers, some of which could be plausibly explained, others would need additional investigation. The most common reason seems to be loading dates. This issue is alwaysp roblematic for large integration efforts, especially where sources frequently update. This is inherently a good thing, but leaves the meta sources with two challenges: The first is their internal synchronization and submission processing times. The second is the balance between "pulling" and "pushing". These terms refer to the host portal either actively picking up (pulling) the source data, for example, as ftp and/or an automatic extract, transform and load procedure (ETL), or the submitter sends (pushes) their updatem anually.I nr egard to ascertaining dates as possible causes of differences, PC makes these query-selectable for all submissions (i.e.,a sS ID dates).C Sd oes not surface record dates in the interface, but does indicate first and latest upload dates for most sources. UC operates aw eekly rebuild and automatically assigns that date to each source for the computed statistics. However,t his is slightly misleading in that it is the source descriptionst hat include both the first and the latest actual load dates,r egardless of the weekly releasedate.
We can look at selected rows in Ta ble4 to pick out both concordances and discordances.A sa ne xample of the former, we can see al ess than 1% difference between the highesta nd lowest of the four ChEMBL counts.W ec an check that ChEMBL 22 from September 2016 In situ updated to version 23 in May of 2017. With this long release cycle the different loads would not affect the October 2017 numbers. We can check this by establishing that the UC load date was the 24th of May (but the data are within the same EBI infrastructurea nyway). This was closely followed by CS on the 25th of May and processed within PC on the 6th of June according to the SID dates. In this case, the 6115 difference between PC CIDs and In situ can be explained by those peptide and protein substances that do not have CIDs. While this is supported by the SID count of 1735576, it exceeds the In situ count by 134, but this is a minor discrepancy.O ther sources also show small differences such as the IUPHAR/BPS Guide to PHARMACOLOGY.H owever, this has ar elatively rapid releases cheduleo fs ix per year.
The discrepanciesr ecorded for Thomson Pharma are far from minor,i nt hat the PC count is more than twice that for CS. Establishing the reason fort his major difference is confounded by Clavariate (previously Thomson Reuters) having ceased their PC cumulative feed at 4.3 millionC IDs in January 2016 (the last SID date) for reasonst hat remainu nclear and not declaring an in situ count. However,i nspecting CS source dates indicates the 2.04 million was probably an early load from 2008 (possibly direct). The UC source page recordst hat their set wass elected from PC (i.e.,a sasecondary source) at the end of July 2013. This explainsw hy their 3.8 million lies midway between the CS and PC counts.T his meanst hat users wanting to search against this large, high-quality compilation of manually curated structures from patents and papers would need to query PC for complete (even if now lapsed)c overage.
The ligands in PDB are ac rucial small-molecule set for drug discovery, but come with particular challenges. Firstly,b ecause bioactive chemicals specificallyb ound in protein pockets are difficult to define and filter cleanly (buta re in the order of % 8000), sourcess ubmita ll the heteroatom structures (HETATM). These encompass resolved smallm olecules including salts and reagents. Secondly,t here are four different sources within wwPDB and the NCBI. While PDBe and PDBj counts are more or less concordant at 25 057 and 25 252, respectively, RCSB PDB drops slightly to 24 140. It is not clear which of these sets was loaded into CS on the 25th of May,b ut UC refreshed their internal (EBI) PDBe on the 6th of November 2017. However,w ea re left with the anomaly of the new PC source of ligand extraction as NCBI Structure, which,a t3 5457 CIDs implies over 10 000 more ligand structurest han are indexed by PDB, for reasons that are not yetclear.
There seems to be no pattern to the discordances because each of the three has at least one example source where they are significantly the lowest. For the Human Metabolite Database (HMDB) it happenst ob eP Ct hat is only 9% of the In situ figures. In this case, dates indicatet he last SID load into PC was November 2011, but CS has am ore recent load from June 2017. For some reason the UC count from September 2017 is 2585 less. However,users need to know that HMDB underwent am ajor expansioni nsitu in Octobera nd so should search againstt his externally as the latest version. [15] With comprehensive coverage in mind, users would thusn eed to check Wiki-Pathways( unique to PC) as well as MetaboLights( uniquet o UC). [16,17] However,t hey would also need to check the first Recon set as loaded (also uniquet oU C) in October 2014. However,t his has not been updated to the latest published Recon 2.3 set of 5324 metabolites. [18] Thus, the important domain of metabolomics presents not only am osaic of partial availability in different databases, but also needs the (hopefullyp ending) update of Recon (n.b.,s ince Figure 3w as compiled,b oth HMBD and DrugBank have been updated in PC to 9765a nd 114297,respectively).

Vendors and virtuals
The major contributors to these databasesb yf ar are vendors. These offer the key advantage of enabling bioactivity research by the purchasing of structural analogousa sa na lternative, or complement, to de novo synthesis. The 293 vendors ourcesi n CS cover 41.8 millionc ompounds, reaching6 7% of total compounds. For PC the correspondingn umbers are 284 sources merging to 63.0 million, coincidently also covering 67 %. We can determine that 29.5 milliono ft he latter are unique structures. For UC there are also four vendors in the top-ten sources ( Table 2), but assessing the overall proportion is confounded by subsuming some of the same sourcesf rom PC. Notwithstandingt he convenience of procurement, the opportunistic vendor" push"t on early 70 %i nb oth PC and CS (with % 50 % of these as uniques tructures in the former), whilec ommercially understandable, can be seen as am ixed blessing from several viewpoints.F irstly, content overlap between vendors becomesh ighert han users probablyw ant (e.g.,P Ci ndexes 92 vendors for aspirin, including1 2S igma catalogueS IDs for identicals tructures). Secondly, highn oveltyl evels have to be caveated with doubts over structural quality because many of these turn out to be related to known CIDs via "same connectivity" matchesi nP C( e.g.,w here the vendorm ay not resolve the stereoisomers).
These issues can be inferred from inspecting Ta ble 4w hich showsaconfusing pattern for just five vendor examples, in-cluding some websites not declaring exact totals. Loading dates are problematic, as can be seen in the case of eMolecules not updating since 2009i nC So r2 012 in UC but further confounded by deciding not to submit to PC at all. To add to the non-obvious, it turns out that users can, in fact, find some eMolecules links within ZINC entries.T he mosts triking numerical discordance is for Mcule with 5.6 million recordsinC Sc ompared to 34 million in UC. While the former is an older load from October 2015 compared with the latter in February 2016, we can assume that the former are probablye xtant stock compounds (i.e.,i np ots) with the latterb eing predominantly virtual representations that have never been made.T hese are sometimes termed" make on demand" (MODs), where they have been enumerated with synthetic tractability in mind and the consequentl ikelihood of order fulfilment.
The statisticsc omputed within UC support the inference of virtuals by showing that no less than 28 milliono ft he Mcule submissions are unique (i.e.,n ot in PC either);i ndeed, this is the single largest contributor to the increased size of UC over PC. The chequered history of some vendors was made manifest when PubChem removed Angene as as ource in 2015. This was because they had reached 40 millionC IDs by ac ombination of piggy-backing (i.e.,r e-submitting existing structures as their own SIDs, as in the NextBio case) and virtuale numerations (as evidenced when PubChem shrank by 8m illiona fter their removal).

Utility Tips
So far this review has been more aboutp roblemst han solutions (hencet he title "Caveat Usor"). However,i ti sh oped that by unpicking at least some of the associatedd etails( but by no means all), users can become morea ware of potential pitfalls. They can then apply these insights to make comparable judgments for these or other resources. This sectiona dds af ew tips, some of which include caveats,t oa id users in utility judgments. Thef irst importantp oint to note is that thesed atabases ared ynamic andc onsequentlyt he snapshot of thef ours etso f numbersp resentedh erem ay change fairly quickly( e.g.,o nt he ordero fm onths).C onsequently, checking currentl oading dates for sources of particular interest thus becomes important.
The decision of whicho ft he three to search first (and the need to move on to either of the other two or not)c learly depends on the question being addressed. However,i nt he context of drug discovery (as the themeoft his issue) the unequivocal first choice is PC. This is not only because of the unparalleled connectivity between Compound,S ubstance, BioAssay, PubMed, and Entrez but also the combination of filtering, mining,a nd analysisf eatures.S electing andi ntersecting are particularly powerful features of PC;f or example, the search ("IUPHAR/BPS Guide to PHARMACOLOGY"[SourceName]) AND ("DrugCentral"[SourceName]) AND ("Therapeutic Ta rget Database (TTD)"[SourceName]) AND ("DrugBank"[SourceName]) AND ("ChEMBL"[SourceName]) produced the result of 1110 CIDs in common between all five curated databases as au sefully cross-corroboratedd rug set, where we can directly select the 374 that are in PDB (n.b.,t his takes only minutes on the advanced menu using the "Add to history" option, butism uch slower executing the searches in the interface). In this way users can isolate essentially any subset. Another feature of PC that may be less well known is the ability to performq uality assessment in situ. The appropriate selectiono ptions are in the limits drop-down menu under "Stereochemistry". There are six settings in each case that enable the counting (and filtration) of both chiral and E/Z centers at different stringencies.T his does not fix quality issues in various sources( as discussed by Lipinski et al. [6] ), but does enables ome extent of amelioration. Note also the ability to salt-stripv ia the "Chemical Properties" menu with the "CovalentUnitCount" is another useful clean up step for inter-source analysisorf iltering sets for download.
Another tip that users mayb el ess aware of is more of a caveatb ut is important to appreciate for interpretation of analysis results. This can be termed circularity as identical content between sources, for legitimate reasons( as opposed to piggybacking by straight copying). The issue has already been raised in the context of UC content, where severals ourcesc ome in twice as independentl oads and via PubChem subsumation. For PC it can be illustrated for the two important activity data submitters of ChEMBL and BindingDB.F rom the 1.7 million compound recordsi nt he former,0 .35 milliona re imported from PubChem confirmed BioAssays as well as 69 000 from BindingDBi ncluding 11 000 with curated patent activity data. Havingi ntroduced this reciprocity forc ontent enhancement, BindingDBa lso import compounds with protein target mapped activity data from ChEMBL. Now both these sources submit to PC which meansn ot only that some BioAssay data are therefore going in twice via ChEMBL, but the valuable BindingDBp atentc uration data also enter twicef rom both sources.T hese reciprocity arrangements are not hidden, but can easily confound users who might assume their respective content is independently acquired.
If users are designing new chemical structures they need to address the basic question "is there anything out there similar to what Ia mw orking on" not only as ab asic cheminformatics question,b ut also for freedomt oo perate checks against patent sources. In such cases it would be prudentt oc heck the other two databases. While UC is out in font forr aw numbers, we can still only approach the question with exact match rathert han similarity searches( although this may change). As the smallestm ember of the triumvirate, CS nonetheless also has unique content,p ossibly running into millions, but this remains undeclared.
For noveltyc hecking there already exists ac rude approximation to the merging of all three databases in the form of searchable IKs as indexed in Google. [21] This can demonstrated by as imple search with the connectivity layer inner Key for aspirin (since Ref. [21] was publishedi n2 013, it turns out that full IK searches have becomemore susceptible to various kinds of false positives). The results are shown in Figure 9.
In this case, we see that each of the three databases have in fact emerged as the top matches, but many other database links are in the hit list, which is remarkably clean in including very few false positives (note also the searches execute faster than internal searchesw ithin the sources).

Conclusions
The resources reviewed herein indicate that, on the one hand, researchers exploring bioactivity space and the wider chemical neighborhood ford rug discovery have never had it so good. On the other,they are confronted with complex differences, including non-obvious ones, between these resources that have ad irect bearing on utility.I nt his regard, ak ey issue needs to be highlighted in an attempt to understand the "why" of divergence. While the "who-has-what" questionsc an be laboriously addressedb yc omparisons of the type done in this work, such results do not explain the causes of this divergence, even where resources are describedi nd etailed publicationsa sw ell as internal documentation.
Notwithstanding, implicitd ivergence trends emerge from this work. For PC this includes the unique breadth of connectivity and the embracing of patent extraction sources to the level of 22 million CIDs. [9] For CS the alternative choice has been madeo fe schewing patentc hemistry in the interests of overall structuralq uality (mainly due to the tendency of automated extraction to convert fragmented IUPAC names). The crowd sourcinge lement is also designed to enhanceq uality in CS (although the statisticso fe ntry names and/or structure corrections as ac onsequence of this have not been declared). For UC the focus on EBI databases is clear,b ut in addition, by subsuming structures from externals ources, they have managed to not only overtake PC by nearly 60 millionb ut also SciFinder by 20 million.
However,b oth from discussions with individuals and observing changes over the years, an undeclared diverging influence emerges. This is that, understandably,d atabase teams have their work cut out simply to maintain the status quo while also pursuingb oth content and feature expansion. This leaves little reservec apacity for longer-term strategic changesi nd irection necessary to significantly shift balances of content. This could include, for example, encouraging submissions of novel chemical space, expanded activity data sets, deprecating sources where value has declined, filtering patent extractionsa th igher stringencies, and resisting vendorpushes of highly overlapping or virtualc ontent. We therefore need to accept (butn ot see as ac riticism) ac ertain ad hoc elementt or esource divergence, while noting at the same time the positive consequence of complementarity.
In terms of cumulative coverage of all three databases there is also an important caveat in the increasing numbers of "boutique" standalone databases with valuable internal small-molecule indexing relatedt od ifferent types of bioactive chemistry. Some of these are either not subsumed into the sourcesabove (i.e.,h ave decided not to submit) or have been languishing many years out of date. Inspection of the 2018N ucleic Acids Research Database issue indicatese xamples of both the former (e.g.,S uperDrug2) and the latter (e.g.,T herapeutic Ta rget Database had not updated in PC since 2012 but eventually refreshed in December 2017 with am ajor expansiont o 22 134). [19,20] So wouldn't it be great if there was just ao nestop portal where:a )all standalone databases with significant value committed to submitt o, b) they ensured their pulls or pushes were synchronized with their internal releases, and c) they agreed to harmonize their differents tructure standardization processes?T his would certainly be high on everyone's wish list looking at these major databases from the outside. However, given the differentf unding models, chemistry rules, and stakeholder interests on the inside, this does not seem so likely in the near future, but we can always hope.
Note added in proof:i nt he months elapsedb etween submission and proofing there have been changes in specific numbers (e.g.,C Sr educing their sources). It was not possible to pick up all of these instances, but they do not alter the general points.