Molecular Scaffold Analysis of Natural Products Databases in the Public Domain


Corresponding author: José L. Medina-Franco,


Natural products represent important sources of bioactive compounds in drug discovery efforts. In this work, we compiled five natural products databases available in the public domain and performed a comprehensive chemoinformatic analysis focused on the content and diversity of the scaffolds with an overview of the diversity based on molecular fingerprints. The natural products databases were compared with each other and with a set of molecules obtained from in-house combinatorial libraries, and with a general screening commercial library. It was found that publicly available natural products databases have different scaffold diversity. In contrast to the common concept that larger libraries have the largest scaffold diversity, the largest natural products collection analyzed in this work was not the most diverse. The general screening library showed, overall, the highest scaffold diversity. However, considering the most frequent scaffolds, the general reference library was the least diverse. In general, natural products databases in the public domain showed low molecule overlap. In addition to benzene and acyclic compounds, flavones, coumarins, and flavanones were identified as the most frequent molecular scaffolds across the different natural products collections. The results of this work have direct implications in the computational and experimental screening of natural product databases for drug discovery.


cumulative scaffold recovery


Molecular ACCess System


Molecular Equivalence Index


molecular operating environment


Shannon entropy


Traditional Chinese Medicine

Traditionally, natural products have played a major role in drug discovery and development by providing novel chemical scaffolds, and serving as leads or drugs (1,2). For many years, 80% of drugs were either natural products or natural product-derived compounds. Even in the modern era, after the advent of techniques such as high-throughput screening of synthetic libraries, half of the drugs approved since 1994 are based on natural products research (3,4). Recent comprehensive reviews of natural products, or compounds inspired by natural products, indicate that more than 100 natural product compounds are currently in clinical trials. Natural products offer the advantage of discovering novel structural classes (5,6) because of their well-documented better coverage of chemical space relative to large synthetic compounds (7). Hence, the structural or chemical diversity of natural products can be utilized to access bioactive compounds with novel scaffolds (7–9). The inclusion of sources of natural products in drug discovery would open up more avenues for new classes of drugs (10) as well as new chemical entities, evidenced by studies that show structural or chemical complementarity between natural products and synthetic compounds (11,12). In addition to natural products, combinatorial libraries are attractive sources to expand the medicinal-relevant chemical space (13). Some combinatorial libraries, though small-sized, are inspired by natural product scaffolds (1,14,15).

Drug discovery programs that use natural products and compounds from other sources involve a large amount of structural and bioactivity data. The information is stored in commercial, in-house or public databases and has been reviewed elsewhere (16–18). Several chemoinformatic analyses of natural products databases have been published. Examples of recent analysis include a comparison of the molecular complexities and structural diversities of three databases, including a natural product collection, and how these properties were related to specificity and diversity of biological performance (19). The authors demonstrated that compound specificity was associated with molecular complexity. In another recent study, the molecular topologies of natural products were compared with those of drugs, human metabolites, clinical candidates, and general bioactive compounds (20). Chen et al. showed that biologically relevant natural products and human metabolites had the highest ratios of single ring system compounds among other measures. In an effort to compute a natural product-likeness score Ertl et al. (21) revealed that natural products consisted of fewer aromatic rings and were less flexible relative to drugs and an in-house set of synthetic compounds. A number of these and several other studies such as the structural classification of the CRC Dictionary of Natural Products (22), involve the analysis of large natural products databases that are not freely available. Such analyses are focused mainly on only one database and do not show comparisons across different natural products collections. To our knowledge, there are no published compilations of natural products databases whose chemical structures are available in the public domain.

In this work, we compiled five publicly available natural products databases and performed a comprehensive scaffold analysis including a recently introduced measure for scaffold diversity analysis (23). The natural products were compared with an in-house collection of compounds obtained from combinatorial chemistry and with a general screening library. The most and least diverse collections were identified as well as the most representative molecular scaffolds in all natural products libraries.


Compound databases

The scaffold diversity was analyzed for seven compound collections: five natural products databases, a general screening commercial library obtained from Maybridge, and a data set of compounds obtained from combinatorial libraries (24–26). Table 1 summarizes the source of each database and the number of compounds in each collection (downloaded July 2011). Of note, the Traditional Chinese Medicine database (TCM) was included (27). The databases contained between 267 and 89 425 compounds.

Table 1.   Databases analyzed in this study
DatabaseCodeURLMolecules (M)Molecular similaritya
Mean (SD)
  1. aCalculated with Molecular ACCess System (MACCS) keys (166 bits) and the Tanimoto coefficient.

  2. bThe structures were retrieved from Clemons et al. [19].

SpecsSC 5620.54 (0.17)
AnalyticonbNP-ALY 24720.48 (0.15)
TimTecNDL 30400.38 (0.12)
Traditional Chinese MedicineTCM 32 3580.51 (0.15)
ZINC (natural products subset)NP-ZINC 89 4250.39 (0.13)
Combinatorial library setTPIMS 2670.62 (0.15)
CommercialMaybridge 11 3260.31 (0.11)

The seven databases were characterized by measuring the intermolecular diversity of the entire molecules using the Molecular ACCess System (MACCS) keys (166 bits) a as implemented in molecular operating environment (MOE) b and the Tanimoto coefficient (28,29). Table 1 shows the corresponding mean values and standard deviations for each set. The similarity values were calculated from a random sample of 1000 molecules each except natural products from Specs and the TPIMS set where all molecules were used. It has been shown that random samples of 1000 molecules are representative of the molecular diversity within a larger collection (30–32). Results of three random samples are presented in Table S1 of the Supporting Information. Of note, the similarity values in Table 1 provide an overview of the molecular diversity considering both molecular scaffolds and side chains. We want to emphasize, however, that we did not perform an exhaustive study of the molecular diversity using structural fingerprints. Such studies will be addressed in a forthcoming study.

Definition of molecular scaffold

The term ‘molecular scaffold’ is employed to describe the core structure of a molecule (33). There are several ways to represent the scaffold of a molecule in a systematic manner (34,35). Herein, the scaffolds were defined as the ‘cyclic systems’ that result from iteratively removing the side chains of the molecule as exemplified in Figure 1. The cyclic systems are part of the chemotype methodology developed by Johnson and Xu and were calculated with the program Molecular Equivalent Index (MEQI; 36,37). The scaffolds used here are similar to the ‘atomic frameworks’ of Bemis and Murcko (38). In MEQI, after iteratively removing the side chains for each compound (Figure 1), the resulting scaffold is labeled with a chemotype code or chemotype identifier that is a code of five characters. The identifier is assigned to each scaffold using a unique naming algorithm based on the Morgan approach. Further details are elaborated elsewhere (36). A remarkable feature of the scaffolds of Johnson and Xu to compare compound collections is that molecules classified in a scaffold do not lie in any other chemotype class (39). The MEQI approach has been extensively used to classify compound collections (23,31,40,41).

Figure 1.

 Definitions of scaffold used in this study. The scaffold is obtained after iteratively removing the side chains from the entire molecule. The cyclic system is identified by a code of five characters or chemotype identifier.

Scaffold analysis

The analysis was organized in two sections, scaffold diversity of each database and comparison of scaffold overlap between collections. The first section was further divided into two parts, measures of diversity of the entire databases, namely frequency counts and cumulative scaffold recovery (CSR) curves, and measures of diversity of the most frequent scaffolds using the concept of Shannon entropy (SE).

Scaffold diversity

For each database, the number of scaffolds was recorded along with the number of singletons, that is, scaffolds containing only one compound. Acyclic compounds (chemotype identifier ‘00000’) were also recorded. The fraction of scaffolds relative to the size of the database, that is, number of scaffolds divided by the size of the database, and the fraction of singletons relative to the size of the database and relative to the number of scaffolds were analyzed. As noted before (23), these measures provide an idea of the scaffold diversity but do not provide information regarding the particular distribution of the scaffolds.

The distribution of scaffolds for each database was analyzed using CSR curves (23,35,42,43). To generate the CSR curves, the scaffolds are ordered by their frequency of occurrence (most to least common). Then, the fraction of scaffolds is plotted on the x-axis and the fraction of compounds that contain those scaffolds on the y-axis. The CSR curves were further characterized by obtaining the fraction of scaffolds required to retrieve 50% of compounds in the corresponding database (35).

The specific distribution of compounds in the n most populated scaffolds was quantified using an entropy-based information metric. The ‘SE’ (44,45) of a population of P compounds contained in n scaffolds is defined as:


where pi is the relative frequency of the scaffold i in a population of P compounds containing a total of n distinct scaffolds; ci corresponds to the absolute number of molecules containing a particular scaffold i. The values of SE range between 0 and log2n. If SE = 0, then all P compounds possess only a single scaffold. If SE = log2n, then the P compounds are uniformly distributed among the n scaffolds, which represents maximum scaffold diversity of the database. To normalize the SE values for different values of n, the ‘scaled SE’ (SSE) is defined as (45):


The values of SSE range between 0, where all P compounds are contained in one scaffold, and 1.0, where each scaffold contains an equal number of compounds. Therefore, SSE values closer to 1.0 indicate large diversity within the n most populated scaffolds. This measure is discussed elsewhere (23,41).

Scaffold overlap

The scaffold content of the libraries was compared with each other, and the most frequent scaffolds across all databases were identified. For each pair of databases A and B, the number of common scaffolds SA,B was recorded.

Results and Discussion

Databases of natural products in the public domain

Several large natural products collections are available such as the Dictionary of Natural Productsc and the GVKBio databases. d However, access to the chemical structures of these collections is on a commercial basis. Herein, we focus the study on natural products collections whose chemical structures can be readily accessed on the web (Table 1). Of note, the number of molecules accessible in each database covers a wide range, from few hundred to several thousands. Table S2 in the Supporting Information shows the number of duplicate compounds (same SMILES) in the five natural products databases. Of note, despite the fact that TCM is a large collection with more than 32 000 compounds, it has a low overlap (<0.8%) with any other natural product database (Table S2). We did not identify duplicate molecules between the natural products from Specs and any other natural products database.

The molecular similarity summarized in Table 1, computed with MACCS keys from the entire molecules, that is, scaffolds plus side chains (see Methods), indicates that all five natural products databases are diverse, with MACCS/Tanimoto similarities <0.8 (mean similarity between 0.39 and 0.54) and that the natural products from ZINC (46) are the most diverse. For reference, the general screening library from Maybridge showed the largest molecular diversity (mean similarity of 0.31) and, not surprisingly, the data set from combinatorial libraries had the lowest diversity (mean similarity of 0.62). We want to emphasize that similarity values in Table 1 serve as references. As it will be discussed later, the molecular similarity computed with the whole molecules does not necessarily correlate with the scaffold diversity.

Natural products collections with structures available in the public domain are particularly attractive to the scientific community in academic, non-profit, and other groups to uncover bioactive compounds for emerging therapeutic targets such as DNA methyltransferases (47). This can be achieved with the assistance of virtual screening (48) as exemplified by the Drug Discovery Portal (3) and iScreen (49) initiatives.

Scaffold analysis

Diversity of entire databases: frequency counts

Similarities or differences between chemical libraries can be determined by analyzing physico-chemical properties and substructure or molecular scaffold composition. Complementary approaches to comparing these measures include, but are not limited to: projections of binary fingerprint data (50), principal component analyses (31), scaffold tree generation (51), clustering and statistical distributions. This work is focused on the analysis of molecular scaffolds. As detailed in the Methods section, the MEQI approach was utilized to identify molecular scaffolds present in the databases.

Scaffold diversity analysis of natural products databases is relevant because it represents an approach to assess the coverage of chemical space for probe or drug discovery. For example, in general, it is more likely to identify hit compounds from a diverse collection in a virtual and/or experimental high-throughput screening campaign. To quantify scaffold diversity, three metrics were employed: frequency counts, CSR curves, and SSE. Previously, it was shown that the combination of the proportion of multiplet and singleton scaffolds is a useful measure to compare scaffold variation (23,52). Table 2 summarizes the results of the scaffold diversity measures for the five natural products libraries, Maybridge and the combinatorial library set. In Table 2, M represents the total number of molecules per database; N denotes the number of unique scaffolds (which includes cyclic and acyclic compounds); Nsing signifies the number of singletons or scaffolds with only one compound; F50 stands for the fraction of scaffolds containing 50% of database molecules, that is, half of all compounds can be extracted using only a fraction F50 of scaffolds; and SSEn is the SSE of the distribution of the top n populated scaffolds.

Table 2.   Results of scaffold diversity analysis using different metrics
DatabaseMFrequency countsCSRScaled Shannon entropy
N N/M N sing N sing/N N sing/M F 50 SSE5SSE10SSE20SSE50SSE100
  1. CSR, cumulative scaffold recovery; SSE, scaled SE; TCM, Traditional Chinese Medicine.

TCM32 35812 2670.3868140.560.210.130.830.810.840.880.90
NP-ZINC89 42515 9680.1864020.400.070.050.900.900.920.920.92
Maybridge11 32680850.7172290.890.640.300.790.750.720.740.75

The ratios N/M and Nsing/M point to the number of scaffolds and singleton scaffolds normalized by the number of compounds, while Nsing/N is the number of singleton scaffolds relative to the total number of scaffolds. In general, the higher the ratios, the more diverse the database. According to these fractions, the natural products collections from TimTec (NDL) and Analyticon (NP-ALY) are the most diverse among the five natural products databases. In contrast, natural products from ZINC (NP-ZINC) had the lowest scaffold diversity with N/M, Nsing/M and Nsing/N values of 0.18, 0.07, and 0.40, respectively. It is worth noting that the low scaffold diversity of natural products from ZINC does not mirror the high structural diversity of the whole molecules as measured by MACCS keys/Tanimoto (Table 1). These results indicate that the side chains of the molecules in this library play a remarkable role in the structural diversity.

For all seven databases compared in Table 2, Maybridge had the largest scaffold diversity with N/M, Nsing/M, and Nsing/N values of 0.71, 0.64, and 0.89, respectively. This conclusion is also reflected in the observation that the average pairwise similarity values of the Maybridge library were the lowest among all the databases (Table 1). The combinatorial library set possesses intermediate scaffold diversity.

Diversity of entire databases: CSR curves

Figure 2A shows the corresponding CSR curves for all databases and point to the distribution of compounds among the scaffolds. CSR curves are interpreted as follows: a diagonal plot indicates an equal distribution of compounds across the scaffolds, while curves with gradients steeper than the diagonal would signify databases with low scaffold diversities. For these databases, Figure 2A shows that Maybridge was the most diverse followed by the combinatorial set, while natural products from ZINC, followed by natural products from Specs, were the least diverse. Among the natural products databases, TimTec and Analyticon showed the largest scaffold diversity.

Figure 2.

 Scaffold diversity of the natural products databases and reference collections. (A) Cumulative scaffold recovery curves for the data sets analyzed in this work. (B) Plot of the fraction of singletons with respect to the size of the database (Nsing/M) vs. the fraction of scaffolds covering half of the molecules in each database (F50). Color scheme is the same as in panel A. Data points are sized by the number of compounds in the database. See text for details. See online version for colors.

The F50 values that can be obtained directly from the CSR curves indicate the fraction of scaffolds required to retrieve 50% of the corresponding database (see Methods). Large values indicate high scaffold diversity. Table 2 shows the F50 values for the seven databases. According to this measure, databases from Analyticon, TCM and TimTec are the most diverse natural products collections (F50 values of 0.11–0.14), whereas natural products from ZINC are the least diverse (F50 value of 0.05). All natural products databases had lower F50 values than Maybridge which showed the most scaffold diversity.

Figure 2B shows a plot of the fraction of scaffolds covering half of the molecules in each database (F50) versus the proportion of singleton scaffolds (Nsing/M). Points on the upper right corner of the graph represent databases that have a relatively higher number of singletons, and also more scaffolds to cover half of the compounds in the collection. These databases would be more diverse as opposed to databases denoted by points in the lower left corner of the plot. As illustrated in Figure 2B, and in agreement with the CSR plot, Maybridge and natural products from ZINC were the most and least diverse, respectively. Deviations from the diagonal indicate that the F50 and Nsing/M differ in their rankings of the given databases. For instance, on the F50 scale, the combinatorial set was more diverse than the natural products from TimTec and Analyticon, while the latter two databases ranked more highly on the Nsing/M scale. These results indicate the advantage of using more than one measure to assess scaffold diversity (23). The results of natural products from ZINC were remarkable given that it contained the highest number of compounds among the databases. This finding may be an exception to the general notion that larger-sized libraries are necessarily more diverse than their smaller-sized counterparts.

Diversity of most frequent scaffolds: Shannon entropy

The SE measure was introduced originally as a tool in digital communication to analyze the information content in transmitted signals (44), but has been adopted in chemoinformatics applications as a non-parametric method to investigate and compare molecular content in compound libraries (23,45). When the SE is scaled by the number of bins present the new measure, SSE, no longer depends on the sizes of the databases. SSE values closer to unity indicate high diversity, and vice versa. The SSEn analyses performed here were akin to the practice, of examining the initial portion of a CSR curve. Table 2 summarizes the SSEn values for the top ‘n’ most frequent scaffolds, that is, n = 5, 10, 20, 50, and 100. SSE values for the top 30 and 40 frequent scaffolds were also calculated and were very similar to either SSE20 or SSE50 (data not shown). Table 2 shows that as the percentage of scaffold coverage increased the SSEs of Specs and TimTec decreased, while those for TCM increased. These results clearly show the importance of considering different numbers of top ‘n’ populated scaffolds to interpret the scaffold diversity using the concept of SE. General patterns can be extracted from the SSEn measure in Table 2. Among the natural products databases, TCM and Specs showed, overall, the lowest SSE10–100 values indicating lower diversity considering the top 10–100 populated scaffolds. Looking at the top five most populated scaffolds natural products from Specs and Analyticon were the most diverse (SSE5 value of 0.98). The combinatorial set was the most diverse except at SSE5. Remarkably, Maybridge was the least diverse (SSEn values ≤ 0.79) in contrast to the conclusions derived when the entire databases were considered. Maybridge consistently showed the least diversity up to the top 100 most populated scaffolds because it has a large proportion of a single scaffold (benzene, 5.8%) as compared to the frequency of the other chemotypes (the second most frequent scaffold in Maybridge has a frequency of 1.4%). This is an example of the ‘early enrichment’ problem encountered in classic recovery curves (53) and also emphasizes the significant difference in scaffold diversity of compound databases when the entire collection is considered or only the most frequent scaffolds.

Scaffold overlap: most frequent scaffolds in natural products databases

To further explore the uniqueness of the scaffolds in each natural product database, we investigated the number of duplicate scaffolds in the five databases. Table S3 in the Supporting Information summarizes the number of unique scaffolds, determined by the unique chemotype identifiers, shared between the natural product databases. In contrast to the number of duplicate molecules summarized in Table S2 (see discussion above), Table S3 shows the number of different scaffolds that are present in any pair of natural products databases. The lowest number of shared scaffolds observed was four, between Specs and TCM, while the highest number (717), was between natural products from TimTec and ZINC. Despite the fact that roughly 25% of the molecules in the TimTec natural product database are contained in natural products from ZINC (Table S2), about half of the scaffolds in TimTec are represented in natural products from ZINC (Table S3).

Having determined the degree of scaffold diversity (Table 2 and Figure 2) and explored the scaffold overlap between the databases, we next looked at the most prevalent scaffolds present across all the natural products databases. There are 127 857 compounds in all five natural products collections considered in this work. In total, there are 29 520 different scaffolds in all five collections of which 1267 (4.3%) scaffolds (including acyclic structures) occur in two or more natural product databases. Benzene (chemotype identifier: RYLFV) and pyridine (91DYR) are the only two scaffolds present in all five natural products databases. The benzene ring is also the most common scaffold in Maybridge, approved drugs (38) and several other data sets (31,40). Figure 3 shows scaffolds present in at least four of the five natural products databases with an average frequency of at least 0.3%. Acyclic compounds are present in all natural product databases except Specs with an average frequency of 1.4 ± 1%. In addition to benzene and acyclic compounds, flavones (YSB4M), coumarins (3P6AH), and flavanones (Q874P) were the most frequent molecular scaffolds found in the natural products databases (Figure 3). None of the last three scaffolds are common in Maybridge.

Figure 3.

 Most common scaffolds in all natural products databases in the public domain considered in this work. Scaffolds shown are present in at least four natural products databases with an average frequency of at least 0.3%. Chemotype identifier, average and standard deviation frequency are displayed.

The frequencies and codes of the scaffolds that showed up at least once between the combinatorial library set (TPIMS) and any of the natural product libraries are shown in Table S4 in the Supporting Information. The data showed that generally at most one scaffold was common in both the combinatorial and each natural product database, excluding Specs that did not possess any scaffold found in TPIMS. Among the shared scaffolds, RJ7LM was found twice in both TPIMS and Maybridge (Table S4). The overall low scaffold overlap among the natural products databases suggest that they can be merged in a single database that can be the subject of virtual and/or experimental screening to increase the likelihood to identifying hits with different scaffolds. The low commonality of scaffolds between TPIMS and any of the natural product libraries pointed to the uniqueness of the compounds obtained from combinatorial libraries analyzed here.

Conclusions and Perspectives

Several bioactive compounds of natural origin have been discovered fortuitously. However, there are efforts to systematically uncover bioactive compounds. One step in this direction is the comprehensive characterization of large natural products collections. In this work, we conducted a detailed analysis of the molecular scaffold content and diversity of five natural products collections whose chemical structures are in the public domain. To our knowledge, this is the first study that compiles and analyzes databases of natural products whose chemical structures are freely available on the web. We concluded that the natural products databases showed different scaffold diversity and content. Databases from Analyticon and TimTec had similar scaffold diversity and were more diverse than the other three collections. Despite the fact that the natural products implemented in ZINC was the largest database, it showed the lowest scaffold diversity followed by natural products from Specs. These results can be attributed, at least in part, to the presence of synthetic analogs of natural origin that can bias the scaffold content. Considering the entire databases, all five natural product databases had lower diversity than a general screening collection which was used as a reference. However, the analysis focused on the most frequent scaffolds revealed that the general screening database had the lowest diversity owing to the large proportion of a single scaffold. Indeed, while some diversity measures used in this work focused on assessing the diversity of the libraries as a whole (frequency counts and scaffold recovery curves), SE was used to measure the diversity of the most frequent scaffolds. In agreement with previous studies, the results of this work emphasized the advantage of using several complementary measures for the comprehensive analysis of the scaffold diversity. Overall, natural products databases in the public domain had low-molecular overlap. Benzene and pyridine are the only two scaffolds present in all five natural products databases analyzed in this work. In addition to acyclic compounds, flavones, coumarins, and flavanones were identified as frequent molecular scaffolds in most of the natural products collections. The major perspectives of this work include the analysis of the same public natural product databases using physicochemical properties, structural fingerprints and measures of molecular complexity. These analyses are the subject of a separate study that will be published in due course. Additional future studies include the analysis of marine- and microorganisms-derived natural products databases.


  1. aMACCS Structural Keys, San Leandro, CA: MDL Information Systems Inc.

  2. bMolecular Operating Environment (MOE), version 2011.10, Montreal, QC, Canada: Chemical Computing Group Inc., available at: (accessed June 2012).

  3. cDictionary of Natural Products, London: Chapman & Hall/CRC Informa.

  4. dGVKBio databases, Hyderabad, India: GVK Biosciences Private Ltd.


We are grateful to Dr. Mark Johnson for providing the program MEQI. This manuscript was supported by the State of Florida, Executive Office of the Governor’s Office of Tourism, Trade, and Economic Development and the Multiple Sclerosis National Research Institute.

Conflict of interest

The authors declare that they do not have any conflict of interest related to this manuscript.