• database of chemical toxicity;
  • histopathology;
  • kidney toxicity;
  • liver toxicity;
  • ontology of toxic pathology;
  • organ toxicity;
  • quantitative structure–activity relationship;
  • recursive partitioning


  1. Top of page
  2. Abstract
  3. Methods and Materials
  4. Results
  5. Discussion
  6. Conclusions and Future Directions
  7. Conflict of Interest Statement
  8. References
  9. Supporting Information

The ability to accurately predict the toxicity of drug candidates from their chemical structure is critical for guiding experimental drug discovery toward safer medicines. Under the guidance of the MetaTox consortium (Thomson Reuters, CA, USA), which comprised toxicologists from the pharmaceutical industry and government agencies, we created a comprehensive ontology of toxic pathologies for 19 organs, classifying pathology terms by pathology type and functional organ substructure. By manual annotation of full-text research articles, the ontology was populated with chemical compounds causing specific histopathologies. Annotated compound-toxicity associations defined histologically from rat and mouse experiments were used to build quantitative structure–activity relationship models predicting subcategories of liver and kidney toxicity: liver necrosis, liver relative weight gain, liver lipid accumulation, nephron injury, kidney relative weight gain, and kidney necrosis. All models were validated using two independent test sets and demonstrated overall good performance: initial validation showed 0.80–0.96 sensitivity (correctly predicted toxic compounds) and 0.85–1.00 specificity (correctly predicted non-toxic compounds). Later validation against a test set of compounds newly added to the database in the 2 years following initial model generation showed 75–87% sensitivity and 60–78% specificity. General hepatotoxicity and nephrotoxicity models were less accurate, as expected for more complex endpoints.


comparative molecular field analysis


comparative molecular similarity analysis


distance weighted discrimination


human liver adverse effects database


k-nearest neighbors


Matthews correlation coefficient


Organization for Economic Cooperation and Development


quantitative structure–activity relationship


registration, evaluation, authorization and restriction of chemicals


random forest


structure–activity relationship


soft independent modeling of class analogies


support vector machines

Toxicology is the single most expensive aspect of preclinical drug discovery, costing roughly as much as all other preclinical operations put together (1). This is because of rigorous testing in animals, which is unfortunately neither scalable nor amendable to miniaturization. More pressure is being applied by initiatives such as the European Union’s Registration, Evaluation, Authorization and restriction of Chemicals directive (2), which requires retesting of already marketed chemicals and limits animal testing in the development of new compounds. Thus, there is a drive in the pharmaceutical and chemical industries to find suitable technologies to predict toxic outcomes, explain molecular mechanism of toxicity, replace animal tests, or prioritize additional studies.

Quantitative Structure–Activity Relationships (QSAR) has been widely used in toxicology to predict the liability of novel compounds using structural features of known toxicants. During the last 40 years, many QSAR models have been published predicting carcinogenicity (3,4), mutagenicity (5,6), genotoxicity (7), developmental toxicity (8), and other endpoints using wide spectrum of QSAR approaches, including k-nearest neighbors (9), artificial neural networks (7), and machine learning (10). For a recent review on QSAR modeling in toxicology, the reader is referred to Valerio et al. (11).

Quantitative structure–activity relationship models are easy to derive using modern computational algorithms and available data, but the predictive quality of the models is a major concern (12). For that reason, regulatory agencies such as the US Food and Drug Administration (FDA) do not consider QSAR predictions as supporting information for safety decision-making without proper testing and validation (13). Other agencies, for example, the Organization for Economic Cooperation and Development (OECD) are trying to establish principles for QSAR model validationa. These principles include the following: (i) a defined endpoint; (ii) an unambiguous algorithm; (iii) a defined domain of applicability; (iv) appropriate measures of goodness-of-fit, robustness, and predictability; (v) a mechanistic interpretation, if possible. Many QSAR models compliant with the above principles have been developed and tested (14–16).

Compliance with OECD principals does not, however, guarantee a high-quality model, which is often dependent on the complexity of the endpoint. For example, acceptable prediction models already exist for some toxicological endpoints based on well-understood mechanisms, such as mutagenicity and skin sensitization (17), whereas mechanistically more complex endpoints such as acute, chronic or organ toxicities were claimed to be impossible to predict adequately (17).

Another major challenge in the development of new QSARs and the evaluation of existing ones lies in the lack of toxicity data and, to an extent, the lack of this data being structured in a form that makes it readily usable for modeling purposesb (18). Harmonizing the reporting of chemical toxicity data to facilitate comparison between available data sources and databases remains a critical need. This gap was recognized by members of the MetaTox consortium (Thomson Reuters, CA, USA), which comprised toxicologists from the pharmaceutical industry and FDAc (19). Development of a fixed terminology and ontology of toxic pathology and manual annotation of publicly available human and animal toxicity data (toxicants and biomarkers) using the terminology were prioritized as the most urgent need for a systems biology platform geared toward safety assessment applications.

Here, we describe the development of a database of animal organ toxicity and the construction and validation of several QSAR models based on compound-toxicity annotations from this database. The models predict general organ toxicity endpoints (e.g., hepatotoxicity and nephrotoxicity) and more precise subcategories of organ toxicity, providing greater mechanistic insight: liver necrosis, liver relative weight gain, liver lipid accumulation, kidney necrosis, kidney relative weight gain, and nephron injury. The models were built using a recursive partitioning algorithm shown to perform well when dealing with complex endpoints associated with multiple mechanisms (20). Models were validated twice with test sets comprising structures of compounds with known properties that were not used in training the model. Models demonstrated overall good performance and are available in the MetaDrug/ToxHunter™ systems pharmacology suite designed for the prediction and evaluation of biological effects of small molecules.

Methods and Materials

  1. Top of page
  2. Abstract
  3. Methods and Materials
  4. Results
  5. Discussion
  6. Conclusions and Future Directions
  7. Conflict of Interest Statement
  8. References
  9. Supporting Information

Database of chemical toxicity

A comprehensive ontology of toxic pathologies was developed by combining pathology terms for experimental findings in various organs with terms describing organ structure and functionality (Figure 1). The terminology is structured by organ at the highest level and classifies ontology leaves by pathology type (branches develop according to pathology type observed) and functional organ substructure (branches develop according to the organ structures affected). At the highest level of detail, the terms include an organ component, an organ substructure component (to the level of cell type where appropriate), a pathology component and a pathology subtype component (e.g., Liver – centrilobular lipid accumulation and microvesicular). Currently, the ontology is completed and annotated for 19 organs. Ontologies for other organs and clinical pathology findings have been developed and are currently being annotated.


Figure 1.  Selection of the Thomson Reuters ontology of toxic pathology for kidney, showing organization of terms by pathology type and by organ substructure.

Download figure to PowerPoint

The ontology is populated by full-text annotation from the scientific literature with DNA, RNA, protein and metabolite changes associated with the specific pathological effects of particular compounds, alongside chemical compounds causing the specific toxicity. Information available in the public domain was annotated according to the following criteria:

  •  Experiments should be performed by the administration of compounds or drugs to mouse or rat. Drug overdose or toxicant poisoning in human was also annotated. Only compounds that had confirmed toxicity in rodents were used for model building however. Experiments performed on cell or ex vivo organ cultures were not captured.
  •  The endpoint should be defined by a macro- or microscopic evaluation of an organ or organ section/slice. Typically, this happens at post-mortem, but may also be measured from biopsy sections.
  • In vivo experiments must be described in the article under curation. Articles where authors refer to other experiments were not used.

Our annotation strategy provides partial compatibility with the first principal of OECD regulations requiring identification of the experimental system that is being modeled by the QSAR. By selecting only two organisms for the identification of toxic compounds (rat and mouse) and considering only histological observations, we tried to reach a balance between a sufficient number of chemicals in the training sets, and the similarity of experimental protocols and conditionsd. Overall, a broad range of compounds, from industrial chemicals, pollutants and pesticides to drugs was used for each toxic endpoint when creating QSAR models.

Training and validation compound sets

The prediction of a particular toxicity is based on the ability of a QSAR model to differentiate between toxic and non-toxic compounds. The set of compounds for training a QSAR model must include chemicals that cause a particular toxicity (positives) and chemicals that do not cause that particular toxicity (negatives). As positives, we took compounds annotated as causing a particular toxic pathology from the ontology described previously. For general organ toxicity models such as ‘Hepatotoxicity,’ compounds associated with all liver pathologies were taken as positives. The choice of compounds for the negative training set is a particularly difficult proposition because it is impossible to definitively prove that the compound is negative for a given finding unless it has been reported specifically as having been tested against a finding and shown to be negative. This is rare for toxicity findings and applying this criterion would result in an insufficient number of compounds in the negative training set. We therefore selected as negatives a number of randomly selected FDA-approved drugs, equal in size to the positive training set, that were not associated with the particular toxicity in our database. The implicit assumption is that FDA-approved drugs have been thoroughly studied for possible toxic effects. If a drug was not found by our curators as causing a toxicity, it is therefore reasonable to assume that the compound is in fact truly negative for the finding in question and may serve as a negative training set compound. All compounds used have publicly disclosed structures and have been annotated from scientific publications as described earlier. The identity and structure of all training set chemicals are freely available to MetaDrug/ToxHunter users within the software. The chemical structures were cleaned of salts and inorganic compounds and exported as an SD file. Positives received an activity value of ‘1,’ and negatives received an activity value of ‘0.’ The initial set was then randomly split between training set and test set, so that the percentage of molecules for the test set was approximately 15% of the training set including an equal number of positives and negatives (Figure 2). The molecules from the test set were not included in model building.


Figure 2.  Schema of quantitative structure–activity relationship (QSAR) model building and validation.

Download figure to PowerPoint

The positives that were correctly predicted by the corresponding models were clustered based on maximum common substructures by the JKlustor 5.9.0 utility from ChemAxone. Common substructures for each model training set can be found in Figures S1–S8.

QSAR model building

Quantitative structure–activity relationship models were built in MetaDrug, a software application that uses a recursive partitioning algorithm, as implemented in ChemTree™ (GoldenHelix, MT, USA) f. ChemTree uses two-dimensional structural descriptors represented as augmented atom pairs (21) that are extracted from compound structures. An augmented atom is a general term describing a focal atom, and the non-hydrogen atoms immediately bonded to that focal atom. The highest and lowest distance between two atom pairs is the value of the descriptor. Figure 3 shows an example of augmented atoms. One oxygen atom bonded only to one carbon atom, O(C), is eight bonds away from a carbon bonded to two other carbons C(CC) and another augmented atom O(C) is only three bonds away from the circled C(CC). This gives us two atom pairs, O(C)-C(CC), one with the descriptor value of three and another eight. If the compound does not have these atoms, the descriptor value would be missing and denoted as a ‘?’. Only the lowest (PLLO) and the highest (PLHI) pathway lengths between two augmented atoms are used as the descriptors. As an example, all descriptors generated for acetaminophen are shown in Table S1. To reduce the number of descriptors, the frequency of appearance of the descriptor can be set to a certain number. The default value for the descriptor occurrence is five meaning that at least five compounds must have the given path length for the descriptor to be included in further analysis.


Figure 3.  Example of augmented atoms as descriptors from ChemTree help manual. Two atom pairs within the same molecule are shown with varying distance of three and eight bonds between each other. Each molecule is described using many of these descriptors. The average number of descriptors for the molecules constituting the hepatotoxicity and nephrotoxicity training sets was 92.

Download figure to PowerPoint

Once all the descriptors have been generated from the structures of the training set compounds, the ChemTree software examines them and chooses the best descriptor to segment the entire data set into two or more nodes. Then, each node is split again with the next best descriptor. The statistical significance of each split is evaluated by calculating the p-value, and the segmentation of a tree continues until no statistically significant splits are found. The average activity value in the final node constitutes the prediction of that decision tree. These values are averaged for 50 trees and the resulting number is the final prediction of the QSAR model. The general statistics on the lowest and the highest number of leaves among 50 trees for each QSAR model can be found in Table S2. The most often observed descriptors among 50 trees for each QSAR model are shown in Table S5. The algorithm is unambiguous and the results with the same training set and the same parameters can be easily reproduced by a third party.

The ChemTree parameters that gave the best results were determined as follows: Path length descriptor must appear was set to at least five times, the maximum number of segments in a multiway split was equal to three, p-value for a multiway split above which it was not significant was 0.99, and number of random trees was equal to 50. Alteration of these optimized parameters in general provided a worse prediction as can be observed for one of the models in Table S3. For each toxicity endpoint, 10 random training and test sets were generated from the initial set. Separate models were built for each training set and the model performance was evaluated on the corresponding test set. This random subsampling was also used to estimate cross-validation performance (Table S4). The best out of 10 models was selected based on specificity, sensitivity, accuracy, and Matthews correlation coefficient (MCC). The resultant binary QSAR models predict values in a range between 0 and 1. A threshold of 0.5 defining a non-decision threshold between negative and positive. Compounds with a score above 0.5 were considered toxic, below 0.5 were considered as non-toxic.

QSAR model validation

We conducted a twofold validation of each QSAR model. For the first level of validation, the test sets separated from the initial chemical training sets were used for the forward validation of the models. For the second validation, toxicants added to the database in the period between the initial release of the QSAR models to the MetaTox consortium and preparation of this publication (2 years) were used to evaluate the prediction of positives. To evaluate the prediction of negatives, we took all FDA-approved drugs and subtracted those that were used for the correspondent training sets (criteria for selecting negatives are described in Training and validation compound sets).

Model performance was evaluated using Cooper statistics parameters: specificity, sensitivity, accuracy, MCC, positive and negative predictivity, calculated according to the following formulas.

  • image
  • image
  • image
  • image
  • image
  • image

where TP, true-positive; FP, false-positive; TN, true-negative; FN, false-negative.

Sensitivity is a measure of correctly predicted positives, specificity measures correctly predicted negatives, accuracy – the measure of closeness to the true value. The closer the value of these coefficients to 1, the better the quality of the model. Matthews correlation coefficient is a correlation between observed and predicted values and ranges between −1 and 1, where 1 represents perfect prediction, 0 – random prediction and −1 – inverse prediction. Positive predictivity is a measure of true-positives in all predicted positives. Negative predictivity is a measure of true-negatives in all predicted negatives. These parameters together provide a good measure of QSAR model goodness-of-fit, robustness, and predictivity, in agreement with the fourth OECD principle.


  1. Top of page
  2. Abstract
  3. Methods and Materials
  4. Results
  5. Discussion
  6. Conclusions and Future Directions
  7. Conflict of Interest Statement
  8. References
  9. Supporting Information

Model building and initial external validation

At the time of model building, January 2008, eight endpoints in the database relating to hepatotoxic or nephrotoxic findings had more than 100 associated compounds: hepatotoxicity and its subcategories (liver necrosis, liver relative weight gain, and liver lipid accumulation) and nephrotoxicity with its subcategories (kidney necrosis, kidney relative weight gain, and nephron injury). Quantitative structure–activity relationship models were built for each endpoint as described, and their initial external test set performance characteristics are presented in Table 1.

Table 1.   Results of initial validation of the QSAR models
QSAR modelTraining set (no. of cpds)aTest set (no. of cpds)SensitivitySpecificityAccuracyMCCPositive predictivityNegative predictivity
  1. MCC, Matthews correlation coefficient; QSAR, quantitative structure–activity relationship.

  2. aTraining and test sets include approximately 50% positives (active compounds) and 50% negatives (presumed inactive compounds).

Liver necrosis300570.910.910.910.820.940.88
Liver relative weight gain305541.
Liver lipid accumulation172280.80.850.820.640.860.79
Kidney necrosis221420.961.000.980.951.000.94
Kidney relative weight gain240490.951.000.980.961.000.97
Nephron injury5981090.911.000.960.931.000.94

Most of the models showed good sensitivity, and specificity above 90%, except hepatotoxicity and liver lipid accumulation. This is likely due to the highly diverse training set for the hepatotoxicity model, with over a thousand compounds. Liver lipid accumulation on the other hand contains the least number of positive compounds. Both positive and negative predictivity are typically above 80% indicating a good ability of the models to discriminate the true-positives and true-negatives, respectively. In general, MCC shows that models for subcategories of hepatotoxicity and nephrotoxicity (liver necrosis, liver relative weight gain, kidney necrosis, kidney relative weight gain, and nephron injury) performed better, and had higher correlation between observed and predicted values, than general toxicity models (hepatotoxicity and nephrotoxicity). This reflects the well-known tendency of QSAR models to have better performance for predicting more precise endpoints, likely due to fewer mechanisms causing the toxicity, and thus fewer possible structure-function relationshipsc. Cross-validation errors from 10-fold random subsampling cross-validation (Table S4) indicate that the models perform better in prediction of negatives, while there are in general greater errors in positive predictions.

Secondary external validation

In October 2010, we conducted a second external validation of the QSAR models using compounds whose toxicity was annotated between January 2008 and October 2010. About 300 novel compound-toxicity associations for hepatotoxicity and nephrotoxicity were annotated in this 2-year period (Table 2).

Table 2.   Results of the second validation for toxic agents (positives)
 Number of correctly predicted toxicantsTotal number of toxicants in the second test setaFraction of correct positive predictions (sensitivity)
  1. aNumber of new toxicants annotated for the given toxic pathology over the 2 years following initial model release.

Liver necrosis60690.87
Liver relative weight gain1021180.86
Liver lipid accumulation43570.75
Kidney necrosis41510.80
Kidney relative weight gain57660.86
Nephron injury1171470.80

Results of the second validation exhibited the same phenomenon observed during initial validation. Quantitative structure–activity relationship models for subcategories outperformed general models in prediction of true-positives (Table 2). For example, kidney necrosis, kidney relative weight gain, and nephron injury QSAR models correctly predicted over 80% of the newly annotated toxic compounds, while the general nephrotoxicity model correctly identified only 66%.

Correct prediction of negative compounds is critical for the evaluation of model performance, yet poses a certain challenge, because a compound’s toxicity might not have been discovered or reported at the time of testing. Thus, there is a possibility that a positive QSAR model prediction can be accurate, but it still has to be treated as false-positive because of the above uncertainty. For this reason, to evaluate model specificity, we used FDA-approved drugs that did not have any annotations for hepatotoxicity in the Thomson Reuters database, that is, assuming that they had been thoroughly evaluated by the scientific community and our database curators and had not been found to be positive (total 421 drugs), and which did not participate in initial training or test sets. Similarly, FDA-approved drugs that did not have any associated nephrotoxicity and did not participate in initial training or test sets were used to validate the kidney pathology models (646 drugs). As expected, the specificity results characterizing the ability of a QSAR model to predict negatives are not as accurate as those for sensitivity: all models correctly identified <80% of negatives (Table 3).

Table 3.   Results of the second validation for non-toxic agents (negatives)
 Number of correctly predicted negativesTotal number of non-toxicants in the second test setaFraction of correct negative predictions (specificity)
  1. aTotal number of FDA drugs that do not have annotated hepatotoxicity (for validation of liver pathology models) or nephrotoxicity (for validation of kidney pathology models) and did not participate in initial training or test sets.

Liver necrosis2644210.63
Liver relative weight gain3104210.74
Liver lipid accumulation2534210.60
Kidney necrosis4336460.67
Kidney relative weight gain5026460.78
Nephron injury4936460.76

Dependence of QSAR model sensitivity on structural similarity

As QSAR models are limited by the chemical space of the training sets from which they were derived, it is important to estimate the applicability domain of QSAR models for successful extrapolation of its predictions. There are a number of ways to measure the applicability domain of QSAR models (22) and well-studied approaches exist relating the applicability of the model to structural similarity of the analyzed compound to the training set structures (23,24). Here, we took a slightly different approach to QSAR model reliability by estimating the dependence of QSAR model sensitivity on structural similarity (in percent) of the test set molecules to the most similar compounds in the training set. This approach is implemented in the software as a Tanimoto Prioritization score. Structural similarity is based on 2D fingerprints as implemented in the Accord Chemistry Cartridge (Accelrys, San Diego, CA, USA). It was previously shown that the descriptors used to calculate similarity do not have to be the same as the descriptors used for QSAR (23).

For these calculations, we used the test set of positive toxicants from the second validation (Table 2). The Tanimoto Prioritization score was calculated for all the toxicants from the second validation test set, and the compounds were split into similarity ranges from 20% (the most dissimilar) to 99% (almost identical). For each structural similarity range, the ratio of correct toxicant predictions (sensitivity) was calculated. This approach is similar to the one implemented by Tong et al. (25), where the applicability domain was measured as a dependence of accuracy from the confidence interval. The results are summarized in Table 4. For the majority of models, the ratio of correctly predicted toxicants does not decrease appreciably with a decline in structural similarity between the test and training set compounds. On the basis of this score, a user may choose not to consider QSAR model predictions in the case of a very dissimilar compound. Table 4, however, shows that the prediction accuracy of QSAR models is quite robust even with low similarity of the query compound to the training set. This can be attributed to a broad range of compounds comprising the training sets. It must be noted that because the structural similarity descriptors are different from the QSAR descriptors, this approach cannot be a definitive measure of applicability domain per se, yet it still provides a measure of QSAR prediction reliability for dissimilar structures.

Table 4.   Sensitivity of QSAR models measured as a ratio of correctly predicted toxicants to the total number of compounds in each Tanimoto Prioritization range. Only compounds from the second validation test set were considered
 Tanimoto prioritization ranges
  1. QSAR, quantitative structure–activity relationship; n/a, not applicable.

  2. n/a Indicates less than five compounds in that similarity range.

Liver necrosis0.400.771.000.851.001.001.00
Liver relative weight gain1.000.810.770.721.000.920.83
Liver lipid accumulation0.750.580.570.80n/an/a1.00
Kidney necrosisn/a0.620.711.000.850.71n/a
Kidney relative weight gain0.881.000.800.82n/a1.00n/a
Nephron Injury0.830.660.790.760.640.940.82

QSAR predictions for known drugs

To demonstrate how QSAR models perform for some standard drugs widely used in many toxicological studies, we took drugs with well-characterized hepatotoxicity (acetaminophen, valproate, and fenofibrate) and nephrotoxicity (acetaminophen, valproate, indomethacin, and tobramycin) and evaluated them using our models (Table 5). Note that the compounds themselves were included in the initial training sets and the goal of the table is not to externally validate the models but rather demonstrate the final outcomes.

Table 5.   Quantitative structure–activity relationship predictions for a set of selected drugs with known toxicological properties for the endpoints predicted
  1. aPrediction was considered true and highlighted in bold when (i) the predicted value is ≥0.5 and there is an annotation linking this drug to this toxic pathology in Thomson Reuters database and (ii) the predicted value is ≤0.5 and above-mentioned annotation is absent.

Liver necrosis0.680.940.590.600.62
Liver relative weight gain0.660.900.630.200.32
Liver lipid accumulation0.550.690.370.570.43
Kidney necrosis0.890.980.730.710.96
Kidney relative weight gain0.670.960.410.110.94
Nephron injury0.851.000.280.740.88

The models were able to predict the major toxic effects of acetaminophen and valproate. Fenofibrate hepatotoxicity and liver relative weight gain were also correctly identified. Kidney pathology with a number of kidney subtoxicities was also predicted correctly for indomethacin and tobramycin.


  1. Top of page
  2. Abstract
  3. Methods and Materials
  4. Results
  5. Discussion
  6. Conclusions and Future Directions
  7. Conflict of Interest Statement
  8. References
  9. Supporting Information

In this study, we report the development and implementation of a number of unique QSAR models to predict organ toxicity endpoints. Our goal was to create models predicting more defined subcategories of organ toxicity (liver necrosis, liver relative weight gain, liver lipid accumulation, kidney necrosis, kidney relative weight gain, and nephron injury) and to compare their performance to models predicting general organ toxicity (hepatotoxicity and nephrotoxicity).

A PubMed literature survey of the last 20 years (Table 6) indicates a general lack of comprehensive and well-validated QSAR models with a broad scope of coverage of chemical space, and a focus on organ-specific toxicity endpoints. Most models reported in the literature tend to focus on a specific class of compounds (organophosphates, haloaromatic compounds, and phenols) and thus have a very small training set, a very small to absent external validation set, and narrow applicability. For many of the datasets, quantitative correlations have not been established and the correlations exist as SARs (omitted in Table 6). Our review table, while far from being comprehensive, summarizes some of the QSAR models described in the literature (those that focus on organ-specific toxicity endpoints). Matthews et al. (26) describe, in detail, a QSAR analysis by four different models (models that use different QSAR paradigms) of two toxicity endpoints (hepatobiliary and urinary tract injuries). The authors used FDA’s postmarket adverse effect reporting system to associate a particular drug with specific toxicity endpoint using a validated approach. But such comprehensive studies are still rather the exception than the rule, and clearly, more work needs to be done to address organ toxicity models.

Table 6.   Overview of published organ toxicity QSAR models
No.ModelReferencePhenotypeSoftware/AlgorithmTraining setSpecificitySensitivityAccuracy
  1. n/r, not reported; FDA, Food and Drug Administration; QSAR, quantitative structure–activity relationship.

  2. aA standard deviation between observed and predicted value was calculated.

  3. b R 2– correlation coefficient.

1Hepatotoxicity(33)Drug-induced hepatotoxicityAlgorithm: K-nearest neighbors (kNN)1270.620.56n/r
2Hepatotoxicity(33)Drug-induced hepatotoxicityAlgorithm: Support vector machines (SVM)1270.620.48n/r
3Hepatotoxicity(33)Drug-induced hepatotoxicityAlgorithm: Random forest (RF)1270.600.56n/r
4Hepatotoxicity(33)Drug-induced hepatotoxicityAlgorithm: Distance weighted discrimination (DWD)1270.770.45n/r
5Hepatotoxicity(27)Drug-induced liver injury (DILI)Software: The ISIDA/Fragmentor program; Algorithm: Substructural Molecular Fragments with hierarchical clustering and SVM approach951n/rn/r55.7–72.6%
6Hepatotoxicity(28)General hepatotoxicity as estimated by five different biomarkersSoftware: The MolConnZ (eduSoft LC, Ashland, VA) and Dragon (v.5.4, Talete SRL, Milano, Italy). Algorithm: The variable selection k-nearest neighbor (kNN) QSAR method49088.5–96.2%60–87.5%65–73%
7Hepatotoxicity(34)General hepatotoxicityAlgorithm: recursive partitioning on a combination of 1D and 2D descriptors37675–90%76–78%n/r
8Hepatotoxicity(35)General hepatotoxicity as estimated by four different in vitro assaysSoftware: Sybyl 6.9 (TRIPOS); Algorithm: SIMCA on CoMFA654n/rn/r52%
9Hepatotoxicity(36)Idiosyncratic hepatotoxicityAlgorithm: Linear Discriminant Analysis (LDA) on Radial Distribution Function (RDF) descriptors7467%100%83%
10Hepatotoxicity(36)Idiosyncratic hepatotoxicityAlgorithm: Artificial Neural Networks (ANN) on RDF descriptors7467%67%67%
11Hepatotoxicity(36)Idiosyncratic hepatotoxicitySoftware: Weka Algorithm: OneR on RDF descriptors74100%67%83%
12Hepatotoxicity/Nephrotoxicity(37)Hepatobiliary and urinary tract injuries as recorded in FDA’s postmarket AE reporting system(i) MC4PC 1.7 (MultiCASE, Inc.); (ii) BioEpisteme 2.0 (Prous Institute for Biomedical Research, S.A.); (iii) MDL-QSAR 2.2 (MDL Information Systems, Inc.); LPDM 2.4 (Leadscope, Inc.)160069–89%46–68%n/r
13Nephrotoxicity(38)Rat in vitro nephrotoxicityAlgorithm: linear regression on ELUMO9n/rn/r94%a
14Cardiotoxicity(39)hERG Inhibition (IC50)Software: Catalyst 4.5 (Accelrys); Algorithm: 3D-QSAR conformational alignment15n/rn/rR2 = 0.77b
15Cardiotoxicity(40)hERG InhibitionSoftware: Sybyl (TRIPOS); Algorithm: 3D QSAR COMFA32n/rn/r74%a
16Cardiotoxicity(41)hERG Inhibition (IC50)Software: Sybyl (TRIPOS); Algorithm: 3D QSAR CoMSiA28n/rn/rn/r

Historically, organ pathologies have been considered difficult to predict because of their complexity and the involvement of multiple distinct mechanisms. At the same time, an alert for hepatotoxicity given to a compound of interest by a predictive tool would be considered by a toxicologist as too generic and would be unlikely in isolation to affect development of the compound. Many low or mildly hepatotoxic compounds are used as drugs, and knowing the specific hepatotoxicity is key in making a reasonable risk assessment for the compound. Thus, we aimed to create organ toxicity models based on subcategories of organ toxicity identified in mouse and rat histologically.

A detailed ontology of toxic pathologies for 19 organs was created and annotated from the literature in a consistent way to capture precise organ toxicity associations of drugs, industrial, environmental, and other compounds. Although, we understand that an ideal training set would include animals of the same species and sex, treated using the same scenario and dosage, examined in the same laboratory, we tried to reach a balance between amount of information existing in the public domain, and the appropriate size and diversity of training sets. The following uniform criteria allowed for consistent selection of experiments for annotation: (i) ontological classification of toxic endpoints; (ii) histological examination (as oppose to toxicity evaluation based on biomarkers that could be non-specific); (iii) experiments on mouse or rat.

Fourches et al. (27) used a semi-automatic extraction from MEDLINE titles and abstracts to build a training set for a QSAR model of drug-induced liver injury. The experiment selection strategy used by the authors provided discrimination by species, but included terms relating to hepatobiliary anatomy and pathology, for example, liver, hepatic, cholestasis, biliary, hepatitis, focal necrosis, as well as hepatic physiological observations such as gluconeogenesis and cell growth. Different types of liver pathologies and adverse reactions determined by a variety of methods ended up in the model, which cannot provide a defined endpoint for users (as required by the first OECD principle). Rodgers et al. (28) used a training set including drugs causing elevation in activity of liver enzymes in humans that frequently are regarded by clinicians as signs of liver damage and, if other pathological states can be excluded, are attributed to drug-induced toxicity (29). However, elevation of alkaline phosphatase, one of the enzymes considered in the study, is associated in the Thomson Reuters database with over a hundred pathology terms across multiple tissues, and also with a dozen disease states, leaving the precise endpoint uncertain, and comprising multiple mechanisms.

We used a forward validation approach to investigate model accuracy over time. Primary external validation was carried out with compounds sequestered from the initial training set. Sensitivity, specificity, and accuracy were above 90% for most of the models. Two years after the initial release of the models, new annotations were used for validation of the existing models. Over 80% of positive compounds and over 70% of negative compounds on average were correctly predicted by the models. The models performed better at the prediction of distinct organ toxicity subcategories than general organ toxicity, reflecting the well-known tendency of QSAR models to have better performance for narrower endpointsg. This is likely due to fewer mechanisms leading to toxicity, and thus fewer possible structure-activity relationships.

The accuracy of our QSAR models for general organ toxicity (0.81 for hepatotoxicity, 0.87 for nephrotoxicity) is better than or similar to the models presented in Table 6. Our models show some decrease in sensitivity and specificity compared with the initial validation when tested against the second external validation set. Predictive performance remained good, however, demonstrating general robustness of the models. We continue annotation of compound-toxicity, and future iterations of our models will include in the training set, compounds used here for external validation, and newly added compounds. Accuracy of the QSAR models of Fourches et al. decreased from 77.6 to 99.3% to 55.7–72.6% when validation was performed on an external dataset. The number of correctly predicted toxicants in the Rodgers et al. paper dropped to 33% from an initial sensitivity of 60.0–87.5% (because of the absence of toxicity reports, the authors had to treat many predicted toxicants as false-positives).

Datasets for organ toxicity, even with defined subcategories, pose a particular challenge for QSAR modeling, mostly owing to their non-linearity, meaning that there are likely several structure-function activities within a large dataset. For this reason, a recursive partitioning model was selected for its ability to handle large non-linear datasets (20,30–32). The algorithm builds decision trees where the structural descriptors derived from the compounds are used to split the nodes of the decision tree to find statistically significant correlations with a particular toxicity. Ideally, structurally similar compounds responsible for each mechanism of toxicity end up in a separate node of the decision tree, eventually representing a potential mechanism of toxicity caused by the compounds in the training set.

We attempted to find the most frequent compound structural descriptors from different decision tree nodes, and establish correlations with known mechanisms of toxicity (data not shown). Our goal was to associate compound substructures (toxicophores) with particular mechanisms of toxicity and thereby provide compliance with the fifth OECD principal requiring a mechanistic interpretation, if possible. Unfortunately, we were not able in our non-exhaustive investigation to identify any obvious associations between discriminating structural descriptors identified by the algorithm (translated into compound substructures) and toxicophores with known mechanisms of toxicity. A possible complication is that many compounds induce several types of toxicity. For example, of 143 compounds causing lipid accumulation in liver, 59 compounds also induced liver necrosis and 57 compounds induced relative weight gain of the liver. Structural features discriminating specific toxicity subtypes may therefore be subtle. This warrants further investigation to better understand the structure-activity relationships for these endpoints.

Identification of toxicophores and solving mechanisms of toxicity are difficult tasks that require application of several approaches, and consideration of compound pharmacology from different angles (chemical stability and reactivity, compound metabolism, target profile, tissue distribution, and others). We believe that we have created one of the largest (and still growing) compound datasets with precise and well-defined toxicity endpoints, and we plan to apply other computational methods and complementary information types from our compound database (metabolism annotations, target records, and physico-chemical properties) to further improve and characterize our predictive models.

Conclusions and Future Directions

  1. Top of page
  2. Abstract
  3. Methods and Materials
  4. Results
  5. Discussion
  6. Conclusions and Future Directions
  7. Conflict of Interest Statement
  8. References
  9. Supporting Information

Overall, we believe that the organ toxicity QSAR models described herein are in agreement with two of four OECD principles for QSAR model validation:

  •  A defined endpoint. The endpoints were comparatively well characterized and defined; only compounds that caused toxic pathologies in rodents confirmed by microscopic evaluation were annotated and used in the training set (Principle 1).
  •  Appropriate measures of goodness-of-fit, robustness, and predictability. All of the models were validated with compounds naive to the training set, using standard statistical parameters to evaluate model accuracy (Principle 4).

In conclusion, the predictive models of organ toxicity described herein present a valuable decision support tool for the toxicologist in prioritizing chemical scaffolds or specific compounds in pharmaceutical development, or in recommending additional empirical studies to further evaluate the toxicity of poorly characterized compounds. The models can provide some insights for mechanistic interpretation as they are associated with precise histopathological observations, however, correlated structural toxicophores are yet to be identified. Future development using the approach described will include update of the training sets with newly annotated compounds and test set compounds used in secondary validation, and the development of QSAR models for the prediction of pulmonary, nasal, testicular, epididymal, and other pathologies.

Conflict of Interest Statement

  1. Top of page
  2. Abstract
  3. Methods and Materials
  4. Results
  5. Discussion
  6. Conclusions and Future Directions
  7. Conflict of Interest Statement
  8. References
  9. Supporting Information

The authors all are employed by Thomson Reuters Inc. who commercializes the research described herein.


  1. Top of page
  2. Abstract
  3. Methods and Materials
  4. Results
  5. Discussion
  6. Conclusions and Future Directions
  7. Conflict of Interest Statement
  8. References
  9. Supporting Information
  • 1
    Kühnel M.P., Cosovic B., Medic G., Russell R.B., Apic G. (2008) Pathway Analysis for Drug Discovery: Computational Infrastructure and Applications. New York: John Wiley and Sons Inc.
  • 2
    Sauer U.G. (2003) The new EU Chemicals Policy--comments of Eurogroup for Animal Welfare and the Deutscher Tierschutzbund on the EU-Commission’s REACH system Consultation Documents. ALTEX;20:225227.
  • 3
    Mombelli E., Devillers J. (2010) Evaluation of the OECD (Q)SAR Application Toolbox and Toxtree for predicting and profiling the carcinogenic potential of chemicals. SAR QSAR Environ Res;21:731752.
  • 4
    Valerio L.G. Jr, Arvidson K.B., Chanderbhan R.F., Contrera J.F. (2007) Prediction of rodent carcinogenic potential of naturally occurring chemicals in the human diet using high-throughput QSAR predictive modeling. Toxicol Appl Pharmacol;222:116.
  • 5
    Devillers J., Mombelli E. (2010) Evaluation of the OECD QSAR Application Toolbox and Toxtree for estimating the mutagenicity of chemicals. Part 1. Aromatic amines. SAR QSAR Environ Res;21:753769.
  • 6
    Zhang Q.Y., Aires-de-Sousa J. (2007) Random forest prediction of mutagenicity from empirical physicochemical descriptors. J Chem Inf Model;47:18.
  • 7
    Shoji R., Kawakami M. (2006) Prediction of genotoxicity of various environmental pollutants by artificial neural network simulation. Mol Divers;10:101108.
  • 8
    Gombar V.K., Enslein K., Blake B.W. (1995) Assessment of developmental toxicity potential of chemicals by quantitative structure-toxicity relationship models. Chemosphere;31:24992510.
  • 9
    Votano J.R., Parham M., Hall L.H., Kier L.B., Oloff S., Tropsha A., Xie Q., Tong W. (2004) Three new consensus QSAR models for the prediction of Ames genotoxicity. Mutagenesis;19:365377.
  • 10
    Leong M.K., Lin S.W., Chen H.B., Tsai F.Y. (2010) Predicting mutagenicity of aromatic amines by various machine learning approaches. Toxicol Sci;116:498513.
  • 11
    Valerio L.G., Yang C., Arvidson K.B., Kruhlak N.L. (2010) A structural feature-based computational approach for toxicology predictions. Expert Opin Drug Metab Toxicol;6:505518.
  • 12
    Snyder R.D., Pearl G.S., Mandakas G., Choy W.N., Goodsaid F., Rosenblum I.Y. (2004) Assessment of the sensitivity of the computational programs DEREK, TOPKAT, and MCASE in the prediction of the genotoxicity of pharmaceutical molecules. Environ Mol Mutagen;43:143158.
  • 13
    Valerio L.G. Jr (2009) In silico toxicology for the pharmaceutical sciences. Toxicol Appl Pharmacol;241:356370.
  • 14
    Liu H., Papa E., Gramatica P. (2006) QSAR prediction of estrogen activity for a large set of diverse chemicals under the guidance of OECD principles. Chem Res Toxicol;19:15401548.
  • 15
    Pavan M., Netzeva T.I., Worth A.P. (2006) Validation of a QSAR model for acute toxicity. SAR QSAR Environ Res;17:147171.
  • 16
    Saliner A.G., Netzeva T.I., Worth A.P. (2006) Prediction of estrogenicity: validation of a classification model. SAR QSAR Environ Res;17:195223.
  • 17
    Simon-Hettich B., Rothfuss A., Steger-Hartmann T. (2006) Use of computer-assisted prediction of toxic effects of chemical substances. Toxicology;224:156162.
  • 18
    Cronin M.T. (2002) The current status and future applicability of quantitative structure-activity relationships (QSARs) in predicting toxicity. Altern Lab Anim;30 (Suppl. 2):8184.
  • 19
    Brennan R.J. (2008) Fine-tuning compound safety assessments: facilitating comprehensive systems toxicology data analysis. Genet Eng News;28:3435.
  • 20
    Rusinko A. III, Farmen M.W., Lambert C.G., Brown P.L., Young S.S. (1999) Analysis of a large structure/biological activity data set using recursive partitioning. J Chem Inf Comput Sci;39:10171026.
  • 21
    Adamson G.W., Lynch M.F., Town W.G.J. (1971) Analysis of structural characteristics of chemical compounds in a large computer-based file. 2. Atom-centred fragments. J Chem Soc C;22:37023706.
  • 22
    Netzeva T.I., Worth A., Aldenberg T., Benigni R., Cronin M.T., Gramatica P., Jaworska J.S. et al. (2005) Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. The report and recommendations of ECVAM Workshop 52. Altern Lab Anim;33:155173.
  • 23
    Sheridan R.P., Feuston B.P., Maiorov V.N., Kearsley S.K. (2004) Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR. J Chem Inf Comput Sci;44:19121928.
  • 24
    Dimitrov S., Dimitrova G., Pavlov T., Dimitrova N., Patlewicz G., Niemela J., Mekenyan O. (2005) A stepwise approach for defining the applicability domain of SAR and QSAR models. J Chem Inf Model;45:839849.
  • 25
    Tong W., Xie Q., Hong H., Shi L., Fang H., Perkins R. (2004) Assessment of prediction confidence and domain extrapolation of two structure-activity relationship models for predicting estrogen receptor binding activity. Environ Health Perspect;112:12491254.
  • 26
    Matthews E.J., Kruhlak N.L., Weaver J.L., Benz R.D., Contrera J.F. (2004) Assessment of the health effects of chemicals in humans: II. Construction of an adverse effects database for QSAR modeling. Curr Drug Discov Technol;1:243254.
  • 27
    Fourches D., Barnes J.C., Day N.C., Bradley P., Reed J.Z., Tropsha A. (2010) Cheminformatics analysis of assertions mined from literature that describe drug-induced liver injury in different species. Chem Res Toxicol;23:171183.
  • 28
    Rodgers A.D., Zhu H., Fourches D., Rusyn I., Tropsha A. (2010) Modeling liver-related adverse effects of drugs using knearest neighbor quantitative structure-activity relationship method. Chem Res Toxicol;23:724732.
  • 29
    Jaeschke H., Gores G.J., Cederbaum A.I., Hinson J.A., Pessayre D., Lemasters J.J. (2002) Mechanisms of hepatotoxicity. Toxicol Sci;65:166176.
  • 30
    Hawkins D.M.Y., Young S.S., Rusinko A. (1997) Analysis of a large structure-activity data set using recursive partitioning. Quant Struct-Act Relat;16:296302.
  • 31
    Tong W., Hong H., Fang H., Xie Q., Perkins R. (2003) Decision forest: combining the predictions of multiple independent decision tree models. J Chem Inf Comput Sci;43:525531.
  • 32
    Young S.S., Gombar G.V., Emptage M.R., Cariello N.F., Lambert C. (2002) Mixture deconvolution and analysis of Ames mutagenicity data. Chem Int Lab Syst;60:511.
  • 33
    Low Y., Uehara T., Minowa Y., Yamada H., Ohno Y., Urushidani T., Sedykh A. et al. (2011) Predicting drug-induced hepatotoxicity using QSAR and toxicogenomics approaches. Chem Res Toxicol;24:12511262.
  • 34
    Cheng A., Dixon S.L. (2003) In silico models for the prediction of dose-dependent human hepatotoxicity. J Comput Aided Mol Des;17:811823.
  • 35
    Clark R.D., Wolohan P.R., Hodgkin E.E., Kelly J.H., Sussman N.L. (2004) Modelling in vitro hepatotoxicity using molecular interaction fields and SIMCA. J Mol Graph Model;22:487497.
  • 36
    Cruz-Monteagudo M., Cordeiro M.N., Borges F. (2008) Computational chemistry approach for the early detection of drug-induced idiosyncratic liver toxicity. J Comput Chem;29:533549.
  • 37
    Matthews E.J., Ursem C.J., Kruhlak N.L., Benz R.D., Sabate D.A., Yang C., Klopman G. et al. (2009) Identification of structure-activity relationships for adverse effects of pharmaceuticals in humans: part B. Use of (Q)SAR systems for early detection of drug-induced hepatobiliary and urinary tract toxicities. Regul Toxicol Pharmacol;54:2342.
  • 38
    Jolivette L.J., Anders M.W. (2002) Structure-activity relationship for the biotransformation of haloalkenes by rat liver microsomal glutathione transferase 1. Chem Res Toxicol;15:10361041.
  • 39
    Ekins S., Crumb W.J., Sarazan R.D., Wikel J.H., Wrighton S.A. (2002) Three-dimensional quantitative structure-activity relationship for inhibition of human ether-a-go-go-related gene potassium channel. J Pharmacol Exp Ther;301:427434.
  • 40
    Cavalli A., Poluzzi E., De Ponti F., Recanatini M. (2002) Toward a pharmacophore for drugs inducing the long QT syndrome: insights from a CoMFA study of HERG K(+) channel blockers. J Med Chem;45:38443853.
  • 41
    Pearlstein R.A., Vaz R.J., Kang J., Chen X.L., Preobrazhenskaya M., Shchekotikhin A.E., Korolev A.M. et al. (2003) Characterization of HERG potassium channel inhibition using CoMSiA 3D QSAR and homology modeling approaches. Bioorg Med Chem Lett;13:18291835.

Supporting Information

  1. Top of page
  2. Abstract
  3. Methods and Materials
  4. Results
  5. Discussion
  6. Conclusions and Future Directions
  7. Conflict of Interest Statement
  8. References
  9. Supporting Information

Figures S1–S8. Common substructures identified in by the JKlustor 5.9.0 utility from ChemAxon for QSAR models ‘Hepatotoxicity’, ‘Liver Weight Gain’, ‘Liver Necrosis’, ‘Liver Lipid Accumulation’, ‘Nephrotoxicity’, ‘Nephron Injury’, ‘Kidney Weight Gain’, and ‘Kidney Necrosis’ respectively.

Table S1. Example of all descriptors for acetamoniphen.

Table S2. The lowest and the highest number of leaves among 50 decision trees in each model.

Table S3. Effect of different ChemTree parameters on QSAR predictions using Liver relative weight gain model as an example.

Table S4. Cross-validation error of correctly predicted toxicants averaged over 10 random subsamples of training and test sets ± standard deviation.

Table S5. Number of times each descriptor is used in 50 trees for each toxicity model.

CBDD_1411_sm_FigureS1.pdf961KSupporting info item
CBDD_1411_sm_FigureS2.pdf634KSupporting info item
CBDD_1411_sm_FigureS3.pdf527KSupporting info item
CBDD_1411_sm_FigureS4.pdf356KSupporting info item
CBDD_1411_sm_FigureS5.pdf891KSupporting info item
CBDD_1411_sm_FigureS6.pdf979KSupporting info item
CBDD_1411_sm_FigureS7.pdf416KSupporting info item
CBDD_1411_sm_FigureS8.pdf423KSupporting info item
CBDD_1411_sm_TablesS1-S4.doc95KSupporting info item
CBDD_1411_sm_TableS5.xls47KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.