Construction and Consensus Performance of (Q)SAR Models for Predicting Phospholipidosis Using a Dataset of 743 Compounds

Authors

  • Amabel M. Orogo,

    1. U.S. Food and Drug Administration, Center for Drug Evaluation and Research, 10903 New Hampshire Avenue, Silver Spring, MD 20993, USA
    Search for more papers by this author
  • Sydney S. Choi,

    1. U.S. Food and Drug Administration, Center for Drug Evaluation and Research, 10903 New Hampshire Avenue, Silver Spring, MD 20993, USA
    Search for more papers by this author
  • Barbara L. Minnier,

    1. U.S. Food and Drug Administration, Center for Drug Evaluation and Research, 10903 New Hampshire Avenue, Silver Spring, MD 20993, USA
    Search for more papers by this author
  • Naomi L. Kruhlak

    Corresponding author
    1. U.S. Food and Drug Administration, Center for Drug Evaluation and Research, 10903 New Hampshire Avenue, Silver Spring, MD 20993, USA
    • U.S. Food and Drug Administration, Center for Drug Evaluation and Research, 10903 New Hampshire Avenue, Silver Spring, MD 20993, USA
    Search for more papers by this author

Abstract

Drug-induced phospholipidosis (PLD) continues to be a safety concern for pharmaceutical companies and regulatory agencies, prompting the FDA/CDER Phospholipidosis Working Group to develop a database of PLD findings that was recently expanded to contain a total of 743 compounds (385 positive and 358 negative). Three commercial (quantitative) structure-activity relationship [(Q)SAR)] software platforms [MC4PC, Leadscope Predictive Data Miner (LPDM), and Derek for Windows (DfW)] were used to build and/or test models with the database and evaluated individually and together for their ability to predict PLD induction. Models constructed with MC4PC showed improved sensitivity over previous models constructed with an earlier version of the database and software (61.2 % vs. 50.0 %), but lower specificity in cross-validation experiments (58.2 % vs. 91.9 %) due in part to the more balanced ratio of positives to negatives in the training set. A new model created with LPDM gave good cross-validation statistics (79.0 % sensitivity, 78.0 % specificity) and the single DfW structural alert for PLD was found to have high positive predictivity (83.3 %) but low sensitivity (10.4 %) when tested with the entire PLD database. Combining the predictions of MC4PC, LPDM and/or DfW resulted in increased sensitivity and coverage over using one software platform alone, although it did not enhance the overall prediction accuracy beyond that of the best performing individual software platform. The comparison across software platforms, however, facilitated the identification and analysis of chemicals that were consistently predicted incorrectly by all platforms. The prevalence of cationic amphiphilic drug (CAD) structural motifs in the database contributed heavily to many of the structural alerts and discriminating features in the models, but the subset of incorrectly predicted structures across all models underscores the need to account for mitigating features and/or additional filtering criteria to assess PLD, in particular for PLD-inducing non-CADs and non-PLD-inducing CADs. (Q)SAR tools may be used as part of an early screening battery or regulatory risk assessment approach to identify those compounds with the greatest chance of inducing PLD and potentially toxicity.

Disclaimer

The findings and conclusions in this article have not been formally disseminated by the Food and Drug Administration and should not be construed to represent any Agency determination or policy.

1 Introduction

Drug-induced phospholipidosis (PLD) is a condition characterized by phospholipid and drug accumulation within cells, and primarily occurs in lysosomes. PLD occurs in multiple tissue types and ultrastructural changes examined through electron microscopy are the hallmark feature.1 There are two hypotheses that attempt to explain the mechanism underlying PLD: (1) The drug itself may directly inhibit or interact with phospholipase function in the lysosome, or (2) drug and phospholipids bind to form drug-phospholipid complexes that cannot be broken down by phospholipases, which then accumulate and are stored in the form of lamellar bodies.2 PLD induction has been found to be dose-dependent in vivo and concentration-dependent in vitro, and is generally reversible upon termination of drug exposure.3 PLD is primarily thought to be an adaptive response to drug exposure and part of a detoxification mechanism that sequesters the drug in the lamellar bodies to reduce potential toxicity within the cell.4

Despite the lack of evidence definitively linking PLD to organ toxicity, it remains an area of concern for both regulatory agencies and the pharmaceutical industry. The U.S. Food and Drug Administration (FDA) Center for Drug Evaluation and Research (CDER) Phospholipidosis Working Group (PLWG) was established to develop guidance on the regulatory implications of PLD and has created a database of PLD findings from published literature and drug review submissions to gain a better understanding of PLD and its toxicological consequences.

Cationic amphiphilic drugs (CADs) containing hydrophilic positively charged amines and hydrophobic cyclic groups belonging to various drug classes have been associated with PLD,5 leading to the hypothesis that the presence of particular chemical structural features may be related to PLD induction. The presence of such structural motifs has also been linked to pharmacological promiscuity, causing a variety of other undesirable off-target effects, as seen with many CNS drugs through interactions with aminergic receptors.6 While it is generally accepted that a CAD feature may serve as a marker for PLD activity, it is also acknowledged that there are drugs that contain this fingerprint that do not cause PLD. Therefore, it is of interest to the scientific community to develop other approaches that can serve to complement or refine the CAD methodology for predicting PLD and potentially identify mitigating factors of this effect. In silico tools lend themselves to this task due to their ability to interpret large data sets of chemical structures and identify statistical and biological correlations within. In particular, these tools offer the ability to consider the effects of multiple contributing factors to better explain the observed experimental activity.

In silico tools have already found widespread application in assessing the toxicological potential of drug candidates for lead compound selection by pharmaceutical industry,7 and in some cases have replaced the use of in vitro screens, particularly for PLD.8 In a drug regulatory setting, in silico tools may be used for the safety assessment of certain classes of chemicals, such as for assessing the genotoxic potential of drug impurities and degradants.9,10 FDA/CDER has developed databases and quantitative structure-activity relationship (QSAR) models for various toxicity and adverse event endpoints of relevance to drug safety,11 including those for the prediction of genetic toxicity12,13 and rodent carcinogenicity.14,15 FDA/CDER extended these efforts to assess PLD using a preliminary version of the PLWG database to generate a training set of 583 compounds and QSAR models created with the commercial software programs MC4PC and MDL-QSAR.16 These models showed moderate overall predictive performance, but with an emphasis on specificity rather than sensitivity.

Previous efforts to model PLD using in silico approaches have been reported with more limited data sets. An investigation of calculated physicochemical properties ClogP and pKa was conducted by Ploemen et al. using 41 non-proprietary compounds and an equation was derived from the observation that PLD-inducing drugs had high pKa (highly ionized amine) and high ClogP (hydrophobic) characteristics.17 The equation was used as a simplified method for predicting PLD-inducing potential, and a compound that exceeded the threshold value of 90 was classified as having a greater propensity to induce PLD. A subsequent evaluation of the Ploemen model by Tomizawa et al. using a test set of 33 compounds found that the equation was unable to predict the PLD-inducing potential of zwitterions.18 A modification of the prediction method by substituting pKa with net charge (NC) was found to better represent ionization of compounds in organelles and improve predictive performance. NC was calculated at pH 4.0, which is close to the pH in lysosomes (pH<5), and compounds with high ClogP (>1) and high NC (1≤NC≤2) were classified as positive in this investigation. The model was found to have improved predictive performance over the Ploemen model using the initial 33 compounds in addition to an external validation set of 30 compounds. A further study of the Ploemen model by Pelletier et al. used a set of 201 proprietary and non-proprietary compounds and found that a simple modification of the rules, by lowering the threshold to 50, improved the model’s concordance.19 In addition, a newer Bayesian model was built by Pelletier et al. on a subset of 125 compounds, with additional descriptors that were chosen based on their chemistry and empirical data. The inclusion of descriptors such as amphiphilic moment, number of basic and acidic centers, and a proprietary structural fingerprint, resulted in improved sensitivity and negative predictive value.

Recently, a set of 103 proprietary and non-proprietary compounds (53 positive and 50 negative) was used to develop a model that predicted the occurrence of PLD by considering pharmacokinetic parameters.20 Because tissue distribution is an important factor in PLD and drug accumulation, the volume of distribution (Vd) was combined with physicochemical parameters pKa and logP to derive the equation. Compounds were predicted positive if pKa×ClogP×Vd≥180 and negative if pKa×ClogP×Vd≤180. Applying this modified model to the data set resulted in higher concordance in predicting PLD compared to using either the Ploemen or Tomizawa method alone.

Amphiphilicity is another physicochemical parameter that has shown discriminating power in in silico PLD screening.8 A program called CAFCA (CAlculated Free energy of amphiphilicity of small Charged Amphiphiles) was developed to calculate the amphiphilic moment of molecules based on their lowest energy 3D conformations.21 This value is expressed in terms of free energy and defined as the vector sum of the distance between the charged and hydrophobic/hydrophilic portions of the molecule. Calculations of amphiphilicity and pKa have been used in drug lead optimization to predict the potential of compounds to induce PLD in vitro, and compounds with a pKa of less than 7 and a free energy of amphiphilicity greater than −6 kJ/mol may be considered to have low risk of PLD.

Structure-activity relationship (SAR) modeling of phospholipidosis-inducing potential using machine learning techniques has been reported by Ivanciuc22 and Lowe et al.,23 based on non-proprietary data sets of 117 and 185 compounds, respectively. Ivanciuc utilized Weka machine learning algorithms and found that the best predictions were obtained with support vector machines (SVM), followed by perceptron artificial neural network, logistic regression, and k-nearest neighbors. Lowe et al. reported similar performance for SVM and random forest models, but noted improved performance when using circular fingerprints alone over other molecular descriptor combinations. Further work by Lowe et al.24 examined the mechanistic basis of PLD, and described machine-learning methods to predict the specific molecular targets that CADs may interact with, potentially inducing PLD. A comprehensive review by Ratcliffe25 provides a detailed overview of the relative performance and other characteristics of many of the modeling approaches described above.

The current report describes the construction of new QSAR models using an enhanced version of the CDER PLWG database and different modeling tools and methodologies than previously investigated. Commercially available global QSAR modeling software that utilize predominantly molecular descriptor and fragment-based statistical algorithms were explored, and in all cases selected physicochemical parameters were also applied by the software as part of the modeling process. The PLWG has facilitated access to new, proprietary PLD data donated by pharmaceutical companies, which when combined with existing data from FDA/CDER archives and the published literature yields a much larger PLD database for QSAR modeling. A notable focus in the current investigation was to increase the quality of negative data in the database by using a comprehensive search strategy to eliminate false negative training set compounds. In addition, an effort was made to balance the large number of positive findings to offer discriminating power during statistical analysis of structural features associated with PLD. This investigation evaluates different commercial (Q)SAR software (Leadscope Predictive Data Miner [LPDM] and Derek for Windows [DfW]) than in previous modeling efforts reported by Kruhlak et al. with a smaller data set.16 This is aimed at performing a comprehensive assessment of the strengths and weaknesses of widely used software platforms that offer prediction explainability to determine whether they can be used in a complementary manner to yield a more robust prediction of PLD induction.

2 Experimental

2.1 Data Sources

The first version of the CDER PLWG QSAR modeling database, described previously by Kruhlak et al. in 2008,16 contained 583 chemicals of which 190 were classed as PLD positive and 393 were classed as PLD negative. The positive classification was based on the presence of EM-confirmed PLD or the presence of foamy macrophages as documented in FDA internal archives and the published literature. Of the negative compounds, only 39 were EM-confirmed; the remainder were “assumed” negatives which were randomly selected from a database of marketed drugs based on their absence of PLD findings.

The enhanced modeling database used in this investigation is based on a comprehensive collection of clinical and preclinical PLD findings, including ADME data, results of toxicology studies from multiple animal species, and physicochemical properties. In addition, the majority of previously used negative compounds were eliminated in favor of those identified using a more comprehensive search strategy. To expand upon the earlier CDER PLD database, a rigorous keyword search for phospholipidosis synonyms was conducted across the published literature, internal FDA archives of IND and NDA submissions, and PharmaPendium. The presence of a subset of keywords, such as phospholipid accumulation or foamy macrophages, indicated a PLD-positive compound, while the absence of keywords suggested a PLD-negative compound. These compounds were further classified into high and medium confidence categories based upon the types of keywords found and the source of the data. For example, keywords that relate to electron microscopy confirmation of PLD were considered high confidence, whereas those relating to only the presence of foamy macrophages were considered of medium confidence. High confidence keywords and phrases used were: phospholipidosis, phospholipid accumulation, lamellar bodies, myeloid bodies, myelinoid bodies, myelin figure, myelin-like structure, and myelinosome. Medium confidence keywords and phrases used were: foamy macrophages, cytoplasmic vacuolation, cytoplasmic granules, lipidosis, dyslipidosis, and histiocytosis. For negatives, compounds with an absence of PLD keywords in NDA documents were considered of high confidence as compared to those based solely on a search of INDs, due to the level of testing information available in these documents. This data confidence rating refers to the reliability of the PLD score, not the potency of the compound in inducing PLD.

The final enhanced QSAR training database contained a total of 743 compounds, consisting of 385 positives (52 %) and 358 negatives (48 %). The names, activities, structures (as SMILES) and data confidence rating (expressed as high or medium) for all nonproprietary compounds used in this investigation are provided in Supporting Information Table SI-1. The data confidence rating is included so that others using the data can apply more conservative inclusion criteria for modeling if they desire.

2.2 Chemical Structures

Electronic representations of the chemical structures in the training data set were created in molfile format. Structural inclusions such as salts, hydrates, and simple counterions were stripped leaving a single molecular entity in each molfile for modeling purposes. Initially, all compounds were compiled into one training set to be input into each of the QSAR software (MC4PC and LPDM) and as a testing set for DfW. MC4PC processes a text file that directs the software to the location of individual molfiles. For LPDM and DfW, the preferred input file type is a single structure data file (sdf), which was generated from the individual molfiles using ConSystant version 3.2 and Mol2SDF. Further changes to certain structures were necessary to ensure compatibility with DfW, which required the reinstatement of charge-separated atom pairs that had been previously converted to their neutral form to ensure compatibility with the other modeling software used in this investigation.

Once the structures were imported, models were generated using MC4PC and LPDM, and optimized by varying the parameters specific to each software program, described below. The entire training data set was also run against the single PLD alert in DfW. A second training data set was constructed from only the high confidence findings in the database using the same method and was used to generate additional models for comparison purposes.

Due to processing limitations within the QSAR software used in this investigation, inorganic chemicals, simple salts, mixtures, and high molecular weight compounds (peptides, polysaccharides, proteins, polymers, etc.) were not included in the models.

2.3 MC4PC Modeling

MC4PC version 2.1.0.14 was obtained by the FDA through a Cooperative Research and Development Agreement (CRADA) and subsequently a Research Collaboration Agreement (RCA) with MultiCASE Inc. The program uses a molecular fragment-based algorithm that breaks up the training set molecules into all possible 2- to 10-atom fragments that serve as descriptors for QSAR modeling. MC4PC then identifies biophores or structural alerts, which are fragments primarily associated with active compounds, and modulators, which are molecular properties that contribute to differences in the activity of chemicals that share a common structural alert. The software uses a proprietary algorithm along with data from the training set to develop a QSAR model that may be used to estimate the potential toxicity of new compounds run against the model.26 Test compounds are screened for the presence of 2- or 3-atom fragments that are not represented in the model. These fragments are classified as “unknown fragments” and can lower the confidence in a test chemical prediction. Default predictions with the software can be filtered using “strict” or “relaxed” criteria. Strict criteria return a “no call” for any test compound with an unknown fragment, as well as any compound classified as inconclusive negative “(−)”, or inconclusive positive “(+)”. Relaxed criteria return a “no call” for test compounds with an unknown fragment but consider inconclusive negative “(−)”, or inconclusive positive “(+)” predictions as negative and positive, respectively. For both strict and relaxed criteria, borderline “m” predictions are excluded from performance statistical analyses. FDA/CDER has also co-developed expert rules to refine the process of identifying valid structural alerts by giving the highest weight to those derived from the largest clusters of training set chemicals.14 These rules differ from the default criteria in some cases and are automatically applied in the program to provide “expert” predictions. Predictions using all three methods were generated for comparison purposes in this report.

MC4PC translates molfiles into SMILES codes and converts input data into activity scores ranging from 10–19 (negative), 20–29 (marginal), and 30–80 (positive). The PLD modeling data set was manually scored according to this scheme prior to being imported. The negative compounds were uniformly assigned activity scores of 10 in all models. In contrast, all positives were assigned activity scores of either 35 (low range) or 80 (high range). Separate models were created with the 10/35 and 10/80 data sets, consisting of the entire database as well as the higher confidence subset, to determine whether the scoring method would affect predictive performance.

Test compounds containing two or more 2- to 3-atom fragments not represented in the training set are considered poorly covered and outside the model’s domain of applicability. Positively predicted compounds with poor coverage are still considered positive because the presence of unknown fragments in other parts of the test molecule does not negate the presence of the structural alert. However, negatively predicted compounds with poor coverage are excluded from the statistics and given a “no call” prediction. The software assigns a “marginal” (equivocal) prediction if it is unable to determine whether a compound is more likely to be positive or negative.

2.4 LPDM Modeling

LPDM version 2.4 was provided to the FDA under a CRADA and subsequently a RCA with Leadscope, Inc. The program has a built-in library of over 27,000 structural features commonly found in small molecule drug candidates and these predefined features serve as a knowledge base for classifying chemical structures. The program searches for commonly occurring substructures in the training set that discriminate for biological response or membership in a particular group. The macrostructure assembly algorithm reassembles the initial set of building blocks (structural fragments) to produce larger substructures (scaffolds) that are components used to predict for activity when building a model.27 The QSAR model builder within LPDM uses partial logistic regression to create QSAR models from binary data sets. The prediction results are given as probabilities and the user can manually set the probability threshold value for defining a compound as positive or negative.

There are several statistics that contribute to determining the significance of model features. Multiplying weights by the population of the feature yields loadings, and total positive loadings and total negative loadings indicate how much a model feature contributed to predicting a positive or negative result. Absolute total loadings are the sum of the respective positive and negative loadings. The positive/negative distribution of training set structures that contain a particular model feature, and frequency of a feature in the training set also help determine whether a feature has good positive or negative predictive power. Percent feature significance indicates the percentage of compounds that were used in fitting the regression line. High feature residuals (≥95 %) indicate that these features were not used extensively, while lower feature residuals (<70 %) indicate higher significance and very low values (<50 %) are most significant.

The domain of applicability is determined by comparing the similarity of structural features between the test compound and compounds in the training set. LPDM calculates the global distance, which is the minimum Tanimoto distance of a test chemical to the training set using a full fingerprint set before any feature selection. The program also calculates model distance, which is the Tanimoto distance of a test chemical to the training set using only the features used in the model.28 If the test compound does not have any model features or if there are no local neighbors in the training set that have ≥0.3 similarity, then that compound is considered out of domain.29 During automated cross-validation, no domain of applicability assessment is applied and all compounds are predicted. In contrast, during manual cross-validation a domain of applicability assessment is applied and compounds considered out of domain (uncovered) are not predicted.

After the binary (1 or 0) PLD data set was imported, LPDM default settings were initially selected for modeling. Subsequently, the number of compounds per scaffold and number of atoms per scaffold were adjusted to optimize the model. Only compounds with prediction probabilities <0.4 (negative) or ≥0.6 (positive) were included when calculating model statistics. Compounds not in domain or with equivocal predictions (probabilities between 0.4 and 0.6) were excluded from the analysis.

2.5 DfW Screening

DfW version 13.0.0 is an expert knowledge rule-based software provided to the FDA under a CRADA and subsequently an RCA with Lhasa Limited. The rules are derived from published literature, data from private industry and government, and expertise of committee members. Molecular structures imported into the program are analyzed with structure-activity relationship (SAR) and decision tree logic algorithms that apply expert rules to determine the level of likelihood that a particular outcome will occur. The alerts are supported by a summary of evidence, example compounds and associated toxicological data, and literature references that may be consulted for further information.30 DfW currently includes a single PLD alert (number 487) based on the CAD structure, and the entire enhanced PLD data set was screened against this alert as an “external” validation test. The PLD training set was not used to modify or optimize the predictive performance of alert 487 as part of this exercise. DfW uses a controlled vocabulary of confidence terms to express the likelihood that a prediction is correct based on the weight of evidence for and against it. Compounds that elicited a PLD structural alert with a likelihood level of plausible, possible, or certain were considered to be PLD-positive predictions. When DfW does not make a positive prediction it displays a “nothing to report” response; however, this response does not distinguish between a negative prediction due to a matching exclusion pattern or the test compound being outside the domain of applicability of the alert. DfW does not currently report the applicability domain of a model or coverage of a test compound.

2.6 Automated Cross-Validation

Each software program in this investigation utilized a different method of automated cross-validation for assessment of predictive performance of models trained with either all compounds or only the higher confidence subset.

MC4PC PLD models were cross-validated using a 10 by 10 % leave-many-out (LMO) procedure to assess model quality. The software randomly divided the entire data set into 10 subsets, setting aside one subset (10 % of the compounds) as a test set to run against the model created with the remaining 9 subsets (90 % of the compounds). This process was repeated 10 times, with a different test set and training set for each iteration.

LPDM selects the LMO cross-validation size based on the number of compounds in the training set. In this case, since the PLD training set had more than 500 but less than 1000 compounds, the software automatically performed 20 by 5 % cross-validations. LPDM randomly divided the data set into 20 subsets and used 5 % as a test set and the remaining 95 % as a training set to create a model, repeating the process 20 times. During automated cross-validation, a domain of applicability assessment is not applied resulting in all compounds being predicted.

2.7 Y-Scrambling Experiments

Y-scrambling experiments were performed by cross-validating models created using a dummy training data set where activity scores were randomized against chemical structures. Models were created and cross-validated using the same procedures described above for MC4PC and LPDM, except that with LPDM the modeling process was constrained to use the same 283 descriptors as in the intact model. The performance results for the random models were then used as a baseline for comparison to the final optimized models created with those same software.

2.8 Manual Cross-Validation for Consensus Analysis

The degree of concordance and complementarity between predictions from specific software was assessed using cross-validation data for two software programs: MC4PC and LPDM. The automated cross-validation method for each software program, described above, differed by the size of the test set left out for each cycle of the experiment, allowing only comparison of models made within that particular platform. In addition, LPDM did not apply a domain of applicability assessment to identify compounds that were out of domain. In order to make a fair comparison among the models across software platforms, manual 10 by 10 % cross-validation of the LPDM model was performed using the same validation sets randomly generated by MC4PC, and in both cases out of domain (uncovered) compounds were identified. DfW predictions were also used for comparison purposes in this series of experiments. Individual predictions for each chemical from each software platform could then be combined to assess complementarity and degree of concordance across all platforms, as well as combined predictive performance.

Predictive performance was calculated for each software program individually, each pair of software, and for all three software. Furthermore, performance was calculated using a “1+” rule, where any single positive prediction from the two or three software programs gave a positive overall call, as well as using “2+” and “3+” rules, where two or three positive predictions, respectively, were required to give an overall positive call.

2.9 External Validation

New compounds in the enhanced PLD database that were not included in the 2008 version of the database were run as an external validation set against the previously described MC4PC model.16 Predictive performance statistics for the external test set were calculated in a similar manner to the statistics from cross-validation experiments.

2.10 Statistical Analyses

Predictive performance statistics of the PLD QSAR models were calculated using the method of Cooper et al.31 Coverage was calculated as the percentage of all chemicals screened for which a prediction could be made. Compounds that were not in domain or equivocal were excluded from statistical calculations. Chi-squared statistics were calculated based on a 2×2 contingency table with one degree of freedom using Pearson’s chi-squared test. The statistic was used to determine the difference between the distribution of correct PLD predictions and incorrect predictions, with the null hypothesis being that there is no difference in their distributions.

3 Results

The performance of models constructed with the MC4PC software can be modified most significantly by (1) varying the scoring system used for training set compounds before model building commences, or (2) applying a set of predefined “expert rules” to the prediction output after a model is applied. Experiments to assess the effects of modifying both of these parameters were conducted and the results of 10 by 10 % cross-validation experiments are shown in Table 1.1

Table 1. Contingency table values and 10 by 10 % cross-validation performance of PLD models created with MC4PC. Abbreviations: TP: true positive; FN: false negative; FP: false positive; TN: true negative; Exp: experimental value; Pred: predicted value; NC: no call; Eqv: equivocal; Spec.: specificity; Sens.: sensitivity; Conc.: concordance; Pos. Pred.: positive predictivity; Neg. Pred.: negative predictivity; Cov.: coverage.
 

Contingency table values

Statistical parameters

 

TP Exp+ Pred+

FN Exp+ Pred−

FP Exp− Pred+

TN Exp−Pred−

NC

Eqv

Total

Spec.

Sens.

Conc.

FP Rate

FN Rate

Pos. Pred.

Neg. Pred.

Cov.

Chi-squared

Default strict criteria

                

(−)=10 and (+)=35 activity score

132

101

84

107

232

84

740

56.0

56.7

56.4

44.0

43.3

61.1

51.4

57.3

6.75

(−)=10 and (+)=80 activity score

135

99

108

93

289

16

740

46.3

57.7

52.4

53.7

42.3

55.6

48.4

58.8

0.688

Default relaxed criteria

                

(−)=10 and (+)=35 activity score

173

154

114

197

18

84

740

63.3

52.9

58.0

36.7

47.1

60.3

56.1

86.2

17.0

(−)=10 and (+)=80 activity score

239

161

133

172

19

16

740

56.4

59.8

58.3

43.6

40.3

64.2

51.7

95.3

18.1

Expert rules

                

(−)=10 and (+)=35 activity score

169

200

97

233

41

0

740

70.6

45.8

57.5

29.4

54.2

63.5

53.8

94.5

19.9

(−)=10 and (+)=80 activity score

229

145

138

192

36

0

740

58.2

61.2

59.8

41.8

38.8

62.4

57.0

95.1

26.5

Two MC4PC models were constructed with positive compounds assigned scores on the upper or lower threshold of the positive range of 30 to 80 activity units (score of 10 defines a negative). Models in which the PLD-negative compounds were scored 10 and PLD-positive compounds were scored 35 were labeled as “10/35”. Models in which PLD-negative compounds were scored 10 and PLD-positive compounds were scored 80 were labeled as “10/80.” After 10 by 10 % cross-validation, the predictions were filtered in three different ways: using (1) default strict criteria, (2) default relaxed criteria, and (3) expert rules, as described in the materials and methods section of this report. The most significant difference in performance was for the expert rule models, where the 10/35 models showed the lowest sensitivity at 45.8 % but highest specificity at 70.6 % compared to the 10/80 models with 61.2 % sensitivity and 58.2 % specificity.

Overall concordance for the default relaxed criteria and expert rule predictions were comparable (52.4 % vs. 59.8 %), as was coverage (95.3 % vs. 95.1 %). In contrast, predictions using the default strict criteria resulted in lower concordance and substantially lower coverage (52.4 % and 58.8 %, respectively) despite the exclusion of perceived weaker predictions with unknown fragment warnings. For the default relaxed criteria and expert rule predictions, the 10/80 scoring system gave more balanced specificity and sensitivity (56.4 % and 59.8 %; 58.2 % and 61.2 %) compared to the 10/35 system (63.3 % and 52.9 %; 70.6 % and 45.8 %), which favored specificity. This effect was amplified when using the expert rules, which is not unexpected since the rules were originally developed to emphasize specificity and positive predictivity.

The chi-squared value was lowest for the 10/80 default strict criteria predictions at 0.668, not even achieving the 3.841 threshold required for a statistically discriminating model. The highest chi-squared value was 26.5 for the 10/80 expert rule predictions, and consequently this model was used for comparison and consensus experiments with other software described later in this report.

Table 2. External validation statistical parameters of 2008 PLD models created with MC4PC. Abbreviations: TP: true positive; FN: false negative; FP: false positive; TN: true negative; Exp: experimental value; Pred: predicted value; NC: no call; Eqv: equivocal; Spec.: specificity; Sens.: sensitivity; Conc.: concordance; Pos. Pred.: positive predictivity; Neg. Pred.: negative predictivity; Cov.: coverage.
 

Contingency table values

Statistical parameters

External validation

TP Exp + Pred+

FN Exp+ Pred−

FP Exp−Pred+

TN Exp−Pred−

NC

Eqv

Total

Spec.

Sens.

Conc.

FP Rate

FN Rate

Pos. Pred.

Neg. Pred.

Cov.

Chi-squared

External test set

135

91

86

183

66

0

561

68.0

59.7

64.2

32.0

40.3

61.1

66.8

88.2

38.3

The availability of an enhanced PLD database provided an external set of compounds that could be tested against previously published FDA/CDER PLD models.16 All compounds not present in the 2008 version of the database and models were combined to create an external test set consisting of 561 compounds, with 316 negatives (56.3 %) and 245 positives (43.7 %). These compounds were screened against the MC4PC model and the results are shown in Table 2. The MC4PC model resulted in 68.0 % specificity, 59.7 % sensitivity, 64.2 % concordance, and a chi-squared value of 38.3. The previously reported cross-validated performance of this model gave 91.9 % specificity, 50.0 % sensitivity, 77.6 % concordance and a chi-squared value of 115.5.

LPDM interprets chemical structures in the training data set using a proprietary set of 27,000 substructural features, which may be combined to create scaffolds. During the stepwise modeling process in LPDM, users can alter the default settings and specify the minimum number of compounds required to qualify as a cluster of a particular scaffold (compounds/scaffold), and the minimum number of atoms that comprise a scaffold (atoms/scaffold). The default settings selected by the software are compounds/scaffolds=5 and atoms/scaffold=5. Twenty by 5 % cross-validations were automatically run based on the number of compounds in the PLD data training set (<1000). A comparison of the cross-validation statistics (Table 3) showed that the LPDM model using all compounds and modified compounds/scaffold and atoms/scaffold ratios had little variation in any of the predictive performance parameters (e.g., concordance ranging from 78.2 % to 79.7 %).3

Table 3. Contingency table values and 20 by 5 % cross-validation performance of PLD models created with LPDM. Abbreviations: TP: true positive; FN: false negative; FP: false positive; TN: true negative; Exp: experimental value; Pred: predicted value; NID: not in domain; Eqv: equivocal; Spec.: specificity; Sens.: sensitivity; Conc.: concordance; Pos. Pred.: positive predictivity; Neg. Pred.: negative predictivity; Cov.: coverage.
 

Contingency table values

Statistical parameters

 

TP Exp+ Pred+

FN Exp+ Pred−

FP Exp−Pred+

TN Exp−Pred−

NID

Eqv

Total

Spec.

Sens.

Conc.

FP Rate

FN Rate

Pos. Pred.

Neg. Pred.

Cov.

Chi-squared

Compounds/scaffold=5 atoms/scaffold=5

263

70

63

223

0

123

742

78.0

79.0

78.5

22.0

21.0

80.7

76.1

83.4

200.2

Compounds/scaffold=3 atoms/scaffold=6

265

68

62

237

0

110

742

79.3

79.6

79.4

20.7

20.4

81.0

77.7

85.2

218.5

Compounds/scaffold=3 atoms/scaffold=5

261

70

62

234

0

115

742

79.1

78.9

78.9

20.9

21.1

80.8

77.0

84.5

209.8

Compounds/scaffold=3 atoms/scaffold=4

259

63

62

231

0

127

742

78.8

80.4

79.7

21.2

19.6

80.7

78.6

82.9

216.0

Compounds/scaffold=4 atoms/scaffold=6

267

67

65

234

0

109

742

78.3

79.9

79.1

21.7

20.1

80.4

77.7

85.3

214.3

Compounds/scaffold=4 atoms/scaffold=5

261

69

60

227

0

125

742

79.1

79.1

79.1

20.9

20.9

81.3

76.7

83.2

208.2

Compounds/scaffold=4 atoms/scaffold=4

261

74

64

235

0

108

742

78.6

77.9

78.2

21.4

22.1

80.3

76.1

85.4

201.9

Compounds/scaffold=5 atoms/scaffold=6

259

66

60

229

0

128

742

79.2

79.7

79.5

20.8

20.3

81.2

77.6

82.7

212.8

Compounds/scaffold=5 atoms/scaffold=4

263

65

63

225

0

126

742

78.1

80.2

79.2

21.9

19.8

80.7

77.6

83.0

209.3

The model with the highest chi-squared value (218.5 for compounds/scaffold=3 and atoms/scaffold=6) gave 79.3 % specificity, 79.6 % sensitivity, and 79.4 % concordance, which did not vary significantly from the model using default settings (compounds/scaffold=5, atoms/scaffold=5), with 78.0 % specificity, 79.0 % sensitivity, and 78.5 % concordance. Overall, the LPDM models exhibited high chi-squared values for all generated, ranging from 200.2 to 218.5, but coverage in all cases was notably lower (82.7 % to 85.4 %) than MC4PC due to the high number of compounds being classified as equivocal. This number would be expected to be even lower for LPDM if a domain analysis were applied, which does not occur during automated cross-validation cycles.

Y-scrambling experiments were performed on the MC4PC and LPDM models selected for manual cross-validation comparison and consensus experiments to ensure that performance was significantly better than random (Table 4). The Y-scrambled cross-validated MC4PC model gave a specificity of 64.5 % and sensitivity of 35.7 %, with an overall chi-squared value of 0.00173, and the LPDM model gave 39.9 % specificity and 56.6 % sensitivity, with a chi-squared value of 0.430. Both indicated the lack of a statistical correlation. Furthermore, in both cases the overall concordance was slightly worse than 50 %, which represents the flip of a coin.4

Table 4. Y-scrambling statistics for baseline comparison using randomized training set and 10 by 10 % cross-validation for MC4PC and LPDM. Abbreviations: TP: true positive; FN: false negative; FP: false positive; TN: true negative; Exp: experimental value; Pred: predicted value; NC: no call; NID: not in domain; Eqv: equivocal; Spec.: specificity; Sens.: sensitivity; Conc.: concordance; Pos. Pred.: positive predictivity; Neg. Pred.: negative predictivity; Cov.: coverage.
 

Contingency table values

Statistical parameters

 

TP Exp+ Pred+

FN Exp+ Pred−

FP Exp− Pred+

TN Exp− Pred−

  

Total

Spec.

Sens.

Conc.

FP Rate

FN Rate

Pos. Pred.

Neg. Pred.

Cov.

Chi-squared

MC4PC

    

NC

Eqv

          

10=(−) and 80=(+) Expert Rules

127

229

119

216

49

0

740

64.5

35.7

49.6

35.5

64.3

51.6

48.5

93.4

0.00173

LPDM

    

NID

Eqv

          

Compounds/Scaffold=5 Atoms/Scaffold=5

94

72

107

71

0

398

742

39.9

56.6

48.0

60.1

43.4

46.8

49.7

46.4

0.430

The human expert rule-based SAR software, Derek for Windows (DfW), contains a single alert derived from the CAD motif to predict PLD. Since the software does not disclose the structures that were used to derive the alert, it was not possible to exclude these structures, if present, from the test set to create a confirmed external validation set. Consequently, the predictive performance of the alert was tested using the entire PLD data set of 743 compounds, where it showed a low false positive rate (2.5 %) but low sensitivity (10.4 %). The software fired the alert for only 49 compounds, but 40 of those 49 compounds were predicted correctly, demonstrating that the PLD alert had high positive predictivity (81.6 %). All other compounds in the testing set yielded “nothing to report” outcomes.

To allow a direct comparison of the predictive performance of the software platforms using the same validation procedure, and to determine their degree of complementarity, a manual 10 by 10 % cross-validation exercise was performed for the statistical-based models using the exact same test and training data sets for each cycle. The entire data set of 743 compounds was predicted using the 10/80 expert rules MC4PC model and the 5/5 LPDM model. The DfW PLD alert predictions for all 743 compounds were also used for comparison and consensus prediction purposes. The results of the three software programs applied in this way are presented in Table 5, which show that LPDM had the highest individual performance statistics of the three software programs (73.8 % specificity, 77.9 % sensitivity, 76.1 % concordance, and 98.2 chi-squared), followed by MC4PC (58.2 % specificity, 61.2 % sensitivity, 59.8 % concordance, and 26.5 chi-squared), and DfW (97.8 % specificity, 10.4 % sensitivity, 52.4 % concordance, and 20.3 chi-squared). However, of the statistical-based models, MC4PC had the highest coverage at 95.1 % followed by LPDM with 49.7 %; DfW specifies no applicability domain and predicts any molecule, resulting in a nominal coverage value of 100 %.

To determine whether the overall predictive performance for PLD could be improved by combining the individual predictions from the three QSAR software platforms using different prediction methodologies, predictions were compared using all possible combinations of two software as well as all three with different call criteria. An assessment of the degree of complementarity between predictions showed the following concordance, including no calls, between predictions: MC4PC and LPDM=42.2 %, MC4PC and DfW=50.7 %, and LPDM and DfW=26.8 %. These values indicate that the three software programs are making different predictions across the entire set of 743 compounds, and may offer complementarity if used with an appropriate interpretation scheme. For example, overall sensitivity can be enhanced by using a “1+” rule where any single positive prediction across two or more complementary software platforms is sufficient evidence to generate a positive overall call. In contrast, the use of a “2+” rule, requiring two or more positive predictions across all software, typically leads to higher specificity (a lower false positive rate) and positive predictivity.32

When comparing pairs of software using a 1+ rule, overall concordance did not exceed that of LPDM alone, although coverage was improved from 49.7 % to 59.6 % in combination with DfW, and to 77.4 % in combination with MC4PC. Overall sensitivity for LPDM was increased from 77.9 % to 83.2 % in combination with MC4PC, but with a significant loss of specificity from 73.8 % to 41.6 %, indicating that the accuracy the additional compounds predicted by MC4PC alone was not as high as for the rest of the set. Combining MC4PC with DfW did not give significantly different performance over MC4PC alone due to the small number of positive predictions made by DfW, most of which were already predicted positive by MC4PC. When all three software platforms were combined using a 1+ rule, the overall coverage increased to 96.5 %, but overall prediction concordance was still lower than for LPDM alone. The use of a 2+ rule resulted in slightly higher concordance than for the 1+ rule, but with a bias towards higher specificity (88.0 %) and lower sensitivity (46.0 %). Coverage was also lower at 76.8 %.5

Table 5. DfW predictions and 10 by 10 % cross-validation statistics of MC4PC and LPDM alone and in combination. Abbreviations: TP: true positive; FN: false negative; FP: false positive; TN: true negative; Exp: experimental value; Pred: predicted value; NC: no call; NID: not in domain; Eqv: equivocal; Spec.: specificity; Sens.: sensitivity; Conc.: concordance; Pos. Pred.: positive predictivity; Neg. Pred.: negative predictivity; Cov.: coverage.
 

Contingency table values

Statistical parameters

Individual model predictions

TP Exp+ Pred+

FN Exp+ Pred−

FP Exp− Pred+

TN Exp− Pred−

  

Total

Spec.

Sens.

Conc.

FP Rate

FN Rate

Pos. Pred.

Neg. Pred.

Cov.

Chi-squared

  1. [a] DfW does not make negative predictions; however, for statistical calculation purposes a lack of a positive prediction by DfW was treated as a negative prediction.

MC4PC

    

NC

Eqv

          

10=(−) and 80=(+) activity score/expert Rules

229

145

138

192

36

0

740

58.2

61.2

59.8

41.8

38.8

62.4

57.0

95.1

26.5

LPDM

    

NID

Eqv

          

Compounds/scaffold=5 atoms/scaffold=5

159

45

43

121

312

60

740

73.8

77.9

76.1

26.2

22.1

78.7

72.9

49.7

98.2

DfW[a]

    

NC

Eqv

          

Alert 487

40

344

8

348

0

0

740

97.8

10.4

52.4

2.2

89.6

83.3

50.3

100.0

20.3

Combined model predictions

TP Exp+ Pred+

FN Exp+ Pred−

FP Exp− Pred+

TN Exp− Pred−

NC

Eqv

Total

Spec.

Sens.

Conc.

FP Rate

FN Rate

Pos. Pred.

Neg. Pred.

Cov.

Chi-squared

MC4PC+LPDM (1+)

263

53

150

107

167

0

740

41.6

83.2

64.6

58.4

16.8

63.7

66.9

77.4

43.5

MC4PC+DfW (1+)

233

141

139

191

36

0

740

57.9

62.3

60.2

42.1

37.7

62.6

57.5

95.1

28.6

LPDM+DfW (1+)

173

75

45

148

299

0

740

76.7

69.8

72.8

23.3

30.2

79.4

66.4

59.6

93.7

MC4PC+LPDM+DfW (1+)

265

112

151

186

26

0

740

55.2

70.3

63.2

44.8

29.7

63.7

62.4

96.5

47.5

MC4PC+LPDM+DfW (2+)

139

163

32

234

172

0

740

88.0

46.0

65.7

12.0

54.0

81.3

58.9

76.8

77.7

4 Discussion

Cationic amphiphilic drugs feature heavily in the data set used in this investigation and as such are expected to be the basis of many of the structural alerts and structure-activity relationships identified. The broadly described features that are generally used to define a CAD are a hydrophilic amine side chain and a hydrophobic ring system; however, the specific definition of the amine and ring systems are somewhat ambiguous as they require a knowledge of the ionization state of particular atoms relative to the biological environment in which the molecule is present. Furthermore, there is no set definition with regards to the distance separating these two molecular features within the molecule. While CAD molecules are most frequently used to investigate the mechanism of PLD induction in biomarker studies,33 there are still a significant number of exceptions to the proposed mechanism as indicated by PLD-positive non-CADs in the database. In addition, examples of PLD-negative CADs suggest that mitigating features play a role in downgrading PLD activity in some instances. A statistical-based analysis of substructural features, such as that used in QSAR modeling, may provide some insights into other structural motifs associated with the effect and other possible mechanisms by which the effect occurs or is mitigated.

In this investigation, three software programs were utilized to analyze the same database from different perspectives with respect to structural interpretation and QSAR methodology. The fragment-based statistical software MC4PC provides test compound predictions based upon the presence of 2–10 atom structural alerts using a default set of evaluation criteria preset by the software developer or an optimized set of expert rules previously developed by FDA/CDER to enhance specificity and positive predictivity14. The predictive performance of the software can also be adjusted by using different scoring systems for training set data, where positive compounds can be assigned a range of activity scores between 30 and 80 units. The PLD training data set used in this investigation does not incorporate a potency measure based on the dose at which PLD is induced, and as such, all compounds are weighted equally. However, despite the data set being binary in nature, scoring systems at the extremes of the activity unit scheme were tested and it was shown that a 10/80 model gave a better balance of sensitivity and specificity, with overall better chi-squared values, than when using a 10/35 model. Furthermore, performance using the expert rule predictions was improved over using the default predictions.

In part, the expert rules classify structural alerts identified by the software as either significant or non-significant based upon a cumulative index which is calculated for each alert from the number of training set compounds from which the alert was derived and the average activity score of those compounds. In the 10/80 model constructed from all 743 compounds, MC4PC identified a total of 144 structural alerts, of which 119 were associated with active compounds in the training data set and 25 were associated with a lack of activity (deactivating fragments). The structural alerts were rank-ordered based on their cumulative index score and those with the highest cumulative scores were considered to be the most significant and reliable in supporting a positive prediction. Based on previous investigations,14 structural alerts with an index score above 150 activity units were considered reliable predictors of activity.

Of the 64 alerts with index scores greater than 150 activity units, 26 were derived from clusters of 4 or more compounds and 7 of these alerts were derived from clusters of more than 10 compounds. Furthermore, 15 of the 26 alerts show features of a CAD structure, which was broadly defined as having a hydrophobic region containing aromatic carbons and a basic nitrogen. Figure 1 shows several examples of PLD-positive chemical structures from the training set that contain these particular CAD-supporting structural alert fragments, where the separation between the basic amine and aromatic system ranges from 1 to 16 atoms. Two of the CAD structural alerts consist of fluoromethyl-substituted aromatic rings, where the halogen increases lipophilicity and drug permeability by electron withdrawal, and may explain the compounds’ ability to induce PLD. Interestingly, a further four structural alerts containing both an amide and an aromatic ring fragment were identified, even though an amide nitrogen is generally considered insufficiently basic to meet the definition of a CAD. Upon closer inspection of the compounds from which these alerts were derived, the structures were found to contain another more basic nitrogen, which defines a CAD and suggests that identification of the amide moiety as the cause of the PLD activity may be coincidental. However, this same feature was identified as being independently significant by LPDM using a different modeling algorithm and different underlying clusters of training set compounds, suggesting that there may be some biological significance to this observation, perhaps when metabolic considerations are made.1

Figure 1.

Positive compounds with MC4PC structural alerts in bold.

LPDM predictions are based on significant structural descriptors (“model features”), which are derived from the Leadscope feature hierarchy or scaffolds assembled from individual features. Because there was little variation in statistics among the models made with all compounds, the PLD model built with default settings was selected for the consensus analysis out of convenience. LPDM identified 283 model features, and the % Residuals statistic was used as an initial sorting filter to determine which of these were most significant. By sorting based on this parameter, 16 model features were found to have ≤70 % Residuals, with one very significant feature having 43.7 %. Thirteen of these 16 features contained CAD features composed of aromatic rings and an amine side chain, and 4 features (one overlapping with the 13) contained an amide. These same 16 features also ranked highly when the feature list was sorted by Total Positive Loadings and Absolute Total Loadings indicating their importance in the model. Figure 2 shows several examples of these significant model features in which CAD structures are heavily represented.2

Figure 2.

LPDM positive scaffolds ranked by % residuals (lowest residuals=most significant for positive prediction).

A comparison of the top-ranked MC4PC alerts and LPDM structural features showed significant overlap despite differences in the way in which the features were derived and the way in which they are applied as part of a prediction exercise. The benzylamine moiety is responsible for the largest overlap between the two sets, followed by amides. Alerts or features with more than a one-carbon separation between amine and aromatic ring are not observed in either software platform due to the way structural features are limited during the model building process. In MC4PC, structural alerts are restricted to 10 atoms or less, and in LPDM, the Leadscope fingerprint of molecular features does not by default include aromatic amines with a greater separation between ring and nitrogen. However, LPDM does allow the import of user-defined features for consideration during the regression analysis, which may provide a way to seed the feature set with more relevant scaffolds for PLD to improve the predictive performance of future versions of the model.

A 10 by 10 % cross-validation analysis for two software platforms was performed using the same test and training data sets for each iteration to determine the degree of complementarity and/or consensus of predictions. By using the exact same test and training sets a direct comparison of predictive performance between platforms was first obtained, before assessing the degree of benefit in combining predictions to obtain an overall call for each chemical. When all three platforms were evaluated in the same way, LPDM showed the highest overall performance of a single software (Table 5), but with the lowest coverage of the three at only 49.7 %, caused by a significant number of equivocal and “not in domain” predictions. In contrast, MC4PC showed far fewer uncovered compounds and/or equivocal predictions but gave lower overall predictive performance; DfW gave high positive predictivity for its single alert, but overall low sensitivity in detecting PLD-causing compounds across the entire database.

Each program uses a unique approach to interpreting chemical structures, which determines whether each model is capable of making a prediction and what that prediction will be. Due in part to these differences, as well as those in the (Q)SAR methodology used, it was expected that there would also be differences in the individual software predictions made by each software on a chemical-by-chemical basis. This was confirmed by the observation that the highest degree of concordance, between MC4PC and DfW, was only 50.7 %, with lower values for other model pairs. Despite the perceived complementarity of 49.3 % of the MC4PC and DfW predictions, it was noted that the majority of the differences were “no structural alert” calls from DfW and positive calls from MC4PC calls and that almost all compounds predicted positive by DfW were also predicted positive by MC4PC. This observation suggests that there is little benefit to using both models to identify positives using a 1+ rule. In contrast, the use of DfW in combination with LPDM showed greater benefit by making calls for a larger number of compounds predicted as equivocal or no call by LPDM, with only a small drop in overall concordance (76.1 % to 72.8 %).

In cases where all three software programs are positive in consensus (3+ rule), one can have the highest degree of confidence in those positive predictions, however, the number of overall positive predictions is very low, in part due to the low number of positive predictions by DfW. Only 30 positive predictions were generated from applying a 3+ rule to this data set, resulting in a false positive rate of only 1.7 % but sensitivity at an unacceptable 6.4 % (data not shown).

Of the 312 compounds classified as not in domain by LPDM, all but 26 were covered by MC4PC, which supports the concept that if one compound is not in domain for one software platform it may still be covered by another due to the different methods of structural interpretation and domain of applicability assessment. Overall, combining individual predictions from multiple software platforms is an advantageous approach because it yields reasonable sensitivity and specificity while increasing the number of chemicals with useable predictions. Furthermore, where predictions are in consensus greater confidence in those predictions can be inferred, especially for positives when the alerting portion of the molecule is the same.

A more detailed assessment of the 44 structures in the database that were consistently predicted incorrectly was performed to better understand the deficiencies of the models. A total of 6 compounds were found to be incorrectly classified as positives by all three software programs, and a total of 38 compounds were incorrectly classified as negatives by all three software programs. The 37 non-proprietary false positives and false negatives are presented in Table 6 and selected structures are shown in Figures 3 and 4.3, 4

Figure 3.

Selected compounds with false positive predictions from MC4PC, LPDM and DfW.

Figure 4.

Selected compounds with false negative predictions from MC4PC, LPDM and DfW.

Of the 6 false positives, all meet the strict definition of a CAD structure by matching DfW alert 487. Three of these compounds (methapyrilene, procainamide, and tiapride) were EM-confirmed negatives, suggesting that the PLD-inducing effect is mitigated by molecular properties that are not represented in any of the models. The remaining 3 compounds were classified as negative based on their absence of documented PLD activity at the time the database was compiled. All 6 drugs have a most basic pKa value of between 8.7 and 9.334 and DfW-calculated logP values ranging from 1.3 to 4.1. The highest logP value was for doxepin, which when combined with high basicity, a high volume of distribution (Vd=170 L/kg) and half-life of 15.3 hours35 suggests that the training set classification of negative may be inappropriate. Doxepin is classified as a medium confidence experimental negative due to the more limited documentation available for searching for PLD-related terms during database construction. In contrast, the three EM-confirmed (high confidence) training set negatives were found to have relatively short half-lives, from 0.2 to 4 hours, and Vd values between 1.4 and 3.9 L/kg36, indicating rapid metabolism and limited tissue distribution of the drugs. Of interest is that the three non-EM-confirmed negatives are also predicted as false positives by the pKa and logP-based Ploemen model17; however, only doxepin is predicted positive by the Hanumegowda model20, which takes volume of distribution into consideration. This suggests that pharmacokinetics should be accounted for to better predict the lack of observed activity for these compounds.

Of the 38 false negatives (non-proprietary and proprietary), only 6 meet the strict definition of a CAD. In the remaining cases the software is unable to identify other substructural features that are statistically correlated with the PLD activity, perhaps due to the lack of representative structures in the database, or again perhaps because other molecular properties such as bioavailability or physicochemical parameters are contributing heavily to the activity and are not adequately accounted for in the models. An additional limitation is the pooling of data across different species and different doses, as well as the combination of EM-confirmed and light microscopy findings.6

Table 6. Nonproprietary PL-negative and PL-positive compounds incorrectly predicted by MC4PC, LPDM and DfW.

Compound name

Experimental score

(Q)SAR prediction

Atomoxetine hydrochloride

+

Doxepin hydrochloride

+

Methapyrilene

+

Metoclopramide

+

Procainamide

+

Tiapride

+

5-Hydroxydopamine

+

Amantadine

+

Atropine

+

Cefprozil

+

Chlorhexidine gluconate

+

Entecavir

+

Eprosartan mesylate

+

Etiprednol dicloacetate

+

Felbamate

+

Flunisolide

+

Flutamide

+

Fluvastatin sodium

+

Formoterol

+

Gabapentin

+

Levodopa

+

Lovastatin

+

Lubiprostone

+

Megestrol acetate

+

Midazolam

+

Nilutamide

+

Olmesartan medoxomil

+

Pantoprazole sodium

+

Perindopril erbumine

+

Pregabalin

+

Ribavirin

+

Sapropterin dihydrochloride

+

Stavudine

+

Thalidomide

+

Trimethoprim

+

Ursodiol

+

Valacyclovir hydrochloride

+

DfW generates predictions that are based on human expert knowledge rather than statistical calculations, and provides references to actual published studies to support its reasoning. The DfW PLD alert, which specifically describes the phospholipidosis inducing potential of amines, is primarily based on the cationic amphiphilic structure, but also takes into consideration logP and pKa parameters. The very strict definition of this alert gave very high positive predictivity but low sensitivity when assessed with the entire PLD database, with the low sensitivity being due in part to the highly specific substructural alerting fragment as well as non-CAD PLD-positive examples in the test set. Broadening the structural definition of a CAD may improve the software’s predictive performance at this endpoint. Furthermore, performing an in-depth analysis of the many non-CAD PLD-positive exceptions in this and other data sets may lead to the development of a non-CAD specific alert that takes into consideration other structural features or physicochemical parameters that can be used to supplement the existing alert.

The cross-validation statistics of the new models built with MC4PC demonstrated lower concordance and specificity compared to those from the previously reported FDA/CDER models,1 but with greater sensitivity as a direct consequence of the higher ratio of actives to inactives in the training set. This ratio is close to 50 : 50 in the new model as compared to 33 : 67 in the previously published model. Of interest is that external validation of the 2008 models using a relatively balanced set (44 : 56) of new PLD data revealed that MC4PC performed with higher sensitivity than in the previously reported cross-validation experiments.

5 Conclusions

The application of chemical structure-activity relationships derived using in silico methodologies provides a rapid and efficient way to assess the potential of compounds to induce PLD. The expanded PLD database and access to more software platforms facilitated the creation and interrogation of (Q)SAR models that incorporate more chemical structures with a greater domain of applicability, and explore a different modeling approach not previously investigated. The new models were trained with a more robust and balanced data set and demonstrated improved sensitivity over the previously described models,16 which emphasized specificity. Furthermore, the new models show benefit when used as part of a screening battery; however, the models do not address the bigger issue of whether PLD translates to toxicity, which can only be evaluated using a targeted set of compounds with organ-specific PLD and toxicological outcomes. Despite each software program’s unique approach to identifying relevant molecular descriptors, the CAD motif features significantly in all models and is the main driver for positive predictions. Nevertheless, the exceptions demonstrate that CAD characteristics are not the only predictive features, and that there are other important factors to take into consideration, including mitigating structural features and physicochemical properties.

(Q)SAR screening provides an advantage over simple CAD identification by visual inspection because of its ability to rapidly and consistently identify structural features for a large number of test chemicals, as well as its ability to account for the effects of other modulating groups within a molecule. (Q)SAR models for this endpoint are more commonly used to screen chemical lead candidates in drug development by pharmaceutical industry, although these models still have potential to be used as part of a safety assessment for drugs under regulatory review. However, a significant limitation is that they do not currently address issues such as organ-specific toxicity or the dose at which PLD may occur. In this investigation, the underlying data were obtained from various sources for different study protocols, and consequently the resulting models do not account for dose, duration of exposure, or species specificity. In addition, the models consider only a limited number of physicochemical descriptors and could likely benefit from addition of those such as amphiphilicity and logD, which have proven useful in previously reported modeling efforts.25

The current models do not show sufficiently high predictive performance, alone or in consensus, as a comprehensive screen and would be best used as part of a risk assessment strategy that incorporates other tools and supporting evidence. A more robust screening approach may be realized by using a stepwise combination of structural alerts, simple physicochemical parameters, and QSAR models. Further research on the biological mechanism of PLD may provide greater clarity on the link between PLD and toxicity for a wider range of chemicals, which may in turn provide more relevant experimental data for the development of new and more focused models to predict PLD-induced toxicity.

Acknowledgements

The authors thank the FDA CDER Phospholipidosis Working Group for providing the PLD database for QSAR modeling and for feedback on this manuscript, and Ms. Esther Kim for assistance with database curation. The authors also thank RCA partners MultiCASE Inc., Leadscope Inc., and Lhasa Limited for providing software and technical support.

Ancillary