SEARCH

SEARCH BY CITATION

Keywords:

  • Average precision;
  • Boosting;
  • Ensemble algorithms;
  • Kernel algorithms;
  • Random forest;
  • Support vector machine;
  • Virtual screening

Machine-learning algorithms such as the support vector machine (e.g., Cristianini and Shawe-Taylor1) and random forest2 are increasingly being used by chemists in the pharmaceutical industry to build models for virtually screening compounds and identifying potential targets for drug development.36 Bajorath7 gave a comprehensive review in 2002, and data mining for high-throughput screening (HTS) data continues to be an active area of research. However, concerns over the superiority of certain algorithms are often misleading, as was broadly noted by Hand.8 We report a small, simple study to show that, for these chemistry problems, the choice of the learning algorithm can be far less important than the choice of the underlying chemical space as defined by the predictors used – note also that a vast array of chemical descriptors is available.9

A total of 1280 compounds were taken from Sigma-RBI’s Library of Pharmacologically Active Compounds (LOPAC). For each compound, we obtained two sets of predictors and created six target labels. At the time this work was initiated, access to HTS data sets was relatively limited. Now, large data sets are more readily available; see, for example, a large data set, DUD, from the University of California, San Francisco.10

For our experiments, two types of chemical descriptors were used. The first predictor set contained 480 binary, atom-pair predictors. The second predictor set contained 6 continuous predictors, consisting of a selected set of BCUT-like descriptors based on the eigenvalues of the molecule’s connectivity matrix (e.g., Burden11). The molecular descriptors were computed using PowerMV,12 which can be freely downloaded from the website, www.niss.org/PowerMV. These descriptors are chemically opaque, but have been successful in making chemical predictions. It is starting to become recognized that data and methods useful for prediction are not necessarily easily interpreted, and that easily interpreted data and models can be less predictive; see, for example, Shmueli’s recent article, “To Explain or to Predict?”.13

The 6 representative target labels (see Table 1) chosen were binary indicators, 1/0, of whether a compound belongs to the adenosine, antibiotic, antibiotic-Cephamycin, cholinergic, or the GABA class, and whether a compound is a hormone.1

Table 1. Number of compounds belonging to each target class.

Target label

Yes

No

Total

Adenosine

57

1223

1280

Antibiotic

29

1251

1280

Antibiotic-Cephamycin

11

1269

1280

Cholinergic

77

1203

1280

GABA

42

1238

1280

Hormone

34

1246

1280

For each target-label and predictor-set combination, we evaluated 3 algorithms: support vector machine (SVM), random forest (RF), and rank boost (RB). These algorithms were selected for being representatives of two broad classes of successful data-mining algorithms, respectively known as kernel algorithms and ensemble algorithms.14

In particular, SVM is a prototypical kernel algorithm, while RF and RB are prototypical ensemble algorithms. Ensembles can be constructed either independently or sequentially. Among independently constructed ensembles, bagging15 and RF are perhaps the most well-known and widely-used. Since bagging is special case of RF,14 we considered RF only in our study. Sequentially constructed ensembles include various boosting algorithms.16 The RB algorithm17 is a particular variation of boosting that is suitable for ranking problems; below, we will say more about why we have a ranking problem.

Altogether, a total of 36 experiments (all combinations of 6 targets, 2 predictor sets, and 3 algorithms) were performed. Each experiment consisted of 25 repeated runs. In each run, the data set was first split randomly into two halves. (Generally, only a very small fraction of the compounds belong to any given class. Hence, when randomly splitting the data into two halves, extra care was taken to ensure that the resulting training and test sets were well balanced. For example, suppose that only 30 (out of 1280) compounds belong to a given class; call these compounds “actives” and the remaining 1250 “inactives.” To ensure good balance in such a situation, we would always randomly sample 15 from the actives and 625 from the inactives to form our training set and use the remainder as our test set.) Each algorithm was trained on the first half (the training set) and evaluated on the second half (the test set).

A salient feature of this type of chemical screening problems is that the class labels are almost always highly unbalanced, i.e., there are many more “no”s than “yes”s in the data set (see Table 1). For such unbalanced problems, misclassification rate is not a good performance measure since, by simply classifying everything into the majority class, one can still achieve a very low misclassification rate. Instead, we used a criterion called the average precision (AP) to evaluate the algorithms’ performances.

The AP measures how effectively an algorithm is able to rank the candidate items; an algorithm that ranks all the “yes”s ahead of the “no”s will have AP=1, whereas an algorithm that ranks the candidates randomly will have AP=π, where π is the fraction of “yes”s in the data set (see Zhu et al.,18 Appendix A, for more details). In other words, the higher the AP, the better. As a measure of ranking, the AP is similar in spirit to the area under the receiver-operating characteristic (ROC) curve.19 However, the AP weighs the initial part of the ROC curve more heavily20 and is, therefore, a more suitable as well as practically more relevant measure for highly unbalanced problems.

Results are displayed in Figure 1. One easily can see that, for any given target-label and predictor-set combination, the three algorithms’ performances were not statistically different from one another. It was much more important to use the “right” predictor set for predicting different targets than to choose the “right” algorithm. For example, the second predictor set (6 continuous, BCUT-like predictors) was clearly more informative for predicting the adenosine class, whereas the first predictor set (480 binary, atom-pair predictors) was more informative for predicting the hormone class, regardless of what algorithms were used.

Formal ANOVA results including all-way interactions (Table 2) clearly show that the “target” main effect, the “predictors” main effect, and the “target:predictors” interaction effect are by far the most significant, while the “algorithm” main effect is the least significant apart from the “block” effect.

Our experiments confirmed the intuitive notion that certain chemical predictor sets contain more information about whether a chemical compound has certain properties. This small experiment indicates that, in order to screen chemical compounds for certain properties, it is much more important to choose the “right” chemical predictor set to match the target than to choose the “right” computational screening algorithm.1

Figure 1. Experimental results: box plots based on 25 repeated runs for each of the 36 experiments (all combinations of 6 targets, 2 predictor sets, and 3 algorithms). “SVM480”=using the support vector machine with 480 (binary, atom-pair) predictors; “RF6”=using the random forest with 6 (continuous, BCUT-like) predictors; etc. Vertical axis=average precision (AP).

Download figure to PowerPoint

thumbnail image

Our study is small by current standards and is dependent on only the 1,280 compounds in LOPAC. The data mining algorithms selected are favoured by researchers and, as such, may be capturing most of the information contained in the descriptors. Our result, that the choice of algorithms makes relatively little difference when compared to the choice of descriptors, is in line with Hand.8 More automated methods for comparing algorithms and descriptors are coming on line21 and access to data sets is better than before (e.g., DUD10) so that, with some effort, our results can and should be replicated with more compound sets and more types of descriptors.2

Table 2. ANOVA results. “Block” refers to 25 repetitions of each experiment. “X : Y” denotes the interaction effect between X and Y.
 

DF

S. Sq

M. Sq

F

% Variance explained

Main effects

     

Target

5

25.629

5.126

514.111

58.32

Predictors

1

3.270

3.270

328.016

7.44

Algorithm

2

0.075

0.038

3.768

0.17

Block

24

0.353

0.015

1.475

0.80

Two-way interactions

     

Target : Predictors

5

4.724

0.945

94.751

10.75

Algorithm : Predictors

2

0.551

0.275

27.607

1.25

Target : Algorithm

10

0.620

0.062

6.215

1.41

Three-way interactions

     

Target : Algorithm : Predictors

10

0.350

0.035

3.508

0.80

Residuals

840

8.375

0.010

 

19.06

Additional Materials

  1. Top of page
  2. Additional Materials
  3. Acknowledgements

Three data sets and a ReadMe file are deposited at dryad.org.

Acknowledgements

  1. Top of page
  2. Additional Materials
  3. Acknowledgements

This research was partially supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, and by the Mathematics of Information Technology and Complex Systems (MITACS) Network. The experiments were performed while the second author was a graduate student at the University of Waterloo.