Partial least square and k-nearest neighbor algorithms for improved 3D quantitative spectral data–activity relationship consensus modeling of acute toxicity



A diverse set of 154 chemicals that included US Food and Drug Administration–regulated compounds tested for their aquatic toxicity in Daphnia magna were modeled by a 3-dimensional quantitative spectral data–activity relationship (3D-QSDAR). Two distinct algorithms, partial least squares (PLS) and Tanimoto similarity-based k-nearest neighbors (KNN), were used to process bin occupancy descriptor matrices obtained after tessellation of the 3D-QSDAR space into regularly sized bins. The performance of models utilizing bins ranging in size from 2 ppm × 2 ppm × 0.5 Å to 20 ppm × 20 ppm × 2.5 Å was explored. Rigorous quality-control criteria were imposed: 1) 100 randomized 20% hold-out test sets were generated and the average R2test of the respective models was used as a measure of their performance, and 2) a Y-scrambling procedure was used to identify chance correlations. A consensus between the best-performing composite PLS model using 0.5 Å × 14 ppm × 14 ppm bins and 10 latent variables (average R2test = 0.770) and the best composite KNN model using 0.5 Å × 8 ppm × 8 ppm and 2 neighbors (average R2test = 0.801) offered an improvement of about 7.5% (R2test consensus = 0.845). Projection of the most frequently occurring bins on the standard coordinate space indicated that the presence of a primary or secondary amino group—substituted aromatic systems—would result in an increased toxic effect in Daphnia. The presence of a second aromatic ring with highly electronegative substituents 5 Å to 7 Å apart from the first ring would lead to a further increase in toxicity. Environ Toxicol Chem 2014;33:1271–1282. © 2014 SETAC