Development of Simple QSPR Models for the Prediction of the Heat of Decomposition of Organic Peroxides

Quantitative structure‐property relationships represent alternative method to experiments to access the estimation of physico‐chemical properties of chemicals for screening purpose at R&D level but also to gather missing data in regulatory context. In particular, such predictions were encouraged by the REACH regulation for the collection of data, provided that they are developed respecting the rigorous principles of validation proposed by OECD. In this context, a series of organic peroxides, unstable chemicals which can easily decompose and may lead to explosion, were investigated to develop simple QSPR models that can be used in a regulatory framework. Only constitutional and topological descriptors were employed to achieve QSPR models predicting the heat of decomposition, which could be used without any time consuming preliminary structure calculations at quantum chemical level. To validate the models, the original experimental dataset was divided into a training and a validation set according to two methods of partitioning, one based on the property value and the other based on the structure of the molecules by the mean of PCA. Four QSPR models were developed upon the type of descriptors and the methods of partitioning. The 2 models issuing from the PCA based method were highlighted as they presented good predictive power and they are easier to apply than our previous quantum chemical based model, since they do not need any preliminary calculations.


Introduction
Organic peroxides are reactive compounds containing the -OÀO-bond widely employed in the chemical industry because [1][2][3][4][5] they generate radicals during their decomposition which can be used as catalyst and as radical polymerization initiators. Nevertheless, their decomposition can be dangerous and can lead to serious effects [6][7][8]. This decomposition can be triggered by heat, mechanical shock or friction and can be due or accelerated by various contaminants [9][10]. To reduce the risk of incidents and of accidents, their hazards are intensively studied both by academic laboratories, competent regulation administrations and by industrial producers.
In order to improve and homogenize the knowledge of the commercialized chemicals, the European Union regulation REACH (Registration, Evaluation, Authorization and Restriction of Chemicals) [11] requires the evaluation of physico-chemical, toxicological and eco-toxicological properties for all chemicals produced or imported by more than one ton by year in Europe. To help industry to meet the requirements of REACH regulation as far as the chemical safety assessment is concerned, technical guidances were published [12] and a general testing strategy for physicochemical properties was proposed to consider the order of testing.
Considering organic peroxides, even if they belong to a dedicated regulated division or class, the knowledge of their explosive properties, as defined by the UN recom-mended tests [13], can be of great importance. The thermal stability (energy and temperature of decomposition) of organic peroxides is a key point as this property is considered as a pre-selection criterion to identify substances that could present explosive properties and is used in the complex procedure of classification of explosives and organic peroxides [13]. Indeed, the UN regulation indicates that there is no need to perform this complex procedure when the heat of decomposition (corresponding to the amount of energy released during the decomposition) measured by calorimetric analyses, notably by differential scanning calorimetry (DSC), is lower than 500 J/g and when the onset temperature is lower than 500 8C. For safety [a] V. Prana

Full Paper
www.molinf.com reasons but also for technical reasons, experimental tests can be difficult to implement for unstable substances like organic peroxides. As a consequence, as a simple and quick approach, the development of methods used for the prediction of data is considered as a great help at the research and development step and can help to accelerate and fulfil the next registration deadlines. In particular, in the REACH regulatory framework, Quantitative Structure-Activity/Property Relationships (QSAR/QSPR) were clearly recommended, as alternative methods to experimental testing, to obtain information data. Indeed, they represent powerful tools of prediction [14] successfully used for toxicological endpoints [15] but also for physico-chemical ones [16][17].
To support the development and use of QSPR models, OECD proposed the 5 following principles for their validation in a regulatory context [18] Some recent reviews [19][20][21] list the existing predictive models developed for the relevant properties of chemicals in the context of REACH.
Concerning the prediction of thermal stability, some models exist for different families of compounds like nitroaromatics [22][23][24][25][26], nitramines [27][28][29], ionic liquids [30] or polymers [31]. Considering the case of organic peroxides, QSPR models were developed for the prediction of the SADT (self-accelerating decomposition temperature) [32][33][34] and for the thermal stability [35][36]. In the study of Lu [35], first MLR models were derived for the heat and temperature of decomposition, from a database of only 16 organic peroxides implying no validation set, neither definition of applicability domain thus implying that the robustness and the predictivity of these models could be improved. In the recent study of Zohari [36], correlations for predicting decomposition onset temperature and heat of decomposition of organic peroxides were proposed from a dataset of 41 molecules (including those of our previous work [37]) but without defining applicability domains of these models. In a recent publication, Prana et al. proposed new QSPR models satisfying OECD principles, based on different descriptors including quantum chemical ones [37]. From on a relatively large basis set containing 38 thermal stability data of organic peroxides obtained from DSC, a new MLR model was derived for the prediction of the heat of decomposition divided by the concentration of the organic peroxide with very high performances in terms of fitting, robustness and predictivity (R 2 = 0.97, Q 2 = 0.94 and R 2 ext = 0.81). However, this model requires a preliminary determination of quantum chemical descriptors which could be time-consuming and not necessarily straightforward. Consequently, the aim of the present paper was to access new QSPR models focusing on the heat of decomposition of organic peroxides that could yield high accuracy, validated on a complete validation process and using only simple descriptors (constitutional and topological) that need no prior quantum chemical calculations, as already done in previous work for the prediction of the heat of decomposition of nitroaromatic compounds [38] and for the impact sensitivity of nitramines [39]. Moreover, two methods of partition between training and validation sets of molecules (property-ranking based or PCA-based partition) were used to derive new QSPR models.

Experimental Dataset
The collection of experimental data is critical in any QSPR analysis as experimental conditions could have a significant effect on the measured property that can propagate in the performances of the developed models. For this reason, a robust database of 38 experimental values of heats of decomposition (measured by Differential Scanning Calorimetry-DSC with few milligrams of sample into a closed stainless steel crucible and a scanning rate of 5 K/min from ambient temperature to 300 8C, see Table 1) of organic peroxides was obtained in our previous work [37]. In this database, different types of organic peroxides are considered such as dialkyl peroxides, diacyl peroxides, hydroperoxides, peroxyesters, peroxyketals and peroxydicarbonates. The concentrations of organic peroxides were close to 95-99 %, except for some of them which contain inert solvents (water, organic solvent, or an inert mineral). The range of heat of decomposition (441-2622 J/g) of these samples cover most of the commercial grade of organic peroxides. As the importance of the concentration of the organic peroxide on the values of the heat of decomposition was already demonstrated in our previous work [37], the modelled property was the heat of decomposition divided by the concentration.

Partition of the Dataset
In order to estimate the predictive power of our models, the experimental dataset was divided into a training set, containing two thirds of the molecules of the dataset and a validation set constituted by the remaining molecules. The training set was used to develop models and perform their internal validation while the validation set was considered for external validation to evaluate their predictive power.
In the present work, two methods of partitioning the dataset were used: one based on the distribution of the property value "property-ranking method" and the other based on the structure. The property-ranking method consisted to classify by increasing order of property the molecules and then to select one molecule out of three Full Paper www.molinf.com 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56 molecules to constitute the validation set. To avoid the presence of extreme values in the validation set, the second molecule of the entire database is the first one of the validation set. This partition enabled both sets to present similar distributions in terms of property to allow a robust development and validation of models. The partitioning of the dataset was also visually inspected to ensure that the validation set covered as far as possible the chemical diversity of the domain of applicability of the model, i. e. of the molecules included in the training set. The range of the property values was also considered in order to prevent that some data influence too much the correlation. For example, the heat of decomposition divided by the concentration of the 2,5-dimethyl-2,5-dihydroperoxy hexane was too extreme compared to the values of the others organic peroxides and was removed of our dataset. Finally, the training set contained 25 organic peroxides and the validation set contained 12 ones.
Considering that the validation set has to cover at best the chemical diversity of the domain of applicability of the model, i. e. of the molecules in the training set, the second method of partitioning based on the structure of organic peroxides was carried out using the score plots issued from Principal Components Analysis (PCA) computed with the R program [40]. Within this method, the initial set of descriptors is projected into a reduced set of new orthogonal variables, called principal components (PC), accounting for the maximum variance of structures in the * validation set; ** outlier; *** validation set but out of AD Full Paper www.molinf.com data set. The score plots represent the molecules of the data set in the new space defined by the PCs, accounting for the variance of the data set in terms of chemical diversity. Here, the score plot based on the first two PCs, accounting for the maximum variance in the whole set of 128 descriptors, was obtained for the entire dataset in Figure 1. Then, training and validation molecules were selected to be homogeneously distributed into the global chemical space represented on this score plot and to favour at best the presence of molecules of each family of organic peroxides in both sets. Finally, it has been checked that the distribution of data was also homogeneous in terms of property.

Molecular Descriptors Calculation
The molecular structures of the 37 organic peroxides were characterized by a series of descriptors [41][42] that do not require any time consuming computation: -Constitutional descriptors based on the identification and the count of specific atoms, functional groups or bonds in molecules. In this class of descriptors, parameters that only need an empirical formula are found (e. g. molecular weight); -Topological descriptors calculated from the atomic connectivity in the molecule (extracted from their 2Dstructure) and give information about size, composition and branching degree (e. g. Weiner, Balaban or Randic indices); Most of the descriptors considered in the present paper were calculated using the Codessa software [43] but additional descriptors were also taken into account considering the specific case of organic peroxides and their specific functional groups such as the number of peroxide bonds (n OO ). Some others descriptors successfully included in previous QSPR models were also considered such as the oxygen balance [13,44] which is an empirical descriptor well known to evaluate hazards related to energetic materials.

Building and Validation of Models
All QSPR models developed in this work were derived through multilinear regressions on the training set using the Best Multi Linear Regression (BMLR) approach as implemented in Codessa program [43]. This stepwise approach of descriptor selection starts by reducing the number of descriptors by eliminating those which present an insignificant variance or correlation with the studied property. Then, when two descriptors are highly correlated together, the one presenting the lower correlation with the property is also eliminated. Therefore, two-parameter regressions involving orthogonal descriptors are computed. Higher rank models are then built by adding new descriptors presenting no correlation with the ones already present in the model until no improvement of the model is found. Finally, the algorithm proposes, at each rank (i. e. for each number of descriptors), the model presenting the highest correlation with the studied property. The final model was chosen as the best compromise between correlation observed within the training set and number of descriptors, taking also into consideration the relevance of the descriptors into the model from a chemical point of view. To further ensure the relevance of the descriptors in the models, t-test values were checked at a 95 % confidence level.
The performances of models were evaluated using a series of internal and external validations. The goodness of fit was measured by the determination coefficient (R 2 ), the mean absolute error (MAE) and the root mean square error (RMSE) between predicted and experimental values. Leaveone-out (LOO) and leave-many-out (LMO) cross validations were used to evaluate the robustness of the model, i. e. the dependence of the fitting of the model to any molecule(s) of the training set via the Q 2 LOO , Q 2 5CV , Q 2 10CV , Q 2 7CV coefficients (for LOO, 5-fold, 10-fold and 7-fold cross validations, respectively). Robust models are expected to present low differences between the different Q 2 values and with the R 2 coefficient. Finally, the predictive power of models was measured by an external validation on the molecules of the validation set based on the R 2 ext , RMSE ext and MAE ext ) as done within the fitting step. Additional validation metrics were calculated as: Q 2 F1 proposed by Tropsha [45] and the OECD guidance document [18], Q 2 F2 defined by Schuurman [46], Q 2 F3 by Consonni [47] and CCC by Lin [48][49]. QSPR models are correlative methods. In that sense, they can only ensure reliable predictions in their applicability domain, defined by the training data set [50][51]. In this study, the AD has been defined based on the descriptors included into the model according to the Euclidean distance method available in Ambit

Results
The present study targets the development of simpler models than the ones developed in a previous work that included quantum chemical descriptors [37]. To this aim, the new models were focused on topological and constitutional descriptors, only requiring knowledge of the 2D molecular structure of organic peroxides. Even more simpler models were looked for, by focusing on constitutional descriptors, topological descriptors requiring the use of computer program (like Codessa, in the present case). In both cases, the two investigated partitions (property-ranking based or PCA-based partition) were used. Details of the models are provided in Supporting Information (Tables S1-S4) and their performances are summarized in Table 2.

Models Based on Topological and Constitutional Descriptors
83 topological and constitutional descriptors were computed. For the property ranking based partition, the best compromise between correlation and number of descriptors among the different models proposed by the BMLR algorithm was obtained for the four-parameter model in Eq. 1.
where n OO is the number of -OÀO-bonds, n OOH is the number of -O-OH bonds, n O,rel is the relative number of oxygen and 1 BIC is the bonding information content order 1.
In this equation, the most important descriptor is the number of -OÀO-bonds with an absolute value of t-test of 10. Three on four descriptors are directly linked to the presence of peroxy bond or oxygen which is in accordance with the critical role of this bond in the decomposition process of organic peroxides [53][54].
with n OO is the number of -OÀO-bonds, n OOH is the number of -O-OH bonds, OB 100 is the oxygen balance according to Kamlet [44] and 1 IC is information content index order 1.
Two of the four descriptors of this model were already included in Eq. 1: the number of -OÀO-and -O-OH bonds with, again, the highest significance into the model in terms of t-test.
As shown in Figure 2, only one molecule revealed out of the applicability domain within the validation set (tert-butyl hydroperoxide) and this model presented, this time, very good predictive performances in its applicability domain when applied to the validation set with R 2 in = 0.89, RMSE in = 268 J/g, Q 2 F1 = Q 2 F2 = 0.83, Q 2 F3 = 0.81 and CCC = 0.90, i. e. even higher than the ones obtained when including quantum chemical descriptors in previous work [37] as shown in Table 2.

Models Focused only on Constitutional Descriptors
To achieve even simpler models, the same analyzes were performed by focusing on the 45 calculated constitutional descriptors. For the property ranking partition, the fourparameter equation (Eq. 3) was finally chosen among the regressions proposed by the BMLR algorithm.
Where n OO is the number of -OÀO-bonds, n OOH is the number of -O-OH bonds, OB is the oxygen balance according to the TDG regulation [13] and G b is the gravitation index considering all bonds.
Once again, the n OO and n OOH descriptors are directly connected to the critical role of the peroxy bond in the decomposition process of organic peroxides.
The goodness of fit of the model is high with R 2 = 0.92 and RMSE = 103 J/g. Internal validations were satisfactory. All cross validations were at a same level of Q 2 nCV = 0.86. It is interesting to note that, for this partition, if the model including topological descriptors (Eq. 1) presented higher correlations in the training set (R 2 = 0.95 for Eq.1 and R 2 = 0.92 for Eq.3), the predictive capabilities of this simpler model revealed higher with R 2 in = 0.80 and RMSE in = 303 J/g. With the PCA-based partition, the obtained model (Eq. 4) is even slightly better.
DH=C ¼ À663 n OO À699 n OOH À4:79 OB þ 11 n single À2036 Where n OO is the number of -OÀO-bonds, n OOH is the number of -O-OH bonds, OB is the oxygen balance according to the TDG regulation [13] and n single the number of single bonds.
This model contains three descriptors in common with the constitutional model issued from property ranking partition. In particular, the presence of n OO and n OOH is in agreement with the great importance of the peroxy bond.
Correlation in the training set is similar to Eq. 3 (with R 2 = 0.89 and RMSE = 166 J/g, as shown in Figure 3). Robustness and chance correlation were checked (in Table 2). Finally, the external validation revealed a slightly better predictive power than Eq. 3 with a same R 2 in = 0.80 and slightly better external metrics in terms of Q 2 Fn and CCC. It has to be noticed that the only molecule of the validation set out of the applicability domain is the same than for Eq. 2 obtained from the same partition, tert-butyl hydroperoxide.

Discussions
In this study, four models were developed targeting access to easy to use alternatives to the model proposed in a previous work that included quantum chemical descriptors.
It has to be noticed that the four new models are mainly based on the same two descriptors, the numbers of -OÀOand -O-OH bonds (n OO and n OOH ). The inclusion of these descriptors in the models is in line with the critical importance of the peroxy bond in the decomposition process of organic peroxides which starts by the breaking of this bond [53][54]. The focus addressed to hydroperoxides within the n OOH descriptor is also interesting to notice since it highlights the known specificity of this family among organic peroxides.
Finally, the best obtained model is the model Eq. 2 with an observed predictive power even slightly higher than the one obtained from the quantum chemical model in previous work, with notably R 2 in = 0.89 vs. 0.81 and RMSE in = 268 J/g vs. 301 J/g. But the application of this model keeps requiring the use of a software to access the topological index it contains (information content index order 1).
So, an even more simpler to use model is proposed with Eq. 4, focused on simple constitutional descriptors, that can be determined directly from the scheme of 2D structure of the organic peroxides. Despite its simplicity, it achieves relatively accurate predictions even if slightly lower than Eq. 3 with R 2 in = 0.80 and RMSE in = 311 J/g. Moreover, both models are in accordance with the OECD validation principles. They target the prediction of heats of decomposition especially obtained from homogeneous protocol (Principle 1). The algorithms of the models are simple and explicitly provided in Eqs. 2 and 4 (Principle 2). These models are applicable to organic peroxides within a defined AD (Principle 3). Performances were evaluated by internal and external validation tests: fitting evaluation for the molecules of the training set, LOO and LMO crossvalidations, external validation set based on a series of external validation metrics (Principle 4). Finally, the descriptors included into the models are meaningful regarding the predicted property (Principle 5). Indeed, the n OO and n OOH are related to the peroxy bond which is critical in the process of decomposition of organic peroxides.

Conclusion
In this paper, new QSPR models were proposed to predict the heat of decomposition of organic peroxides. In particular, simple models were targeted without any timeconsuming quantum chemical descriptors (compared to the model proposed in a previous work). For this reason, only topological and constitutional descriptors were investigated. Moreover, two different methods were used to partition the data set into training and validation sets.
Finally, two alternative models were proposed, both following all the validation principles of OECD enabling their use in a regulatory context. The best one, including a topological descriptor presented an even better predictive power than the quantum chemical model on the tested validation set but required the use of a program to compute a topological descriptor.
An even more simple model was obtained with slightly lower predictive capabilities but only based on constitutional descriptors that can be easily determined by visual examination of the 2D structures of the organic peroxides.
Such simple models are finally more accessible to industrial users and regulatory instances since their use and evaluation do not require any quantum chemical expertise to be used.