Software news and updates electronegativity equalization method: Parameterization and validation for organic molecules using the MerzKollmanSingh charge distribution scheme
Authors
Zuzana Jiroušková,
National Centre for Biomolecular Research, Faculty of, Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic
The electronegativity equalization method (EEM) was developed by Mortier and Coworkers1–4 as a semiempirical method based on the densityfunctional theory (DFT)5, 6, which can be used for the fast calculation of charge distribution in a molecule. However, because of the semiempirical character of this method, it is necessary to parameterize the EEM before using it for the atomic charge calculations.
The parameterization of the EEM is a timeconsuming process, which consists of several steps. First, a sufficiently large set of appropriate molecules must be chosen, for which the ab initio atomic charges are calculated. These charges are then used during the developmental process of the EEM parameters A_{i}, B_{i}, and the adjusting factor κ.
Essentially, the EEM is parameterized for the specified level of theory (e.g., HF, B3LYP), basis set (e.g., STO3G, 321G, 631G*) and charge calculation scheme [e.g., Mulliken population analysis (MPA), MerzKollmanSingh (MK), charges from electrostatic potentials using a grid based method (CHELPG)]7. Nowadays, the EEM is parameterized for different combinations of the theory level, basis set, and the scheme for charge calculation as one can see for example in Yang and Shen8, 9, Menegon et al.10 and Bultinck et al.11, 12 The EEM is also used in cooperation with other chemical techniques (e.g., Smirnov and van de Graaf13, Heidler et al.14, and others). In this article, the process of EEM parameters development was performed according to the methodology described in detail in ref.15, where the EEM parameterization methodology was validated on large sets of organic, organohalogene, and organometal molecules, which contained up to 6000 molecules. The validated methodology was afterward used to improve existing and calculate additional EEM parameters for HF/STO3G MPA charge calculations.
One of the goals of this article was to show that the EEM parameterization methodology, which was successfully used for the calculation of HF/STO3G MPA charges, can also be used for the calculation of HF/631G* and B3LYP/631G* MK charges. Once confirmed, another aim of this article is to calculate the EEM parameters for HF/631G* and B3LYP/631G* MK charges. The quality of all obtained EEM parameters has been carefully validated using a reference set of molecules.
Theoretical Basis
The EEM is derived from the DFT and it is based on three basic principles. First of them is the Sanderson's electronegativity equalization principle16, 17. This principle states that the electronegativities of the atoms forming the molecule are equal:
(1)
where χ_{i} and χ_{j} are the individual effective atomic electronegativities and χ is the molecular electronegativity.
The second one is the principle of charge balance in a molecule:
(2)
where q_{i} is the atomic charge distributed on atom i.
The third principle is the definition of the atomic effective (chargedependent) electronegativity1, 4 and this is the core principle on which the EEM is based on:
(3)
χ_{i} is the effective atomic electronegativity of atom i in the molecule, q_{i} and q_{j} are the atomic charges distributed on atoms i and j, R_{ij} is the separation distance between atoms i and j, A_{i} and B_{i} represent the atomic valence state electronegativity and the hardness of atom i, respectively. Parameter κ is an adjusting factor. Parameters A_{i} and B_{i} are defined by eqs. (4,5):
(4)
(5)
Where χ is the electronegativity of an isolated neutral atom i, and η is its hardness. Δχ_{i} and Δη_{i} describe the corrections invoked by the change in size and shape of the atom in the molecule and the influence of the surrounding molecules.
EEM Parameterization
The aim of the EEM parameterization is to determine the parameters A_{i}, B_{i}, and κ for all specified atom types. To achieve this goal, it is necessary to know all the atomic charges q_{i} for all atoms i, molecular electronegativity χ, and the distances between constituent atoms R_{ij}. Afterward, the eq. (3) can be reformulated into the form:
(6)
which is more suitable for the EEM parameterization using the leastsquare minimization method which is based on linear regression. Equation (6) can also be written in the following form:
(7)
where x_{κ} = q_{i} and . These two values of x_{κ} and y_{κ} are connected together with the properly selected κ value.
A brief description of the EEM parameterization process is as follows (for detailed description see ref.15):
First, a suitable set of molecules has to be chosen. Afterward, the atomic charges are calculated using a quantum mechanical approach and χ is taken as the harmonic mean of the neutral atoms which constitute the molecule.
In the next step, the entire set of molecules is divided into the sets of atoms, so that each set contains only atoms with the same atomic type and bond order.
Then, according to the eq. (7), a pair of x_{κ} and y_{κ} values is calculated for each atom in each set of atoms and for all values of κ. Finally, the best value of κ is chosen depending on the R value.
The R value is an average of R_{mol} values, which are calculated during the process of EEM parameterization for each molecule in the training set. The R_{mol} value is the Rsquared value of the linear regression line, which is interposed among the points q_{i}(ab initio), q_{i}(EEM)], where q_{i}(ab initio) and q_{i}(EEM) represent ab initio and EEM charges of the atom i, respectively. The R_{mol} value is a number between 0 and 1. The closer the R value to 1 is, the better the results are.
Methods
Selected Sets of Molecules
For the parameterization of the EEM, molecules from the Cambridge Structural Database (CSD)18 were chosen. The CSD is governed by the Cambridge Crystallographic Data Centre (CCDC) and it is a repository of small molecule crystal structures that have been determined experimentally using Xray and/or neutron diffraction.
Three sets of molecules were taken from the CSD. The first, called as the training set, was constructed from 380 molecules containing hydrogen, oxygen, carbon, nitrogen, and sulphur atoms, all elements which are present in proteins and also some other elements, such as bromine, chlorine, fluorine, and zinc. This set of molecules was used for parameterization independent from the second set of molecules, the validation set, which was used as a reference set for the validation of calculated parameters. The validation set contains 116 molecules with the same atom types (it means the same elements and bond orders) as the training set.
For the purpose of comparison between our EEM parameters based on the B3LYP/631G* MK calculations and EEM parameters in literature12, it was necessary to have another set of molecules, the comparative set, which only contained such molecules with elements and bond orders for which we had EEM parameters in both EEM parameter sets. Specifically, the comparative set contains 111 molecules of hydrogen, oxygen, carbon, nitrogen, and fluorine atoms.
Details about the constitution of all sets are presented in Table 1. The geometries of all molecules were stored in SDF format19.
Table 1. Details of the Sets of Molecules Used During EEM Parameterization and Validation
Element
Bond order
Number of molecules and atoms
Training set
Validation set
Comparative set
Molecules
Atoms
Molecules
Atoms
Molecules
Atoms
Br
1
56
77
19
24


C
1
342
2670
105
752
103
776
Cl
1
58
130
33
81


F
1
36
134
12
42
21
78
H
1
368
7401
112
2159
107
2192
N
1
209
403
74
131
72
122
O
1
266
770
76
196
84
245
S
1
54
120
16
34


Zn
1
41
52
15
17


C
2
361
4246
114
1350
107
1258
N
2
130
282
47
96
38
74
O
2
257
556
72
156
85
182
Total
380
16841
116
5038
111
4927
Quantum Chemistry Calculations
For the parameterization of the EEM, quantum chemistry calculations were done to determine the partial charges on atoms. For the calculation of these atomic charges, the MerzKollmanSingh (MK) algorithm7, 20 was used, because of the fact, that MK charges are derived from the electrostatic potential, they are known to be much less basis dependent in comparison with the Mulliken charges.
For each set of molecules (training, validation, and comparative), two sets of charges were calculated: MK charges at the HF/631G* and B3LYP/631G* level of theory.
All quantum chemistry calculations were performed using the program GAUSSIAN 0321.
The EEM Validation Procedure
For performing the validation procedure of EEM charges obtained using EEM parameters shown in this article, program EEM_SOLVER22 was used. This software was developed in our group and it is freely available at http://ncbr.chemi.muni.cz/˜n19n/eem_abeem.
Results and Discussion
For all molecules in the training set described in Table 1, the HF/631G* and B3LYP/631G* charges calculated using MerzKollmanSingh (MK) charge distribution scheme were used to parameterize the EEM. During the EEM parameterization, parameters A_{i}, B_{i}, and κ were obtained and further optimized using the procedure described in ref.15.
The accuracy of obtained parameters is expressed by the R value which specifies the correlation between EEM and ab initio charges. The R value is a number between 0 and 1 and the closer this value is to 1, the better the results are.
Table 2 presents obtained EEM parameters A_{i}, B_{i}, and κ after the process of optimization for both B3LYP/631G* and HF/631G* MK charges.
Table 2. The EEM Parameters A, B, and κfor the Training Set with Statistical Significance Test and Standard Deviation s for All Regressions
Element and bond order
EEM parameters obtained using
B3LYP/631G* MK charges κ = 0.302
HF/631G* MK charges κ = 0.227
A
B
s
A
B
s
The statistical significance test is calculated for α = 0.05 which means that the numbers show the interval where the parameter should fit with the probability of 95%. All values are in Pauling units.
Br
1
2.659 ± 0.026
1.802 ±0.228
0.051
2.615 ± 0.025
1.436 ±0.234
0.060
C
1
2.482 ± 0.003
0.464 ±0.011
0.076
2.481 ±0.003
0.373 ± 0.009
0.075
Cl
1
2.519 ±0.013
1.450 ±0.220
0.087
2.517 ±0.015
1.043 ±0.215
0.098
F
1
3.577 ± 0.268
3.419 ± 0.972
0.172
3.991 ±0.357
3.594 ± 0.945
0.169
H
1
2.385 ± 0.003
0.737 ± 0.017
0.056
2.357 ± 0.004
0.688 ± 0.016
0.055
N
1
2.595 ± 0.022
0.468 ± 0.039
0.092
2.585 ± 0.023
0.329 ±0.031
0.093
O
1
2.825 ± 0.032
0.844 ± 0.058
0.095
2.870 ± 0.035
0.717 ± 0.051
0.094
S
1
2.452 ± 0.013
0.362 ± 0.047
0.088
2.450 ± 0.014
0.269 ± 0.035
0.090
Zn
1
2.298 ± 0.060
0.420 ± 0.074
0.063
2.185 ±0.091
0.375 ± 0.074
0.064
C
2
2.464 ± 0.002
0.392 ± 0.007
0.072
2.475 ± 0.002
0.292 ± 0.005
0.071
N
2
2.556 ± 0.009
0.377 ± 0.018
0.068
2.556 ± 0.009
0.288 ± 0.014
0.067
O
2
2.789 ± 0.064
0.834 ± 0.130
0.094
2.757 ± 0.064
0.621 ± 0.108
0.096
T
0.941
0.940
The quality of EEM parameters A_{i}, B_{i}, and κ was confirmed by the validation process. For the validation set, EEM charges were calculated using obtained EEM parameters shown in Table 2. Afterward, obtained EEM charges were compared with ab initio charges calculated for the validation set. The R values were taken as the comparison criterion. The R value has been calculated to be 0.925 for B3LYP/631G* MK charges and 0.926 for HF/631G* MK charges. From the presented R values, it is seen that the charges calculated using the EEM are in very good agreement with the quantum mechanically obtained ab initio charges.
To make sure that our data are correct, we decided to calculate the absolute differences between the EEM and ab initio charges on the training set. For this purpose, all atoms in the training set were separated into smaller subsets according to the atom type. For each atom in each atom type subset, the absolute difference between the EEM and ab initio charges was recorded and plotted into the histograms.
From the histograms, it is seen that EEM charges obtained using EEM parameters derived from B3LYP/631G* and HF/631G* MK charges are similar. A more detailed view shows higher accuracy of the EEM charges based on the B3LYP/631G* MK calculations.
The best agreement between the EEM and ab initio charges is observed for hydrogens (see fig. 1a). This is due to the fact that more than one half of the training set was formed by the hydrogen atoms. The higher the number of atoms in the set the more accurate the results are.
The worst agreement between the EEM and ab initio charges is observed for nitrogen atoms with a bond order of 1 (see Supp. Info.). This is due to the fact that some of the nitrogens, which were labeled as nitrogens with a bond order of 1, are in fact conjugated.
The histogram in Figure 1b shows that the agreement between the EEM and ab initio charges for zinc atoms with a bond order of 1 is also very good, even if zinc is a metal that may exhibit significant charge transfer.
Histograms obtained for other atom types are shown in the Supporting Information.
Another interesting feature is the R(EEM) value which shows the correlation between EEM charges received from HF/631G* and B3LYP/631G* MK charges. The R(EEM) value is a number between 0 and 1 and the closer this value is to 1, the better the results are. For the training set, the R(EEM) value is equal to 0.991.
The accuracy of our EEM parameters based on the B3LYP/631G* MK calculations was compared with the EEM parameters published by Bultinck et al. in ref.12. For this purpose, the R value was set as the comparison criterion. Because of the fact, that the κ value is not specified in ref.12, it was necessary to find it out using our methodology (for detailed description see ref.15). After κ was found, we calculated two sets of EEM charges and determined their R values. The first set of EEM charges was calculated for the comparative set of molecules using the EEM parameters from literature and the second one for the comparative set again, but using our EEM parameters. The obtained R values for the first and second set were 0.940 and 0.956, respectively. This shows better correlation for our parameters compared with those from the literature.
Conclusions
In this work, we have parameterized the EEM for B3LYP/631G* and HF/631G* MerzKolmannSingh charge distribution scheme to receive the EEM parameters for hydrogen, carbon, oxygen, nitrogen, sulphur, fluorine, chlorine, bromine and zinc atoms.
As a training set, we used organic molecules with experimentally determined structure from the database of crystallographic structures CSD. The calibration process of the EEM parameters was carried out according to methodology which was precisely validated on large sets of organic, organohalogene, and organometal molecules. We have proven that the EEM parameterization methodology, which was successful on HF/STO3G MPA charges, can also be successfully applied to HF/631G* and to B3LYP/631G* MK charges.
To receive the best results, obtained EEM parameters were carefully verified by a validation procedure on different set of molecules. During the validation process, the absolute differences between the EEM and ab initio charges were analyzed. According to these characteristics, a set of histograms was created. Our EEM parameters were also compared with the EEM parameters published in the literature and it was shown that our EEM parameters provide more accurate results.
From the results presented in this article, it is seen that EEM charges obtained using our EEM parameters are in very good agreement with ab initio charges as well as more accurate than parameters described in literature (in case of B3LYP/631G* MK charges) or results which have not been published so far (in case of HF/631G* MK charges).
The EEM parameters published as a part of this article can be directly used for charge calculations using program EEM_SOLVER22, which is freely available on the web page http://ncbr.chemi.muni.cz/˜n19n/eem_abeem.
Acknowledgements
The authors thank the Supercomputing Centre in Brno, Czech Republic, for providing access to computer facilities.