Interpretable Fragment‐Based Molecule Design with Self‐Learning Entropic Population Annealing

Self‐learning entropic population annealing (SLEPA) is a recently developed method used for achieving interpretable black‐box optimization via density‐of‐states estimation. Applying SLEPA to a chemical space is not straightforward, however, because of its dependence on Markov chain Monte Carlo sampling in the space of generated entities. Herein, SLEPA is applied to optimal molecule generation by combining an irreducible Markov chain in the space of fragment multisets and a probabilistic fragment assembler such as MoLeR. The weighted samples from SLEPA are used to identify salient fragments for the highest occupied molecular orbitals‐lowest unoccupied molecular orbitals (HOMO‐LUMO) gap maximization and minimization, and the relationship between the identified fragments and the electronic structures is elucidated. This approach offers a viable platform to reconcile the incompatible goals of optimization and interpretation during molecular design.


Introduction
Although molecular design by molecular fragment assembly is a conventional topic of computational chemistry, [1,2] it has gained renewed interest, because molecules designed by fragment assembly are amenable to robotic synthesis.For example, Wu et al. built a robotic system to assemble molecular fragments using Suzuki-Miyaura coupling and reported the successful design of organic laser materials. [3]By assuming an oracle that returns the properties of a given molecule (i.e., simulations or real-world experiments), black-box optimization algorithms [4] such as Bayesian optimization [5] and evolutionary algorithms [6] can be applied to optimize the generated molecules.However, in addition to optimization, scientists are often interested in identifying salient fragments related to advantageous properties.To this end, the molecules generated by black-box optimization have been analyzed statistically to discover new property structure relationships. [7]However, such an analysis may not always be reliable, because black-box optimizers are biased samplers. [8]elf-learning entropic population annealing (SLEPA) is a recently developed blackbox optimization method that allows for consistent statistical analyses. [8]Although this method has been successfully applied to the optimization and analysis of biological sequences, it poses challenges for application in molecular fragment assembly owing to its dependency on Markov chain Monte Carlo (MCMC) sampling, [9] which requires an irreducible Markov chain in the space of generated entities.In other words, a set of possible moves must be defined such that any point in the space can be reached from any other point.The space of all molecules formed by fragments is difficult to characterize, thus rendering difficult the design of suitable Monte Carlo moves that enable provably irreducible Markov chains.In this study, we followed a SLEPA approach for molecule generation by considering the space of fragment multisets to perform MCMC sampling.The energy function of a fragment multiset is defined as the expected property with respect to a probabilistic fragment assembler (PFA) such as MoLeR. [10]The proposed algorithm is called MolSLEPA.MolSLEPA creates weighted samples such that the property distribution estimated from them is identical to the density of states, i.e., the distribution obtained from brute-force enumeration (Figure 1).Using the weights, we obtain the joint distribution between the property and a descriptor of interest consistently.
The highest occupied and the lowest unoccupied molecular orbitals (HOMO and LUMO) are of special interest in materials science. [10]Their molecular orbital energies and the difference between them (i.e., HOMO-LUMO gap) are related to excitation energies, stability, polarizability, and chemical reactions.Modulation of HOMO-LUMO gaps is essential in the development of light-emission diodes, photovoltaic solar panels, and electrochromic devices. [10]In this study, MolSLEPA is applied to optimization problems of the HOMO-LUMO gap, where it is both minimized and maximized.
We demonstrate that MolSLEPA offers higher accuracy in estimating the property distribution compared with simple random DOI: 10.1002/aisy.202300189Self-learning entropic population annealing (SLEPA) is a recently developed method used for achieving interpretable black-box optimization via density-ofstates estimation.Applying SLEPA to a chemical space is not straightforward, however, because of its dependence on Markov chain Monte Carlo sampling in the space of generated entities.Herein, SLEPA is applied to optimal molecule generation by combining an irreducible Markov chain in the space of fragment multisets and a probabilistic fragment assembler such as MoLeR.The weighted samples from SLEPA are used to identify salient fragments for the highest occupied molecular orbitals-lowest unoccupied molecular orbitals (HOMO-LUMO) gap maximization and minimization, and the relationship between the identified fragments and the electronic structures is elucidated.This approach offers a viable platform to reconcile the incompatible goals of optimization and interpretation during molecular design.sampling.At the same time, the optimization performance of MolSLEPA is on par with evolutionary strategy, [6] a standard evolutionary algorithm.Using the sample weights, salient fragments in top molecules are identified in both HOMO-LUMO gap minimization and maximization.The fragments associated with minimization and maximization showed completely different characteristics in terms of atomic hybridization states and structural features and were consistent mostly with existing chemical knowledge.For example, the fragments associated with narrow HOMO-LUMO gaps are found to have sp 2 hybridized atoms, leading to high photoactivity. [11,12]Our results imply that MolSLEPA is a viable optimization method that offers a reasonable way of interpretation.

MolSLEPA Overview
Assume an oracle OðmÞ that returns the property of molecule m.Without loss of generality, a smaller value of OðmÞ is regarded as more favorable.Let [a,b] denote the set of integers ranging from a to b.A point of the fragment-based chemical space (FCS) is a multiset of fragments represented by a vector of fragment counts x, where n is the total number of fragments and a molecule is made by assembling K fragments.The transition between any two points in this space can be defined by a series of fragment replacements.A fragment replacement is realized by subtracting one from the count of the deleted fragment and adding one to that of the added fragment.Thus, an irreducible Markov chain needed in MCMC sampling can easily be realized in an FCS, unlike in the space of molecular graphs.Given that a fragment multiset can correspond to multiple molecules, we define the expected property of x using a probabilistic fragment assembler Pðm j xÞ as follows In experiments, it must be approximated as the mean over α samples from PFA, m 1 ðxÞ, : : : , m α ðxÞ MolSLEPA creates a set of weighted samples using a probabilistic fragment assembler (PFA) such that the estimated property distribution is identical to that from brute-force enumeration (i.e., density of states).
The density of states (DoS) nðeÞ is the histogram of property distribution between e min and e max .Denoted by E the set of ν evenly spaced points between e min and e max .Let e α ðxÞ be approximated by the closest point in E. DoS can then be computed exactly as the fraction of points in X whose mean property is where Ið⋅Þ is the indicator function that is one, if the given condition is satisfied, and zero otherwise.Salient fragments are those overpresented in molecules with favorable properties.
The saliency of a fragment j is defined as the occurrence probability in favorable multisets where t is the user-defined threshold.Although ranking all fragments by the saliency can help users identify important fragments, exact computation is computationally infeasible for large K.
MolSLEPA generates a set of weighted samples ðx i , w i Þ N i¼1 .The samples are generated such that they are overrepresented in high-property regions in comparison to uniform sampling.The weights are determined such that the DoS is approximated as These weights allow us to consistently estimate a variety of statistics.For example, the saliency of fragment j is approximated as MolSLEPA maintains M samples (i.e., a set of multisets) and updated τ À 1 times to yield N ¼ Mτ samples.The number of oracle calls is Nα.It also uses a machine learning surrogate model for predicting e α ðxÞ.See Method section for details.

HOMO-LUMO Gap Optimization
We applied MolSLEPA to design molecules with an optimized HOMO-LUMO gap. Figure 2 illustrates the 30 fragments used in our experiments.First, 100 frequently occurring fragments in ZINC15 [13] were obtained using Python script preprocess.py in MoLeR package.Then, Max-Min algorithm [14] was applied to select maximally diverse 30 fragments.Molecules are constructed by assembling four fragments (K ¼ 4).As the oracle, the message passing neural network [15] implemented in the Chemprop GitHub repository [16] is used to predict the HOMO-LUMO gap.It is trained by the PubChemQC PM6 dataset [17] containing about 2.6 million molecules.The dataset was split to training, validation, and test sets in an 8:1:1 ratio.Chemprop has the following hyperparameters: hidden size, depth, dropout, and the number of feed-forward layers.They were optimized via Bayesian optimization using hyperopt package. [18]The root mean squared error for the test set was 0.16 eV.
MoLeR is a graph neural network that maps an input vector to a molecule. [10]The code is provided by the GitHub repository. [19]lthough it can be used for fragment assembly or atom-by-atom generation, here we use it as a probabilistic fragment assembler by feeding random vectors as the input.The mean property (3) was computed using ten MoLeR samples, i.e., α ¼ 10.Note that MoLeR was trained with 5.7 million molecules from ZINC15. [13]olSLEPA's population size was set to M ¼ 100 and the inverse temperature was changed from β 1 ¼ 0 to β 19 ¼ β max with equal intervals in-between.The surrogate model was a Gaussian process regression [20] defined on the 512-dimensional latent vectors of MoLeR. [10]Per one MolSLEPA run, the number of generated samples was 2,000 and the number of oracle calls was 20 000.
MolSLEPA is evaluated in terms of optimization efficiency and DoS estimation.Optimization efficiency is measured by how quickly the property of the best sample improves as the sampling proceeds.To measure the accuracy of the DoS estimation, we conducted a brute-force enumeration of all possible 40 920 multisets.The estimated DoS, nðeÞ, was compared with the one by brute-force, n Ã ðeÞ, based on the Hellinger distance     where t is the user-defined threshold.By setting t to a small value, it is capable of comparing distribution tails only.This was compared with random selection of fragments, denoted as "random," and ðμ, λÞ-evolutionary strategy, [6] denoted as "ES."Here, λ is the size of population and μ is number of selected elites at each iteration.For a fair comparison, the population size was set to 100, 20 iterations are applied, and the fitness function is derived from 10 MoLeR samples.μ is set to 10.We applied MolSLEPA to minimize and maximize the HOMO-LUMO gap. Figure 3 summarizes the results of HOMO-LUMO gap minimization (β max ¼ 5).The progress of the minimum HOMO-LUMO gap min i∈½1,N e α ðx i Þ as a function of the number of samples N is shown in Figure 3a.The means and standard deviations over 10 attempts are plotted.SLEPA performed significantly better than random; however, ES showed better efficiency.Figure 3b shows the accuracy of DoS estimation from 2,000 samples at different thresholds, defined in terms of percentiles.At a wide range of thresholds, SLEPA achieved better accuracy compared with random and ES.These results show that SLEPA achieves efficient optimization and accurate DoS estimation at the same time, while ES focuses purely on optimization.Figure 3c shows the saliency of fragments derived at two different thresholds (1 and 0.1 percentiles).The best estimate among the 10 attempts using SLEPA, random, and ES are plotted.The fragments are sorted according to the saliency derived by the brute-force method.SLEPA exhibited the best match with the brute-force results, while the saliencies of ES are widely different, suggesting that the biased samples obtained using black-box optimization may produce misleading results.

S H O O
Figure 4 shows the results of HOMO-LUMO gap maximization (β max ¼ 3).Similar trends were observed in the optimization and DoS estimation performance, whereas the salient fragments are completely different from those in the minimization.

Salient Fragments
Table 1 lists the most salient fragments identified by MolSLEPA.The salient fragments for wide HOLO-LUMO gaps consist entirely of sp 3 -hybridized atoms, whereas those for narrow HOMO-LUMO gaps consist entirely of sp 2 -hybridized atoms.The hybridization state of an atom in a molecule is known to affect the HOMO-LUMO gap. [21]In general, molecules including sp 2 -hybridized atoms tend to have a smaller HOMO-LUMO gap than those composed only of sp 3 -hybridized atoms, because the former can form π conjugate systems.
Our results indicate the influence of ring-type fragments on the HOMO-LUMO gap.Strained rings, such as cyclopropane or cyclobutane, have nonplanar structures that can cause significant perturbations in the electronic structure of molecules.These perturbations result in the deviation in the bond angles and bond lengths from their standard values.Ring strain destabilizes the molecular orbitals, leading to an increase in the energy required to promote electrons from HOMO to LUMO.
MolSLEPA assigns a high saliency to fragments with π-bond in a narrow HOMO-LUMO gap.Multibond systems, including π-bonds, have typically narrower HOMO-LUMO gaps than σ-bond systems, owing to the orbital splitting energy in the formation π-bond being smaller than that of the σ-bond.However, pure carbon conjugation fragments were assigned a low saliency for narrow HOMO-LUMO gaps, and aromatic rings including hetero atoms (N, S) were deemed more salient.
Figure 5 illustrates the MolSLEPA-generated molecules that exhibited the widest and narrowest HOMO-LUMO gaps.Fluorides are known as stable molecules with a wide HOMO-LUMO gap.In particular, the three molecules with the widest HOMO-LUMO gaps belong to the family of α-ω dihydroperfluoropolyethers (HFPE), and have potential applications in fire extinguishing agents and heat exchange fluids owing to their low melting points, low viscosity, high stability, and environmental compatibility. [22,23]The first molecule has already been used in cleaning and coating applications, owing to its unique chemical stability. [24]Accordingly, the third molecule, which is nonflammable and environmentally friendly, has been used as a foaming agent material. [25]n contrast to the wide HOMO-LUMO gap molecules, MolSLEPA generates π-bond rich molecules for a narrow HOMO-LUMO gap.Furthermore, the narrowest HOMO-LUMO gap molecules had a σ-, β-diketone, furan, and so forth, which are  characterized by nonbonding orbitals.Hence, MolSLEPA seems to narrow the HOMO-LUMO gap by inserting nonbonding orbitals between π and π* orbitals.See Supporting Information for an additional analysis about the orbitals.A similar tendency has been observed in other molecule generation studies for longwavelength light absorption. [11,12]Generally, a narrow HOMO-LUMO gap indicates high photoactivity.For example, diketone derivatives have traditionally played an important role in photochemistry and are commonly used as photoreaction initiators. [26]

Method
MolSLEPA employs the surrogate model ẽα ðxÞ, and the following Boltzmann distribution where β is the inverse temperature, and f is the free energy We can sample from P β ðxÞ using the Metropolis sampling method. [9]In one step of the Metropolis sampling process (Figure 6a), sample x is randomly perturbed to x 0 by random fragment replacement and it is accepted with probability minð1, P β ðx 0 Þ=P β ðxÞÞ (11)   If rejected, the sample remains the same.Additionally, MolSLEPA uses resampling [27] as a key technique (Figure 6b).Let x 1 , : : : , x M are the samples subject to distribution pðxÞ.To transform this sample set to one that is subject to another distribution p 0 ðxÞ, the probability of k-th samples is defined as π k ∝ p 0 ðxÞ=pðxÞ (12)   and the set was updated by drawing M samples with replacements.
The algorithm of MolSLEPA is shown in Algorithm 1.Samples at multiple inverse temperatures β i are obtained through population annealing. [27]Starting from a small β, M samples are updated by 50 Metropolis steps.The inverse temperature is increased in a step-by-step manner.Whenever the inverse temperature increases, resampling is applied to the sample set to adapt to the new temperature.Simultaneously, e α ðxÞ is computed for all samples using the oracle and fragment assembler, which is then used for training the surrogate model.
Then, we obtain τ sets of M samples that possess both ẽα ðxÞ and e α ðxÞ.Each sample set was resampled to adapt to the true energy e ðxÞ.Using the resampled sets, the multiple histogram method [28] was used to estimate the density of states.The interval endpoints e min , e max were determined by minimum and maximum values found so far, respectively.The number of grid points ν was set to 50.Let h i ðeÞ denotes the histograms obtained from the samples at β i .By assuming that the histogram counts are subject to the Poisson distribution, the sample weights for temperature β i are derived as follows See Li et al. [8] for details regarding the derivation process.The free energies f i and DoS nðeÞ were obtained from histograms h i ðeÞ by alternating the following two steps until convergence is achieved.
Finally, the weight r i ðeÞ was assigned to all samples.

Conclusion
We have demonstrated how MolSLEPA can be successfully applied to optimal fragment assembly.Since the use of fragments results in a limited chemical space, de novo generators such as JTVAE [29] and ChemTS [30] would work better in finding molecules with optimized properties.DoS estimation with de novo generators could be possible, but it would need a more intricate algorithm and substantially more samples.The oracle used in this study is based on a neural network, but it can be replaced with quantum chemical calculations for better accuracy.
MolSLEPA requires a relatively large number of samples, so it may be difficult to use with low-throughput experimental systems.High throughput systems based on robotics [3] might enable the application of MolSLEPA to diverse properties.Our results depend on a deep learning-based fragment assembler, MoLeR.The use of MoLeR might introduce unwanted bias in The evolution strategy is a rather simple optimization algorithm.In term of optimization performance, more sophisticated algorithms can outperform MolSLEPA in a wider margin.In our experience, DoS estimation requires more samples than optimization.It would be an interesting research topic to investigate how to achieve both of goals in a more efficient manner.
Interpretable machine learning methods such LIME [31] and SHAP [32] can identify salient descriptors, providing a dataset and a prediction model. [33]Because their results depend on the dataset, biased datasets generated by black-box optimization are harmful to these methods.Weighted samples by MolSLEPA may also be used to remove bias for such methods.
The current implementation of MolSLEPA depends on classical Metropolis sampling, but one can also take advantage of recent efficient samplers such as locally balanced samplers. [34]n addition, it is possible to employ other methods for DoS estimation such as the replica-exchange Monte Carlo method, [35] broad histogram, [36] and Wang-Landau method. [37]We hope that MolSLEPA serves as a basis of robotic synthesis systems that can generate useful chemical knowledge.

Figure 2 .
Figure 2. Fragment vocabulary in our experiments.SMILES representation of each fragment is also shown.

Figure 3 .
Figure 3. Optimization and DoS estimation performances in HOMO-LUMO gap minimization.a) The best HOMO-LUMO gap against the number of samples.b) The accuracy of the density-of-states estimation in terms of the Hellinger distance at various thresholds.The number of samples of all the methods is 2000.c) The saliency of fragments at different thresholds.The top and bottom panels correspond to the thresholds of 1 and 0.1 percentiles.The fragments are sorted according to the saliencies of the brute-force method.

Figure 4 .
Figure 4. Optimization and DoS estimation performances in HOMO-LUMO gap maximization.a) The best HOMO-LUMO gap against the number of samples.b) The accuracy of the density-of-states estimation in terms of the Hellinger distance at various thresholds.The number of samples of all the methods is 2000.c) The saliency of fragments at different thresholds.The top and bottom panels correspond to the thresholds of 1 and 0.1 percentiles.The fragments are sorted according to the saliencies of the brute-force method.