Machine Learning for Screening Large Organic Molecules

Organic semiconductors are promising materials for cheap, scalable and sustainable electronics, light-emitting diodes and photovoltaics. For organic photovoltaic cells, it is a challenge to find compounds with suitable properties in the vast chemical compound space. For example, the ionization energy should fit to the optical spectrum of sun light, and the energy levels must allow efficient charge transport. Here, a machine-learning model is developed for rapidly and accurately estimating the HOMO and LUMO energies of a given molecular structure. It is build upon the SchNet model (Sch\"utt et al. (2018)) and augmented with a `Set2Set' readout module (Vinyals et al. (2016)). The Set2Set module has more expressive power than sum and average aggregation and is more suitable for the complex quantities under consideration. Most previous models have been trained and evaluated on rather small molecules. Therefore, the second contribution is extending the scope of machine-learning methods by adding also larger molecules from other sources and establishing a consistent train/validation/test split. As a third contribution, we make a multitask ansatz to resolve the problem of different sources coming at different levels of theory. All three contributions in conjunction bring the accuracy of the model close to chemical accuracy.


Materials Database
u s e f o r t r a i n i n g , v a l i d a t i o n , t e s t i n g Figure 1: A machine learning (ML) algorithm that estimates molecular properties at very low cost, embedded in the big picture of a materials discovery scheme.
Machine learning in materials discovery for OPV.Chemical compound space is huge, therefore it is prohibitively expensive to determine the desired properties of each compound with ab-initio methods.This is where machinelearning (ML) steps in.Machine learning is a data-driven approach that promises to speed this up, for example by substituting expensive DFT calculations by statistical models that estimate chemical properties at a fraction of the numerical cost.Figure 1 shows a scheme where a machine-learning algorithm is used as a filter, and only those candidates that pass the ML filter are processed in the subsequent steps (ab-initio calculations and, when applicable, synthetization and characterization in the laboratory).

Machine Learning for Quantum Chemistry
Bridging the gap between molecular properties and the performance of a OPV device.A number of efforts has been made to relate the performance of a OPV device to the elemental properties of the materials.The Scharber model [9] relates the power-conversion efficiency to the donor band gap (under some assumptions on the acceptor and electrode materials).Padula et al. [10,11] tried to go beyond that with a machine learning model taking both electronic and structural features as input.
Goal and context of present work.In the present work, we take an intermediate step and develop a machine-learning model to reliably estimate the HOMO and LUMO energies of a given molecular structure.
The wider context in a materials-discovery scheme is represented in Fig. 1.The task of the ML filter is to provide estimates of the molecular properties (currently the HOMO and LUMO energies) in order to identify the "good" candidates from a pool of molecule.Beyond the scope of this article, the good candidates would be further investigated, first by means of ab-initio (usually DFT) calculations, where again, some candidates will be discarded and others pass to the next stage, experiments in the lab, eventually leading to a better product.Along the way, the knowledge obtained from negative results should not be thrown away, but rather stored in a database in order to be used as training data for future development of the ML algorithm.Another aspect is applying a genetic algorithm and generate new candidate molecules by means of crossover and mutation of good candidates and feeding them back into the pool of molecules.

Short review of algorithms for the prediction of molecular properties.
There is a long history of advances in applying ML for predicting chemical properties.Early approaches used standard ML methods on top if fixedlength molecular fingerprints [12].But molecular fingerprints are hand-crafted and not invertible.
Duvenaud at al. [13], as well as Kearnes et al. [14] integrated the fingerprint feature extraction into the deep neural network, which resulted in graph convolutional networks that operate on vertices (atoms) and edges (bonds) of a b The "Tier 2 basis" set in "tight setting" is the numeric atom-centered orbitals as implemented in the FHIaims package [26].c The hyprid PBE0 exchange correlation functional of Refs.[27,28].d Multipole Expansion (MPE) implicit solvation method for water [29].e Only the molecules with the identifiers FPDI-T and T2 have the same InChI.They differ only by single-bond rotations and have very similar properties.Items marked in bold are the ones considered in this work.molecular graph.This was developed further by Schütt et al. [15], who added continuous-filter convolutions, and Chen et al. who developed MEGNet [16].
Refs. [17,18] are benchmarks of state-of-the art models on QM9 data and have shown that accuracies better than the so-called chemical accuracy (1 kcal/mol = 0.043 eV) can be reached with models like enn-s2s, SchNet, SchNetE on a dataset of small molecules (for details on QM9, see the following section 2.1).
For the real-word application of materials discovery for organic photovoltaics, one has to go beyond the scope of the QM9 dataset.The contribution of this Article is developing a relevant data corpus (Section 2) and a suitable architecture for predicting the HOMO and LUMO energies of large organic molecules (Section 3), and finally the empiric validation (Section 4).

Data Sources
Several datasets containing significant numbers of molecular structures and properties are already publicly available [19,20,21,22,23].These sets consists of pairs (x, y), where x is the point cloud of the atoms in space, i.e., a set of quadruples representing the coordinates and the atomic charge of each atom, and y is a set of scalar properties.The geometries and properties are the outcome of DFT calculations at a given level of theory.The properties y include, in particular, the energies of the molecular frontier orbitals HOMO and LUMO, which are proxies for the ionization energy and the electron affinity of the respective molecule.
• The QM9 dataset [19] contains (presumably) all molecules of up to 9 "heavy" atoms (C, N, O, F) plus hydrogens, which are 130k structures.The data was obtained at the B3LYP/6-31G(2df,p) level of theory.This dataset is  commonly used in many theory papers [15,16,17,18].• In 2019, Chen et al. released a dataset of 200k molecules of up to 12 heavy atoms (C, N, O, F, S and Cl), within the scope of the "Alchemy" contest [20].The level of theory and the set of properties are the same as QM9, but the dataset is extended to slightly larger molecules and larger variety of chemical elements.• Stuke et al. released the "OE62" dataset containing 62k structures, including larger ones [21].The initial structures were extracted from organic crystals in the Cambridge Structural Database (CSD) and the molecular geometry and properties were determined via DFT.The chemical composition is much wider than those of the other sources, ranging from hydrogen to iodine (H, Li, B, C, N, O, F, Si, P, S, Cl, As, Se, Br, Re, I).• The HOPV dataset contains 350 different molecules (5k conformers), which are large and relevant for organic photovoltaics [22].The HOPV set contains the elements H, C, N, O, F, Si, S, Se. • Kuzmich et al. published a set of 80 structures that are relevant for organic photovoltaics [23].Two molecules were discarded because of ambiguous labeling.Since this set is small, we will use it for testing only.The variety of chemical elements in the Kuzmich set is the same as HOPV plus B and Cl, which are only in OE62.
Table 1 summarizes the properties of the different data sources.In particular, we will focus on the subset marked in bold face and group them into (i) "B3LYP" exchange-correlation functional with a reasonably large basis set (QM9, Alchemy, HOPV) and (ii) "PBE0" exchange-correlation functional with a reasonably large basis set (OE62, HOPV).Ideally, a sufficiently "good" level of theory will produce data close to the real values.But in practice, there may be systematic shifts between data obtained from different levels of theory (Figure 2).We will address this problem with a model that makes separate predictions for B3LYP and PBE0, see Sec. 3.1 below.

Train/Validation/Test Split
In machine learning, the available data is commonly split into three parts, (i) the training data for fitting the free parameters of the model, (ii) the validation data used to make decisions, e.g., when to stop the training process, and (iii) the test set to finally evaluate the accuracy of the model.For a fair evaluation, it is important that the three subsets are disjoint, in particular, the test shall represent the ability of the model to generalize to data that was not "seen" in the training.Since the same structures may occur in several data sources, care has to be taken that the train/validation/test split is consistent across all sources.For example, the molecule C 6 H 3 NO 2 with the molecular graph as shown in Figure 3 occurs both in QM9 and in Alchemy, like some 18 000 other molecules.
The international chemical identifier (InChI), which is a unique identifier of a molecule [30] with a layered structure from coarse to fine information.This opens a convenient way of grouping similar structures together, namely structures that coincide in the first InChI layers.For example, if only the first two sublayers of standard InChI codes are used, the chemical formula and the backbone structure are the same, but the locations of hydrogens, charge information and the steroechemical information, as well as isotopic information may differ.
The split used in work is constructed according to the following rules: 1. Perform the split in terms of InChIs truncated after the second layer (molecules that differ only in later layers will be assigned to the same part of the split by construction).
2. Assign the test set of the Alchemy contest to test.
3. Assign the original Alchemy validation set to validation, except for those structures whose truncated InChIs are already in the test set (2% of the truncated InChIs), which go to test.

4.
Assign the rest of the data such that 18% and 9% of the truncated InChIs are assigned to test and validation, respectively.The rest is assigned to train.
5. Subsequently assign the truncated InChIs of QM9, OE62, and HOPV in the same manner: Truncated InChIs that have occurred previously are assigned as before; The rest is filled up such that test and validation fraction are 18% and 9%, respectively.
Table 2 collects the sizes of the different data sets and their intersections in terms of truncated InChIs.Finally, Table 3 collects the sizes of the train, validation and test parts that result from the procedure described above.

Model Architecture
The machine-learning model does not question the training data; it just learns to imitate the results.As mentioned above (Figure 2) that there may be systematic offsets between data obtained at the one or the other level of theory.
Our approach to deal with these systematic shifts between different data sources is that the ML model makes separate predictions for the B3LYP values and the PBE0 values of the frontier orbital energies.

Multitask Approach
The idea of multitask learning [31] is to train multiple target values in the same model.The model has a common part and then splits into different branches, one for each task.Here, the tasks are the B3LYP estimates and the PBE0 estimates.Figure 4 depicts two example architectures with a single input and multiple outputs (more details given below in sections 3.2 and 3.3).
A given item from the training set may or may not have target values for all tasks.For those tasks where targets are available, errors are computed in the loss function and backpropagated through the network.Notably all tasks contribute to the training of the common part of network and help to learn a meaningful representation of the input.Since the tasks are correlated, it is expected that they benefit from each other.

Molecular-Graph Processing
The input to the model are point clouds of atoms, which are readily interpreted as graphs, where the atoms are the vertices and the pairwise distances are the edges.For the backbone of our model, we use the SchNet architecture by Schütt et al., which was designed to model atomistic systems and makes use of continuous-filter convolutional layers, i.e., learned interactions that respect the symmetries of the underlying physics [15].In a nutshell, SchNet maps the atomic charges to a learnable, n-dimensional (here: n = 128) embedding, which is then processed by several (here: 6) learned interaction modules, each of them incorporating a continuous filter generator.
The filter generator.For atoms i, j, the (continuous) distance r = r j − r i is Gaussian expanded, e k = exp −γ(r − µ k ) 2 , for equidistant µ k and k = 0 . . .24.This expansion passes through two dense layers with shifted-softplus nonlinearities, yielding W l i,j ∈ R n for the l-th filter generator.The interaction module.The atom features are first transformed atom-wisely (matrix multiplication plus adding a bias term).Then, the continuous-filter operation lets the atoms interact with the other atoms, weighted with the filter, natoms j=0 x l j • W l i,j , where "•" represents the element-wise multiplication.That is, the interaction module is factorized: the interactions are channel-wise, while channels are mixed in the atom-wise operations.Then, there are three more layers inside the interaction block, atom-wise, soft-plus nonlinearity, and another atom-wise layer.Finally, there is a skip connection, i.e., the sum of the outcome of the described interaction chain and its input is returned.

Aggregation and Readout
At some point, the representation on a molecular graph of arbitrary size (feature vectors per atom) has to be reduced to the set of output estimates, which are per molecule.
The simplest way of aggregating would be to take the sum or the average.This is even the natural choice in some cases.For example, the energy is an extensive quantity, and the energy of a molecule is the sum of the contributions from all atoms in their local environment.In fact, this is how the energy is constructed as the sum of local contributions in Ref. [15].
Sum and Average Aggregation.These are the standard aggregation modes in the SchNet paper [15].The (highdimensional) atomic representation is reduced to the desired output dimension (one for a single regression target) by a small neural network (typically two affine layers with a nonlinearity in-between).Then, the contributions from the atomic environments are summed up or averaged, which gives the final result, see Figure 4(a).
Set2Set Aggregation.The scaling of the present quantities of interest, the HOMO and LUMO energies, with the system size is not clear a priori.Both are related to the level spacing and tend to decrease slowly with the system size, but neither sum nor average is the obvious choice.
Therefore, we resort to the Set2Set module that was developed by Vinyals et al. [32] to learn a fixed-length representation for input sets of arbitrary size.It is based on the repeated application of a long short-term memory (LSTM) module: • The initial LSTM input and hidden (cell) state are zeros.
• Project the atom features x (one per atom) on the LSTM output q (one per molecule), apply softmax, weighted average r of the atom features x. • Then, feed (q, r) into the LSTM again.
• After N repetitions (here, N = 3), return the vector (q, r), one for each molecule.
Thus, the Set2Set module learns to generate a weighted average of the atom features.It has more expressive power than sum or average and, at the same time, respects the invariance with respect to the order of the input features.
The Set2Set aggregation works on the atomic embedding and doubles its dimensions.Afterwards, a small neural network (here: 2 affine layers with a nonlinearity in-between) reduces the dimensions to the output dimension.

Impact of Graph Aggregation
To start with, we train a SchNet model on the smallest subset of the training data (QM9 only) with different aggregation modes and output modules (training details and parameters are listed in Sec.A.1): Sum.For each of the targets (HOMO, LUMO, gap), atomic contributions are obtained with a 2-layer neural network and summed up.Results are shown in column (a) of Fig. 5.The model achieves good accuracy on test data from the same distribution as the training (QM9), but is completely off for out-of-distribution data, like the large molecules from HOPV and Kuzmich (2017).
Average.Results for average aggregation (with otherwise the same parameters) are given in panel (b) of Fig. 5. Average aggregation is not as far off as sum aggregation.
Set2Set.The set2set output module aggregates in a flexible and learned manner, the dimension is reduced after aggregation.This yields even better results (panel (c) of Fig. 5) than average aggregation.Of course systematic deviations of the order of 1 eV are still not acceptable, but Set2Set is the best starting point for the next stage, extending the training set.

Impact of the Training Data
We successively extend the training set by adding successively sets of larger molecules (QM9 → Alchemy → OE62 → HOPV), Fig. 6 (a) -(d).This reduces the significantly regression error on the sets of large molecules (HOPV and Kuzmich2017).
From panels (e) to (d), we go in the opposite direction.Whereas (e) is trained without the small molecules, the The accuracy of the full model (Set2Set output module for both tasks, full training set, see Fig. 6 (d)) has shown errors of the magnitude 0.05 eV for HOMO and LUMO for small molecules (QM9, Alchemy) and 0.1 eV for large molecules and molecules with large chemical variety (OE62, HOPV, Kuzmich2017).We stress that this is already close to the so-called "chemical accuracy" (1 kcal/mol = 0.043eV) and should be good for practical applications.

Ablation Study
Three efforts have lead to the accuracy achieved in the full model (Fig. 6 (d)): the concept of multitask training, the choice of the output module and the extended training set.Previously, in Sec.4.2, we have analyzed the impact of the training set, given the design decisions "multi-task training" and "set2set aggregation".In this section, we verify that these design decisions were actually important to achieve the accuracy reported above.

Multi-task learning
In Table 4, the column "Main Experiment" is the multitask model, i.e., B3LYP and PBE0 tasks trained a set2set output module, on top of the representation learned by a common SchNet graph-processing network, as represented in Figure 4(b) (with N = 2 × 3 predictions because 3 outputs (HOMO, LUMO and gap) are trained for two tasks, B3LYP and PBE0).In the other columns, B3LYP task and PBE0 task are trained separately (Figure 4(b) with N = 3 predictions).
For most quantities, the multi-task model achieves better accuracy.The only exception are the HOPV-PBE0 errors, which are lower in the model trained with PBE0.But on the other hand, those quantities that are hard in the single-task setting (error measure larger than 0.1 eV, marked in red in the columns "B3LYP only" and "PBE0  only"), the errors reduce significantly when the tasks are trained together (column "Main Experiment").We conclude that the concept of multitask-training is indeed important for the accuracy and reliability of the model.

Aggregation modules
We have shown above (Sec.4.1) that the Set2Set output module is best if the training represents only a small part of the test distribution (QM9 only).Here we repeat this investigation with the full training set.Table 5 collects the results.The columns labeled "Main Experiment" is the model SchNet(6) + Set2Set output module.
Average aggregation.The model with the average output modules is competitive on small molecules (QM9 and Alchemy), which are well represented in the training set.It also performs slightly better on the HOMO(PBE0) task on the HOPV test.However, it performs much worse on those four tests and quantities that are already difficult for the baseline model Set2Set (error measure above 0.1 eV, marked in red in the table).Altogether, the average output modules has 12 error measures above 0.1 eV.
Sum aggregation.When moving on from average to sum, the picture is similar: Sum performs worse than average  In conclusion, out of the three output modules considered, the Set2Set output module remains the most reliable one, especially on challenging data.
We have achieved such a model with reasonable errors on the large molecules (i) by incorporating the Set2Set output module for improved generalization, (ii) by extending the training set with data from four different sources, and (iii) by introducing a multitask ansatz to handle the domain shifts due to different levels of theory in different sources.
Outlook and Future Work.The method is not restricted organic photovoltaics and is in general transferable to any problem in materials discovery where material properties are available as training data, think for example of organic light-emitting diodes, organic circuits, materials for batteries or supercapacitors.
Our model can be used in future explorations of the chemical compound space in the search for better materials for organic photovoltaics.As indicated in Fig. 1, we have interdisciplinary approaches in mind that incorporate machine learning for fast prediction, but also algorithms that generate new candidates, as well as ab-initio methods and experiments for further verification and technical exploitation.
Approaches for molecular geometry generation range from genetic algorithms [33,34] over fragment-based approaches [35] to autoencoder-based methods [36,37,38] and generative neural networks [39,40].One challenge for using the current model for such applications is that it is trained with the exact molecular geometries as obtained from DFT geometry optimizations.If the model is to be applied to structures that come from generating algorithm, the geometries will not be exact and may result in unexpected predictions by the model.Thus, the sensitivity of the model accuracy with respect to noise in the input coordinates needs be investigated.The problem can probably be alleviated by training the model with random noise added to the input geometries.Note, however, that this would change the regression task from "What are the properties of this molecular geometry?" to the question "Imagine this geometry relaxes to its near-by ground state, what would be its properties?"

A.2.1 The Maltose Library
The library is installed as a python module and provides the set2set output module and multitask functionality.It builds upon the schnetpack package [41] (at least v1.0.1), which again makes use of the "Atomic Simulation Environment (ASE)" [42] and pytorch-1.8.0.The implementation of the Set2Set output module depends on the packages pytorch geometric and pytorch scatter.

A.2.2 Running experiments
Data preparation.The scripts for for data download and prepare the datasets (i.e., to convert to ase format) are located at scripts/primary data/ in the maltose repository, as well as the script that creates the unified train/test/validation split.These scripts depend on the xyz2mol code from https://github.com/jensengroup/xyz2mol, which implements the algorithm from Ref. [43], and on the chemical toolkit RDKit [44] Configuration and execution of experiments.The experiment configurations are provided in scripts/configs/.
There are scripts to run the training, to evaluate the errors, and to create the plots included in this article.

Figure 2 :
Figure 2: Energy of the HOMO obtained by DFT with the PBE0 exchange-correlation functional vs. the B3LYP functional for the molecules of the OE62 dataset.DFT results at different levels of theory are strongly correlated, but may have significant systematic shifts.

Figure 4 (
b) represents the resulting network architecture.

Figure 5 :
Figure 5: Regression accuracy of SchNet models with different output modules, trained only on the QM9 part of the training data.Shown are the target-estimate pairs of 25 molecules randomly sampled from each test set.The mean average error (MAE) measured on the entire test set given in the legend.

Figure 6 :
Figure 6: Regression accuracy of SchNet models with Set2Set output modules trained on different subsets of the training data, as indicated above (Q = QM9, A = Alchemy, O = OE62, H = HOPV).Note that the panels for the PBE0 values in columns (a) and (b) compare B3LYP estimates to PBE0 target values (gray background) because, without PBE0 targets in the training data, the model cannot make separate PBE0 estimates.

Table 1 :
Summary of dataset sizes and properties.

Table 2 :
Intersections of datasets in terms of truncated InChIs (sum formula and non-hydrogen connectivity).The numbers on the diagonal are the counts of distinct truncated InChIs in each data source.

Table 4 :
Comparing the multi-task model "Main Experiment" and the single-task models (B3LYP only and PBE0 only) on the basis of the error measures mean error (ME), mean absolute error (MAE), root-mean-square error (RMSE), all measured in eV.Bold: best MAE or RMSE value in a given row.Red: error measures above 0.1 eV.

Table 5 :
Kuzmich (2017)e regression error (in eV) for different output modules.Bold: best MAE or RMSE value in a given row.Red: error measures above 0.1 eV.The "Main Experiment" is SchNet(6) + Set2Set trained on the full training set.thosetests and quantities that are already difficult for average and set2set.In total, the sum output module yields 16 error measures (out of 33) that are above 0.1 eV.The estimates of the LUMO values for theKuzmich (2017)data are systematically shifted: The mean of the ground-truth LUMO values is −3.30eV, but the mean estimated LUMO is −4.86 eV, manifesting unphysical scaling of the estimate with the system size.