Novel machine learning insights into the QM7b and QM9 quantum mechanics datasets

This paper (i) explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties, and (ii) further explores an inverse design approach to molecular design consisting of using machine learning methods to approximate the atomic composition of molecules, using QM9 data. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from physical‐chemical properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well‐defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner, core region that concentrates clustered inliner objects. A significant relationship exists between the number of atoms in the molecule and its outlier/inliner nature. The spatial structure exhibits a relationship with molecular weight. Despite the structural differences between the two datasets, the predictability of variables of interest for inverse molecular design is high. This is exemplified by models estimating the number of atoms of the molecule from both the original properties and from lower dimensional embedding spaces. In the generative approach the input is given by a set of desired properties of the molecule and the output is an approximation of the atomic composition in terms of its constituent chemical elements. This could serve as the starting region for further search in the huge space determined by the set of possible chemical compounds. The quantum mechanic's dataset QM9 is used in the study, composed of 133,885 small organic molecules and 19 electronic properties. Different multi‐target regression approaches were considered for predicting the atomic composition from the properties, including feature engineering techniques in an auto‐machine learning framework. High‐quality models were found that predict the atomic composition of the molecules from their electronic properties, as well as from a subset of only 52.6% size. Feature selection worked better than feature generation. The results validate the generative approach to inverse molecular design.


| INTRODUCTION
The goal of molecular design (MD) is to generate molecules with desired properties, which would correspond to new materials, new drugs, and compounds of interest.This is a very complex undertaking considering the size of the chemical compound space, where conventional approaches are either inefficient, time-consuming, and ultimately prohibitive.In that sense, the introduction of technologies leveraging the progress made in computationally intensive data analysis and artificial intelligence disciplines (e.g., machine learning and others) represents an important step towards achieving the aforementioned goal.Specifically from a machine learning perspective, the two main approaches are the so-called discriminative and generative.In the first case (also known as forward design, the purpose is to relate molecules with their properties (electronic or otherwise), whereas in the generative approach (so-called inverse design), the idea is to start with a given collection of properties and derive the actual molecules compatible with them. 1 The latter has been approached with different techniques, including deep learning and others.Variational autoencoders (VAE) working with real-valued vectors representing the molecules have been used for property optimization in latent spaces. 2 The targeted application was the generation of new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds with molecules having fewer that nine heavy atoms.A comparison of a VAE with an adversarial autoencoder (AAE) was presented in Reference 3 and applied to generate novel compounds with activity against dopamine receptor type 2. VAE and AAE were also used in Reference 4 with the purpose of identifying new molecular fingerprints with predefined anticancer properties, showing that the models significantly enhanced the efficiency of the development of new molecules with specific anticancer properties.The application of other approaches to inverse molecular design such as generative adversarial networks (GAN), reinforcement learning, and transfer learning has been described in Reference 5.In Reference 6 inverse molecule design was formulated as a classification problem and highquality models for predicting atomic composition from electronic properties were found.
However, there is a lack of studies oriented to the understanding of the structure of the molecule datasets used in the aforementioned approaches.Gaining insights related to the internal structure of the information and the characteristics of the data could benefit inverse molecule design and related topics.One objective of this paper is to start filling that gap by using unsupervised machine learning methods on two of the datasets (QM7b and QM9) that relate quantum mechanical properties with atomic composition and that have been used in inverse molecular design studies.
From the point of view of inverse molecule design, this paper extends previous studies in several ways (i): works with a large dataset (133,885 molecules described by 19 properties), (ii): include isomers (6095), (iii): formulates the problem as a regression (instead of classification), considering that the number of atoms of a given element in a molecule is a ratio variable with discrete values, (iv): it produces a simultaneous, multi-target prediction of the whole molecular composition, which benefits from the interrelationship of the chemical elements present in the molecule, (v): it introduces feature engineering methods, and (vi): It considers model scenarios that include the original predictor space and latent, embedding spaces.
The common methodology for the analysis of these datasets starts with a description of their general characteristics in terms of size and the physical-chemical properties used for describing the set of molecules that they cover, as well as the preprocessing applied to the data in order to better condition the original description for the subsequent processing stages.From there the process follows two main stages: (i) an unsupervised analysis oriented to the exploration and understanding of the internal structure of the data (with methods described in Section 2.3.1), and (ii) a supervised analysis consisting of assessments of the predictive capabilities of the properties, feature analysis and selection, aiming at finding the most informative ones for modeling, and the actual machine learning model search (with the methods from Section 2.3.2).Supervised modeling focused on the QM9 dataset only because it is the largest and most comprehensive of the two datasets considered.
It is important to emphasize that even though this paper centered on the analysis of the QM7b and QM9 datasets, the methodology used is of general character and could be applied to similar, likely more comprehensive, and more up-to-date molecular datasets.Moreover, in order to gain additional insights into the understanding and exploitation of the available data, other unsupervised and supervised machine learning methods should be incorporated.This would certainly be beneficial for materials science investigation, drug design, and other applications.
The rest of this paper is organized as follows: Section 2.1.2describes the data used in the study, Section 2.3 describes the collection of supervised machine learning methods applied and the auto-machine learning approach, Section 3 presents the results and Section 4 the conclusions.

| METHODOLOGY
The unsupervised analysis progressively starts with estimating the intrinsic dimension of the data (with different approaches presented in Section 2.3.1), in order to have an initial idea of the actual dimensionality of the space containing the information of the entire molecular corpus.This is important because of the differences between the dimension of the original space determined by the features (properties) used for describing the objects (molecules) and that of an abstract space capable of containing the same information.For real-world data, the former tends to be larger than the latter and the magnitude of that difference is an indicator of the amount of irrelevancies, redundancies, and noise present in the data.A second element following that assessment involves manifold learning procedures for actually constructing approximations to such more concise spaces (with a stateof-the-art approach presented in Section 2.3.1).Other elements strongly related to the internal structure of the information are (i) the data distribution within the space.That is, the presence of natural clusters (Section 2.3.1), and (ii) The existence of abnormalities or oddities (outliers) (with a variety of methods mentioned in Section 2.3.1).
The supervised analysis is mainly oriented to the inverse design, understood as the prediction of the chemical species (the targets) from the physical-chemical properties (original predictors), in this case using machine learning methods.In particular, the paper covers several modeling scenarios (Section 2.3.2):(i) target prediction directly from the original properties, (ii) target prediction from generated features in an embedding space, (iii) target prediction from features in the original descriptor space, reconstructed by inverse-mapping from the embedding space, and (iv) target prediction is a subset of selected features in the original descriptor space (Section 2.3.2).In these processes, modeling was approached as a multitarget regression task from an AutoML (automated machine learning) perspective (Section 2.3.4), in order to wrap model selection and model parameters optimization in a unified framework that covered simultaneously the five inter-related chemical species targeted.In that sense, the paper also represents an experiment in testing such capabilities in a real-world complex problem.In addition, explainable AI methods (XAI) were introduced to better understand the behavior of the found models and the role of the different predictors in controlling the model's performance.Contrary to the QM7b data that includes none, QM9 has 6095 constitutional isomers among the 134k molecules.It is considered that this dataset provides very accurate quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules (presented in Table 1).
Accordingly, it has been used for a variety of tasks, like benchmarking existing methods, hybrid quantum mechanics/machine learning, and the identification of structure-property relationships.In this paper, the focus is on characterizing the internal structure of the data T A B L E 1 QM9: Properties characterizing the molecules.from a machine-learning perspective and on exploring models for inferring the atomic composition from molecular properties.

Index Molecular property Description
The correlation structure of the chemical elements is shown in Table 2 reveals that there are strong relationships between the chemical elements composing the molecules in the dataset (all p-values were highly significant at the 0.05 level).

| Preprocessing
For both QM7b and QM9, the original properties describing the data are expressed in different units of measure.Variables that are measured at different scales do not contribute equally to the analysis and create biases, affecting a broad variety of statistical and machinelearning methods.In order to make their values comparable, and also to eliminate this source of bias, prior to the application of the machine learning methods described in the following section, each dataset was preprocessed by converting the original property values to z-scores.
With the new variables having zero mean and unit variance, all properties are expressed in the same unit of measure (their variance), eliminating that source of bias.
For the application of supervised machine learning methods aiming at creating predictive models, the 133,885 objects of the QM9 dataset were divided into a training and a testing set following a uniform random 90,120,496 and 13,389 objects in the training and testing sets, respectively.

| Machine learning methods
The machine learning methods used in the study consisted of unsupervised and supervised techniques.The former aimed at creating alternative data representations and their characterization, including the construction of visual representations and nonlinear mappings for manifold extraction.The latter consisted of prediction algorithms applied to the different data representations, including automated machine learning based on Auto-sklearn AutoML. 48The architecture of the system used in the paper for this process is shown in Figure 1, where the data go through preprocessing and dedicated feature engineering stages prior to the model search phase.The preprocessing required by the QM9 data was a straightforward standardization, which need not be included in the search procedure, and powerful feature selection and generation approaches were apriori chosen for feature engineering operations.Accordingly, the AutoML process was designed to concentrate the meta-learning, the ensemble building and the Bayesian hyperparameters optimization components on the important task of model space exploration.

Intrinsic dimension
Data is typically represented in two forms: as objects described by their features, or as a collection of pairwise similarities or dissimilarities.In real-world problems, the dimensionality of these spaces can be large, leading to the curse of dimensionality, which can negatively impact the performance of statistical and machine learning algorithms.
However, the information contained in the data is often concentrated in a subspace of much lower dimensionality than that of the original descriptor space.The intrinsic dimension (IDim) is the dimension of an embedded manifold that contains the data.Identifying these manifolds can aid in understanding the internal structure of the information, and working with them can alleviate the curse of dimensionality and improve the performance of machine learning methods.
Finding the intrinsic dimension has no univocal solution.Estimates based on different approaches vary and it is necessary to work with a consensus from the estimated values since they are based on a variety of principles and assumptions.The following approaches were used in this paper for estimating the intrinsic dimension of the QM9 dataset: Angle and norm concentration (DANCo), 12 Correlation Integral (CorrIn), 13 Fisher separability (FisherS), 14,15 Minimum neighbor distance-ML (MiND_ML), 16 Maximum likelihood (MLE), 17 Estimation within tight localities (TLE), 18 Method of Moments (MOM), 19 PCA Fukunaga-Olsen (lPCA), 20,21 Minimal Neighborhood Information (TwoNN). 22They are implemented in Reference 23 under a common framework.
The estimates of the intrinsic dimension provide an idea of the required dimension of embedding spaces containing the data.When they are based only on the information contained in the predictor features, the new sets of generated features can be used as predictors for modeling, when the target variables are added.

Manifold learning
A low-dimensional transformation of the original representation space provides an environment where the information can be visualized and/or modeled with supervised methods.When the objective is to visualize the data, if the intrinsic dimension of the data is sufficiently low, the new space could provide useful insights into the structure of the information.Otherwise, it would provide an approximation.
Whenever estimates of the intrinsic dimension are available, they could be used for making appropriate choices for the dimension of the embedding space via nonlinear transformations.Principles used for computing these transformations are, for example, the preservation of local distances/dissimilarities, the preservation of conditional probability distributions within neighborhoods, or more complex criteria.This paper used the Uniform Manifold Approximation and Projection (UMAP) which is a general-purpose manifold learning and This Umap approach was used for creating embedding spaces.
The final transformation is given by the function composition φ ¼ ðU ∘ PÞ, where U is the nonlinear Umap mapping and P is a principal components transformation applied to the nonlinear space with a number of components equal to the number of target dimensions.As a result, φ is a canonical mapping where the axes are ordered according to a monotonic decrease of variance in the nonlinear space and will be referred to as the canonical Umap.

Tree-SNE hierarchical clustering
Tree-SNE 26 is a hierarchical clustering and visualization algorithm based on stacked one-dimensional t-SNE embeddings 27 with varying perplexity and degrees of freedom of the t-distribution underlying the mapping.As the perplexity and degrees of freedom decrease, smaller and less coarse clusters are produced in the embedding.The approach uses one-dimensional t-SNE embeddings for computational efficiency and ease of clustering via a Fourier transform-based fast t-SNE algorithm. 28,29To that end, a general t-distribution kernel is used α , where d is the Euclidean distance between two objects and a scaling parameter α for the degree of freedom ν given by ν ¼ 2α À 1).This reformulation allows kernels with tails heavier than any possible t-distribution to have negative degrees of freedom, which have proven to enhance cluster separability.Typically, about 30 to 100 one-dimensional t-SNE embeddings are stacked to create the tree-SNE visualization, with the first layer at the bottom of the plot, and with α and perplexity decreasing for each upper layer.In the tree visualization created using t-SNE, each layer is initialized with the embedding from the previous layer rather than with random initialization.This allows each layer to refine the clustering found in the previous layer, with larger clusters being broken into smaller clusters on subsequent levels of the tree.As a result, the path of a single observation or group of observations can be traced vertically through the tree.
Alpha-clustering is a method that assigns cluster labels to the t-SNE embeddings obtained with Tree-SNE.It recommends the optimal cluster assignment by finding the cluster stability across multiple scales, without prior knowledge of the number of clusters.The group assignment is derived from the t-SNE embeddings obtained with Tree-SNE by automatically selecting the best level of granularity for the data, defined as the stable clustering for which the range of α values is the largest.It uses spectral clustering 30 on each layer of the tree to determine the number of clusters present in the data, rather than using a specified number of clusters.
The shared nearest neighbors method is used to build the graph for spectral clustering, connecting points in the t-SNE embedding that are nearest neighbors in the original highdimensional space.Alpha-clustering allows for explainable clustering on multiple levels of granularity.It can be used with tree-SNE visualizations to compare the results to other unsupervised clustering algorithms and determine the meaningful aspects of the data captured by tree-SNE.Alpha-clustering extends and validates the tree-SNE algorithm and provides an opportunity for explainable unsupervised clustering, which is difficult to achieve with other approaches that often rely on deep neural networks.Alpha-clustering helps prevent over-fitting to noise in the data, as it tends to select for clusterings that appear earlier in the tree structure.One-dimensional t-SNE tends to generate easily distinguishable clumps of data, making it easy to identify and count the clumps on each layer of the tree-SNE tree.The difference in magnitude between alpha values rapidly decreases as more layers are added.

Hierarchical clustering methods have strong limitations posed
by the size of the data, but the advantage of the tree-SNE approach is twofold: On the one hand, it has the capability of processing large datasets (prohibitive for classical hierarchical clustering approaches, but required for QM9).On the other hand, the representation mechanism of tree-SNE allows the visualization of dendrograms composed of many thousands of objects, which is the case of QM7b and particularly QM9.

Outlier detection
Outliers in a dataset are considered rare or odd observations, events, or individuals characterized by unusual attribute values, which deviate from the general distribution observed within a population.Due to their peculiar composition and the fact that they deviate from the main distribution characterizing the data, outliers are also referred to as anomalies.This definition is inherently OD was performed with the SUOD approach, using the following combination of state-of-the-art individual techniques: CBLOF, 34 HBOS, 35 LODA, 36 LOF, 37 and Isolation Forest. 38

| Supervised modeling
The objective of the supervised methods in the present context is to construct predictive models capable of relating molecular properties with their atomic composition.
Several modeling scenarios are illustrated in Figure 2.
There are several errors associated with these mappings: (i) the prediction error of the target for a given object X (δ the reconstruction error of X under the inverse transform φ À1 from the embedding space (ϵ X ¼ kX À Xk), (iii) the error between a given prediction and the one based on the object's reconstruction (δ X ¼ kFðXÞ À Fð XÞk), and (iv) the error of a target prediction based on embedded predictor features (δ Xe ¼ kT À ψðX e Þk).These mappings and their associated errors will be discussed in the subsequent subsections.

Feature selection
Like many real-world datasets, the ones analyzed here are described in terms of a large number of descriptor features.Often many among them are irrelevant and/or noisy to the task (classification or regression), but that is not known in advance.It is known that working with overly large feature sets usually makes many machine learning algorithms exhibit a decrease in accuracy when the number of variables is significantly higher than optimal.This is aside from practical issues like slowing down the algorithms and increasing the amount of computational resources required.F I G U R E 2 Data spaces and mappings.X is a predictor object and T is its associated target (linked by the dashed line).F : ℝ n !ℝ m is the mapping from the predictor space to the target space.φ : ℝ n !ℝ d is the mapping from the predictor space to a low-D embedding-generative space (d is the dimension of the embedding space) and X e the image of X in the embedding.φ À1 : is the inverse mapping from the embedding space to the predictor space ( X ¼ φ À1 ðX e Þ is the inverse reconstruction).Fð XÞ is the prediction of a reconstructed (inverse) object under F. ψ is the mapping from the embedding to the target space and ψðX e Þ is the predicted target based on X e .
[41][42] Using multiple feature selection algorithms is beneficial in order to make the selection more robust and unbiased, but it can also add extra complexity due to discrepancies in the rankings produced by the different algorithms.This so-called rank aggregation problem arises when ranking alternatives based on one or more criteria.While ranking is relatively straightforward when only one criterion is considered, it becomes more complex when multiple criteria are involved.This is the case when different feature selection algorithms produce different rankings of importance scores computed on the same set of data descriptor features.
In such situations, it is necessary to compute a "consensus" ranking based on the individual rankings produced by the different algorithms.
Many approaches have been proposed for solving the rank aggregation problem, most of which are based on some notion of optimality that the aggregation algorithm must use to create a fully ordered list from the different total or partial lists provided by individual sources. 43However, there are rank aggregation methods that are not optimization-based but are still effective in practice.In particular, Markov chain-based methods have proven to be very effective. 44In them, the states of the chain correspond to the candidates to be ranked, and the transition probabilities depend on the given (partial) lists.The state space is the union of the sets of features ranked by various selection algorithms and the algorithm looks for the stationary distribution of transition probabilities.In the MC4 variant, if the current state of the chain is feature P, then the next state is chosen as follows: (i) first pick a feature Q uniformly from the union of all features ranked by the selection algorithms, (ii) If τðQÞ < τðPÞ for a majority of the lists τ that ranked both P and Q, then go to Q, else stay in P.This chain generalizes Copeland's suggestion of sorting the candidates by the number of pairwise majority contests they have won. 45The Markov chain approach allows handling partial lists of candidates (features) and also enhances other proposed heuristics.In addition, it is computationally efficient and it outperforms other classical rank aggregation algorithms with respect to the most used metrics.
This paper uses MC4 as the rank aggregation procedure.

| Gamma test
The Gamma test, 46,47 is a non-parametric technique that estimates the amount of residual variance in a dataset (variance component of a target variable that does not come from the predictor variables).Let S be a system described in terms of a set of variables with y ℝ being a variable of interest, potentially related to a set of m variables x ℝ m expressed as where f is a smooth unknown function representing the system, x is a set of predictor variables and r is a a random variable representing noise or unexplained variation.Despite f being an unknown function, under some assumptions, it is possible to estimate the variance of the residual term (r) using available data obtained from S. This will give an indication of the possibility of developing models for y based on the information contained in x .Among the most important assumptions are (i) the function f is continuous within the input space, (ii) the noise is independent of the input vector x and (iii) the function f has bounded first and second partial derivatives.
Let x Nbi,kc denote the k-th nearest neighbor of x i in the input set f x 1 , ÁÁÁ , x M g.If p is the number of nearest neighbors considered, for every k ½1, p a sequence of estimates of E 1 2 ðy 0 À yÞ 2 based on sample means is computed as In each case, an 'error' indication is given by the mean squared distances between the k nearest neighbors, given by where E denotes the mathematical expectation and j:j Euclidean distance.
The relationship between γ M ðkÞ and δ M ðkÞ is assumed linear as δ M ðkÞ !0 and then Γ ¼ VarðrÞ is obtained by linear regression The three main statistics are Gamma (Γ, the residual variance), Gradient (related to model complexity), and V ratio ¼ Γ=varðyÞ, where varðyÞ is the variance of the target variable y ½0,1.The latter is particularly useful since it is a normalized version of Γ and allows comparability across different datasets.All of these magnitudes can be estimated directly from the data and allow an assessment of the predictability of the target variable given the input (predictor) variables, without any actual modeling.
The Gamma test was used for making predictability assessments of molecular weight from the original electronic properties and from embedding spaces generated features.Also, for evaluating the individual capabilities of predictor variables in relation to other variable importance estimation methods.

| Model search with AutoML
The Gamma test is an existential estimator of modeling feasibility.
Therefore, actual modeling must be performed with dedicated machine learning procedures and it is well known that there is no universally better method.Accordingly, a model search is necessary.
AutoML is a research area that targets the progressive automation of machine learning processes.It provides tools and algorithms for data preprocessing and cleaning, selection and construction of appropriate features, model selection, and model hyperparameters optimization.
In this paper, the AutoSklearn autoML tool 48   4. Compute Importance: The importance of each feature is then computed as the difference between the baseline performance and the performance obtained with the permuted feature where i j is the importance for feature f j and s kj is the score of M with the permuted dataset.This process is repeated for all predictor features in the dataset, providing an estimate of the importance for each feature.It's important to note that PVI measures the importance of a feature given a particular model and it does not necessarily imply causality.
7][58][59][60][61] The idea is to use a simpler explanation model as an interpretable approximation of the original complex model.With the so-called local methods, the prediction of a model is based on a single input, and explanation models use simplified inputs. 51If f is a (possibly complex) model and g is the explanation model, a mapping function h x maps simplified inputs x 0 to the original model inputs x, via x ¼ h x ðx 0 Þ.In local methods the idea is that if z 0 ≈ x 0 , then gðz 0 Þ ≈ fðh x ðx 0 ÞÞ.In the Additive Feature Attribution methods the explanation model is a linear function of M simplified binary variables z 0 f0, 1g M , with coefficients ϕ i ℝ which represents an effect to each feature.Adding all the effects produces an approximation of the output of the original model f for input x.The basic idea is that for complex models there is a need to use a a simpler explanation model, which is an interpretable approximation of the original.
With the so-called local methods, the prediction of a model is based on a single input, and explanation models use simplified inputs. 51If f is a (possibly complex) model and g is the explanation model, a mapping function h x maps simplified inputs x 0 to the original model inputs x, via x ¼ h x ðx 0 Þ.In local methods the idea is that if z 0 ≈ x 0 , then gðz 0 Þ ≈ fðh x ðx 0 ÞÞ.In the Additive Feature Attribution methods the explanation model is a linear function of M simplified binary vari- where ϕ i ℝ are suitable coefficients which represent an effect to each feature.Adding all the effects produces an approximation of the output of the original model f for input x.
They are estimated using Shapley regression values. 56 It is proven 51 that it is possible to find a unified measure of feature importance, by computing Shapley values on a conditional expectation function of the original model where where S is the set of non-zero indexes in z 0 .The values are obtained as It is possible to find a unified measure of feature importance, by computing the so-called Shapley values on a conditional expectation function of the original model where f x ðz 0 Þ ¼ fðh x ðz 0 ÞÞ ¼ E½fðzÞjz S , and S is the set of non-zero indexes in z 0 .The values are obtained as These SAHP values (Shapley Additive exPlanation) associate with each attribute the change in the expected model prediction when conditioning on that attribute.In this paper, this approach was applied to the model obtained with AutoML procedures when using all of the original 19 properties to predict atomic composition.

| Unsupervised analysis
The sorted intrinsic dimension estimates obtained with the methods from Section 2.3.1 are shown in Table 3, where n7 ¼ 20, n9 ¼ 100 are neighbourhood sizes, fan, ðmli, mleÞ, fo are the Fan, maximum likelihood, and Fukunaga-Olsen variants of the corresponding algorithms, respectively.k1 ¼ 20, k2 ¼ 100 (first and second neighborhood sizes considered).The mean and median dimensions are 5.088 and 4.469, respectively, for a Idim=N v ratio in the ½0:235,0:268 range.Accordingly, most of the information is contained in a subspace of dimension ≈ 25% of the descriptor space.

| Manifold learning
For visualization purposes and considering the limitations imposed by hard media, 2D approximations of the manifolds containing the data in QM7b and QM9, respectively, can be obtained by computing the mapping θ : ℝ 19 !ℝ 2 using Umap (Section 2.3.1).They are shown in For QM9 (Figure 3B) there is no clear division at low α-levels and cluster stability is found at higher levels ( [13,20]), with 33 clusters, indicated by the curly braces between the horizontal dashed lines.
These 33 clusters successively split into smaller ones as perplexity decreases in the hierarchical process, and represents the main constituent of the core region.The cluster separation process in the [13,20]   α-level interval is visible as gaps that segment the individual lines in Kriging, 62,63 was used to create the function N atoms ¼ fðx, yÞ, where x, y are the coordinates of each molecule according to the θ-mapping (kriging produces the best linear unbiased prediction at unsampled locations).The same procedure was applied to QM7b as well and the resulting functions are shown in Figure 7.  4 and Figure 8.
For the two datasets, highly accurate models were obtained when using the original properties, as evidenced by the low error measures, the high correlations, and the behavior of the prediction versus the observations.When the predictions are made from small dimension embedding spaces the errors are much higher, but the correlations and only slightly smaller.The predictions versus the observations are more scattered, particularly for QM9, but still with a highly significant R2.Considering that the dimension of the embedding space is only 2, Distribution of the number of atoms in the molecule in the 2D embedding spaces.Top: QM7.Bottom: QM9.much smaller than the mean of the intrinsic dimension estimate, achieving this level of predictability is noteworthy.

| Inverse design
The sorted intrinsic dimension estimates for the QM9 dataset obtained with the methods from Section 2.3.1 are shown in Table 5, where nn ¼ 100 (neighbourhood size), comb ¼ }mle} (maximum likelihood), k1 ¼ 20, k2 ¼ 100 (first and second neighborhood sizes).The mean and median estimated dimensions are 5.088 and 4.469, respectively, for a Idim=N v ratio in the ½0:235,0:268 range.Accordingly, most of the information is contained in a subspace of dimension roughly 25% of that defined by the number of properties.
A 2D approximation of the manifold containing the data was obtained by computing the mapping θ : ℝ 19 !ℝ 2 using Umap (Section 2.3.1).It is shown in Figure 9A, where the outlier/inliner information is overlapped a posteriori to the unsupervised mapping for comparison.Clearly, the outer region is composed predominantly of outliers, whereas the core region concentrates inliner objects.It must be kept in mind that since the estimated Idim is around 5 (larger than 2), there are inevitable deformations and misplacements of objects in the space.The structure shows a ring-shaped cloud of scattered, outlying points, surrounding a densely packed core containing most objects.A closer view of the core region exhibits the presence of a large quantity of relatively small to medium-sized clusters (Figure 9B), with some outliers.
A general property related to atomic composition is molecular weight (W m ).When added as a third dimension (z-axis) to the space of Figure 9, it is possible to create a function W m ¼ fðx, yÞ via kriging interpolation, 62,63 where x, y are the coordinates of each molecule according to the θ-mapping (kriging produces the best linear unbiased prediction at unsampled locations).The resulting function is shown in Figure 10.
The comparison with Figure 9A identifies an outer ring mostly composed of outliers with lighter molecules.There is an inner core with a more complex structure composed of a ridged rim concentrating heavier, clustered molecules.The tilted perspective from dependency between the location of the molecules in the 2D manifold and its weight, as an overall expression of its constituent atoms.
A predictability assessment for W m was made using residual variance analysis (Section 2.3.3), with predictors given by the original properties, as well as by generated features corresponding to embedding spaces of various dimensions (Figure 2).The results from Table 6 indicate that the predictability of molecular weight is very high when using all original features or embedding spaces of dimension 5 or 10 (low Gamma with not much difference among them, and extremely small V ratio , close to 0, the ideal).The Gradient values indicate that the models obtained from embedding spaces are expected to be more complex than those using the original features.

| Supervised inverse design
The objective of inverse design is to use a set of molecular electronic properties as input and to predict the atomic composition of the molecules that correspond to these properties.Some important elements were considered when formulating the modeling problem: (i): for each chemical element its number of atoms in a molecule is a ratio variable with discrete values, interpreted as a fuzzy number (real-valued, defuzzified to obtain a discrete value by rounding to the nearest integer, since the predicted number of atoms must be a positive integer).(ii) Modeling is formulated as a regression (classification would have treated the targets as nominal, which they are not).Specifically, since a molecule is composed of atoms of different elements, all of which are interrelated  (Table 2), the formulation is that of a multi-target regression, (iii):

| Prediction using the 19 original properties
The AutoML model search procedure architecture is shown in Figure 11, where the inputs were set to the original QM9 19 properties and the outputs to the five chemical species.The settings used for conducting the search process are included, and the model repertoire was the one presented in Section 2.3.4.Interestingly, despite the great flexibility allowed for the search process, the resulting model was not an ensemble, but a singleton random forest.The results on the testing set obtained when using all of the original 19 properties for predicting atomic composition are shown in Table 7.
The metrics correspond to a multi-target regression model of the form f : ℝ Nv !ℝ Nt , resulting from the 24 h of AutoSklearn exploration.
In addition to regression metrics, the table presents the individual accuracies for each chemical element in order to provide a more  8.The properties are sorted according to the Γ values (since the target is the same for all predictors, and therefore its variance, the ordering determined by V r is the same).There is a first group of predictors fu298,h298,g298,u0g, with a very strong functional relationship with the Carbon variable, where the ð1 À V r Þ values are in the order of 99.66% which is a relative determinism index.A second group of variables fg298 atom, u0 atom, h298 atom,u298 atom, alpha, zpveg still exhibits a somewhat fair predictive relationship with Carbon, however much lower than with those of the first group (ð1 À V r Þ values in the order of 60%).For the rest of the predictor variables, the noise levels are much higher, indicating a weak direct relationship with Carbon.
However, they may still have influence due to interactions with other variables, which in this case is noteworthy.The ð1 À V r Þ value when all predictors are considered is 99.90%, much higher than for any individual predictor.
The role of all predictor variables on the AutoML model, as evaluated from the point of view of their PVI and SHAP values on the F I G U R E 1 1 AutoML model search architecture for predicting the five targets using the 19 original properties.It corresponds to the X to F(X) mapping (F : ℝ 19 !ℝ 5 ) in Figure 2.
T A B L E 7 Performance measures on the testing set for predicting atomic composition using the 19 original properties.testing set is shown in Figure 12 (a and b, respectively).For each method, the top 10 features are overwhelmingly the most important ones.Moreover, they both coincide in identifying the same features as their top 10, with only slight differences in their ranking.Interestingly, the zpve property is seen as the most important one of the two methods, in contrast with the Gamma test, which also considers its relation to the target much more complex than with the other features, as indicated by the very high values of the Gradient measure. 47e difference between XAI methods and the Gamma test with respect to zpve is probably due to the fact that the former uses the model to evaluate the importance measures (with the other predictors present), whereas the latter focuses only on the single predictor relationship with the target exclusively using the data, in the absence of any model.

| Prediction using selected features
The results from Section 3.4.1 showed that when predicting the number of atoms of each element that are present within each molecule, certain properties are more relevant than others and it is important to find them.In order to explore this situation in greater detail, the feature selection algorithms described in Section 2.3.2 were applied to the data described by the entire set of the 19 original properties.
Since these algorithms operate on single targets, a double aggregation process was set in place: (i) a first one creating a consensus among the different algorithms that evaluate feature importance when predicting a given element, and (ii) a second one aggregating the aggregated results for all elements.The result would produce a consensus on a subset of properties that could be appropriate for the multi-target case.The results of the two aggregating processes using the MC4 Markov algorithm are shown in Table 9.
Based on the second-order aggregation results, a subset composed of the top 10 properties was considered as most relevant, representing 52.6% of the original predictors.They are in good agreement with the subsets found in Section 3.4.1 and were used in a model search exercise with the same parameters as when using the original 19 predictors.
The model search process based on the 10 selected features is shown in Figure 13, and the results obtained with the best model found are shown in Table 10.In this case, it corresponded to an ensemble of two random forests with weights 0.88 and 0.12, respectively.
The model quality measures are better than the ones obtained when using all features (Table 7), indicating that there are irrelevancies, noise, and harmful interactions within the set of original properties.This is confirmed by the results of a Diebold-Mariano test, 64,65 between the prediction of the number of atoms of each chemical element given by the model using all features, and the one using the 10 selected features (Table 11).
For the majority of the targets, the differences are statistically significant indicating that when using 10 selected features the predictions are equal to or better than with all of the features.The improvement obtained when using the selected features cannot be attributed to overfitting since the training and testing errors were very similar.F I G U R E 1 3 AutoML model search architecture for predicting the five targets using the 19 original properties.It corresponds to the X to F(X) mapping (F : ℝ 19 !ℝ 5 ) in Figure 2.
T A B L E 1 0 Performance on the testing set for predicting atomic composition using 10 selected properties.an ensemble composed of one extra-trees and one random forest, with weights 0.64 and 0.36, respectively.There was a prediction performance improvement, but the small magnitude of the gain shows that even though there was still predictive information in the extra dimensions, it was not enough to produce an overall high-performance model.
The model obtained with the 10 selected original properties, was clearly better.An additional experiment using a variational autoencoder with a latent space layer of dimension 10 for the φ and φ À1 mappings was actually an under-performer as shown in Table 13.
In the overall, feature selection performed better than feature generation.

| Prediction using the 19 reconstructed properties
This scenario corresponds to the mapping ðF ∘ ðφ À1 ∘ φÞÞ : ℝ Nv !ℝ Nt in Figure 2, where φ and φ À1 are the encoded and decoder components of the described VAE, and F the target predictor obtained with AutoML.The results are shown in Table 14.
Model metrics lag behind those obtained when working directly from the original predictor space, either using all properties of a selected subset, as it was found for the ðψ ∘ φÞ case.However, it is interesting that all performance metrics are very close to the ones obtained when predicting the targets from an embedding space of the same dimension, despite the additional transformation (φ À1 ) involved in ðF ∘ ðφ À1 ∘ φÞÞ (with its additional error contribution).

| CONCLUSIONS
Unsupervised analysis revealed that the overall data structure of the T A B L E 1 2 Performance on the testing set for predicting atomic composition using Umap embedding space features.core region that concentrates inliner objects grouped in clusters.For QM9 there is a significant relationship between the number of atoms in the molecule, and its outlier/inliner nature so that molecules with either a small or large number of atoms are predominantly outliers, whereas molecules with a number of atoms in the middle range are mostly inliners and clustered.These characteristics should be taken into account when developing predictive models for inverse-design of de-novomolecules.
For both datasets the intrinsic dimension is several times smaller than the descriptor dimension, indicating a high level of redundancy.
Despite the structural differences, their properties carry strong predictive information for molecular composition, as illustrated by the number of atoms in the molecule, which can be accurately predicted from the original properties.Embedding spaces of small dimensions still retain predictive capabilities.
For the QM9 dataset, models predicting the atomic composition of molecules using all of the electronic properties as predictors exhibited very high performance according to all common model quality metrics, supporting the proposed approach for inverse molecular design.Model-free (Gamma test), as well as model-based XAI methods and feature selection procedures, coincide in identifying a subset of features that are the most relevant for predicting atomic composition from physical-chemical properties.They represent approximately one-half of the total.For this problem and data, models using only a selected 52:6% subset of the original properties outperformed those using all of them.In addition, working with the original features (either with all of them or with selected subsets), resulted in better models with respect to those based on generated features from embedding spaces.In addition to the individual chemical elements, global properties like molecular weight also exhibited high predictability from the electronic properties.
AutoML procedures proved useful in the automation of inverse molecule design approaches and deserve dedicated attention in future studies.
QM7b and QM9 are the two molecule datasets used in the paper.They are part of a larger collection of resources available in Reference 7 with the aim of accelerating the development of fast and accurate simulations of quantum-chemical systems from first principles 2.1 | Data 2.1.1 | QM7b dataset The QM7b dataset, 8,9 is a derivation of the QM7 data, which is composed of all molecules of up to 23 atoms (including seven heavy atoms C, N, O, and S), totaling 7165 molecules.This dataset features a large variety of molecular structures such as double and triple bonds, cycles, carboxy, cyanide, amide, alcohol, and epoxy.QM7b is an extension of QM7 oriented to for multitask learning where 13 additional properties (e.g., polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW).Additional molecules comprising chlorine atoms are also included, totaling 7211 molecules.QM7b is composed of 7211 objects described in terms of 14 properties representing the inputs (Coulomb matrices).Each molecule is composed of atoms from the following six chemical elements: Carbon (C), Chlorine (Cl), Hydrogen (H), Nitrogen (N), Oxygen (O), and Sulfur (S).The following quantum-derived properties describe the molecules: (1) PBE0 atomization energies, (2) zindo excitation energy, (3) zindo highest absorption intensity, (4) zindo homo, (5) zindo lumo, (6) zindo 1st excitation energy, (7) zindo ionization potential, (8) zindo electron affinity, (9) PBE0 homo, (10) PBE0 lumo, (11) GW homo, (12) GW lumo, (13) PBE0 polarizability A3, (14) SCS polarizability A3. 2.1.2| QM9 dataset QM9, 10,11 contains N ¼ 133,885 small organic molecules, where each molecule is composed of a combination of the following five chemical elements (N t ¼ 5): Carbon (C), Chlorine (Cl), Hydrogen (H), Nitrogen (N), and Oxygen (O).In inverse design approaches, they represent model targets.The molecules are characterized by N v ¼ 19 electronic properties consisting of geometric, energetic, electronic, and thermodynamic properties computed using quantum chemistry techniques.

1
Architecture of the AutoML model search process.subjectiveand that is why there are several criteria to decide what should be considered abnormal.Outliers may also represent novel or special elements containing valuable information, not necessarily associated with faulty situations.Before considering the possible elimination of these points from the data, it is necessary to understand why they appeared and whether it is likely that similar values will appear again.Outliers may affect the results of statistical and machine learning procedures, and that is the reason why preprocessing stages remove them from the data prior to modeling.However, in the present study outliers are considered a source of information about the molecules.Considering the great care with which QM7b and QM9 were prepared and curated, it is unlikely that outliers would be wrong data that should be disregarded.Most of the existing outlier detection (OD) algorithms are unsupervised due to the high cost of acquiring ground truth31 and a good strategy is to use heterogeneous OD approaches, where ensemble methods select and combine diversified base models and produce more reliable results.The recently introduced SUOD framework32 enables the combined application of different methods with an ensemble approach based on three independent and complementary levels: (i): the data level random projection, (ii): the model level using pseudosupervised approximations and (iii): the execution level where the algorithms are applied via balanced parallel scheduling.The data level relies on the Johnson-Lindenstrauss projection33 to create a space for training the models.The model level constructs fast supervised models which use the unsupervised models' outputs as "pseudo ground truth", and the execution level uses a balanced parallel scheduling mechanism that can forecast the time cost of each OD model (training time).That is done before scheduling so that the task load can be evenly distributed among workers.

Feature
selection in machine learning generally involves three types of methods: (i) Filter Methods (the selection is based on scores from statistical tests that are independent of machine learning methods), (ii) Wrapper Methods, in which features are used within a machine learning algorithm and are added or removed based on the quality of the inferences made, and (iii) Embedded Methods, in which the algorithms have their own built-in feature selection methods (e.g., Lasso).In this paper, six methods were used for feature selection: {Boruta, Permutational Variable Importance (Dalex), Multivariate

2. 3 . 5 |
Explainable AIThe acceptance of AI systems for decision-making tasks in critical domains, like medicine and health care requires the introduction of techniques capable of delivering explanations of models' behavior.Many among those that are built with ML and deep learning methods are considered of the 'black-box' kind because their underlying structures are complex, non-linear, and extremely difficult to interpret and explain.Popular ML algorithms like Random Forest, XGBoost, SVM, and neural networks (shallow and deep) fall in this category.Their opacity has created the need for architectures aiming at producing more transparent models, developing techniques that enable humans to interact with these models, and making their inferences trustworthy.Explanation methods work either at the model-level, or at the instance-level.The former refers to the transparency and interpretability of the entire model.This involves understanding how input features are used across all decisions made by the model for a given dataset (i.e., explaining how certain features influence the predictions for the testing set).On the other hand, instance-level explanations (also known as local interpretability), focus on individual predictions.They explain why a specific decision was made for a given data instance (i.e., why a certain sample was classified as belonging to some class).Both types of explanations are important for understanding, validating, and trusting AI models.Also, they offer different insightsmodel-level explanations help understand the overall logic of the model, while instance-level explanations help understand individual predictions.In this paper only model-level explanations were considered when predicting atomic composition from the original molecular properties, focusing on the predictor's role on model behavior.To that end, variable importance on the model was evaluated via Permutation Variable Importance and SHAP values.Permutation Variable Importance (PVI),49,50 is a technique used to estimate the importance of different features in a predictive model.It's a model-agnostic method, meaning it can be applied to any type of predictive model.The PVI is calculated by permuting the values of each feature and measuring the decrease in the model performance.The assumption here is that permuting the values of an important feature should result in a significant performance drop.The process is composed of the following steps: 1. Fit the Model: Fit the predictive model M with the dataset D. 2. Evaluate the Model: Evaluate the performance of the fitted model using a chosen metric (e.g., accuracy for classification, MSE or R 2 for regression) on D. This gives a baseline performance score s.3.Permute Feature f j and Evaluate: For each feature, permute (i.e., randomly shuffle) its values in the validation set K times and then re-evaluate the model performance on this perturbed (permuted) set.
If F is the set of all features and S ⊆ F the method retrains the model f on all subsets S and considers the effect on the prediction by including/notincluding a given feature.Models are trained f S [ fig , f S with and without the i-th feature for a given input x, where x S and x S [ fig represent the description of x composed exclusively of attributes in S and in S [ fig respectively.The difference f S [ fig ðx S [ fig Þ À f S ðx S Þ measures the magnitude of the effect of the i-th attribute when it is computed for all possible subsets of F not including i (S ⊆ F ∖ fig).The Shapley regression values are obtained by considering all possible differences, weighted by

Figure 3 .
Figure 3.In the case of QM7b, the most distinctive feature is the sharp division of the data into two well-defined clusters of different sizes, with a very large distance in between.The second, larger one exhibits a marked elongation, indicating a strong covariance structure in the nonlinear Umap space.For QM9 the space has a quite different structure, composed of an outer broad region composed of scattered, outlying objects, surrounding an inner much smaller and densely packed core area, concentrating most objects.Both substructures are quasi-circular in shape, indicating the lack of covariance between the two embedding space variables.In any case, the observed structures of the 2D embedding spaces are only approximations of the real situation since the intrinsic dimension for both datasets is around five.Clustering brings another perspective since it was performed in the original descriptor spaces (the spaces defined by the original properties).A hierarchical clustering approach was preferred here in order to avoid methods that require, for example, an estimation of the number of clusters present.Other methods that are free from this

Figure 4B .
Figure 4B.The structure is very different than that of QM7b, as there are many more clusters, and they are defined at higher α stages.Altogether, the α-clustering results (involving all properties) coincide with

F I G U R E 5
Umap 2D space with the 133,885 QM9 dataset molecules.Blue: Inliner objects.Red: Outliers.(A) Overall 2D mapping showing the approximate distribution of inliner (Blue) and outlier (Red) objects.(B) Zoomed view of the spherical core region of Figure 3B.F I G U R E 6 Relationship between the number of atoms in the molecule and the outlier score.Blue: Inliners.Red: Outliers.The second-degree polynomial regression is least square and 95:2% of the N ¼ 133,885 points are inliners, determining the location of the middle x-axis range minimum.For QM7b, the lower left corner, which corresponds to the smaller cluster of Figure3A, is predominantly composed of larger molecules ( > 17 atoms).The upper right region of the space is mostly composed of smaller molecules ( < 15 atoms), with the cluster periphery occupied with larger molecules.For QM9 there are approximately concentric rings composed of alternating smaller ( < 10 atoms) and medium (½10 À 16 atoms) molecules, with an inner core composed of a rim containing large molecules ( > 18 atoms) encircling a small area of medium-sized molecules.A predictability assessment of the functional dependency between the location of the molecules in the 2D manifold and its size (understood as given by the number of atoms in the molecule), was made with regression tree models using a random 90%/10% training/testing split.They were used because of a trade-off between their simplicity and their ability to work with either a small or large number of predictor features (the embedding spaces of Figure3have only two features, whereas the original descriptor spaces have 14 and 19 respectively).From them, baseline models were produced to illustrate the predictive capability of such spaces, using the number of atoms in the molecule as the target.The results for the testing sets of both QM7b and QM9 are shown in Table

Figure 10 (
Figure 10 (Bottom) presents a clearer view of the aforementioned characteristics of the W m distribution, suggesting an actual functional

F I G U R E 9
Umap 2D space with the 133,885 QM9 dataset molecules.Blue: Inliner objects.Red: Outliers.(A) Overall 2D mapping showing the approximate distribution of inliner (Blue) and outlier (Red) objects.(B) Zoomed view of the spherical core region of Figure 9A showing clusters within the structure.
contrary to common practices, outliers were not removed prior to the learning process.(iv): In all scenarios (Section 2.3.4,Figure2), modeling was performed using the AutoML approach with the same training and testing sets (Section 2.2), as well as with the same parameters in order to ensure comparability: 5-fold crossvalidation, metric= mean squared error (MSE), scoring functions= {mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), coefficient of determination (R 2 )}, random seed = 123.Instance-based models were excluded (e.g., k-nearest neighbors).The time limit for trying any model during the search process was set to 0.1 of the total time allotted for the entire search process, in order to ensure that a complex but potentially relevant model would not be disregarded.Moreover, ensemble models were always allowed.The remaining learning parameters were set to their F I G U R E 1 0 Dependency between molecular weight and location in a 2D embedding space.Top: View through the negative z-axis.Bottom: Tilted view.T A B L E 6 Predictability assessment for the molecular weight from the original properties and from generated features.
detailed perspective.All error measures were small and particularly the R 2 value indicates that the model exhibits very high quality.Moreover, training and testing errors were very similar, indicating no overfitting.The individual accuracies for the chemical elements were computed after a post-processing phase, where the real-valued estimations of the number of atoms of each element are defuzzified and converted to discrete values.Once in discrete form, the accuracy values were computed from the confusion matrices constructed for each element.The element-wise accuracies are very high in all cases,with the lowest one corresponding to Fluoride (95.73 %), which is high in itself.For predicting the entire molecule composition, this model achieves a very high accuracy as well (97.80%).Altogether, this approach allows very accurate estimations of molecule composition, from inputs given by their electronic properties.3.4.1 | Explainable AIThe methods from Sections 2.3.3 and 2.3.5 can be used to derive model-level explanations for the predictions associated to given atomic species, for instance, Carbon (similar procedures can be applied to the other elements).A Gamma test-based predictability assessment of the number of Carbon atoms in the molecule by the different properties, derived from the information contained in the training set is shown in Table QM7b and QM9 molecule datasets are quite different.While the former has a clearly defined two-cluster structure, the latter composed of an outer region predominantly containing outliers and an inner, F I G U R E 1 4 AutoML model search architecture for predicting the five chemical targets using an embedding space of dimension 10 obtained with UMAP.It corresponds to a ðψ ∘ φÞ : ℝ 10 !ℝ 5 mapping in Figure2.
Pearson correlation matrix for the QM9 targets.
was used for finding multi-target regression models on datasets using the original features, selected features, and generated features (those from embedding spaces).The modeling regression repertoire provided by AutoSklearn is composed of the following algorithms: {AdaBoost, Automatic Relevance Determination, Decision Tree, Extra Trees, Gaussian Process, T A B L E 5 Intrinsic dimension estimates.Regression tree models for QM7b and QM9 predicting the number of atoms in the molecule.
Diebold-Mariano tests between predictions obtained with the original 19 properties and with 10 found by feature selection (significant differences in bold).
T A B L E 1 1 Metrics (2 embedding space features): ðψ ∘ φÞðXÞ in Figure2Performance on the testing set for predicting atomic composition using the 10 VAE latent space features.Performance on the testing set for predicting atomic composition using 19 reconstructed properties from the VAE used in Table13( X ¼ φ À1 ðX e Þ in Figure2).
T A B L E 1 4