Tackling the challenge of a huge materials science search space with quantum-inspired annealing

Efficient screening of chemicals is essential for exploring new materials. However, the search space is astronomically large, making calculations with conventional computers infeasible. For example, an $N$-component system of organic molecules generates>$10^{60N}$ candidates. Here, a quantum-inspired annealing machine is used to tackle the challenge of the large search space. The prototype system extracts candidate chemicals and their composites with desirable parameters, such as melting temperature and ionic conductivity. The system can be at least $10^4$-$10^7$ times faster than conventional approaches. Such exponential acceleration is critical for exploring the enormous search space in virtual screening.


Introduction
Finding new functional materials with improved performance is critical, especially in developing next-generation batteries, solar cells, and other energy-related devices. [1] Materials informatics has been attracting attention because of its potential to accelerate research and development. [1,2] Machine learning, which can handle the big data from materials science and identify statistical trends, is a critical technology in this field. [1][2][3] Trained models can predict the performance (y) of diverse materials from their structures (x). [1,2] The predicted values (ypred) from a trained model (fML) have become increasingly reliable. Various parameters, including basic molecular properties (e.g., mechanical strength and melting point), [4] conductivity, [3] photoconversion efficiency, [5] and synthetic yields, [6] have been predicted successfully. If an appropriate training database and model are prepared, the predictions will become as accurate as those from human researchers and molecular simulations. [3] Another major advantage of machine learning is its high prediction speed (e.g., 10 -3 s per condition), [1] allowing researchers to screen chemicals with a trained model much faster than with real experiments and first-principles calculations.
A key challenge in materials informatics is the astronomically large search space for candidate materials. Even for small organic molecules, there are more than 10 60 candidates. [7] Current virtual compound screening using a machine learning function, ypred = fML(x), may not finish before the end of the universe; the required calculation time of 10 60 × 10 −3 s is longer than the estimated life of the universe, 10 19 s [8] (Figure 1). Furthermore, the search space increases exponentially when a composite system of multiple chemicals is considered (10 60N candidate composites for an N-component system), and most conventional materials and devices consist of multiple components. [1,3,5]  Various approaches have been proposed to solve the massive search space problem, such as defining an inverse function x = fML -1 (y), which predicts structures or experimental conditions from a target parameter y. [9][10][11][12] However, the function is intrinsically hard to define because of the uniqueness of solution problem. [1,10,12] Furthermore, the preparation of chemical structures itself is still a huge challenge even with cutting-edge deep learning techniques due to the current limitations of model complexity and computing power. [13] Other molecule generation techniques, such as Bayesian approaches, [14] also tend to have similar limitations.
Most studies have focused on single-component systems and the methodology for multiplecomponent systems has not been developed satisfactorily. [10,11,14] A practical approach to exploring materials is to narrow the search space by filtering candidates manually. Based on the knowledge of human researchers, only feasibly useful structures are selected. [3,5] Typically, 10 2 −10 4 candidate chemicals would be selected to allow 4 calculations to be finished quickly [3,5,9] . However, filtering based on existing knowledge may overlook hidden potential candidates. The approach also may fail when treating N-component systems. If there are 10 M chemical candidates, their available combinations will be 10 × , excluding their weight compositions (i.e., N -1 additional independent variables). For instance, the calculation time would be as long as 10 9 × 10 −3 s ≅ 300 h with × = 9. This type of combinatorial explosion is critical because ≥ 3 and ≥ 3 are typical for conventional materials and devices. [3,5] Here, we introduce a quantum physics-inspired annealing machine to tackle the challenge of the huge search space ( Figure 1). The main limitation of conventional approaches is the insufficient computing power compared with the search space. Quantum physics-based or -inspired computing solves the exploration problems efficiently [15][16][17][18][19][20] owing to massive parallelization that cannot be achieved by conventional computers. [15][16][17][18][19][20] To demonstrate the potential of quantum computing approaches, we constructed a prototype system to explore organic chemicals and composites displaying desirable parameters (e.g., melting point and ionic conductivity). The system was at least 10 4 −10 7 times faster than conventional approaches, helping to solve the combinatorial explosion problem in materials science.

Molecular exploration system framework
Quantum computers can solve some types of problems quickly, including combinatorial optimization, prime factorization, and molecular quantum processes. [16,19] The superposition principle is useful for exploring new materials from a vast search space efficiently (Figure 2a).
Gate-based quantum computers can solve various problems, although they currently have many technical problems, such as long-time coherence. [16,19] In contrast, quantum annealing machines have already been commercialized. [21] Quantum annealing machines are used to solve the Ising 5 model (equation 1), [15,21,22] which was used to explore materials in our current study. Previously, annealers have been used in limited cases to explore materials (e.g., orders of nanomaterials). [15] In this work, our prototype system was used to explore general organic materials with diverse properties. The architecture of our material screening system is shown in Figure 2a. First, structural information was converted into binary arrays (0 or 1) using a mapping function, fmap ( Figure   6 2b). Binarization was needed because quantum annealers accept only binaries. [22] A machine learning function, ML , was prepared in the form of the Ising model. An experimental database of target materials was used to train the model.
The machine learning function was a linear model considering quadratic terms (xixj, i≠j).
Annealers can solve the problem of = ML ( ) (or ) promptly (around 10 -6 to 10 -0 s, Figure 2c). [20,22] The obtained could give ypred, which is close to the global maximum (or minimum) of the system (equation 1). [22] Conventional computers may not be able to solve the problem when the dimension of x becomes higher than around 100 (i.e., 2 100 ≅ 10 30 -dimensional search space). [20] Finally, chemical structures were proposed as real-world information based on an inverse mapping function, fmap -1 .
For the annealer, we used a digital annealing unit, which was developed based on quantum annealing. [20] The machine is a digit simulator, but it can treat up to 8192 fully connected binary variables with massive parallelization. [20] In contrast, the commercialized quantum annealer D-Wave still has a strict limitation on the input variables and binary interactions; there are typically fewer than 100 available variables for achieving fully connected bits. [8,21,22]

Exploration with a single-component system
First, we extracted high-melting-temperature molecules with the annealer. An open experimental database of organic molecules [23] was used for machine learning (about 3000 chemicals, Table S1). The structural information was converted into 2048-dimensional 7 molecular fingerprints xFP. [24] The array was compressed to n-dimensional arrays to obtain a binary input, x (0 < n < 2048, Figures 2b, 3a, S1, and S2). Compression was needed because nC2 interactions must be added as quadratic terms xixj (i.e., 2048C2 = 2,096,128 terms with n = 2048 are too many for regression). After adding the terms, the statistical relationships between the input and z-scores of the melting temperatures (y) were inputted to train a linear regression model ( Figure 2a and equation 1). The regression determined the coefficients Jij and hi. Regularized models [25] were used for regressions. Ridge (i.e., L2 norm) models gave better prediction performances than did Lasso models (i.e., L1 norm, Figure S3). The sparsity obtained by the Lasso model may have discarded too much of the essential contribution of quadratic terms (xixj) in equation 1. Because regression algorithms were developed for continuous variables, [25] a new specialized algorithm or theory for binary arrays may be needed in future work.

8
For regression, the database was split randomly into the training and testing datasets.
The training dataset was used for the regression, and the testing dataset was used only for prediction. The R 2 score for the testing dataset was as high as 0.5 with a dimension of x = 576 ( Figure 3b; results with other conditions are shown in Figure S3). The regression scores were even higher than those obtained with random forest regression models, which are regularly used to treat nonlinear systems ( Figure S4). [1,5] Therefore, the proposed linear model with quadratic terms was sufficient to predict chemical properties from binary arrays.
The dimension of arrays, n, should be sufficiently larger than 100 to achieve high prediction accuracy ( Figures S3). However, the corresponding search space of >2 100 ≅ 10 30 was too large for conventional computers. Thus, new machines are needed to explore best input xideal effectively.
It was essential to obtain xideal yielding the highest predicted performance, ypred, from equation 1. A digital annealer was used to solve the problem ( Figure 4a). As a control, simulated annealing on a conventional computer was examined. The annealing time with the annealer was only around 2.5 s, regardless of dimension n. In contrast, the required time with the conventional machine increased quasi-exponentially and exceeded 300 s with n = 1250.
Moreover, ypred obtained in the control experiment was substantially smaller than in the annealer with n > 100, demonstrating that the control could not reach the global maximum of ypred ( Figure 4a). An impractically long time would be required to match the results of the annealer. After decompressing xideal to the fingerprint form, xideal,FP, Tanimoto similarity scores [26] were calculated to compare the fingerprints of the ideal compound and those recorded in the test dataset ( Figure 4b). Experimental y became statistically higher when the high similarity scores increased ( Figure 4c, red plots). The increase indicated that xideal found by the annealer maintained essential and universal information to achieve desirable performance. In contrast, there was no significant relationship between y and the similarity scores when the fingerprints of an existing compound with the highest y in the training dataset were compared ( Figure 4c, blue plots). This was because the structural information of only one compound was too specific and narrow, and may have overlooked some essential structural information to achieve the desired performance. We calculated the average y of the top 5% most similar compounds ( Figure 4d). Statistically, the annealing approach was preferable with various dimensions n rather than with existing fingerprints. Most of the extracted chemicals displayed z-scores of >1, corresponding to melting points of around 200 °C.
The proposed framework was useful for extracting chemicals with other desirable parameters, such as low melting temperature, high solubility parameter, high glass transition temperature, and high photoconversion efficiency in organic thin-film solar cells (Figures S5−S10). This versatile framework could be used for a wide range of materials and properties, not only for organic molecules, but also for inorganic compounds. During the annealing processes, additional restriction conditions using the fingerprints of existing chemicals also improved the extraction performance. The restrictions increased the structural feasibility of xideal, whereas only y was pursued without the restriction ( Figure S5, see Supporting Discussion for further information). Additional studies are needed to provide systematic insights into the extraction processes.

Generating new chemicals from ideal fingerprints
Instead of comparing the similarity scores with the existing database, directly generating molecules from xideal is preferable for accessing the vast chemical search space. With a large dimension n (e.g., >1000), x can express a sufficiently large number of chemicals, 2 1000 ≅ 10 300 . Therefore, if appropriate mapping functions (fMAP and fMAP -1 ) are given, the annealing framework may be able to explore arbitrary organic molecules. However, determining the functions, especially the inverse functions, is still unwieldy, regardless of the remarkable progress in chemoinformatics and machine learning. [1,11,13] Here, we used a more straightforward approach to preparing chemicals directly from xideal. After decompressing xideal to the 2048-dimensional fingerprint, xideal,FP, chemical structures were generated from the fingerprint fragments ( Figure 5a). A fingerprint bit xFP,i represents a specific fragment of a chemical structure. In the example in Figure 5a, bit xFP,0 indicates whether an aromatic ring is available in a molecule. The bit is 1 if the ring is present and else 0.
From the melting temperature database, chemical fragments were extracted by the breaking of retrosynthetically interesting chemical substructures (BRICS) algorithm. [27] Then, some fragments sharing the same bit of the ideal fragment were extracted and reconnected to make molecules according to BRICS (Figures 5b, 5c, and S11). Predicted y of the generated chemicals was statistically higher than the chemicals of the original database and randomly generated chemicals (control, Figure 5b). The average scores were 1.80, 0, and 0.71 for the proposed approach, original database, and random generation, respectively. Large values of ca. 4 were even obtained from the ideal fingerprint (Figures 5c and S11). Our high-throughput molecular generation system will help researchers design new materials with desirable properties.

Optimizing the composition of Li + -conducting electrolytes
Finally, we extended the annealing framework to find the optimal compositions of multiple chemicals forming composite materials. The search space increases exponentially with 13 N-component systems (N > 1). We focused on finding the best composition of organic Li +conducting electrolytes consisting of up to six chemicals (Figures 6a and S12).
Li + -conducting electrolytes were selected because of the essential role of Li + batteries in society and the demand for higher conductivity to improve outputs. [28] Most electrolytes consist of at least two components, namely, a solvent and salt. [3,[29][30][31] Additional chemicals are often added to improve viscosity, conductivity, stability, and other battery properties. [29,31] The search space becomes even larger when polymeric electrolytes are considered as essential components for next-generation solid-state batteries and conventional gel electrolytes. [3,30] We have previously explored organic Li + -conducting electrolytes using machine learning. [3] The largest database of Li + -conducting electrolytes was prepared, which covered both monomeric and polymeric conductors (around unique 1200 records near room temperature, Table S1 and Figure S13). The prediction accuracy of a model trained with the database was as accurate as that of experienced researchers [3] and the machine learning model identified new polymeric conductors. However, a narrow search space (<10,000) was used during the screening and it is likely that candidates in the broader search space were overlooked. 15 Composition information about electrolytes was converted into binary arrays. For simplicity, up to six chemical components [3] were considered for one composite. Weight ratios of the components were also included in x. The continuous values were converted into binaries by a unary method (see Experimental section for details). Then, regression was conducted to predicted room temperature conductivity. A linear model yielded high R 2 scores of 0.88 and 0.68 for the training and testing datasets, respectively ( Figure S13b).
After obtaining xideal with the annealer from the regression model, the decompressed array was compared with the fingerprints of a chemical compound database. There were around 10,000 candidate chemicals in the database. Therefore, the search space was as big as 10 4×6 , ignoring weight ratios (i.e., an additional five independent variables).
Compounds were extracted from the database by comparing their fingerprints with those of the ideal compounds. The selections were made by a random process, where high-similarity compounds were extracted preferentially according to a Gaussian distribution. The weight ratio was calculated by restoring the corresponding binaries from xideal. The selection was repeated to prepare 100 types of candidate composite (Figure 6b). These selection processes were executed five times to ensure statistical validity. On average, it took only around 3 s to obtain composites with ypred > 1.3. After finishing the 100 iterations, the maximum value increased to 1.4, which took about 14 s.
However, 10 5 iterations were not sufficient to achieve ypred = 1.3 when the compounds and weight ratios were selected by a completely random process (Figure 6b). The average time to reach the maximum (ypred = 1.17) was 15,600 s, whereas the annealing method took less than 0.85 s. Therefore, the latter was at least 15,600/0.85 ≅ 18,000 times faster than the standard approach.
Bayesian optimization, a conventional approach to explore optimal conditions, [1,32] did not find the best combination (Figure 6b). The best score was only 0.87 after 4000 iterations, although the method took an exceptionally long time of 356,000 s. The poor result was attributed to the non-smoothness of the function and search space that was too large.
Smoothness is essential to finding the global maximum (or minimum) effectively by Bayesian optimization. [32] However, ypred changed drastically only when a single component of a composite was changed ( Figure S14) because the compound ID in the database was a categorical variable, not a molecular descriptor. [1] At this point, the extraction performance of the Bayesian process was similar to that of the random process (see histogram of ypred in Figure   6c). It took an excessive amount of time to fit the responses to the fluctuating responses with a costly stochastic model during Bayesian optimization. [32] The annealing approach improved the statistical distribution of ypred greatly. For instance, 7% of the proposed electrolytes provided ypred of over 1.2, whereas in the random approach only 0.0002% did (Figure 6c). The speedup ratio for this specific task was as large as 4 × 10 4 . The The quality of the extracted compounds obtained by annealing was also higher than that of the controls (Figures 6d, S15−S17). About half of the chemicals extracted by annealing were experimentally examined as electrolyte components for Li + batteries [3,29] (shown in yellow, Figure S15). In contrast, 75% or more of the compounds extracted by the controls were unconventional ( Figures S16 and S17). Experimentally, most of the compounds proposed by the random approach would not provide ionic conductivity because the components were far from salt-or solvent-like structures (e.g., no polar solvents in a composite, Figures S16 and S17).
High prediction scores could be obtained by fML even with noisy composites, whose structures were dissimilar from electrolytes, because of the lack of training data during machine learning. We trained the model only with Li + -conducting electrolytes, not with non-Li +conducting composites. This was because current machine learning models can process only specific tasks in narrow fields due to the limited complexity of the model and computing power, [1] and adding too broad a context may decrease the prediction accuracy. [33] The preparation of databases, which is generally a manual process, [3,5] is also hugely timeconsuming. Therefore, the inaccurate predictions by fML are currently unavoidable and should be avoided during the candidate selection step.
The ratio of conventional and unconventional compounds in proposals is crucial during screening. If there are too many unconventional compounds, the proposed compounds will be "noisy" (Figures S16 and S17), whereas suggesting too many conventional compounds hinder the discovery of new materials. The ratio of 50% provided by annealing may be acceptable during the exploration of potential candidates. The balance can be tuned by adding a restriction condition during annealing in a similar way to the single-component system ( Figure S5).

Future challenges
Although the proposed annealing system is superior to conventional approaches, there are still problems that must be addressed. The prediction accuracy of fML is not sufficient. Based on standard knowledge of electrochemistry, we suspect that most of the proposed unconventional compounds, even those obtained by the annealing approach, will not improve sufficient conductivity ( Figure S15). [31] Apart from experiments or costly simulations, there is no way to confirm whether the proposed unconventional compounds are unexpectedly good or are noise.
Regardless of the higher accuracy required, fML is currently limited to the linear Ising model (equation 1). Models that are more sophisticated may be needed to improve accuracy.
Hybrid calculations with conventional computers are needed to implement more complex functions while awaiting the development of gate-based quantum computers. [18,19] In addition, the mapping functions to convert material information reversibly to binary arrays (fMAP and fMAP -1 ) must be optimized to access the enormous search space more efficiently.

Conclusion
We constructed a rapid extraction system for chemicals using machine learning and a quantum-inspired annealing machine. The framework allowed rapid exploration of chemicals and compositions, and the extraction was at least 10 4 −10 7 times faster than conventional methods. Increasing the speed of prediction methods is essential to tackle the combinatorial explosion problem in chemical screening (e.g., 10 60N candidates for an N-component system of small organic molecules). The structures extracted by our method were higher quality than those extracted by the control methods because the annealer could extract the essential features of desirable compounds. The screening system will be critical to dramatically shortening the time required to optimize versatile, functional materials and devices.

Acknowledgments
This work was partially supported by Grants-in-Aid for Scientific Research (Nos. 17H03072, 18K19120, 18H05515, 20H05298, 19K15638, and 20H05298) from MEXT, Japan. The work was also partially supported by the Research Institute for Science and Engineering, Waseda University.

Tackling the challenge of a huge materials science search space with quantum-inspired annealing
Kan Hatakeyama-Sato, Takahiro Kashikawa, Koichi Kimura, and Kenichi Oyaizu*

General information
Program code (written in Python 3) and databases used in this paper can be accessed at https://github.com/KanHatakeyama/annealing_project. Annealing was conducted using digital annealing unit 2 (DAU). [1] Another calculation was done using an Intel Xeon Gold 6148 CPU @ 2.40 GHz.

Databases
Five types of databases were introduced in this study (Table S1). a) Bradley's dataset [2] The database contains the molecular structures of about 3000 types of organic chemicals and their melting points. Molecular structures and melting points were set as x and y, respectively. b) Delaney' dataset [3] The database contains properties of about 1100 organic chemicals. Molecular structures and their experimental solubility parameters were set as x and y, respectively.
c) Polymer database [4] The database contains glass transition temperatures of about 170 conventional polymers.
Repeating polymer unit structures and the transition temperatures were set as x and y, respectively. d) Organic solar cells [5] The database records the performances of organic thin-film solar cells and contains the experimental responses of about 1200 cells. Molecular structures of the donor molecules and the photoconversion efficiency of the corresponding cell were set as x and y, respectively. Other parameters that may contribute to the efficiency, such as molecular weight, were ignored for simplicity.
e) Li + -conducting electrolytes [6] This database contains various types of organic liquid-and solid-state Li + -conducting electrolytes. Data for around 1200 unique electrolytes are recorded around room temperature.
Typically an electrolyte consists of multiple chemicals. Chemical structures (up to six compounds) and their weight ratio were set as x, and room temperature ionic conductivity was used as y. Other parameters, including inorganic additives and polymer structures (e.g., crosslinking), were excluded from x for simplicity.

Scheme to explore chemicals for one-component system
Except for Li + -conducting electrolytes, chemical structures were explored using DAU according to the following scheme ( Figure S1).

1) Data splitting
The original database was split into training and testing datasets randomly. In Figures 4d,   S3, S4, and S6−S10, five-fold cross-validation was used to ensure statistical validity.

3) Fingerprint compression
The generated fingerprints were compressed into n-dimensional binary arrays, x ( Figure S2 and equation S2).

4) Addition of interactions
Interactions were added to obtain binary arrays, xint, for machine learning (equation S3).
= ( 1 , 2 , 3 , … , n , The coefficients and ℎ were determined by training. The models were prepared by stochastic gradient descent (SGD) learning, using a module of the scikit-learn library. [9] SGD learning was used because the quadratic terms ( ) increased the dimension of the inputs, and thus increased calculation time enormously. The Ridge regression with a regularization strength α = 0.1 was used unless noted otherwise. For the control experiments, random forest repressors (scikit-learn) were trained with x without adding quadratic terms.
6) Calculation of R 2 score R 2 scores were calculated for the actual (y) and predicted values (ypred) of the records in the training and testing datasets. The testing datasets were used only for prediction, not for training the model.

7) Finding ideal binary xideal using DAU
The model was equivalent to the quantum Ising model, and thus could be processed by annealers to find the global minimum. The task is called the quadratic unconstrained binary optimization problem (QUBO). [1] The QUBO matrix, QUBO , was passed to DAU to find xideal, In Figures S6−S10, the QUBO matrix was biased with a diagonal matrix with a perturbation coefficient of pert (equation S6).
Here, was that gave the highest y in the training dataset.
Then, Tanimoto similarity scores were calculated for each compound in the test dataset against xideal,FP.

9) Extraction of chemicals with high-similarity compounds
Compounds in the test dataset were sorted according to their similarity scores. The average y of the compounds with the highest 5% similarity was defined as ytop5%. For the control, results are also shown using a restored fingerprint, , , instead of xideal,FP in Figure 4c.

10) Generation of new chemicals
New chemicals were generated directly from xideal,FP ( Figure 5). RDKit modules were used to make molecules. Chemicals in the corresponding database were fragmented by the breaking of retrosynthetically interesting chemical substructures (BRICS) algorithm. Fragments with a specific fingerprint value of xFP,j (= 1) were extracted. Here, j could be the integers satisfying xideal,FP,j = 1. The extracted fragments were reconnected randomly according to the BRICS algorithm. As a control, all types of fragments were reconnected randomly (random generation, Figure 5b).

Scheme to explore chemical composites
Chemical components of Li + -conducting electrolytes were explored using DAU according to the following scheme ( Figure S12). Most procedures before converting into binaries were the same as in our previous report. [6] 1) Load database The database consisted of compound and composite databases. The compound database maintained the chemical structures of organic molecules, including polymers and normal lowmolecular-weight molecules. Li atoms were omitted because the database focused on only Li + electrolytes. For convenience, bonds representing repeating or grafting units were expressed by Mg and Ca atoms, respectively. Conductivity data around room temperature were extracted.
Conductivity was converted in a logarithmic scale, and then into a z-score.

2) Calculation of the weight ratio of the components
In the databases, chemical compositions were recorded as either molar or weight ratio. All compositions were recalculated as weight ratios.

3) Compound sorting by weight ratio
For the unique expression of the database, the recorded compounds were sorted by their weight ratio (from large to small). If there were more than six components, only the top six weight ratio components were used for data processing for simplicity.

4) Calculation of fingerprints
For one composite, 2048-dimensional fingerprints of the components were calculated.

5) Conversion of weight ratio into a binary array
The composition ratios of chemicals, which were continuous values, were expressed by binary arrays. First, the original weight ratio r was normalized to .
The normalized form, ri/ri+1 (≥1), provided a restriction condition during the composition screening. The restriction of si ≥ 1 was easier to implement than the original condition ( 1 + 2 + ⋯ + 6 = 1 and r 1 > 2 > ⋯ > 6 > 0). The values were converted on a logarithmic scale because of their large scale ranges.
Here, max and min were the maximum and minimum values of i observed in the database. The number of binaries used, Nbin, was set as 10. The binary method (∑ 2 −1 ) was not compatible with the following linear regression and annealing because its dynamic range was too large.

6) Split dataset annealing
The binary database was split into the training and testing datasets (9/1 ratio) and compressed to n-dimensional arrays (= x). The dimension was always set to be 398 for the electrolyte system. The fingerprint and weight ratio information was included in x. Polymer structures, molecular weight, inorganic additives, and other related parameters [6] were excluded for simplicity. Up to annealing, the same procedures were applied in the same way as in the single-component system.

7) Calculation of similarity to decompressed xideal
A maximum of six types of ideal fingerprints could be obtained from decompressed xideal.
The components were explored by referencing a chemical compound database, containing about 10,000 SMILES structures from various sources. Chemicals with higher similarities were selected as the components. The selection was made randomly using sampling with a Gaussian distribution. The weight ratio was restored from equation S9. The random component selection was repeated 100 times ( Figure 6b). For the control experiments, chemicals and weight ratios were selected completely randomly or by Bayesian optimization using a Python library of GPyOpt. [11] Table S1. Information about databases used for analyses.

Name
Input (x) Target property (y) Number of records Ref.

Multiple molecules (including polymers)
Ionic conductivity 1200 a) [6] a) Number of records around room temperature. Figure S1. General scheme for exploring chemicals using DAU (one-component system).

Figure S2.
Compression and decompression of the binary arrays. In the figure, x3 was removed during compression because it gave 1 in most cases. During decompression, x3 was restored using the mode value of 1. Therefore, the compression/decompression processes are irreversible.
The compression was done with any xi with a variance was less than the threshold.

Extraction of chemicals with versatile parameters
In addition to compounds with a high melting temperature, compounds with a low melting temperature, high solubility parameter, high glass transition temperature, and high photoconversion efficiency were explored (Table S1, Figures S6−S10). Except for the last parameter, single-molecule properties were targeted. On the other hand, the photoconversion efficiency of organic thin-film solar cells can be determined by many various parameters, such as cell configuration. [5] R 2 scores for the test datasets were as high as 0.5 with a sufficiently large n > 100 for most cases. Only the scores of solar cells were around 0.3, indicating that it was necessary to add other explanatory variables for regression.
A perturbation parameter, cpert, was changed during the annealing process ( Figure S5, equations S5 and S6). Standard annealing, discussed in the main manuscript, was done with cpert = 0. As the coefficient increased, the binary obtained from the annealer would converge to xexist, corresponding to the structure of an existing chemical, which gave the highest y in the training dataset. The coefficient was given to increase the feasibility of the proposed chemical structures as a restriction condition.
In the high-melting-temperature exploration, the extraction score, ytop5% (average y of the top 5% most similar compounds to the found binary), was high with smaller cpert, indicating that this perturbation was unnecessary. On the other hand, the feasibility of xideal may be questionable for ypred. The observed ypred of >10 in Figure 4a was too large as a z-score. Although it did not affect the extraction efficiency of chemicals substantially, more feasible predictions may be required using restriction conditions (e.g., cpert > 0).
When low-melting-temperature compounds were explored, the extraction was not successful with cpert = 0 ( Figure S7). The reason for this was unclear, but the lack of molecular feasibility may have affected the result. On the other hand, the extraction score ytop5% increased with cpert > 0 and gave the maximum at cpert ≅ 0.1. The mixed properties of the ideal (xideal) and actual (xexist) structure were needed for this extraction task.
For the glass transition temperature, mixed trends for high-melting-temperature compounds were detected ( Figure S9). The small number of records in the database (= 170) made it difficult to extract reliable statistical trends, as the large error bars indicated.
For the compounds and solar cells with high solubility parameters, ytop5% reached the maximum at cpert ≅ 0.1 (Figures S8 and S10). The maximum was also observed at cpert = 1 with some dimensions. Further studies are needed to understand the complex responses by systematically considering the effects of the fingerprint algorithms, regression model, similarity functions, functional groups, and database formats.