ASBase: The universal database for aggregate science

This paper reports the first universal and versatile database on aggregate materials for the field of aggregate science research. At the current stage, the database (http://119.91.135.188:8080/) contains over 1000 entries of organic aggregate material systems (mainly luminescent systems at the current stage) with a unique data structure which is designed particularly for aggregate materials and containing the photophysics and physicochemical properties of the compounds in different statuses of aggregation, including dilute solution form, pristine solid‐state, stable crystalline, and nanoaggregates formed in solvents. The web‐based interface of the database provided functions to index, search, manipulate, fetch and deposit data entries. In addition, a background calculation service optimizes the chemical structure of new entries on different levels of accuracies. The database also provided background API for interactive developments of prediction or regression models based on machine‐learning algorithms.


INTRODUCTION
The aggregate science, which concentrated not only on the chemical structures but also on the organization of molecules, enlightened the research of chemistry and material science in recent years. [1,2] As the representative subarea of aggregate science, research about aggregation-induced emission (AIE) is a breeding ground for thousands of new materials with outstanding optical performance, unique physicochemical behavior, and potential for applications as bio-imaging agents, photovoltaic devices, light-emitting diodes, sensors, and actuators. [2][3][4][5][6][7][8][9] The concept of AIE extends from the traditional synthetic organic small molecules to a wide range of systems, including polymers, [10,11] natural products, [12] biomacromolecules, [13] metal-organic coordination compounds, [14,15] metal-organic frameworks, [16,17] covalent-organic frameworks, [18] and even inorganic compounds, [19][20][21] displaying a rich diversity, complexity, and unlimited possibilities. However, in contrast to the rapid progress in the laboratory, the research methods were stuck in the classical stage of experimental chemistry, highly relying on rules of thumb, experiences in case studies, and trials and error. Moreover, the progress in computer science and data science is still underutilized by aggregate science due to the lack of accumulation and generalization of data.
Databases such as Cambridge structure database (CSD), [22,23] Scifinder, [24] Reaxys, [24] and Materials project [25] provide data retrieval, comparison, filtering, and traceability, which are indispensable in modern chemistry and materials research. Specifically for luminescent materials, there were also some good examples. [26][27][28] Nevertheless, all existing databases are somewhat maladaptive F I G U R E 1 Basic web interface of ASBase in aggregate science. The least fit is in the storage pattern of data, where the aggregate of science requires that the chemical structure and the status of aggregation be regarded as a clue to store macroscopic properties. The researchers of aggregate science may pay more attention to the relationship between properties and the organization forms of molecules. This character facilitated the demand for a specific database design suitable for this unique field of research.
In this work, we reported the first versatile database system, ASBase, designed and developed for aggregate science, with a web-based interface as shown in Figure 1 providing multiple retrieval functions, back-end computation, and statistics services. Comparing to existing databases for luminescent materials, our database did not only include the photophysics about the systems but also many other physicochemical properties which are indispensable for the application research. In addition, the database is a platform with data deposit function, which enabled researchers from all over the world to share their systems on the platform. Currently, the database contains over 1000 entries of aggregate materials with traceable references and reported properties. Each entry contains around 40 fields of uploaded or generated data about chemical structure, literature references, patents, photophysics (absorption, emission, quantum yield and lifetime), physicochemical properties (stabilities and solubilities), and application information. Remarkably, the photophysics of an entry in the database was organized by different aggregation forms, including dilute solution (solution), pristine solid powder (solid), stable crystalline (crystal), and aggregates formed in solvents (aggregate).

Database structure and organization
The structure of ASBase was shown in Figure 2. The whole system could be divided into two parts: the server-side and the client-side. The server-side, which was constructed by django 4.0 framework, [29] could also be divided into two sub-modules: the data server module and the calculation server module. The data server module was constructed with three sub-tables of the database. An entry should refer to a specific chemical structure and contain different aspects of the information about the compound: (1) the structure information contains the chemical structures, the SMILES, the code name of the compound, and the index in the database; (2) the reference information contains the reported feature (AIE, aggregation-caused quenching (ACQ), roomtemperature phosphorescence (RTP), [30] thermally activated delayed fluorescence (TADF), [31] or aggregation-induced delayed fluorescence (AIDF) [32] ), the report year, the bibliography, first author and the corresponding author names of the literature reporting the compound, and the patent number if it is available; (3) the photophysics contains the tested solvent and all photophysical properties (the absorption peak The structure of ASBase wavelength (λ abs ), the emission peak wavelength (λ em ), the molar extinction coefficient (ε), the quantum yield (Φ), and lifetime (τ) of photoluminescence) in different statuses of aggregation (in dilute solution of tested solvent, in pristine solid powder form, in the stable crystalline form and in aggregates formed in solvents); (4) the physicochemical properties contain the stability (photostability and thermostability) and the solubilities in different solvent systems; (5) the application information contains four keywords of the potent applications of the compound; (6) the uploader information contains the information about the uploader of the entry.

Data uploading and verification
A registered user could upload an entry through the data deposit interface in the webpage and the uploaded entry will be stored in the temporary database for verification. At current stage, we will review the data entries in the temporary database periodically to determine their reliability, enter the reliable entries into the core databases, and put the unreliable ones into the problematic database. The unreliable data will be returned to the uploader for double check. Once the reliable data were recorded in the core database, some information including the calculated logarithm of the partition coefficient (log P), the calculated topological polar surface area (TPSA), the structure identifiers (including InChI and InChIKey), the chemical formula and the molecular weight will be generated. The scalable vector graphics (SVG) images of the chemical structures in two-dimensional (2D) will be also generated in the data-server. Periodically, the entries in the core database will be sent to calculation server to do more complicated computations for generation of the 3D structure in different levels (MMFF94, [33] GFN2-xTB, [34] and density functional theory (DFT)), the orbital energies of highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) and transition energies through TD-DFT.

Data retrieval and searching
The client side provided a comprehensive web-interface for user to interactively utilize the database. The most important function in the client side is the data retrieval system. The web-interface provided three different implementations for retrieving the data entries: (1) the basic search approach enabled user to search entries conditionally with a simple filter such as the name of the authors, the photophysics in different statuses of aggregation, the reported feature, and the reported application; (2) the advanced search function helps user to retrieval entries with multiple filters simultaneously; (3) the chemical structure search provided a chemical structure editor and a SMILES input box for searching compounds with inputted chemical sub-structure. All the search results could be sorted by ID, absorption/emission peak wavelengths, and solid-state quantum yields. The detailed information of entries meets the search requirements could be TA B L E 1 Five couples of properties with highest values of Pearson's correlation coefficient reviewed as shown in Figure 3. Table 1 provided a detailed table of the recorded information and properties of a given entry in the database.

Statistics
Another critical feature of ASBase is statistical report generation. The server-side can generate real-time statistical reports based on the core database's data entries and render the graphs on the webpage. In the current version, the statistical report contains six graphs, including three histograms about the distribution of absorption peak wavelengths in solution, emission peak wavelengths and fluorescence quantum yields in solid-state; two pie graphs about the distribution of reported features and reported photoluminescence mechanisms. The statistical report also includes a correlation scatter plot of absorption peak wavelengths as the x-axis and emission peak in the solid-state as the y-axis.

Algorithms
Pearson's correlation coefficient. For an ensemble of variable x i and another ensemble of variable y i , the Pearson's correlation coefficient between x and y is defined as: The value of R is between −1 and 1. When R > 0, the correlation between x and y is positive. When R < 0, the correlation is negative. The absolute value of R indicates the linear correlation between x and y. Spearman's correlation coefficient. For an ensemble of variable x i and another ensemble of variable y i , the Spearman's correlation coefficient between x and y is defined as: where d i is the difference in ranks of the elements in x i and y i ensembles. The value of Spearman's correlation coefficient is also between −1 and 1. When R > 0, the correlation between x and y is positive. When R < 0, the correlation is negative. The absolute value indicates the monotonicity between x and y.
t-Distributed stochastic neighbor embedding (t-SNE). t-SNE is a tool to reduce the dimensions and visualize high-dimensional data. The main idea of t-SNE is to convert similarities between data points to joint probabilities and to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. [35]

Software
The database server-side was held on a × 86 server at the current version by django ver. 4.0 framework. All background logic is implemented in Python ver. 3.8 or above with RDKit ver. 2021 [36] or above and OpenBabel ver. 3.1.1. [37] The web interface was built by Boostrap CSS ver. 5.0 [38] or above with Chart.js [39] to render the statistical report. The chemical structure editor is the JavaScript Molecule Editor (JSME) ver. 2022 or above. [40]

Computational methods
The logP, TPSA, chemical formula, and molecular weight were generated originally from the SMILES of the entries by algorithms provided by RDKit. The SVG images of 2D chemical structures were also generated from the SMILES by OpenBabel. OpenBabel optimized the 3D structures in the MMFF94 level. Grimme's xTB software optimized the 3D structure in the GFN2-xTB level. The 3D structure in the DFT level, and energies of HOMO and LUMO were optimized by ORCA ver. 5.0.2 under BLYP/def2-SVP level. [41] The implementation of the t-SNE algorithm is provided by Scikit-learn package ver. 1.1. [42] 3 RESULTS AND DISCUSSION

The purpose of creating ASBase
The potential audiences of ASBase were not only academic researchers but also engineers in the industry who are going to develop products for particular specific demands, investors who need some reference about the significance and potential to evaluate a project, and policymakers who are willing to get macroscopic views of the development of the field of aggregate science. As shown in Figure 4, the primary purpose of ASBase is to provide a tool for users mentioned above with functions of searching entries, stating data, exploring new theories and predicting photophysics by data-mining or machine-learning methods. Besides, the prediction function was still under development. [43] The main idea for such prediction is to encode the chemical structure with a serial bit string, like molecular fingerprints, and then trained a prediction model with labeled data in the database. Then, we can get prediction from the serially encoded chemical structure of a given unknown compound from this trained model.

Current data
At the current stage, the database contains 1100+ entries. Each entry corresponds to a reported chemical in aggregate science, including AIE, [3] ACQ, [44] RTP, [30] TADF, [31] and AIDF. [32] For their reported mechanism of photoluminescence, all entries could also be categorized into intramolecular charge transfer (ICT), twisted intramolecular charge transfer (TICT), [45] and excited-state intramolecular proton transfer (ESIPT), [46] neutral aromatic (NA), and clusterluminescence (CL) systems, [47] respectively. It is noteworthy that the NA mechanism denoted the system with aromatic rings without obvious donor, acceptor, and ESIPT substructures such as pyrene, tetraphenylethylene, anthracene, and so on. Besides, there were many entries without the specifically reported features and mechanisms in the source literatures and will be categorized into an "other" category. In addition, some systems with features or mechanisms which are too unique to be categorized into a normal ensemble would be also included in the "other" category. Figure 5 displays the distribution of photophysics of the first 988 entries in the database. From the violin plots, we could get some exciting information that cannot be observed easily from isolated papers and case studies. For example, the distributions of wavelengths indicated around 120 nm of the average Stokes' shift. In addition, the mean value of emission peak wavelengths in the crystal is the lowest among all four statuses of aggregation, which might be caused by the rigidification effects in the crystal that inhibit the excited-state relaxation process. Chemical A1 and A2 are the bluest and reddest emitter in pristine solid powder form in the stated entries. Chemical A1 contains three connected benzene rings with many methyl groups serving as spatial hindrances, which diminish the conjugation and cause high transition energy. In contrast, chemical A2, which has an emission peak wavelength at around 1100 nm in pristine solid powder, contains a donor-π-acceptor-π-donor-shaped skeleton, with two triphenylamines as the electron donors, two thiophenes as the conjugation bridges and a 1H,5H-benzo[1,2-c:4,5c']bis( [1,2,5]thiadiazole) as a strong electron acceptor. Near infrared ray (NIR) emitters like chemicals A2 and A4 always have strong donors, acceptors, and large conjugation scale. In the distributions of quantum yields, the quantum yields in aggregates, crystal and pristine solid powder have two prominent density peaks. However, the quantum yields in dilute solution have only one obvious density peak at around 0.05. This result provided empirical evidence that the aggregation effect behaves as an amplifier for photoluminescence for the compounds in ASBase, which is identical to the topic for data collection. It is noteworthy that not every entry in the database has completed data in all the fields of the photophysics. Generally, most source literature reported the quantum yields in solid and solution. But fewer papers reported the quantum yields in crystal and aggregates form due to the more complicated measurement. This unbalanced data collection caused the density of high quantum yields in dilute solution higher than those in aggregates and crystal. In regular cases, the quantum yields negatively correlate to the emission wavelengths due to the fast non-radiative decay process from states with small energy gaps. So, chemical B1 is a precious system with red emission and high fluorescence quantum yield at around 0.63. Chemical B2 (with a quantum yield of 0.49) and B3 (with a quantum yield of 0.03) exhibited typical quantum yields for their emission color in the green-yellow and NIR regions, respectively. The range of lifetime in the crystal is undoubtedly the largest among all four statuses of aggregation because the crystal is the critical requirement for most RTP systems to display an ultra-long lifetime. As a typical RTP compound, chemical C2 in Figure 5 was reported in 2021 with 868 ms in crystal.
One significant database usage is detecting latent patterns in the recorded entries. Figure 6 displays the correlation coefficient matrices of stated data at the current state. separately. From the matrices, it could be indicated that a strong correlation could be observed between wavelengths, quantum yields, and lifetimes. A negative correlation between molar extinction coefficients and quantum yields in aggregate could be explained by the relative calculation algorithms of quantum yields. The quantum yield is the ratio TA B L E 2 Five couples of properties with highest absolute values of negative Pearson's correlation coefficient The scatter plots of t-SNE components of molecular access system (MACCS) keys mapped by features, mechanisms, emission energy (ΔE solid em ) and quantum yield (Φ solid ) in pristine solid powder of the compounds in the database at current stage of emitted and absorbed photons. It is easy to understand that a strong molar extinction coefficient could reduce the quantum yield of a given compound if the emission intensity is kept constant. The conclusion above mentioned are almost in line with expectations. However, there are also some strange correlations which may never be considered without the accumulation of data. For example, high molar extinction coefficients are positively but not linearly correlated to the log P values. These two factors are hardly ever discussed together in case studies. A possible explanation is that high molar extinction coefficients were always achieved by a vast conjugation scale, which may enhance the lipophilicity. The positive correlation between molecular weights and the log P values could also be explained by such a theory.
To explore the relationship between chemical structures and the properties recorded in the database, we encoded all chemical structures into the molecular access system (MACCS) keys' fingerprints [48] and performed a t-SNE decomposition analysis. As shown in Figure 7, each scatter point represented a specific pattern of MACCS keys' fingerprints, which referred to a particular suite of features in the chemical structure of a given compound. The distances between the points are correlated with the difference in chemical structures. All scatter points were color-mapped by their features, mechanisms, emission transition energy, and quantum yield in solid-state (pristine solid powder). It could be observed that systems with AIE features existed in almost all clusters of scatter points, implying the diversity in chemical structures of AIE systems. Nevertheless, for mechanisms, many categories displayed structural dependences. Some clusters only contained entries with the same photoluminescence mechanisms. Besides, there is a higher structural correlation in the transition energy than in the quantum yield in pristine solid powder form, indicating that the quantum yield is more unpredictable than emission wavelengths for a given chemical structure.

CONCLUSION
Hereby, we present the versatile database for aggregate science ASBase, sourcing mainly from the reported literature. For better adaptability of the database to the requirements of aggregate science, we uniquely designed the database structure by organizing the photophysical properties with the status of aggregation. ASBase is the first database designed specifically for aggregate science research. Currently, the database contains 1100+ unique entries with around 40 different fields of information for each and a comprehensive web interface to retrieve, deposit, and manipulate data. Besides, a statistical report generation function provided a convenient approach to get a macroscopic view of the development of aggregate materials and the progress of aggregate science, which is of great significance for not only academic researchers but also engineers, investors, and policymakers. The database will continue operating and expanding with novel data and new features. We hope that the ASBase could promote the data-driven research paradigm in the field of aggregate science.

C O N F L I C T O F I N T E R E S T
The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.