Implement the Materials Genome Initiative: Machine Learning Assisted Fluorescent Probe Design for Cellular Substructure Staining

The Materials Genome Initiative (MGI) is accelerating the pace of advanced materials development by integrating high‐throughput experimentation, database construction, and intelligence computation. Live‐cell imaging agents, such as fluorescent dyes, are exemplary candidates for MGI applications for two reasons: i) they are essential in visualizing cellular structures and functional processes, and ii) the unclear relationship between the chemical structure of fluorescent dyes and their live‐cell imaging properties severely restricts the current trial‐and‐error dye development. Herein, the MGI is followed to present an intelligent combinatorial methodology for predicting the staining cell ability of dyes utilizing machine learning (ML) driven by a structurally diverse combinatorial library. This study demonstrates how to high‐throughput synthesize 1,536 dyes and evaluate their imaging properties to establish a feature dataset for ML. A set of high‐precision ML‐predictors is then successfully modeled for assisting live‐cell staining and endoplasmic reticulum judgment. This approach is believed to bridge the gap between dye structure and corresponding staining behavior, and can accelerate the discovery of novel organelle‐specific stains.


Introduction
Advanced materials are essential to human well-being and sustainable development. [1] Current research strategies rely primarily on scientific intuition and trial-and-error experimentation. However, the resulting process is costly and time-consuming, creating a bottleneck in technological changes and scientific progression. An innovative model of materials development is urgently needed to meet the burgeoning needs of these grand DOI: 10.1002/admt.202300427 challenges. The Materials Genome Initiative (MGI), launched in 2011, has accelerated the transition pace from material innovation to market applications while simultaneously cutting costs. [2] By synergistically combining high-throughput experimentation, database construction, and artificial intelligence (AI) computation, MGI integrated models driven by large experimental databases can directly predict, screen, and optimize new materials at unparalleled scales and rates, even when fundamental understanding of the chemistry, physics or biology behind their properties are lacking. [3][4][5] These purely data-dependent computational models have helped scientists make breakthroughs in new alloys, [6,7] bio-surrogate materials, [8,9] energy storage, [10,11] and optoelectronic materials, [12][13][14][15] and are gradually being utilized in other research fields, such as chemical reaction optimization, [16] protein engineering, [17] drug, and vaccine development. [18][19][20][21] For example, the MGI approach was utilized for the development of the lipid nanoparticle (LNP) delivery agents used in the BioN-Tech/Pfizer (BNT162b) and Moderna (mRNA-1273) COVID-19 vaccines. [22][23][24] LNPs fabricated from charged or ionizable lipids molecules (a polar head group and hydrophobic tail) have many applications in disease prevention and treatments. [25][26][27] However, the complexity of biological systems and the highly selective permeation barrier of cell membranes severely limits the further investigation of cellular changes caused by LNP-mRNA delivery. Visualizing cells or their substructures undoubtedly provides one of the most effective means for direct assessment and in-depth understanding of cellular responses and interactions in various life processes.
Live-cell imaging agents are indispensable in visualizing cellular morphology, structures, and functional processes for various applications, example biomedical research, [28,29] disease diagnosis, [30] and targeted drug discovery. [31] Due to their extensive use and multiple functionalities, current or developing fluorescent dyes must have good cell permeability and biocompatibility. [32] Generally, fluorescent chromophores can achieve transmembrane staining by modifying and adjusting the hydrophilic and hydrophobic groups, like the amphipathic structure of lipids that deliver mRNA. [27] Selective targeting of specific molecules, organelles, and structures is also required; however, the relationship between chemical structure and target recognition is largely unknown. Machine Learning (ML), an important part of AI computation, can analyze and extract useful information from high-throughput data without any additional knowl-edge, and establish correlations between features by developing predictive models. [33,34] Coupled with the construction of material genomic libraries, ML provides a powerful approach for discovering new target-specific fluorescent dyes and guidance on the rational design of their structures, a relationship that is necessary and challenging to elucidate.
In this work, we utilize the MGI strategy by integrating combinatorial chemical synthesis, high-throughput experimentation, and machine learning to develop a set of predictive models for assisting live-cell staining (Figure 1a). We constructed a triazine-derived fluorescent dye library based on the temperature-mediated selective reaction for screening active live-cell stains. All fluorescent dyes were characterized with microplate reader and high-throughput fluorescence microscopy and were further classified with quasi-subjective and objective cross-validation. We then generated a feature dataset of the whole structurally diverse library to link chemical fingerprints with staining behaviors and developed machine learning models to predict the staining ability in potential dyes. We also evaluated the target selectivity of sub-library stains to the endoplasmic www.advancedsciencenews.com www.advmattechnol.de reticulum (ER) via colocalization and used ML algorithm to summarize the feature importance for ER-targeted fluorescent dyes.

Results and Discussion
Triazine (1,3,5-triazine), possessing structural symmetry, is widely selected as a robust scaffold for rapid generation of diverse molecular libraries with drug and biological activity because it allows gradual functionalization through nucleophilic reactions with high selectivity. [35,36] Furthermore, triazine and its derivative groups are also excellent electron-withdrawing acceptors and can be linked with donor units (such as triphenylamine, carbazole or acridine) to form organic light-emitting compounds. [37] In this work, a triazine-based fluorescent dye library containing 1,536 compounds was synthesized in three steps based on the temperature-mediated selective reaction of 1,3,5-trichloro-2,4,6-triazine (TCT). As shown in Figure 1b,c, the fluorescent framework with two chlorine atoms (4-(4,6dichloro-1,3,5-triazin-2-yl)-N,N-diphenylbenzenamine, TT) was constituted first via Suzuki cross-coupling reaction of TCT and 4-(diphenylamino)phenylboronic acid at 50°C. Then, the two chlorine atoms of precursor TT were sequentially substituted by 64 amines/thiols A 01∼64 at 30°C and 24 diamines B 01∼24 at 80°C to produce 1,536 fluorescent dyes in situ. The precipitate was removed by centrifugation and solvent evaporation, and the crude products dissolved in dimethyl sulfoxide (DMSO) for later use. All products in the library exhibited excellent solubility in DMSO or other common organic solvents, and their structures were verified via mass spectrometry (MS , Table S2 and Figures S1-S5, Supporting Information), 1 H and 13 C NMR spectra of representative compounds (Figures S6-S16, Supporting Information).
The yellow color of the TT solutions was observed to gradually faded during the two-step nucleophilic substitution reactions. The TT, TTA 01∼64 , and TTA 01∼64 B 01∼24 in DMSO were examined via microplate reader to determine the photophysical relationship between the absorbance shift and nucleophilic substitution (Figure 2a,e). The statistics show that the main absorption peak of the TT (396nm) was blue shifted as each of the two chlorine atoms were substituted with the A 01∼64 (375nm) and B 01∼24 (360nm) groups, respectively. The blue-shifting is attributed to the replacement of the chlorine atom with the amine/thiol group, which reduces the electron-withdrawing ability of the triazine acceptor group and increases the overlap between HOMO (highest occupied molecular orbital) and LUMO (lowest unoccupied molecular orbital), resulting in a wider optical energy gap (Figure 2d,f). [38][39] By contrast, the maximum fluorescence emission exhibited a slightly different trend. The TT to TTA 01∼64 demonstrated red-shifting from the median values of 460 to 521 nm, while the TTA 01∼64 to TTA 01∼64 B 01∼24 blueshifted to 472 nm ( Figure 2b). The fluorescence intensity was significantly enhanced with the sequential substitution of chlorine atoms (Figure 2c). More than 75% of the dyes in the library showed strong blue emission that can be discerned by the naked eye under 365 nm excitation, which is beneficial for reducing background noise during imaging and obtaining sharp cell images (Figure 2g). These apparent optical changes and features can allow automatic monitoring of the reaction, and the identification and preliminary screening the target product during the highthroughput synthesis process.
The cell permeability and biocompatibility of the fluorescence agents are crucial characteristics for live-cell imaging. Highpermeability and bright fluorescent dyes are ideal imaging tools; however, the dye quantities required can induce cytotoxic effects. Therefore, the biocompatibility of the TTA 01∼64 B 01∼24 fluorescent dyes at 5 and 10 μm concentrations was assessed via tetrazolium-8-[2-(2-methoxy-4-nitrophenyl)-3-(4-nitrophenyl)-5-(2,4-disulfophenyl)-2H-tetrazolium] monosodium salt (CCK-8) cell viability assay ( Figures S17-19, Supporting Information). As shown in Figure 3a, the exposure of all cells to the higher 10 μm concentration showed 35% of the dye library did not reduce the cell proliferation below 95% survival. In contrast, the 5 μm concentration produced cell viability >95% for ≈50% of the dye library after 8 h of co-incubation at 37°C. Next, HeLa cells were stained with the dye library at 5 μm for 30 min, followed by freezing and collection of the fluorescence distribution under uniform detection settings. The resulting data was summarized in a heat map after being classified into 6 levels according to the fluorescence intensity (Figure 3b,c). Nearly 45% of the dyes classified as level 5 and level 4 can clearly illuminate the cells with high signal intensity and low photobleaching. The level 3 dyes, which accounts for ≈6% of the library, can also effectively stain cells but with a significant loss of fluorescence signal. The statistical analysis of these 768 positive results revealed that the isopentylamine (A 13 ) and N,N,N',N'tetramethyldipropylenetriamine (B 06 ) are the most likely building blocks to construct dyes for staining cells (Figure 3d,e). In contrast, the remaining dyes from level 2 to level 0 either cannot stain the cells or emit only as weak fluorescence as the background and are all defined as negative results. This visual classification method is simple and easy to use but relies on subjective judgments, which may result in subjective disparities.
To avoid this, the above results of staining are verified again using the mean gray value (Mean) of each obtained fluorescence distribution image as the objective evaluation parameter for staining. The macro calculation process ( Figure S20, Supporting Information) of all Means is shown in Figure 3f,g. First, the acquired RGB-color image is processed by splitting channels and the blue channel converted to 8-bit grayscale. Next, the optimal threshold for each grayscale image is calculated automatically using the TRIANGLE algorithm suitable for the histogram with unimodal feature. [40] Based on the threshold, the grayscale image is further binarized, and the corresponding Mean is batch computed. There are 821 dyes (53%) among the TTA 01∼64 B 01∼24 library that can illuminate cells with a Mean >50. Compared with visual classification, the heat map of Mean-based calculation exhibit near-identical sample distribution with similar proportions of positive samples, confirming the visual classification results.
In view of the reliable staining classification results to our structurally diverse dye library, a machine learning model for predicting the cell staining ability of potential dyes was developed for the first time. As shown in Figure 1a, the dataset for calculation including molecular descriptors, fingerprints and measured parameters was first exported from the fluorescent dye library that contains all 1,536 structures using an open-source tool "PaDEL-Descriptor". [41] Approximately 4,247,040 effective structural parameters (2,765 descriptors per molecule, Table S5, Supporting Information) were obtained according to the computing. Next, these data points were processed with machine learning algo- rithms to solve a binary classification problem, formulated as whether the dye can stain cells or not. Thanks to the balanced ratio of positive and negative samples in the dataset, multiple machine learning algorithms can be run directly, such as k-nearest neighbor, logistic regression, random forest, gradient boosting and multi-layer perceptron. [8] The training and testing subsets were randomly split according to the proportion of 70% and 30%, which were independent of each other, and their ratios were always kept constant in each algorithm.
After optimizing the hyperparameters (Table S1, Supporting Information), all five machine learning models could effectively predict the cell staining ability of dyes with an accuracy ranging from 69% to 84%. The precision-recall (PR) curves, receiver operating characteristic (ROC) curves and relevant reports are summarized in Figure 4a-c. Precision and recall are two criti-cal model evaluation metrics, while the area under curve (AUC) value derived from ROC curves can quantify model prediction performance. The comparison found that the two models constructed by random forest and gradient boosting algorithms exhibited the best predictive performance, achieving high precisions of 80% and 81%, excellent recalls of 86% and 89%, respectively, with AUC values close to 1. Furthermore, the feature importance for the two best models was also evaluated (Table S4, Supporting Information), and the top 20 descriptors were ranked and visualized in Figure 4d,e. The results indicated that descriptors L1m (1st component size directional WHIM index) and VP-3 (valence path eigenvalue) contribute most to the cell staining ability of dyes. As we have >2,000 feature descriptors for each molecule, Principal Component Analysis (PCA) is used to reduce the feature dimensions. Figure 5 shows the PR and ROC Curves  for different models after applying PCA with 2, 5, and 10 components respectively. Note that the best performance is obtained from the model trained with features without PCA. The successful predictive model construction and multitudinous potentially available dyes prompted us to further investigate their subcellular targeting ability in living cells. The level 5 group, containing 192 dyes with the best all-around performance, was selected for more targeted screening through colocalization tests with commercial trackers for various organelles ( Figure S23, Supporting Information). The endoplasmic reticulum (ER) is a key organelle for cellular function and metabolic adaptation. The malfunctions of the ER can induce the unfolded protein response (UPR) or ER-associated protein degradation (ERAD), [42,43] which is closely implicated in a variety of diseases, including diabetes, neurodegenerative disorders and even cancer. [44] Therefore, imaging the endoplasmic reticulum is essential for understanding these diseases' fundamental biological processes and studying the related early pathological mechanisms. The HeLa cells were stained with 5 μm candidate dye and counterstained with ER-Tracker Red before collecting the fluorescence distribution signal by a laser scanning confocal microscope (LSCM) at 630× magnification. The confocal fluorescence imaging revealed that 40 dyes in this group can stain (ER-like) cellular components around the nucleus, which is consistent with the ER staining pattern shown by the ER-Tracker Red counterstain (Pearson's correlation coefficient >0.800, Figure S21 and Table S3, Supporting Information). As shown in Figure 6a, TTA 02 B 21 , TTA 12 B 06 , and TTA 52 B 06 as three examples (blue channel) tightly overlapped with ER-Tracker Red with Pearson's correlation coefficients of 0.909, 0.871, and 0.911, respectively. To verify the ER selectivity, the dye TTA 02 B 21 was further examined by staining in HEK293T (human embryonic kidney), HFF (human foreskin fibroblasts), and MCF-7 (human breast cancer) cells, showing Pearson's correlation coefficients of 0.894, 0.842, and 0.833, respectively. Their colocalization with ER-Tracker Red also revealed identical patterns to HeLa cell, indicating that these dyes were indeed selective to ER (Figure 6b). Furthermore, a heterocyclic amine group (2-ethylpiperidine) that can bind to the ER-associated STING protein was also found in our ER-specific stains with B 09 building block, [27] such as TTA 46 B 09 (Pearson's correlation coefficient = 0.960), further supporting our judgment.
To further examine the relationship between dye structure and ER-targeted imaging, we developed another machine learning model to solve the new binary classification problem of whether dyes can target ER for imaging. The input dataset was obtained by calculating the 2D structural parameter of the level 1 group with 192 dyes and contained 204,480 data points (1,065 descriptors per molecule). Compared with the machine learning mentioned above, solving this problem was more challenging because our sample set is small (only 40 of 192 samples could target the ER) and the data is highly imbalanced. To alleviate these effects, we increased the proportion of the training subset to 75% and introduced an adaptive synthetic (ADASYN) sampling approach for data preprocessing (Figure 7a). ADASYN is a common over-sampling approach to handle imbalanced datasets, using a weighted distribution for different minority class examples according to their level of difficulty in learning and generating more synthetic data for minority class examples. [45] After preprocessing, the obtained dataset is directly input to the same five algorithms as before for machine learning. The results show that the random forest model had the best precision of models utilized (83% at the 80% recall) due to the over-sampling approach, achieving the highest accuracy of 92% and the maximum AUC value of 0.95 (Figure 7b,c). The feature importance was calculated from the random forest model, and the statistics of the top 20 descriptors were summarized (Table S4, Supporting Information). We found that the descriptor JGT (Global topological charge index) and JGI9 (mean topological charge index) contribute most to ER-targeted imaging and have a strong positive correlation in ER-targeted imaging (Figure 7d,e). The principal component analysis (PCA), based on 109 effective importance descriptors of the random forest algorithm, showed individual differences between the dyes and that the cumulative variance percentage in the two principal component dimensions was as high as 97.24% (Figure 7f). The determination of these descriptors and their interrelationships can help us better understand the data structure and can be used to further optimize our predictive model in the future.
Furthermore, another group containing 78 dyes ( Figure S22, Supporting Information) with similar patterns was also distinguished by colocalization with Golgi staining kit and confocal images, such as TTA 44 B 12 , TTA 48 B 16 , and TTA 61 B 05 in Figure 8. The Fluorescence distribution showed that these dyes tended to aggregate in regions consistently co-stained with the commercial Golgi dyes (green channel), but only partially overlap due to their larger staining area. Considering that both ER and Golgi apparatus are dynamic single-membrane-bound organelles and are closely related in the secretion pathway of proteins, lipids, and other components, [46][47][48] we speculate that this group of dyes may only bind non-specifically to the Golgi apparatus, or may bind similar liposomes, endosomes or others.

Conclusion
Here we show a new paradigm based on MGI strategy for accelerating the development and application of live-cell imaging   agents by integrating combinatorial chemical synthesis, highthroughput experimentation, and machine learning. We first utilized a combinatorial design to rapidly establish a structural diversity chemical library for discovering potentially active stains. The library of 1,536 candidate fluorescent dyes underwent highthroughput screening to monitor and evaluate their live-cell imaging behaviors for summarizing molecular features and generating computational datasets. Then, we used a quantitative structure-property relationship as a bridge to link chemical structure features and their biological properties, and successfully developed a machine learning model to predict the cell staining ability of dye, achieving a high prediction accuracy of 84%, an excellent recall of 89% with AUC of 0.9 in the gradient boosting model. Finally, we identified organelle-specific imaging agents from the dye sub-library through colocalization technology and developed another ER-targeted predictor based on the random forest algorithm to elucidate the feature importance of dyes for ER-specific staining. We envision that our MGI combinatorial methodology and machine learning model can provide robust and intelligent support for developing live-cell imaging tools with such biomedical applications as subcellular research, targeted drug discovery and disease diagnosis.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.