The Rise of Machine Learning in Polymer Discovery

In the recent decades, with rapid development in computing power and algorithms, machine learning (ML) has exhibited its enormous potential in new polymer discovery. Herein, the history of ML is described and the basic process of ML accelerated polymer discovery is summarized. Next, the four steps in this process are reviewed, that is, dataset selection, fingerprinting, ML framework, and new polymer generation. Finally, a couple of main challenges for ML accelerated polymer discovery is presented and the outlooks in this field are prospected. It is expected that this review can service as a useful tool for the people who just step into this field and deepen the understanding for the people who are already in this field.


Introduction
Conventionally, there are three typical steps toward polymer design, including 1) constituting a large chemical space with enough candidates based on domain knowledge, 2) synthesizing all the possible candidates with characterizations, and 3) comparing and screening candidates with targeted design properties. In other words, polymer design is a bottom-up and brute-force approach. To constitute a reasonable chemical space, a great amount of domain knowledge reserves, chemical intuitive, and experiences are necessary. Meanwhile, a large amount of funds and manpower are needed to synthesize and characterize new polymers, and sometimes, a piece of luck is indispensable. Often, the brute-force approach may not result in polymers that meet with the desired properties, suggesting that the design threshold must be lowered, or the entire design fails. Therefore, discovery or design of new polymers usually come from large and renowned labs with sufficient domain knowledge, talents, money, time, and manpower. To speed up polymer design and discovery, there has been a long history to streamline polymer design with the assistance of artificial intelligence (AI).
As is well known, machine learning (ML) is a branch of AI, which enables machines to simulate human behaviors without programming explicitly. It originated from statistics in 19 th century. [1,2] The conceptual ML was first proposed in 1959 [3] and drew some attention at that time. Mathematically, the reason why ML can achieve better decisions is not so difficult to understand. As shown in Figure 1, imagine that we need a weight θ 0 and a bias θ 1 to do prediction, our aim is to obtain a local optimum point with the least loss, then these inputs and loss constitute a hyperplane. In other words, the artificial neural network (ANN) in ML is to solve a minimization problem (let the comprehensive difference between the prediction and the ground truth be minimum) for a loss function by a series of math methods. First, it leverages 1st high-dimensional Taylor theorem to update loss function L(θ) in ith step, which reads where θ is the updatable tensor. Let θ iþ1 Àθ i ¼ Àα∇L(θ i ), where α is called "learning rate" (α > 0). Then Equation (1) can be written as In practice, loss function is usually considered to be greater than zero, such as mean square error (MSE), mean absolute percentage error (MAPE), etc. Therefore, the loss function can be continuously reduced by decreasing the certain ratio of square of the gradient of the loss function, which is the so called "gradient descent method." According to the updating method for the updatable tensor, the weights and bias (which are components of updatable tensor) can be updated (where the chain rule will be used) as In this process, the weights and bias of layers will be updated from back to front, which is called "backpropagation." Meanwhile, ML introduces different activation functions in every neuron, which aim to improve the stability (such as sigmoid function, hyperbolic function, etc.) and introduce piecewise function (such as ReLU function, step function, etc.). Finally, through adjusting learning rate, the prediction of the final function could be closer and closer to the ground truth. Through our knowledge or experience (which is the weight and bias on every neon), human beings can generate an evaluation algorithm and then achieve a local optimum point (the point 1 in Figure 1b). However, it should be mentioned that, limited by our knowledge base, this point could not necessarily be the global optimum point. Here, we do not want to imply that machine learning is absolutely better than human intelligence. In fact, machine learning is not so smart as humans in a lot of aspects. The limitations of machine learning will be further discussed in the last section. On the other hand, if a machine can extract enough features and apply appropriate algorithms, it is no longer limited by our knowledge base; hence, it could fit a better function or find a more appropriate local optimum point (the point 2 in Figure 1) through the gradient descent approach. Although ML exhibited enormous potentials, the hardware and algorithms for ML strongly impeded the application of ML at that time (because of the comparatively slow converging speed, huge volume of data is needed to find the global minimum). Take the most popular pattern extraction method or convolution operation as an example. Although only addition and multiplication are used, a huge number of calculations is necessary. For example, to realize a convolution operation for a colorful image with pixels 1080 Â 768, we need about 22 million operations for addition operations and 20 million multiplication operations (by a 3 Â 3 filter or kernel). Apparently, the computers at that time were not capable of handling this large number of computations. Meanwhile, for complex problems, the algorithms at that time could not efficiently update weights for each layer in the neural network. Although so far, we do not fully understand the operation mechanism of neurons in brain, ANN shows higher operational ability, accuracy, and execution speed than human intelligence, which is why machine learning is superior to human intelligence in some scenarios.
A big milestone was achieved in 1996, when the IBM supercomputer "deep blue" defeated the world chess champion Garry Kasparov. [4] This monumental event brought ML back to public view again. Meanwhile, two advances fueled the developments of ML. In hardware, the graphics process unit (GPU) has far more cores than the central process unit (CPU). Hence, it enables the computer to process a huge amount of parallel computations, which is relatively appropriate for ML. In algorithms, some new approaches such as backpropagation [5] and convolutional neural work [6] drastically improved the computational efficiency. These studies contributed to the significant improvements in ML and led to another iconic event. In 2016, the supercomputer "deep mind" defeated the world champion of GO, Lee Sedol, which made the most intelligent human question the creativity of human beings. [7] Since then, ML has developed into a useful tool to serve human beings in a variety of fields, such as visual data processing, [8] agriculture, [9] and sports. [10] Although ML can indeed find the better local optimum points than human beings, it is still difficult to conduct ML in the field of polymer discovery. Currently, we face two major issues. First, there are not enough data available for most polymers. However, it is well known that ML models are typically hungry for datapoints, that is, need a large amount of training data. Therefore, there is a gap between the needs and supplies. Second, feature identification or fingerprinting is usually difficult for researchers who do not have sufficient domain knowledge, especially for crosslinked polymers. Fingerprinting is a process to convert the molecular structures, including 3D networks, to scalars, vectors, and matrixes so that a computer can read and understand the polymer. Nevertheless, a crosslinked network has multilength scale structures, including atomistic level, topological level, and morphological level. It is a challenge to extract these multilength scale features because the polymerization process is a random process. For example, different molar ratios between monomers, use of catalyst, and curing time and temperature during synthesizing may lead to polymers with different topological and morphological structures, even with the same monomers.
The first two papers in the field of ML-assisted polymer discovery dated back to the middle of 1990s. At that time, Venkatasubramanian et al. utilized genetic algorithm (GA) and designed a couple of polymer structures with desired properties. [11,12] After that, with some new ML approaches, Figure 1. A schematic diagram to describe why ML is better than human intelligence. a) An assumed neural network (θ is updatable tensor; z is activation function. b) Gradient descent method. (L(θ 0 ,θ 1 ) is the loss function, which is a comprehensive difference between prediction and ground truth). In the training process, the weights and biases will be repeatedly updated by W Ã ¼ W À α dL dW and B Ã ¼ B À α dL dB , where α is the learning rate. Through repeated updating, the weights and biases that meet the minimum difference between prediction and grand truth can be found by L Ã ¼ L À αk∇OðθÞk 2 (This is a simple minimization problem that is applied by Taylor theorem). The superscript (n) indicates the nthlayer and the subscript m indicates the mth neon, respectively. It should be noted that Figure 1 (b) only corresponds to a simple case for Figure 1  several works for polymer property prediction or polymer design were carried out successively. [13][14][15][16][17] Basically, these works are property prediction model or new material discovery model, which follow a similar routine of six steps, as shown in Figure 2.
Step 1: Collecting molecule structure from literatures or available database (Figure 2a).
Step 2: Fingerprinting molecules to the forms that computers can recognize. To be specific, converting these 3D molecular structures into tensors (Figure 2b).
Step 3: Choosing an appropriate ML framework and inputting the features that represent molecules into it, which enable researchers to obtain the structure-property prediction model (Figure 2c).
Step 4: Choosing an appropriate ML model ( Figure 2d) to generate new molecules (Figure 2e,f ) and inputting the framework in Steps 1-3 to predict properties.
Step 5: Screening these new molecules by applying a couple of threshold values or properties and deleting those that do not meet with the threshold properties ( Figure 2g).
Step 6: Screening these new molecules by domain knowledge to ensure that the screened samples meet with fundamental chemical laws and are synthesizable ( Figure 2h). Finally, one can obtain the desired polymer ( Figure 2i).
In these six steps, to predict properties, we only need the first three Step 1-3, which can be formulated as To conduct Steps 4-6, we need another mapping, such as S ¼ ϕðwÞ (5) in which S represents the appropriate polymer structure. It is worth noting that this mapping is not necessarily one to one. Among all the steps, the most difficult steps are Steps 1-3, that is, molecule structure collection, fingerprinting, and selection of ML framework. A number of review papers have been published recently on this important area of study. [16][17][18][19][20][21][22][23][24][25][26][27] The focus of this  www.advancedsciencenews.com www.advintellsyst.com article is to provide new insight on ML-assisted polymer discovery, including crosslinked polymer networks. The aim of this review is to provide an insight for researchers who are already in this field and for researchers who are stepping into this field. We expect that many more excellent studies will be conducted in the coming years. The review starts with a brief introduction for ML development and the basic framework for ML-assisted polymer discovery, followed by reviews on every step in the process for the state-of-the-art ML-assisted polymer design and discovery. Finally, we discuss challenges in this area of study and provide some outlooks for future studies.

Dataset Selection
As is well known, ML is essentially a fitting model; hence, the quantity and quality of datapoints directly determine the performance of the models. So far there are a total of four types of approaches to collect datapoints, that is, dataset collection from available database, dataset collection from literatures, dataset collection from scientific computations or theoretical models, and dataset collection from experiments.

Dataset Collection from Available Database
Leveraging the available database is one of the popular methods and has been adopted by many studies. [28][29][30] To data, there are a few databases available for the public, such as, CROW, [31] NIST, [32] NIMS, [33][34][35] 3PDB, [36] SCIFINDER, [37] and Huan et al.'s database. [14] Among them, Polyinfo is the largest polymer database, wherein one is able to find 18 526 homopolymers, 7,442 copolymers, 19 136 monomers, and 492 645 property points. Because these databases can provide enough datapoints, one is able to establish ML models with high accuracy, which is undoubtedly the greatest advantage of this method. However, most databases in this field mainly focus on homopolymers instead of copolymers. In addition, for some functional polymers, such as vitrimers, [38][39][40] shape memory polymers (SMPs), and [41][42][43] elastocaloric polymers, [44,45] these databases do not provide the relevant information. Therefore, for many polymer designs, the mentioned databases are not capable of providing enough help. For this reason, some researchers leveraged transfer learning to solve this problem. Specifically, through the similarity comparisons between targets and another different but relevant objects, some databases that would otherwise be inappropriate can be used for new material discovery. For example, Wu et al. stored the domain knowledge for glass transition temperature to design polymers with desired thermal conductivity ( Figure 3). [30] Yan et al. leveraged the information from drug to learn the features of SMPs ( Figure 4). [46] 2.1.

Dataset Collection from Literatures
In the largest global citation database "Web of Science," one is able to find about 1.3 million papers referring to the keyword "polymer". Also, considering polymer science is a rapidly growing field, the number of available polymers will continuously grow. Apparently, this method unshacks the limitations of the available database and enables researchers to design some desired polymers, which has been adopted by a lot of researchers. [47][48][49][50] The limitation of this method is that dataset establishment is time-consuming and the datapoints within them could be fewer than that from the available databases. Recently, some researchers implemented automatic information extraction system via natural language processing and ML, which could be a promising direction to break through the limitations for the dataset collection based on literatures. [51] Another challenge facing this data collection approach is that almost all research works only report selected results, which suggest that only partial information is available. Thus, these data could not really meet with Gaussian distribution requirement and could induce overfitting of ML models.

Dataset Collection from Scientific Computations or Theoretical Models
Sometimes, given that researchers are unable to find the desired information from the available databases or literatures, they usually adopt scientific computation tools or theoretical models to  [30] Because of the correlation between thermal conductivity and glass transition temperature, thermal conductivity is learned by first studying glass transition temperature. Reproduced with permission. [30] Copyright 2019, Nature Publishing Group. obtain desired datapoints. For example, based on density functional theory (DFT), Liu and Cao calculated 11 relevant parameters to predict glass transition temperature. [52] Zhu et al. built 300 types single-chain polymers by the molecule dynamics (MD) model and combined it with ML to discover polymers with high thermal conductivity. [53] Mattino and Jurs leveraged automated data analysis and pattern recognition toolkit (ADAPT) and generated 100-300 relevant parameters to predict glass transition temperature with the help of ML. [54] In addition, some other researchers leveraged some more convenient theoretical models or semiempirical approaches for data creation. For example, in order to make up the scarce dataset, Yan et al. [49] simplified the formula to estimate the recovery stress of SMPs using where is σ r the recovery stress, E r is the rubbery modulus, ε prog is the programming strain, R fix is the shape fixity ratio, and R re is the shape recovery ratio. This approach allowed them to supplement 60 datapoints into the dataset. Apparently, this data collection approach possesses one obvious advantage. That is, all the calculations are based on first principles or theoretical calculations; thus, they should be reliable. However, they could be computationally expensive. For example, for a typical biomolecular MD simulation with around 10 5 -10 6 atoms and multiple nanosecond simulation time, 8 to 32 processors would take a couple of weeks for completion; [55] for DFT-based electronic structure calculations, the time for generating meaningful, physical tight binding (TB) parameter sets with the semiempirical approach is usually up to a couple of years. [56] In addition, limited by the computation time, sometimes the results from scientific computation tools are not so accurate. For example, because the MD model can only conduct simulation in nanoseconds, it omits the relaxation effect for recovery stress of SMPs and leads to a result that is about twice the actual experimental measurement. [57]

Dataset Collection from Experiments
In order to establish a satisfied dataset, some researchers specifically carry out experiments by themselves. [30] However, because polymer synthesis processes are typically time-consuming, they could spend extremely longer time than any method discussed above. For example, to synthesize an SMP, one could spend weeks or months for completion, not mentioning the related property measurement. Therefore, to establish a small dataset with about 100 datapoints, one could spend years, which is basically unbearable from the application standpoint.
To better understand the pros and cons of each of the methods discussed above, we summarize their advantages and limitations in Table 1.

Fingerprinting
In this step, researchers convert 3D molecules into tensors or scalars, which enable the computer to read and further process. To our knowledge, so far there are a total of eight different fingerprinting methods, which will be elaborated one by one below.

Linear Notations Encoding and Direct Label Encoding
This approach includes two procedures, that is, linear notation and number encoding. First, linear notation to convert the 3D molecule structure into a linear string of characters or symbols is applied. The popular linear notation includes the simplified molecular-input line-entry system (SMILES), [58,59] Wiswesser line notation (WLN), [60] the modular chemical descriptor language (MCDL), [61] BigSMILES, [62] etc. Second, this linear string into numbers is converted. The common approaches include one-hot and ASCII conversions. For one-hot conversion, one needs to provide a dictionary for encoding. The dictionary can be obtained by a collection of all the types of atoms, bonds, and numbers as well as other symbols included in the training dataset, and they are written as a string of text. For example, as shown in Figure 5, in order to convert 2,5-difluorostyrene into a binary matrix, one needs three steps. First, one writes SMILES code (see Figure 5a) in the first column and the dictionary in the first row. It is noted that some cheminformatics software or a couple of websites can directly covert a polymer molecule to a SMILES code, for example, one can leverage RDKit package or the website "Cheminfo." Next, if the cross points correspond to the same character, then one can fill in "1" in the matrix. Third, one can fill up "0" for the rest of the matrix. This matrix (see  [46] The features of polymers are studied through the drugs. The behaviors of TSMPs are studied by first training a VAE model by a large drug database which consists of over tens of thousands of small molecules with molecular structures similar to those of the monomers in TSMPs. Reproduced with permission. [46] Copyright 2021, American Chemical Society. www.advancedsciencenews.com www.advintellsyst.com Figure 5b) can be viewed as a gray image in Figure 5c. Also, one can convert every symbol in SMILES into ASCII, which can be further converted into a binary vector (see Figure 6). [63] In addition, for some simple linear polymers, we can directly convert the repeat unit or compositional unit into a vector by label encoding. As shown in Figure 7, the beads with different colors represent different atoms.

Morgan Fingerprinting
The Morgan fingerprinting was developed by Rogers and Hahn, which is a reimplementation of the extended connectivity fingerprint (ECEP). [64] Through this method, one can obtain a unique, sequential atom numbering for any given monomer. This approach basically involves three main procedures. First, integers are assigned to each atom in the monomer. Second, each atom identifier are updated to reflect each atom's neighbors (see Figure 8). In this process, each iteration tends to create an identifier that denotes larger and larger circular substructures around the central atom. Third, the same features are removed to generate a single representative. Because it is able to describe each atom and its neighbors, Morgan fingerprinting could better represent the 3D topology than linear notation encoding. 1) Time-consuming to search the data. 2) Authors most of the time just report "successful data", thus the data distribution could not accord with Gaussian distribution.
Dataset collection from scientific computations or theoretical models 1) Can produce desired training datapoints without being limited by available databases. 2) Can even generate training datapoints that do not exist yet.
1) Time consuming for computation. 2) The time scale is limited by computational resources.
3) The number of atoms or molecules involved is also limited by the computational sources, and thus the computational results could not be very accurate.
Dataset collection from own experiments 1) Not limited by available databases. 2) More comprehensive types of data, including those "bad" data, to satisfy Gaussian distribution. 1) Extremely time consuming. 2) Extremely labor and money consuming. Figure 5. Fingerprinting of 2,5-difluorostyrene by combining linear notation with one-hot method. [50] a) Monomer and corresponding SMILES; b) binary matrix (the horizontal symbols represent dictionary and vertical symbols represent SMILES); and c) binary image. Reproduced with permission. [50] Copyright 2020, Elsevier.
www.advancedsciencenews.com www.advintellsyst.com This method was also adopted by many studies. [28,29,65,66] In practice, one can conveniently leverage open-source cheminformatics software RDKit to implement it.

Combined Tensor Representation Based on Compositional Block
Based on different repetitive blocks in a polymer, this method constitutes a tensor combination. For example, in Mannodi-Kanakkithodi et al.'s study, the polymers include seven different building blocks and three distinct surrounding blocks; thus, they established three types of tensors, including 7-dimensional vector, 7 Â 7 matrix, and 7 Â 7 Â 7 matrix, to represent the building blocks, neighboring blocks, and triplet neighboring blocks, respectively. [67] However, if the topology of a polymer is relatively complex, the matrix combination for a polymer could be very large and is computationally expensive. Huan et al. leveraged the combination of four vectors for fingerprinting, that is, the fractions of all the element types existing in the structure, single-bond counting, two-bond counting, three-bond counting, and four-bond counting [68] (see Figure 9). The fingerprinting can extract the features for small organic molecules made up of elements C, H, O, N, F. Similar fingerprinting can be found in other studies. [69][70][71]

Molecular Graph
In this method, the atoms and the chemical bonds can be represented by vertices and edges; then, the whole monomer can be represented by an undirected graph. It should be mentioned that this method can only be used for graph convolutional neural networks (GCNN). [28,72] This is because GCNN redefines the kernel and pooling, which can well capture the highly unordered data, such as polymer graph. Figure 10 shows a fingerprint of a polymer by a molecular graph approach. Because a polymer is an aggregate of plenty of monomers, the connections between compositional units are also critical. Considering that, Aldeghi and Coley further modified the molecule graph method for linear polymers. [73] In their studies, one main contribution is that they proposed to leverage the weighted edges (the dotted lines in Figure 11) to denote the probability occurring in compositional unit. In addition, they considered that the arrangement of compositional units in a linear polymer chain is represented by the letter sequences (- Figure 11. With this method, they were able to extract more information from the graph.

Quantitative Structure-Property Relationship (QSPRs) Descriptors
This method involves two steps: first, identifying all factors that could relate to the prediction aim and second, applying scientific Figure 7. Direct label encoding for simple linear polymer, wherein the beads with different beads represent different atoms or compositional units and this leads to a binary vector. [123] Here, A is represented as 0, and B is represented as 1, yielding a vector of binary values corresponding to each monomer position in the chain. Reproduced with permission. [123] Copyright 2020, Royal Society of Chemistry. Figure 8. A typical iteration process in Morgan fingerprinting. [64] The central atom is the atom 1 in (a). After each iteration, Morgan fingerprinting creates an identifier that denotes larger and larger circular substructures around the central atom. Reproduced with permission. [64] Copyright 2020, American Chemical Society.
www.advancedsciencenews.com www.advintellsyst.com computation tools to calculate these factors. This method has become widespread in ML-assisted polymer discovery or prediction. For example, in order to predict glass transition temperature, Liu and Cao employed DFT simulation to obtain 11 parameters, including thermodynamic parameters, the energy of the lowest unoccupied molecular orbital, the energy of the highest occupied molecular orbital, etc. [52] Mattino and Jurs leveraged Automated Data Analysis and Pattern Recognition Toolkit (ADAPT) to generate 100-300 parameters in three categories (topological, geometric, and electronic features) for predicting glass transition temperature. [54] Shafe et al. defined 20 atomisticlevel fingerprints and used molecular dynamics (MD) model and statistical analysis to correlate the fingerprints with thermomechanical properties of amine-epoxy thermoset SMPs. [74] Figure 12 shows a schematic for the fingerprinting scheme of Isophoronediamine (left) and Bisphenol A diglycidyl ether (right) monomers.

k-mer Method
k-mer method aims to reflect the distinct neighboring relations in a polymer chain. It is a common method that is used in bioinformatics, which can conveniently represent the sequence for DNAs. It is also applicable for linear polymers. As shown in Figure 13, this method enumerates all the possible frequencies of occurrence in a compositional unit. The numbers of all these frequencies can constitute a vector.

Sequence Fingerprinting
For some database of linear copolymers, if researchers limit the research scope in a small branch of datapoints, then the building blocks of polymers are limited, which allow researchers to fingerprint the polymers with a comparatively easy way. First, different digitization techniques are leveraged to convert the common building blocks into vectors with low dimension. Second, according to the order of the sequence that we want to apply, these lowdimension vectors will combine to constitute a high-dimension vector corresponding to a polymer unit. For example, Wu et al.
implemented OpenSoundControl (OSC) to encode the common fragments in the database and then leveraged them to establish a high-dimensional binary vector corresponding to the sequence [69] (see Figure 14). Kim et al. used 108 common fragments in their study, that is, atom triples, which include C3-S2-C3, H1-N3-C4, etc. [75] These fragments can constitute a diverse range of organic materials. Therefore, the highdimensional vector corresponding to the polymer can be formed from low-dimensional vector representing the atom triples. However, this method is not appropriate for some polymer networks with indefinite units. For example, as shown in Figure 15, Figure 9. Illustration of the motif of three types involving building blocks (in the top row, A i ), neighboring blocks (in the middle row, A i À ℬ j ), and triplet neighboring blocks (in the bottom row, A i À ℬ j À C k ) [68] of materials composed by hydrogen, oxygen, and carbon. Reproduced with permission. [68] Copyright 2020, American Physical Society.
www.advancedsciencenews.com www.advintellsyst.com a monomer BIS-GMA and a crosslinker TATATO can be synthesized to form a polymer network. However, because the connection possibilities for C═C monomer (or crosslinkers) are almost equal, the building blocks of the formed new polymer could be diverse, such as the two building blocks in Figure 15b,c. In other words, if the sequence is not definite, this method is inappropriate.

Hierarchy Fingerprinting
Basically, all the methods discussed above aim to extract features from a single scale, which could not capture the whole picture of a polymer. Therefore, some investigators turned to hierarchy fingerprinting method. This approach enables researchers to extract features in multiple length scales and dimensions and thus could make up the disadvantages for incomplete information of single fingerprinting method. For instance, Kim et al. proposed a threelevel fingerprinting for 854 linear molecules. [75] First, they defined a series of atomistic fragments to form fragment-based vectors. Second, they applied a QSPR descriptor. In QSPR descriptor, a polymer is encoded into a group of numbers with clear physical meanings, such as molecular quantum numbers, fraction of sp 3 C atoms, etc. Finally, they applied "morphological descriptors." In this stage, some morphological features of polymers are measured by DFT computations, such as fraction of atoms that belong to the sidechain, the shortest topological distances between rings, etc. Therefore, the inputs are three groups of vectors with different dimensions. Patel et al.'s proposed another hierarchy fingerprinting in their study. [76] This hierarchy approach involved three scales. Among them, one-hot was used to describe the arrangement of the repeat unit; Morgan fingerprinting was used to convert every distinct compositional repeat unit into an integer vector; and QSPR descriptors were also leveraged to extract more key features. Mohapatra and Gómez-Bombarelli proposed a new hierarchy method including three approaches to fingerprint polymers. [77] Specifically, SMILES was leveraged to express monomer information; a vector called MONOMERS was used to express the monomer indices in synthesis process; a vector called BONDS listed all the bond indices in the synthesis process. Tao et al. also applied a hierarchy imprinting in three scales on the basis of linear notations, [28] that is, monomer, repeat unit, oligomer (see Figure 16).

Weighted Vector Combination Method
Most fingerprinting methods discussed above are specialized in the homopolymer, wherein a single monomer or a single repeat unit can represent a polymer network; thus, it is reasonable to ignore molar ratio. However, in copolymers, different molar ratios among monomers directly determine the performance of the final polymer, such as crosslink density and topology. Hence, molar ratio is a critical factor for copolymers. Meanwhile, molar ratio and chemical structures are two quantities in different dimensions; hence, combining them is not a Figure 10. Molecular graph for a protein kinase inhibitor. [82] Every edge and vertex are assigned a solid color and label, respectively. The colored vertices are updated with the colors from neighboring vertices and labels of the connected edges. Finally, the vertices possess mixed colors. Reproduced with permission. [82] Copyright 2022, Nature Portfolio.
www.advancedsciencenews.com www.advintellsyst.com trivial issue. In view of this, Yan et al. developed a weighted vector combination method. [46] This method can be divided into three steps. First, monomers and crosslinkers of a copolymer are converted into linear notations. Second, the linear notations into variational autoencoder (VAE) model are input, which could convert the monomer and crosslinker into high-dimensional vectors m 1 , m 2 ,… m n . Third, according to the molar ratio in the copolymer synthesis process, the copolymer network can be expressed as þ : : : m n ⋅ a n (7) Figure 11. A modified graph that considers the connection probability (red decimals in the figure) after polymerization. For different types of polymerization processes, the connection between two atoms is different. The sequences below every subgraph represent the connections in a linear polymer chain. [73] Reproduced with permission. [73] Copyright 2020, Royal Society of Chemistry. Figure 12. Schematic of fingerprinting scheme for an amine-epoxy system. a,b) Sidechain dihedral and bond angles and c,d) backbone dihedral and bond angles. All the quantities of the properties (a)-(d) can be calculated by MD simulation. [74] Reproduced with permission. [74] Copyright 2022, Elsevier.
www.advancedsciencenews.com www.advintellsyst.com where a i (i ¼ 1, 2,… n) represents the molar percentage of a monomer in the whole copolymer network. Because the highdimensional vector based on VAE is a sequence of continuous numbers, this method has shown some superiority to linear notation and label encoding. [46] However, it should be mentioned that some new topologies are still not represented by this method; thus, some improvements will be needed in the future. The schematic diagram for this fingerprinting is shown in Figure 17.
Clearly, each of the above discussed methods has pros and cons. Table 2 shows a summary of the advantages and limitations of each fingerprinting method.

ML Framework
In this step, researchers employed a variety of models to explore the correlation between the polymer fingerprints and polymer properties. Herein we summarize some classical ML models. Figure 13. k-mer method. [123] The beads with different colors represent different atoms. In the figure, the numbers for white bead, combination of white and black bead, and the combination of two black beads are 3, 8, 8, respectively, which thus can be written as a vector with integers [3,8,8]. Reproduced with permission. [76] Copyright 2020, Royal Society of Chemistry. One example for fingerprint structure of polymer D. The procedures are provided on the left side and that of the small molecule nonfullerene acceptors is provided on the right side. R1, R2, and R3 represent alkyl chains with different lengths. [73] It is open access. [73]

Regression Model
In this model, the correlation between QSPR descriptors and target properties is assumed as where x 1 , x 2 , …, x n are QSPR descriptors from scientific computation, α 1 , α 2 , … α n are undetermined parameters, and ε is an error. The aim of this method is to look for a subset of {x 1 , x 2 , …, x n }, which is able to minimize the errors, including root mean square error, mean average percentage error, etc. This method has been employed by some early studies [52,54] and still exhibits its capability in the recent researches. [78] The limitation of regression model is obvious. That is, regression model assumes a linear relation between QSPR descriptors and target properties while polymers usually possess intricate nonlinear relation between QSPR descriptors and polymer properties, which could bring some unavoidable errors.

Artificial Neural Networks
As shown in Figure 18, ANNs is a simulation that was motivated by the biological neural network, which can correlate QSPR Figure 16. Hierarchy fingerprint for three scales for 1-(4-Vinylphenyl)-3-Piperidino-1-Propanol. a) Monomer, b) repeat unit, and c) crosslinked repeat units (redrawn from ref. [28]). The first one is fingerprinted with SMILES and the left two are fingerprinted with modified SMILES. Redrawn with permission. [28] Copyright 2022, Elsevier. The monomer and crosslinker will be fingerprinted into vectors by latent distribution first; then, they will form a new vector by combining with molar ratio. [46] Reproduced with permission. [46] Copyright 2022, American Chemical Society.
www.advancedsciencenews.com www.advintellsyst.com descriptors with properties. [52,54] Biologically, the cognitive process is realized by the conduction of chemical and electrical signals between an ocean of synapses. Computer scientists modeled this process and divided it into three parts, that is, input layer, hidden layer(s), and output layer. Mathematically, one expresses it with a mapping [79] f NN ∶ R I ! R K where I and K are the dimension of input space R I and target (output) space R K , respectively. The net input signal to an ANN aims to get the weighted sum of all input signals.
net ¼ where z i and w i are inputs and weights, respectively. The activation function receives the net input signal and bias and determines the output. The typically used activation functions include linear function, step function, ramp function, etc. By choosing an appropriate loss function, the weights and bias can be updated by gradient descent approach (see Figure 1). Only appliable for linear polymer with fixed compositional unit.
Sequence fingerprinting 1) It is very effective for database including limited building blocks, especially for linear polymer with fixed sequence.
2) The prediction accuracy is usually high because of the range of database.
Cannot effectively capture the important features of polymer network or stochastic linear copolymer, wherein the sequences are not definite.
Hierarchy fingerprinting 1) Can extract some features in multiple scales.
2) Could produce good model.
1) It is time-consuming to collect data in every scale. 2) Could be computationally expensive.
Weighted vector combination method 1) Can deal with both homopolymers and copolymers.
2) Can consider molar ratios in copolymers.
1) Cannot completely represent the topologies of polymer networks.
2) The meaning of the vectors from VAE model is vague.
Hidden layer  Figure 18. Schematic diagram for a) a whole neural network and b) a neuron. z 1 , z 2 , …, z n represent inputs from the last layer. w 1 , w 2 , …, w n are weights. f AN is the activation function, which can be an exponential function, piecewise function or logarithmic function, etc.

Convolutional Neural Network
CNN was inspired by the study for neocognitron. [81] According to Fukushima, human reorganization for an object follows such a flow path: pixel ! edge and orientation ! contour and details ! recognition. [80] Motivated by this study, LeCun developed CNN by combining neocognitron with NN. [6] To date CNN has been widely leveraged by ML-assisted polymer discovery. [28,49,50] As indicated above, the linear fingerprinting of a polymer, SMILES, can be represented by a binary image; thus, it can be conveniently recognized by CNN. For example, Micco and Schwartz utilized CNN to look for the mapping between polymer composition and glass transition temperature. [50] Sometimes, researchers have to develop more complex CNN networks to search for structure-property relation. For example, with different programming strains and programming temperatures, recovery stress in SMPs can be significantly different, which leads to the fact that recovery stress is not only related to a polymer structure, but also related to external inputs (see Figure 19). Therefore, Yan et al. presented a clever ML framework to resolve this issue. [49] Specifically, they first developed an CNN model to predict a polymer property (glass transition temperature) that is directly related to molecular structure (see Figure 20a). Next, according to the output in the first CNN model and an empirical equation, they further predicted the recovery stress σ r (see Figure 20b) through the second CNN. This framework allowed them to successfully predict 14 new thermoset SMPs (TSMPs) with higher recovery stress than ever before. It should be mentioned that Yan et al. did not limit the range of the target polymer in a small chemical space. Instead, the polymers that they dealt with involved a large chemical space; hence, their model could have a more general applicability. Besides, as shown in Figure 21 and 22, CNN can directly recognize the 2D image of chemical structure without the conversion of linear notation. [81] Although CNN performs well, it bears two limitations. First, CNN model usually requires a large training dataset. Second, CNN is difficult to recognize the location and orientation of an image. Hence, the polymer dataset must provide massive datapoints for the model (considering the complexity of polymer, thousands of samples could be needed for achieving an excellent model). Meanwhile, if the inputs are 2D molecule images, rotations of the images could be needed to ensure CNNs literally recognize molecule structures.

Graph Convolutional Neural Network
As indicated above, GCNN can be exclusively employed by molecule graph for property prediction. [28,72,82] GCNN is different from CNN in two aspects, that is, convolution and pooling (see Figure 23). First, the kernel in CNN possesses a regular 2D receptive field and the kernel moves with regular Euclidian distance. On the contrary, the kernel in GCNN is determined by its surroundings. Second, CNN applies uniform 2D grid polling while GCNN applies nonuniform polling operation from its surrounding. Because of the two unique characteristics, GCNN can handle some irregular domains, such as complex network connectivity in polymer graph. For example, Wang et al. applied GCNN to predict the interaction sites of protein kinas inhibitors and achieved a prediction accuracy up to 86%. [82] Tao et al. employed GCNN to predict glass transition temperature for an homopolymer, which leads to a satisfied model accuracy up to 79.18%. Altae-Tran et al. leveraged GCNN to achieve one-shot learning and the model shows promising prediction performance for small molecules. [83] By implementing the modified GCNN (see Figure 24), Aldephi et al. took the connection after polymerization into consideration and proposed a new network wD-MPNN, which exhibited apparent performance improvement by comparing with previous GCNN.

Bayesian Model
Bayesian model is able to calculate the conditional probabilities of newly generated chemical structures, which can be written as pðSjY ∈ UÞ ∝ pðY ∈ UjSÞpðSÞ where Y and S are the molecule properties of polymers and correct molecule structures, respectively. pðSjY ∈ UÞ, pðY ∈ UjSÞ, and p(S) are the probability to generate correct polymer structure under current design requirement, the probability to generate desired property characteristics under such a structure, and correctness probability for a linear notation to generate legit molecule structure, respectively. This method can choose the de novo polymers that meet the design requirements with the greatest possibility, which has been widely used by some researchers. [63,84,85] The likelihood pðY ∈ UjSÞ can be evaluated by different approaches. [63,85] For instance, Ikebata et al. calculated the maximum likelihood by a t-distribution linear regression model [85] Polymer structure  Figure 19. Correlation between polymer structure and ground truth. The programming temperature and recovery temperature are about 20°C higher than glass transition temperature for a polymer. The estimated ground truth σ r will be predicted by an input of 3D vector (T pg , T re , ε pg ).

Concatenation
(a) Figure 20. Basic pipeline structures for the network of predicting a) glass transition temperature and b) recovery stress. [49] The inputs for two subfigures are SMILES and SMILE with 3D vector (T pg , T re , ε pg ), respectively. The activation functions are not shown in the figure. Reproduced with permission. [49] Copyright 2022, Elsevier.
leveraged Bayesian optimization to screen the appropriate candidates for polymer-protein hybrids. [86]

Long Short-Term Memory
Long short-term memory (LSTM) is another popular ANN and was developed in the end of 20th century. [87] Different from the general ANNs, LSTM is able to learn the sequential order of a sentence. As a type of recurrent neural network (RNN), LSTM is not only able to determine how much the short memory affects the output, but also can determine how a long memory affects the output. Hence, it is appropriate to learn the knowledge that relates to the linear notation of polymers. For example, for a SMILES "Cc1ccccc1," the last symbol "c1" of a benzene ring is not only related to the adjacent symbol "c," but is also determined by the distant symbol "c1." The characteristics enable LSTM to be commonly used in ML-assisted polymer discovery. For example, based on LSTM (see Figure 25), Simine et al. leveraged the vector of 29 intermonomer dihedral angles to predict the associated value of the j th excited-state energy relative to the preceding Figure 21. CNN "Chemception" predicts chemical properties for small molecules. [81] Molecules are converted into "grid" images (the 2D structures of each molecule were then mapped onto a 80 Â 80 grid, where each pixel had a resolution of 0.5 Å) before they are input into a deep neural network. Reproduced with permission. [81] Copyright 2017, aXiv. Figure 22. Basic pipeline structures of "Chemception." a) Depiction of a typical CNN for toxicity prediction of small molecules and b) high-level architectural details of the "Chemception" architecture. [81] Reproduced with permission. [81] Copyright 2017, aXiv.

Decision Tree
Decision tree is a decision support tool that uses a tree-like structure to predict possible consequences. [90] Because it leverages graphical representation, researchers are able to easily understand it, which has also been commonly used for polymer discovery. [15,91] Many algorithms have been used to establish a decision tree, such as information gain, gradient boosting, etc. For example, Li et al. utilized decision tree, gradient boosting, and logistic regression to predict the self-assembly behavior of hydrogels (see Figure 26), which can be used in biomedical applications such as cell culture. [15] Bhowmik et al. leveraged decision tree and principal component analysis (PCA) to predict the specific heat of polymers. [92] Kumar et al. used random forest to classify cellular uptakes, toxicity, and editing efficiency under high/low categories for ribonucleoprotein. [93]

Gaussian Process Regression
Similar to the fitting for Gaussian distribution, Gaussian process regression (GPR) managed to obtain two parameters Figure 23. Comparison between general convolution operation and graph convolution operation (convolution operation indicates a double-dot operation between submatrix and kernal), general pooling, and graph pooling [72] (pooling denotes a method to reduce the amount of data). Reproduced with permission. [72] Copyright 2020, American Chemical Society. Figure 24. Graph convolution by considering the connection after polymerization. [73] x v and e uv represent the atom features. h represents the hidden features. W represents trainable weights. w kv represents the connection probability after polymerization. cat represents catenation operation. τ represents the activation function. It is open access. where m(x) and k(x,x 0 ) are the mean function and covariance function, respectively. It has been widely used for polymer discovery or property prediction. [75,94] For example, Chen et al. used GPR to predict frequency-dependent dielectric constant for polymers. [96] The kernel (or covariance function) is given by where x and x 0 are the features of two materials. σ f , σ l , and σ n are the variance, the length-scale parameter, and expected noise in the data, respectively. Kim et al. leveraged GPR to predict seven polymer properties, [75] including bandgap, dielectric constant, refractive index, atomization energy, glass transition temperature, solubility parameter, and density. The kernel is a radial basis function (RBF), which is similar to Equation (18) and reads where σ, l, and σ n are hyperparameters in the model; x i and x j are fingerprint vectors. This method was also used in other studies. [96][97][98]

Gradient Boosting
Different from the general ML models wherein the model and loss function hold fixed forms, in gradient boosting, researchers usually combine a lot of weak learner or weak model to achieve a strong model. By continuously updating the form of models, one can obtain a better model with the form where F mÀ1 (x) is the model in the last iteration, γ m is the learning rate, and h m (x) is the new estimator. The method can be carried out in both regression and classification models, which is often combined with decision tree. For example, Kumar et al. leveraged gradient boosting to iteratively update the cloud point of a polymer, which enables them to successfully find 17 new polymers. [91] Li et al. adopted gradient boosting to predict Akron abrasion of rubber through some related mechanical properties. [99] Ethier et al. used gradient boosting to do quantitative prediction for the phase behavior of polymers in solution. [100] Figure 25. Methods for predicting the spectra of conjugated polymers. A) Spectra prediction based on coarse-grained (CG) molecular dynamics models (inputs are atomistic description of molecules) and B) CG model with an explicitly included term for the intermonomer dihedrals {ϕ} (defined in the figure by the encircled CG sites). LSTM is leveraged to predict excited-state energies using selected sequences of {cosϕ}. [88] Reproduced with permission. [88] Copyright 2019, National Academy of Sciences. Figure 26. Machine learning algorithms for gel prediction by random forest algorithm. [15] The method allows the precision to achieve 54% at 50% recall. Reproduced with permission. [15] Copyright 2020, National Academy of Sciences.

Kernel Model
Kernel model is a class of algorithms for pattern analysis, which can solve nonlinear problems with linear classifier and has been widely used by researchers. For example, Mannodi-Kanakkithodi et al. leveraged kernel ridge regression (KRR) to predict polymer dielectrics. [67] This method aims to learn a linear function in the space defined by the kernel and the data, and its loss function is the least square, which reads where y i and y ⌢ i are prediction and ground truth, respectively. n is the number of datapoints. Another commonly used kernel method is support vector machine (SVM), which was also used for polymer discovery. For example, SVM was applied to predict the refractive index value of polyimides. [101] The loss for SVM is the epsilon-insensitive loss, which reads where f (x,ω) is the fitted linear model. To better understand the advantages and limitations of each ML framework, a detailed comparison is summarized in Table 3.

New Polymer Generation
In this procedure, ML frameworks will generate new chemical structures of polymers. Basically, according to the type of polymers, we can divide the frameworks into two categories, that is, new polymer generation for homopolymers and new polymer generation for copolymers.

New Polymer Generation for Homopolymers
As is well known, homopolymer is synthesized by single-unit cell-represented chemical structures, that is, monomer. It is reasonable to represent them with a single monomer. Herein, we list the four classical generation methods of chemical structures for homopolymers. It should be mentioned that these methods can also be used for drugs because drugs can usually be synthesized by small molecules.
Variational Autoencoder (VAE): VAE model was created by Kinggma and Weilling, which belongs to the families of probabilistic graphical model. [102] It can be divided into encoder and decoder. The encoder maps inputs into vectors in a hidden space, which can be represented by where x is the input and z is the vector in the hidden space. On the contrary, decoder maps the vector in the hidden space back to the inputs, such that Table 3. Comparison between different ML frameworks. Bayesian method. 1) Provide a natural way to combine prior information with data.
2) Can be used for inverse design.

1) A specific mutational model should be specified. 2) To form
Gaussian distribution for probability model, a large amount of computation is needed.
Long short-term memory 1) Can effectively learn the order information from linear notation of polymers. 2) Has better performance than the general recurrent neural network. GPR 1) Easy to use. 2) Can estimate for its own uncertainty. 1) Not sparse. 2) Lose efficiency in high-dimensional spaces.
Gradient boosting 1) Have better performance than random forest. 2) Good model accuracy.
1) Prone to overfitting. 2) The model update could need more time.

Kernel model
The nonlinear problem can be solved in a linear algorithm, thus the computational cost is reasonable.
It does not suit to solve the problem with the low-dimensional problems and large database.
www.advancedsciencenews.com www.advintellsyst.com D ¼ qðxjzÞ Mathematically, a classical VAE is to make the mappings probability distribution close to the actual probability of inputs and maximize the probability to reconstruct inputs from latent space, which read arg min θ ½KLðqðzjxÞkp θ ðzÞÞ À E qðzÞ ½ln p θ ðxjzÞ (25) where qðzÞ is the prior distribution of the latent space. p θ ðxjzÞ is the conditional distribution of input x given a high-dimension vector z. KLðqðzjxÞkp θ ðzÞÞ is the Kullback-Leibler divergence between probability q(z) and p(z|x). The second term also represents reconstruction loss. θ is the trainable parameter in the neural network.
In order to design a satisfied VAE model, one can choose the same or different layers for encoder and decoder. For example, one can choose CNN layer and dense layers as both encoder and decoder; [103] one can choose dense layers to work as both encoder and decoder; [104] one can also choose CNN layer as encoder and LSTM layer as encoder (see Figure 27). [46] If the training data is well sampled, then the vectors in the latent space could accord with a multivariate Gaussian variable X ¼ N(μ,Σ), where μ and Σ are mean vector and covariance matrix, respectively. By sampling the points that accord with X ¼ N(μ,Σ) but do not come from mappings of inputs, we are able to obtain some new vectors. Through decoder, one can further obtain the new chemical structures. For example, Samanta designed a novel VAE model that can learn the spatial coordinates of atoms in molecules. [105] By sampling the latent space, they found that this model is able to identify new molecules with property values 121% higher than its rivals. Batra leveraged the combination of VAE and GPR to discover some new polymers that satisfy three design requirements, that is, high glass transition temperature, high bandgap, and high glass transition temperature and bandgap (see Figure 28). [94] Gómez-Bombarelli et al. leveraged the VAE model to search for the optimized functional compounds in drug-like molecules and the molecules with nine heavy atoms, respectively, [106] and obtained satisfactory results. Shmilovich implemented model VAE to extract the features of peptides and identified the candidates from 8000 possible XXX tripeptides. [96] Convolutional layer 1 (102×28×8)  Reproduced with permission. [46] 2021, American Chemical Society.
www.advancedsciencenews.com www.advintellsyst.com Generative Adversarial Network (GAN): GAN is a generative model that was developed in 2014. [107] As shown in Figure 29, GAN includes two parts, that is, generator and discriminator. In the beginning, the generator will generate some random samples and then input them into the discriminator, while real samples are also input into the discriminator. Next, the discriminator compares the real input samples with the generated samples and then tries to distinguish between them. After that, the computer will try to modify the form of the generated samples to enable them close to the real samples. Finally, the computer will repeat this loop until it cannot distinguish the difference between the real sample and fake sample. Mathematically, GAN can be represented by [107] min G max D VðD, GÞ ¼ E xÀp data ðxÞ ½log DðxÞ þ E zÀp z ðzÞ ½ð1 À log DðGðzÞÞÞ (26) where D and G represent discriminator and generator, respectively. E represents the expected value.
In the training process, one aims to train the discriminator D to maximize the probability of assigning the correct label to both the real samples and the generated samples from generator G as well as to train the G to minimize the probability that D classifies its samples as fake. The loop in the GAN stops until discriminator cannot distinguish the difference between real samples and the samples from generator. Researchers have implemented GAN to discover some new polymers. For example, as shown in Figure 30, based on a training dataset with 9,800 2D polymer images, Hiraide leveraged the Wasserstein GAN to discover some phase separation structures of polymer alloys with 15 stipulated Young's modulus. [108] Prykhdro et al. leveraged LatentGAN to generate 200 000 drug-like compounds. [109] Figure 28. Polymer discovery with VAE and GPR. [94] a) VAE is used to map SMILES into the vectors in the continuous hidden space. b) Polymers with known properties are encoded in encoders. A supervised learning method GPR is used to map the vectors to polymer properties. c) In the design stage, known polymers with desired properties are encoded for searching for the region of interest in the latent space. The latent points are sampled to meet the design goal through the GBR model. Reproduced with permission. [94] Figure 29. Basic structure of GAN. Generator continually produces the samples that resemble the real samples, which aim to minimize the difference between fake samples and real samples. The loop in the GAN stops until the discriminator cannot distinguish the difference between real samples and the samples from the generator. Reproduced with permission. [24] Copyright 2020, Elsevier.
www.advancedsciencenews.com www.advintellsyst.com Genetic Algorithm (GA): GA model was inspired by Charles Darwin's theory of naturel evolution, which was firstly proposed in 20th century. [110] The logic of GA is to search for the fittest individual to produce the next generation of offsprings. The basic procedures of GA involve four steps, that is, initialization, selection, crossover, and mutation. [111] The flowchart of GA is shown in Figure 29. GA was employed by some early ML-assisted polymer discoveries. [11,12] In this process, an appropriate fitness function is key to generate the desired target. For example, Venkatasubramanian et al. defined a Gaussian-like function as fitness function. [11] fitnessðxÞ ¼ exp Àα where P i is the i th property value and P i is the average of the maximum and minimum acceptable property values P i,max and P i,min , respectively. In another work, Venkatasubramanian et al. utilized another sigmoid fitness function, [12] which reads where P F ¼ 0.5,i is the property value in which the evaluated fitness is 0.5. The fitness function evaluates how close the optimized result is to the desired aim ( Figure 31). Particle Swarm Optimization (PSO): PSO is originally developed in 1990s, [112] which was first used to simulate social behaviors. [113] The known training dataset is considered as particles, and the particles then gradually move in the search space to look for the entire swarm best known position. Mathematically, PSO can be represented by [91] min Fðx, f ⌢ ðxÞÞ subject to Gðx, f ⌢ ðxÞÞ ≥ 0 (29) where F: R d Â R ! R is the objective function and G: R d Â R ! R m is the vector-valued constraint function. A pseudocode of the standard PSO algorithm is listed in Table 4. [114] This method can also be used to discover polymers. For example, Kumar et al. successfully combined PSO with gradient boosting and decision tree to discover 17 polymers with aimed cloud points 37-80°C. [91] Monte Carlo Algorithm (MC): Monte Carlo (MC) is a random algorithm that is able to look for an aim with high probability through repeat samplings. [115] The common MC methods include MC algorithm for minimum feedback arc set, Gibbs Figure 30. Generating new polymer alloys through Wasserstein GAN. [108] The inputs are pseudopolymer images, which are arrays with the shape of (32, 32, 1). By leveraging Wasserstein loss function, the difference between the fake image and the real image from generator can be significantly reduced. It is open access.

Initialization
Mutation Crossover Fitness First Generation

Termination when desired aim achieves
Next Generation Figure 31. Basic flowchart of GA. It starts with initial population by random generation or other heuristics, which is then implemented by a small random tweak (mutation) and swapping (crossover) to get first generation. Fitness function will be applied to evaluate how close a new generation is to the desired goal. When satisfactory fitness level is reached, the cycle terminates. Reproduced with permission. [24] Copyright 2020, Elsevier.  [85] The resampling algorithm is shown in Table 5. Apparently, to generate the samples that accord with Gaussian distribution, a mass of resampling should be carried out, and thus it is computational expensive. Reinforcement Learning (RL): In all the five generative algorithms discussed above, VAE, GAN, and PSO limit the search space around the training database, which may miss some molecules which are not in the dataset. Therefore, some researchers leveraged another approach named reinforcement learning. It is a model that simulates Markov decision process. In this process, one needs to define a set of environment and agent states S and a set of actions a for the agent. In every action a from s to s 0 , an immediate reward R a (s,s 0 ) can be evaluated. Finally, one is able to evaluate the "how good" in a given state by the value function V π (s), which reads where R is the sum of future discounted rewards and can be written as in which r t is the reward at step t and γ is the discount rate. As shown in Figure 32, Zhou et al. defined three actions for a molecule, such as atom addition, bond addition, and bond removal. [117] The value function is defined as In order to find the local optimized points, the model's aim is to minimize the loss function LðθÞ ¼ E½f l ðy t À Qðs t , a t ; θÞÞ where θ is the trainable parameters, y t ¼ r t þ max a Q(S t þ 1,a;θ) is the target value, and f l is the Huber loss.
With this method, Zhou et al. optimized drug molecules with desired properties in specific ranges with 100% success. It should be mentioned that this is also the ML algorithm that AlphaGo used for defeating Lee Sedol. Therefore, although this method was only adopted for drugs, it does show some unique creativity over other ML algorithms and is appropriate for polymer discovery. However, this method still emphasizes the effects of variation of single atom, whereas some properties of polymers are not mainly determined by atoms. For example, for SMPs, whether the atom is S or O does not affect shape memory effect (SME) too much. On the other hand, SMEs are affected by the numbers of rigid chains and flexible chains. [74] Meanwhile, the monomers often possess symmetric structures, whereas RL often leads to some unsymmetric structures, which may be hardly synthesized. Table 4. Algorithm for Pseudocode of standard PSO algorithm [124] (Reproduced with permission. [124] Copyright 2017, Elsevier.).
for each particle i ¼ 1, . . . , N do initialize the particle's position with a uniformly distributed random vector: x i ∼ U(b lo , b up ) initialize the particle's best known position to its initial position: then update the swarm's best known position: g ← p i initialize the particle's velocity: v i ∼ U(À|b up À b lo |, |b up À b lo |) end for while a termination criterion is not met do: for each particle i ¼ 1, . . . , N do for each dimension d ¼ 1, . . . , n do pick random numbers: r p , r g ∼ U(0, 1) update the particle's velocity: update the particle's position: update the particle's best known position: update the swarm's best known position: g ← p i end for end for end while Table 5. Backward prediction algorithm. [85] It is open access Input T,R, E, θ, {S r 0 │r ¼ 1,…,R},{β t │t ¼ 1,…,T} Output {S r t │r ¼ 1,…,R,t ¼ 1,…,T} Set t ¼ 0, and w r t ¼ 1/R for r ¼ 1,…,R. for t ¼ 1,…,T do for r ¼ 1,…,R do Transform S r tÀ1 to an intermediate state S r Ã using the structure manipulation model G θ (S r tÀ1 ,S r Ã )(the procedure is detailed in the main body of article). Update the weight of the r th structure as Note that if the modified S r Ã contains unclosed ring or branch specifications, those indicators must be temporally removed before the conversion of the chemical string to a descriptor in the likelihood calculation End for The weights are normalized to obtain the selection probabilities W S r Ã ∝ w r t such that

New Polymer Generation for Copolymers
Most papers in the field of ML-assisted polymer discovery only focus on homopolymers. Therefore, they can modify the chemical structures of polymer networks by modifying the chemical structures of the corresponding single monomer. However, it should be understood that most polymers were synthesized by combining different monomers or crosslinkers, hence the It should be noted that only the monomers and crosslinkers that have the functional groups that can react will be arranged into combinations, such as the combination of two C═C monomers, the combination of three C═C monomers, the combination of one C═C monomer and one thiol crosslinker, the combination of one epoxy monomer and one imine crosslinker, etc. Reproduced with permission. [49] Copyright 2021, Elsevier. Figure 32. Three defined actions for RL. [116] a) Atom addition (atoms in a set of element ε ¼ {C, O} are added in the possible location). b) Bond addition. A bond addition action is performed between two atoms with free valence (not counting implicit hydrogens). c) Bond removal. The bond order is reduced for the existing bond. It is open access.
www.advancedsciencenews.com www.advintellsyst.com method is inappropriate. So far only a few papers in this field conducted studies for copolymers. [46,49] In their studies, a method called "monomer combination method (MCM)" was proposed. The method can be divided into two steps (see Figure 33). First, they collected monomers and crosslinkers from literatures. Second, different monomers and crosslinkers were combined into distinct arrangements. It should be mentioned that the combination should be in accordance with the law of chemical reaction. For example, the monomers with the functional group C═C cannot react with epoxides, which should be avoided in the latter rearrangement. By resorting to this method, they discovered 14 new TSMPs with higher recovery stress [49] and 5 new UV-curable TSMPs with desired properties. [46] For linear copolymer, MCM can also be used. Shmilovich et al. presented a similar traversal combination method to constitute new copolymers. [96] In their study, they leveraged all possible permutations of 20 natural amino acid to generate new copolymers in DXXX-OPV3-XXXD family, which constitutes a chemical space with 20 3 ¼ 8000 samples. They then implemented GPR and VAE, and active learning to search for best candidates. Wu et al. also applied MCM in their study. [73] Specifically, they summarized commonly used building blocks (fragments) from the database and then constituted the new linear copolymers with them. Jablonka defined four different monomer building block (beads) and then leveraged them to produce different linear copolymer combinations. [71] Web et al. leveraged backbone beads and pendant beads constituting 10 building blocks (constitutional units) and then leveraged them to generate the new linear copolymers [70] (see Figure 34). In another work, Aldeghi and Coley leveraged another method, that is, connection possibility method (CPM). They introduced the connection possibilities for atoms to quantify the different topological structures of copolymers [73] (see Figure 11). In summary, several frameworks are discussed in discovering new polymers. Their advantages and disadvantages are summarized in Table 6.  MCM 1) Can deal with both linear copolymers and copolymer networks. 2) Can clearly describe the linear copolymer with nonstochastic structure. 3) All the monomers come from reality, so the generated polymer can be comparatively easily synthesized in reality.
1) Cannot discover the polymer network with nonexistent topology in reality.
2) Cannot completely describe the topology of new polymer network. www.advancedsciencenews.com www.advintellsyst.com

Challenges and Outlooks
With ML, researchers have discovered some polymers that never exist before and have synthesized some of them, which prove the great potential and promising outlooks for ML. However, researchers still face some main challenges. First, current fingerprinting cannot extract all effective features from polymer networks. To date, most studies in this field still mainly focus on the topology of single monomer or crosslinker. However, the polymer network could generate some new topologies that never show up in monomers or crosslinkers. For example, as shown in Figure 35, EPON-IPD [118] not only keeps the main features of the monomer (EPON) or crosslinker (IPD) (which were highlighted by red circles), but also generates some new tetrahedron-like unit cells (which were highlighted by blue circles). This new feature cannot be described by linear notations for monomers (or crosslinkers). Apparently, this topology should be extracted with other feature descriptors.
Second, ML models for polymers still have to be further combined with physical and chemical principles to achieve a better prediction. Currently, the loss functions in ML models for polymers only measure the direct differences between prediction and ground truth of polymer behaviors, which are still brute-force and like fishing for a needle in the ocean. The convergence speed of the loss is relatively slow because of the loose constraints. For example, for the protein AI program AlphaFold 2, with about 170 000 protein structures in the training dataset, the predication accuracy of the protein structures is only about 35%. [118] On the contrary, if we combine ML with physical and chemical laws together, then the loss function will have more constraints, which would contribute to a quick search for optimum points and need less datapoints. In recent years, some researchers have conducted some studies in this direction, [119][120][121] wherein the loss functions with physical constraints exhibit much faster speed of convergence than the common neural network by adding a force equilibrium equation. Meanwhile, theoretically, all the polymer behaviors are driven by first principles and Newtonian mechanics. Therefore, adding the necessary physical laws could lead to more "intelligence" for ML models and should be an important direction for the ML-assisted discovery of polymers.
Third, current available data points for the polymer network still mainly focus on linear polymers. The datapoints in the database are still not enough to design some desired polymers especially for crosslinked polymer networks. Moreover, for some new polymers, such as SMPs and vitrimers, we do not even have any available database. On the other hand, because polymer networks possess relatively complex topological and morphological structures, at least tens of thousands of datapoints could be needed for a good ML model. For example, for the well-known AI program AlphaFold 2, researchers have collected 170 000 protein structures in the training dataset. [119] Therefore, one suggestion would be to establish an online platform for QSPR descriptor that enables researchers to submit and share data, which will significantly increase the volume of the training dataset. We expect that the future ML models for crosslinked polymers would have similar performance to AlphaFold 2.
Fourth, more comprehensive data are needed to produce better ML models for polymers. As is well known, the data matching with Gaussian distribution can generate better ML model prediction. However, if we choose to collect data from literatures, one should note that almost all the reported results are inevitably biased and do not meet Gaussian random distribution requirement. That is, although experimental results are supposed to yield a Gaussian distribution, almost all research works tend to report selected results, which suggests that only some particular polymers were reported. For example, if the papers focus on SMPs, only SMPs with large recovery stress or high recovery efficiency were reported. For those SMPs with moderate recovery stress or moderate shape recovery ratio, they were involuntarily Figure 35. a) EPON reacts with b) IPD to form a c) EPON-IPD polymer network. Red circle highlights the topological structure from monomer and crosslinker and blue circle highlights new topological structures. Reproduced with permission. [74] Copyright 2022, Elsevier.
www.advancedsciencenews.com www.advintellsyst.com ignored. Therefore, it is suggested that a uniform online platform be established to collect these ignored data. With the addition of these ignored data, we expect that the ML prediction would be more accurate and reliable. Fifth, the problem in ML itself should be further resolved. In other words, although ML is believed to be "intelligent," it is still controversial if it is really intelligent or not. As is well known, ML can only look for the local minimum from provided dataset but intrinsically "its learns are, ironically, superficial, not conceptual." [122] For example, a famous deep neural network Gato (developed by Deepmind) "has no idea what is actually in the picture as opposed to what is typical of roughly similar images." [122] In other words, when ML is fitting a model, a computer does not really understand whether it is designing a polymer or playing chess; hence, it could be stuck in somewhere. In this sense, it seems very difficult for computers to find a global optimum or global minimum. On the contrary, although traditional ways are slow, scientists understand the underlying mechanism behind the phenomena; hence, it is possible that they can find the global optimum in a long run. From this point of view, we still have a long way to go for enabling computers to truly understand the world around them.
Although we are still in the infant stage in the field of MLassisted polymer design, ML exhibited its ability to reach some unreachable areas in this field. With further development in fingerprinting and expansion of database, as well as improvement in ML model, it is promising to synthesize some de nova polymers by overcoming the cognitive blind spots of human beings.