Molecular design and performance improvement in organic solar cells guided by high-throughput screening and machine learning

Over past two decades, organic photovoltaics (OPVs) with unique advantages of low cost and flexibility meet significant development opportunities and the official world record for the power conversion efficiency (PCE) of organic solar cells (OSCs) has reached to 17.3%. Traditionally, efficiency breakthrough need the constant input of intensive labor and time. The artificial intelligence, as a rising interdisciplinary, brings certainly a revolution in research methods. In this review, we introduce a state-of-art theoretical methodology of the synergy of high-throughput screening and machine learning (ML) in accelerating the discovery of high-efficient OSC materials. We present key details, rules and experience in database construction, selection of molecular features, fast-screening calculations, models training and their predication capabilities. Meanwhile, three typical ML frameworks are concluded to reveal the structure-property-efficiency relationship, suggesting that this theoretical methodology can train powerful models with just molecular configurations and theoretical calculations for molecular design and efficiency improvements.


INTRODUCTION
F I G U R E 1 A general framework of ML application in OSCs to unfold the relationship between molecular structures, microscopic properties, and device efficiency ∼1% to ∼18%. [2][3][4][5][6] Compared with the conventional siliconbased solar cells and the emerged perovskite solar cells, OSCs demonstrate unique preponderance in the strength of low cost, flexibility, translucence, and light weights, [7][8][9] but are still overshadowed by their low PCE (below 20%). Thus, there still exist areas for efficiency development and performance optimization in OSCs.
In order to approach to the theoretical limit of OSCs, strenuous research efforts had been made in these three aspects: (1) design and synthesize novel donor and acceptor materials, (2) optimize fabrication conditions and device structures, and (3) explore the mechanism of device operation. [10] The first two methods are a trial-and-error procedure, which requires expensive materials, long-time consumption, and large manpower input while the last manner is a new way to study the physicochemical properties of materials and to simulate the device operation processes only based on multiscale computational methods, such as Density Function Theory (DFT), Molecular Dynamics (MD), Monte Carlo (MC) and so on, [11][12][13][14][15][16][17] which provides helpful guidelines and insight for designing novel molecules. In principles, the final device performance is only dependent on its materials and the function of manu-facturing technology and device structures is to better give play to the role of materials. Understanding the structureproperty relationship of OSCs materials is the crucial step in optimizing and improving OSCs performance.
With the development of computer science and mathematic statistics, ML, as a subject of studying how to use computers to simulate human learning activities, is becoming a rising artificial intelligence method and attracts wide attention and the aim of ML is to help people understand and dig the knowledge behind big data across a wide range of fields, for example, biology, [18] physics, [19,20] and materials science. [21][22][23][24][25] In the field of OSCs, the combination of ML and high-throughput screening has been identified as an effective method to deepen our understanding of the unknown relationship. [26][27][28] Herein, we will review recent successful ML applications in OSCs into a general framework (Figure 1), including not only the quantitative representation of the molecular structure, microscopic properties, and device efficiency in Section 2 and a brief introduction of ML models in Section 3, but also a thorough exploration of how ML was performed to unfold their connection with each other in Section 4. In the last section, we also list a short conclusion and some perspectives on ML for the future development trends in OSCs.

MOLECULAR FEATURES AND DESCRIPTORS
To complete efficient ML models for building a relationship of the structure-property-efficiency in OSCs, general procedures follow four steps: (1) Database construction from high-throughput materials screening by calculations or experiments; (2) Extracting quantitative and representative molecular information, namely molecular feature/descriptor selection; (3) Model training and evaluation under specified ML algorithms to build connections; (4) Model prediction for exploring unknown materials or properties.
Once you have enough data, the first step for achieving ideal ML models is finding suitable quantitative features and descriptors to express various molecular properties. An optimal representation should be beneficial to model construction and contain enough structural information.
More importantly, the descriptors should be invariant for rotations and translations and keep a non-linear correlation with each other. Moreover, the generation of descriptors should be convenient and user-friendly.

Representation of molecular structures
When prepared a molecule for theoretical simulation, it is natural to consider the element and the spatial coordinate of atoms, namely geometry configuration, as a descriptor to represent the molecule structure. Though the structural information can be described in the form of coordinates clearly, it seems unsuitable to choose coordinates as molecular representation in ML as there remains a struggle on how to get reasonable XYZ-coordinates within acceptable computation cost.
Coulomb matrix (CM) representation has been exploited for describing the 3D geometry of the molecular structure in the form of vectors. [29] For atom i and atom j in a molecule, the element of the matrix can be expressed by the equation below: where Z i is the nuclear charges of atom i, R i and R j represent the Cartesian position vector of atom i and atom j, respectively. According to the equation, the Coulombic interactions between atom i and atom j are dependent on their Cartesian coordinates and nuclear charges. After encoding a 3D conformer to a vector with the fixed length, the molecular structure converts to the language that the machine could recognize, making it advantageous for subsequent application. Fingerprints, which are regarded as graph-based representations, have been widely used in the field of pharmaceutical chemistry, [30,31] and recently have shown their potential for describing molecular structures in ML. [32,33] This may be attributed to the easy generation of fingerprints from Simplified Molecular input Line System(SMILES) [34] strings with some common package like Rdkit [35] and Openbabel. [36] As shown in Figure 1, rapid conversion from structure to machine language makes it possible to generate novel molecules automatically, allowing for large-scale screening of potential materials. There are many kinds of fingerprints serving as molecular descriptors, including extended connectivity fingerprints (ECFPs) [37] , molecular access system (MACCS), and Morgan circular fingerprints. [38] Morgan circular fingerprints are traditional representations of chemical structures that utilize vectors to record the environment of each atom within a molecule. Derived from Morgan circular fingerprints, ECFPs are developed for structure-activity modeling and optimized to save computational effort during fingerprint generation. Unlike the former two, MACCS is a substructure keys-based fingerprint. According to MACCS, atom properties, bond properties, and atomic neighborhoods are described by one or more bits in a keyset, and each bit relates to one structural information that positive represents presence while negative represents absence.

Representation of microscopic properties
The operation mechanism for OSCs includes four fundamental steps: [9,39] (1) upon absorbing solar photons, photoinduced excitons (bound electron-hole pairs) are generated in donors; (2) then excitons migrate towards the interface between donors and acceptors and (3) dissociate into free electrons and holes via the interfacial driving force; (4) holes and electrons transport to their corresponding electrodes, thereby producing the electrical power.
The key macroscopic metrics to evaluate the performance of OSCs is PCE, which is calculated by the following formula: where V OC represents open-circuit voltage, J SC represents short-circuit current density, FF represents fill factor and P in represents total incident optical power. [40] In the theoretic studies, FF and P in are usually set up to be constant values. Thus, the microscopic properties can be concluded in the following two aspects:

V OC -relevant microscopic features
V OC is directly influenced by the highest occupied molecular orbital (HOMO) of the donor and the lowest unoccupied molecular orbital (LUMO) of the acceptor, so the frontier molecular orbital (FMO) plays an important role in the optical performance. [41][42][43]

J SC -relevant microscopic features
Photoexcitation is a complex process affected by absorbing ability, intramolecular and intermolecular charge transfer. [42][43][44] J SC is a photo-physical parameter and depends on this process. Some concrete indexes are closely relevant to J SC , including maximum absorption peaks, excitation energies, electron affinities (EAs), ionization potentials (IPs), and reorganization energies of electron (λ e ) and hole (λ h ). [45,46]

ML models
Advanced mathematics algorithms are the kernel of ML applications, and specifying a suitable ML algorithm will bring improvement in final prediction performance. According to whether there is known output data or not, ML models are mainly divided into supervised learning and unsupervised learning. Supervised learning has matured with many algorithms, and these algorithms have been widely used when one applied ML in the field of OSCs. The k-Nearest Neighbors (kNN) algorithm [47] is considered as the simplest ML model because this algorithm makes predictions only by searching for the nearest neighbor in training set. This algorithm is easy to understand and its model is quick to establish, but its prediction speed would be slow once the dataset is over large, and it would not work well if a dataset contains too many features. The Decision Tree (DT) [48] is another popular algorithm that uses a treelike model to represent a mapping relationship between target and input variables. Since this algorithm does not need the scaling of datasets and is easy to be visualized, it is often utilized to handle the datasets with a mix of binary and continuous features. However, overfitting and poor generalization performance are two main weaknesses of DT despite the pre-pruning application. To overcome these obstacles, two sub-models, Random Forest (RF) [49] and Gradient boosted decision tree (GBDT), [50] have been proposed between which RF injects randomness into tree building so that every tree is different and GBDT tries to correct the mistakes of the previous tree continuously during tree building, rendering it holds the advantages of regularization and multiple loss functions. Neural Network (NN) [51] is a class of algorithms whose framework is similar to our artificial neurons. Every node represents a specific output function while a weighting value between two nodes is served as their connection. Artificial neural network (ANN), multilayer perceptron (MLP) neural network and deep tensor neural network (DTNN) are part of this algorithm. Other algorithms like Linear Regression, Ridge Regression, and Support Vector Machines (SVM) have also been applied to train ML models in OSCs.
In contrast, unsupervised learning works if we do not have enough known output data but still want to extract knowledge from input data. A common application of unsupervised learning is clustering, [52] whose aims are splitting up the datasets into distinct groups with similar items. DBSCAN, k-Means, and Agglomerative Clustering are three powerful clustering algorithms. Dataset transformations is another application, whose goal is to create a new representation for dataset or to explore the components that "make up" the dataset. The common algorithms for this purpose include principal component analysis, non-negative matrix factorization, and manifold learning algorithms. Unsupervised learning is a helpful tool for a better understanding of the properties of data. However, until now, there are not many successful attempts to adopt unsupervised learning in OSCs, and the applications of unsupervised learning in OSCs remain further exploration.

Model evaluation indices
Cross-validation [53] process is widely used in evaluating the generalization performance of ML models. Mean absolute error (MAE), root mean squared error (RMSE) and coefficient of determination (R 2 ) are three common evaluation indices, whose concrete mathematic definitions are listed below, respectively: wherêis the predicted value, is the target value, and is the mean value of in the sample. Generally, lower MAE and MASE indicate a higher model accuracy while R 2 represents how well the regression model fits the target data and a higher coefficient is an indicator of a better model fitting.
Besides, Pearson correlation coefficient ( ) is a metric to evaluate the linear correlation between variables ( and ), which is determined by: wherēand̄represent the mean value of and , respectively. The interpretation of is the measurement of how well two variables are related. In ML model evaluation, it is often applied to reflect the relationship strength of the predicted value and the target value, and a higher value of means a better model performance.

Relationship of structure and property
Though electrical and optical properties could be calculated by DFT and TD-DFT theory, optimizing molecular structures is a time-consuming process. [54,55] Hence, a very high computational cost is still required when highthroughput screening is utilized for molecule design. ML could accelerate this progress to a large extent.
As is listed in Table 1, many ML techniques have been applied to predict the electrical properties of OSC materials from their structures. Xin et al. [56] used the ANN model to predict the energy gaps of porphyrins and obtained a considerable RMSE lower than 0.06 eV, while Shen et al. [57] used a highly accurate ANN model to predict the absorption maxima (λ max ) of dye-sensitized solar cells (DSSCs) and obtained a high accuracy with the squared correlation coefficient (R 2 ) up to 0.991. A supervised GBDT model was conducted by Jelfs et al. [58] to predict the gaps of new donor-acceptor oligomers. Then they generated about 1700 potential candidates whose energy gap is TA B L E 1 Summary of properties predicted from structures by ML techniques
The parameters related to charge transfer have a great influence on J SC , however, it is complex to calculate them  [70] Copyright (2019), American Chemical Society even with quantum-chemical (QC) methods and the computational consumption is too large for high-throughput calculation. [66] Notably, with the help of ML technology, these properties could be predicted and the computation time could be reduced.
As shown in Figure 2A, Atahan-Evrenk and Atalay [67] predicted the intramolecular reorganization energy (RE) with only simple representations of the molecular structure. For the dataset construction, over 5000 molecules were generated from seven building blocks and their REs were calculated. After testing different models and different descriptors, they found that the best model with an R 2 of 0.92 and an MAE of ∼12 meV was obtained by the DNN method based on Morgan circular fingerprints. Traditionally, we complete the calculation of the reorganization energy through quantum chemistry and Marcus theory. [68,69] Their results give a new approach to achieving reorganization energy with high accuracy bypassing complex calculation and demonstrate the possibility to predict charge transport parameters from the molecular structures. In Figure 2B, A kernel ridge regression(KRR)based ML approach for evaluating electronic coupling was reported by Hsu et al.. [70] The ethylene dimers derived from its disordered condensed phase using MD simulation were chosen as the dataset and their corresponding electronic coupling was calculated by QC methods. With Coulomb matrix representation, the best model could achieve 98% accuracy and a perfect MAE of 3.5 meV. It was noted that this ML approach could save the calculating cost of 10−10 4 times in comparison to QC calculation.

Relationship of property and efficiency
As discussed above, the high predictive power of ML models has paved the way for learning molecular properties from structural information. Nevertheless, the efficiency of a molecule is still hard to predict before the experiment since it could not be deduced from molecular properties directly. This may be attributed to the complicated photoexcitation processes and photo-physical performances affected by donor/acceptor interface morphology and fabrication conditions of the devices like additive and weight ratio. [71][72][73][74] Therefore, it is necessary to build the relationship between molecular properties and PCEs, thus reducing the experimental consumption.
Scharber et al. [75] proposed design rules for donors derived from the molecular properties to unravel the correlations between PCE, bandgap, and the LUMO of the donor. The Scharber model has been commonly used in the computer-aided design of new molecules due to the ease of computation. However, Troisi et al. [76] claimed that the photoactive behaviours including V OC , J SC , and PCE were predicted very poorly by the model, as the Pearson's coefficients between experimental and calculated properties are all smaller than 0.4. In this case, the model is not accurate enough for predicting the efficiencies of unknown materials. What's more, the Scharber model is only designed for the PC 61 BM acceptor and whether it could be extended to other acceptors has not been validated. Hence, it is urgent to optimize a general model to predict PCEs from molecular properties, and ML has been recognized as a promising approach.
Wen et al. [77] implemented a new approach combining ML and virtual screening to discover potential organic dyes, which is illustrated in Figure 3A. Firstly, the molecular properties of DSSCs were calculated by quantum chemistry. Next, by heterogeneous model consisting of GBRT, SVM, and ANN, two robust ML models were trained with different numbers of input features to predict the PCEs from initial molecular descriptors. To suit large-scale screening, the simple model with only nine easily available features was selected for predicting PCEs of more than 10,000 molecules. Then, the top 500 candidates were verified using the complex model due to its higher accuracy. At last, eight DSSCs with excellent efficiency were presented for further exploration.
Similarly, a series of impressive researches were reported by Sahu et al., [78] who considered an ML way to make the theoretically predicted PCEs close to the experimental values. After constructing a database of 280 molecules from the experiments, they chose 13 quantum-mechanical parameters related to important microscopic properties as the descriptor input. As shown in Figure 3B, the GBRT method showed the best outcome with a good Pearson's coefficient of 0.79. Later, they performed large-scale screening to search for new donor molecules of OSCs. [79] They simplified the above model, making it only contain seven descriptors that could be collected within a marginal computational cost. Under this modified model, they screened over 10,000 candidate molecules composed of 32 unique building blocks, and then discussed the importance of these moieties and their arrangements in Figure 3C. Finally, not only 126 potential molecules were proposed for in-depth exploration in the experiment, but also the suitable building blocks were identified, and useful design principles were provided. Additionally, their group [80] utilized RF and GBRT model to predict V OC , J SC , and FF, which aimed to unravel the relationship between molecular properties and device parameters. Hutchison et al. [81] reported the computational design and selection of OSCs. They utilized a genetic algorithm to predict the potential efficiency of conjugated polymers and screened 90,000 copolymers to search for high-efficiency targets.
The advancement above all concern on fullerene and its derivatives, however in recent years, non-fullerene acceptors (NFAs) have received infinite interest for their lowcost synthesis, tuneable energy levels, and superior light absorption. [82,83] ML could be also applied to speed up the virtual screening of candidates and provide helpful guidelines for the synthesis of new NFAs.
Lee [84] built up an ML model to predict the PCEs of NFAs from their electronic descriptors. In this study, a dataset of > 100 donor/non-fullerene was constructed and the HOMO, LUMO, and bandgap of donors and acceptors were taken as feature input. Trained by the RF model, the impressive accuracies with R 2 = 0.80 for the test set resulted in a credible method for the next generation of NFAs. Su et al. [85] designed a class of novel NFAs based on the bistricyclic aromatic enes(BAE) derivative cores and investigated the qualitative parameters associated with PCEs. Notably, they adopted ML exploration to predict PCEs of these acceptors when matched with different donors and found three good candidates. This research from conformation effect to PCEs prediction enhanced the understanding of how conformations of the central BAE influence molecular properties.

Relationship of structure and efficiency
For the rapid screening of potential candidates, the desired aim is to develop a convenient scheme for predicting performance and to establish the relationship between chemical structures and efficiencies.
A conventional way is the implementation of a quantitative structure-property relationship (QSPR) model calibrated from empirical data. Methenitis et al. [86] and Mustafin et al. [87] surveyed the QSPR model for fullerene (C60) derivatives as electron acceptors of polymer solar cells using the genetic algorithm (GA) method and the self-consistent regression method, respectively. Alsberg F I G U R E 3 A, The overall framework of the work in Ref [65]. Reproduced with permission. [77] Copyright (2020), Wiley-VCH. B, Theoretically predicted versus experimental PCE for the testing set (a) and all data points using the leave-one-out cross-validation technique for the GB model  [88,89] employed linear multivariate methods and evolutionary strategy to develop two QSPR models for phenothiazine-based DSSCs. However, these QSPR models are proposed for specific structures and could not be generalized to other spaces. Importantly, ML application has been of service to this purpose. The descriptors of molecular structures in ML have been discussed above. As is shown in Figure 4A, Min et al. [90] displayed a paradigm for screening the combinations of polymer donors and non-fullerene acceptors. It was noted that GBRT and RF models could achieve reasonable prediction accuracies without preliminary knowledge about material properties. Next, 32,076,000 D/A pairs were screened using the models, and six of them were characterized experimentally to test the reliability of the models in practical application. The good consistency between predicted values and experimental outcomes demonstrated that the workflow could be utilized for guiding the compatible design of polymer-NFA-based OSCs devices.
Sun et al. [91] studied an efficient approach to develop new donor materials by experimental analysis combined with the ML intelligent prediction. In this scheme, they were able to classify molecules into high performance (higher than 3.00%) and low performance (0% to 2.99%) from their chemical structures. After trying different expressions as feature inputs, they acknowledged in Figure 4B that using Hybridization fingerprints for molecular descriptions could reach the highest accuracy of 81.76%. The practicality of the scheme was examined by experimental verifications of ten new donors ( Figure 4B). Additionally in Figure 4C, their group [92] reported a deep learning way to fast estimate the performances of donor materials from their pictures of molecular structures directly. It was found that the CNN model was able to learn chemical knowledge extracted from the images of molecular structures, further for predicting the values of the efficiencies. Chen [93] confirmed that SVM and RF models could be applied to predict the PCEs of polymer OSCs only from the fingerprints of the donor structures. Peng and Zhao [94] found out that the performance of non-fullerene acceptors could be predicted and analyzed by CNN models.
To further improve the accuracy of the ML models, some researchers have attempted to employ molecular structures together with molecular properties as feature inputs, as is summarized in Table 2. To facilitate the screening of polymer-fullerene OSCs, Saeki et al. [95] used experimental parameters and molecular structures as descrip-tors. Their results manifested that the RF model could give an ideal accuracy and was acceptable for predicting the performance of a conjugated polymer before it was synthesized. Zhao et al. [96] carried out a comprehensive investigation of the effect of different features in ML application. In their research, various combinations of physical descriptors and structural descriptors were explored to maximize the predictive power of the model. It was indicated that the utilization of larger and more varied databases for training accompanied with higher accuracy. Several predictive models were explored by Padula and Troisi, [97] who exploited the representation of both electronic or structural parameters as feature inputs at the same time. The predicted values obtained by the KRR model showed good correlations to experimental data (r = 0.7). Their group [76] also offered a more precise model that yielded a consistency between experiment and prediction (r = 0.78). For the sake of testing whether this model could allow for reliable prediction in practical application, they predicted the efficiencies of some donor-acceptor pairs that were recently reported. The results illustrated that this model could be usable for designing new donor-acceptor pairs. More significantly, it was inferred from these studies that considering both electronic and structural parameters for feature inputs could improve the predictive capability of models.

OUTLOOK
In summary, we present a comprehensive review of the application of ML and high-throughput calculations in pushing forward the development of high-efficiency OSCs. From the above-mentioned successful examples, it should be noted that ML is becoming a feasible strategy to construct the structure-property-efficiency relationship of materials and accelerate the high-throughput screening for potential molecules. Accordingly, we also point out some prospects and challenges below to advance the progress in this exciting field.

Database construction
For ML applications, the model accuracy is impacted by the size and quality of the database to a considerable degree. Indeed, although several ML models like dipole moment in going from the ground state to the first excited state for donor molecules, Δ . Reproduced with permission. [90] Copyright (2020), Springer Nature. B, The testing results of RF using different types of fingerprints as input(left). Prediction results versus experimental data for the predicted donor materials with the RF algorithm and Daylight fingerprints(right). Reproduced with permission. [91] Copyright (2019), AAAS. C, Structure of the convolutional neural network (CNN) in Ref [92]. Reproduced with permission. [92] Copyright (2019), Wiley-VCH CGCNN, SchNet, [98] and MEGNet [99] have demonstrated their capacities in predicting molecular properties, their applications in OSCs are confined by the lack of adequate training datasets. To handle the small materials database, Zhang and Ling [100] presented a strategy that incorporates the crude estimation of property in the feature space to construct accurate ML models. This solution has revealed its feasibility in predicting the band gap of binary  [65] 280 small moelecule donors/fullerene acceptor molecular properties GBRT 0.78 [66] 135 donors/non-fullerene acceptors molecular properties RF 0.80 [72] 161 donors/non-fullerene acceptors molecular properties SVM 0.96 [73] 565 donor/acceptor pairs molecular structures GBRT 0.71 [78] ∼1200 polymer donors /fullerene acceptor molecular structures RF 0.64 [81] ∼1200 polymer donors /fullerene acceptor structural descriptors, experimental data RF 0.62 [83] 566 donor/acceptor pairs molecular properties, molecular structures k-NN 0.73 [84] ∼320 donor/acceptor pairs molecular properties, molecular structures Krr 0.78 [85] semiconductors. Since there is a crucial demand for large and high-quality datasets of OSC materials, researchers are encouraged to make their datasets readily accessible and share them with lower acquisition barriers.

Feature selection
When considering ML application in the field of OSCs, there are two major objectives in the process of feature selection. One is to transform the molecular structure into reasonable descriptors, and another is to find out suitable molecular properties as feature inputs. Sparse learning [101] and compressed sensing technology [102] may be helpful for these two aspects. SISSO, [103] which is created to identify the best low-dimensional descriptor among various candidates, has been applied to improve the ML model prominently for predicting lattice thermal conductivity, [104] and for predicting the stability of perovskite oxides and halides. [105] Undoubtedly, if fed with good descriptors, more robust and universal models could be obtained. Now that researchers are relying on their intuition to choose descriptors, intelligent feature selection is worthy of further exploration.

ML algorithms and models
The application of more advanced ML methods can help us build more accurate models. Transfer Learning (TL), [106,107] which refers to the transfer of knowledge from related tasks already learned for improving learning in new tasks, is promising to address the problem of insufficient data in materials science. In the field of OSCs, TL has proved to be useful for achieving more predictive modeling for HOMO values of donors. [108] Unsupervised learning is a novel approach of model establishment that has a strong ability to find laws in large datasets and has demonstrated its potential in atomistic simulations. [109][110][111] More importantly, we should pay more attention to improving model interpretability. The one-dimensional or multi-dimensional physically interpretable descriptors are desired for expressing the structure-property relationship of OSCs materials, with which we can better understand the operating principle of the device.