Fast Exploring Literature by Language Machine Learning for Perovskite Solar Cell Materials Design

Making computers automatically extract latent scientific knowledge from literature is highly desired for future materials and chemical research in the artificial intelligence era. Herein, the natural language processing (NLP)‐based machine learning technique to build language models and automatically extract hidden information regarding perovskite solar cell (PSC) materials from 29 060 publications is employed. The concept that there are light‐absorbing materials, electron‐transporting materials, and hole‐transporting materials in PSCs is successfully learned by the NLP‐based machine learning model without a time‐consuming human expert training process. The NLP model highlights a hole‐transporting material that receives insufficient attention in the literature, which is then elaborated via density functional theory calculations to provide an atomistic view of the perovskite/hole‐transporting layer heterostructures and their optoelectronic properties. Finally, the above results are confirmed by device experiments. The present study demonstrates the viability of NLP as a universal machine learning tool to extract useful information from existing publications.


Introduction
[9] The data-driven approach typically relies on numerical data such as the efficiencies and capacities obtained from highthroughput calculations or high-throughput experiments, [10][11][12][13] while new types of data such as text and images are also available.
However, compared with other types of data, textual data, which store the majority of scientific information in the published articles (e.g., abstracts), are often neglected by materials scientists.[16][17] For instance, Tshitoyan et al. employed the natural language processing (NLP)-based method to successfully identify patterns and concepts in materials and chemistry domains using the historical textual data; the periodic table was automatically constructed without human knowledge inputs and new thermoelectric materials were discovered. [18]Zhang et al. employed the text mining method to explore new energy materials and identify several potential high-performance photo-rechargeable materials that were verified via first-principle calculations. [19,20]nverse materials design is highly desired for the materials discovery process. [21]Historically, the typical materials-design period from laboratory discovery to commercial product is 15 to 20 years.The standard procedure entails the following steps: 1) develop a novel or improved material concept and test its appropriateness; 2) synthesize the material; 3) integrate the material into a device or system; and 4) characterize and measure the properties and performance. [22]However, conventional materials design is by trial-and-error based on time-consuming human reading of the literature.In the age of big data, scientists are tired of reading massive amounts of the literature and are working to speed up research cycles and revolutionize the current materials-and chemical-design paradigms.
[32] The well-used organic HTLs, such as Spiro, are thermally unstable due to the low melting point for most of the organic molecules.Therefore, a new highmelting-point inorganic HTL is needed.
In this manuscript, we employ the NLP method to explore PSC materials in an effort to automatically extract scientific information and predict candidate materials using the new data type.The prototypical metal halide perovskites, the ETL and HTL materials, and the additive materials are analyzed via the word2vec-based NLP technique in detail; the forecast of appropriate material ingredients of the PSC is provided based on the NLP-based results.An HTL candidate, Fe 3 O 4 , which has not been realized to be an appropriate HTL material in the existing database, is predicted by the NLP model.The candidate is then structurally and electronically characterized via first-principles calculations and advanced Fe 3 O 4 /CH 3 NH 3 PbI 3 heterostructures are constructed to reveal their optoelectronic properties.Device experiments based on the Fe 3 O 4 HTL were conducted to confirm the theoretical analysis.This study provides a platform to apply the NLP method to analyze and predict suitable PSC materials.

Machine Learning
A database containing 29 060 literature abstracts regarding perovskites is prepared from SpringerLink with the publication year ranging from 1997 to 2021.It should be noted that the PSC first appeared in 2009; however, the employment of perovskite materials in materials science is an old concept, and preparing a database containing articles published before 2009 is necessary.[35] The selected articles are from different journals (Figure S2, Supporting Information).Starting from 2013, the number of articles containing the PSC significantly increased over the years.The NLTK toolkit and the word2vec method are both employed for the NLP preprocessing and model construction steps, respectively.ChemDataExtractor is employed to extract materials names and chemical formulas. [36]The materials candidates are ranked according to the cosine similarity between the word vectors of the materials names and application in PSCs.The cosine similarity is calculated according to the following formula where x•y is the vector dot product of Â (material) and y (application), jjxjj is the long vector x, and jjyjj is the long vector y.In the word2vec-based model construction process, the skip-gram method is used, and the following parameters are employed: the dictionary is truncated if the word frequency is less than 2; the feature dimension is 100; the window size indicating the maximum distance between the current word and the predicted word is 5; the threshold for the random downsampling of the high-frequency words is 1Â10 À4 .The word2vec skip-gram model takes in pairs of words by moving a window across the textual data and trains a hidden-layer neural network based on the given input word, providing us with a projected probability distribution of words close to the input.To get from the projection layer to the hidden layer, words are virtually one-hot encoded; the projection weights are then translated into word embeddings.As a result, this network will provide us with 100-dimensional (or 200-dimensional) word embeddings if the hidden layer contains 100 neurons.In contrast, the continuous bag of words model predicts the center word using the average of numerous input context words rather than a single word as in the skip-gram case.The preparation of a custom dictionary, or userdefined dictionary, is an important and common step in NLP to facilitate the tokenization.For most professional subjects, there are domain-specific terminologies and jargon that may reduce the tokenization accuracy.In these specific scenarios, the word segmentation results will cause differences in the tokenization step and cannot meet the criteria.The flexible expansion of the thesaurus provided by the user-defined dictionaries can solve this problem.As a result, in this study, a custom dictionary is provided, including materials domain-specific terminologies such as "solar cell," "perovskite solar cell," etc., which help the tokenization process and the cosine similarity calculation.The overall workflow of the NLP study for the PSCs is provided in Figure 1.More information regarding the details of the preprocessing, named-entity recognition, and the model construction steps can be found in the literature. [37]

First-Principles Calculation
The first-principles calculation is performed using CASTEP; [38] the cut-off energy is 430 eV, and the density functional is Perdew-Burke-Ernzerhof.A spin-polarized calculation is performed because of the magnetic properties of Fe 3 O 4 .
The CH 3 NH 3 PbI 3 halide perovskite is utilized to construct the advanced perovskite/HTL interface, because it represents the majority of the perovskite light-absorbing materials.
The perovskite surface along the (001) direction is focused on, and different terminations are investigated to construct the heterostructures.2) The preprocessing is performed using the NLTK toolkit for the tokenization, position tagging, and spell checker.ChemDataExtractor is employed to extract materials names and chemical formulas.
3) The domain-specific technical words are identified in the name-entity recognition step and the neural network-based word2vec method is employed to construct the NLP model, and the relationships between the materials and the applications are extracted.
4) The results are generated and visualized based on time-evolution plots and materials ranking.A potential HTL material candidate is selected and density functional theory (DFT) calculations are performed to obtain the atomic structure and optoelectronic properties of the NLP-predicted candidate material.The custom dictionary is included to facilitate the tokenization step.In this study, the custom dictionary includes the materials domain-specific terminologies such as "solar cell" and "perovskite solar cell".
[41] A lattice mismatch within 5% is achieved for the heterostructure of The Kramers-Kronig relations are expressed as the momentum matrix elements between the occupied and the unoccupied electronic states. [42,43]

. Results and Discussion
The NLP model based on the textual data of the perovskite literature successfully extracts important chemical information.For example, the NLP model effectively recognizes the knowledge of the periodic table, and different elements are grouped into respective regions.The alkali metals such as Li, Na, and Cs distribute in similar positions while the alkaline earth metals such as Ca distribute next to the alkali metals; the transition metals such as Cu, Mn, Fe, and Zn distribute in another region; the nonmetal elements such as I, P, and O also distribute next to each other.In this way, the distributions of different elements identified by the word embedding are analogous with the periodic table (Figure 2a), and it should be noted that such knowledge extraction process represents a machine learning manner where minimal human effort is required.The word "perovskite" and the phrase "solar cells" are closely related with each other, and the relationship can be quantified by the large value of cosine similarity (0.7), while the machine learning model can be further interpreted through the presence of the bridging words b) Further illustration of the automatic chemical information extraction from the perovskite literature, demonstrating that the relationships between "perovskite" and "solar cells" can be explained via various bridging words.For example, "photovoltaic" exhibits large cosine similarities with "perovskite" and "solar cells," while "battery" can be correlated with both "perovskite" and "solar cells," signifying the energy-storage applications of the perovskite materials.c) Materials maps of perovskite, ETL, and HTL materials that are distributed in three respective regions, demonstrating the capability of the machine learning model to automatically recognize the PSC materials.A relatively uncommon HTL material, Fe 3 O 4 , is highlighted, which overlaps with other HTL materials and will be atomistically simulated to understand its optoelectronic properties in the context of the halide PSCs, and the axes refer the two dimensions after applying PCA to reduce the dimensionality of word vectors.d) Relationship extraction showing that the chemical elements distribute in a consistent direction, the oxides distribute in another direction and the applications of materials align in an alternative direction.
(Figure 2b).For instance, the cosine similarity between "perovskite" and "photovoltaic" is 0.78, while the cosine similarity between "photovoltaic" and "solar cells" is slightly higher (0.79).46][47][48] Apart from the capability to extract general chemical/materials knowledge from the literature, the present NLP model successfully extracts domain-specific knowledge about PSCs.For example, in PSCs, the perovskite light-absorbing layer, ETL, and HTL are three critical ingredients that are responsible for the lightabsorption, charge-dissociation, and charge-transport processes.This concept is successfully learned by the machine learning model just by reading the text of the literature.In the materials maps (Figure 2c) of the representative materials in PSCs, the perovskites, ETL, and HTL materials are distributed in three distinctively different regions after the dimensional reduction in the principal component analysis (PCA).Specifically, TiO 2 , SnO 2 , and ZnO, which are typical ETL materials in PSCs, are in close vicinity of each other; the HTL materials, including CuI, CuS, WO 3 , and CuO, for PSCs are in close proximity; and the perovskites, such as the prototypical CH 3 NH 3 PbX 3 and CsPbCl 3 , neighbor each other.A relatively uncommon HTL material, Fe 3 O 4 , is highlighted, which overlaps with the other HTL materials and will be atomically simulated to understand its optoelectronic properties in the context of the halide PSCs.In addition, the present machine learning model successfully differentiates more chemical and materials concepts, including the elements, applications, and oxides that align in their respective directions (Figure 2d).
The evolutions of different PSC materials (perovskite, ETL, HTL, and additives materials) are revealed in the NLP process.The perovskite materials are traditional mineral materials that have been widely researched prior to the year 2009 when the first PSC report appeared, and scientists began to embrace the halide perovskite materials for solar cell applications.The NLP analysis demonstrates that the oxide perovskites are the predominant materials for the solar cell-related research rather than the halide perovskite before 2009 (Table 1), which agrees with the domain knowledge.S3, Supporting Information) appearing in the database exhibit similar patterns; for example, the word frequency of CH 3 NH 3 PbI 3 increased significantly from 2020 (twice the number of CsPbBr 3 and four times the number of CH 3 NH 3 PbBr 3 in the same year), despite a quasi-plateau region from 2018 to 2019. [49,50]The synonyms of the halide perovskite materials often cause named-entity recognition issues in the name-identity recognition stage, and more analysis is performed on the perovskite synonyms such as CH 3 NH 3 PbI 3 and MAPbI 3 , where the two entities correspond to the same materials.The trends are also consistent for CH 3 NH 3 PbI 3 and MAPbI 3 , which exhibit similar rising patterns from 2014 to 2021 (Figure S4a, Supporting Information), despite the fact that the popularities of their bromine-counterparts are decreasing after 2014 (Figure S4b, Supporting Information).Moreover, the metal halide perovskite materials most relevant to solar cells in the top 10 list exhibit slight variations over the years (Table S1, Supporting Information).For example, the CH 3 NH 3 -based halide perovskite is predominant in the years from 2014 to 2017; in contrast, more Cs-based and formadinium (FA)-based halide perovskites appear from 2018 to 2021.In addition, CH 3 NH 3 PbI 3 ranks 1st from 2014 to 2021 among all the metal halide perovskite materials, suggesting CH 3 NH 3 PbI 3 as the primary metal halide perovskite material for solar cells.Summarizing, the NLP analysis demonstrates the dominance of iodine-based hybrid organic-inorganic CH 3 NH 3 PbI 3 over its bromine-containing CH 3 NH 3 PbBr 3 and CsPbBr 3 counterparts for solar cell application.
The NLP model is employed to display the domain-specific information, including the time evolution of the ETL materials.[53][54][55] The word2vec-based rankings of SnO 2 demonstrate the importance of this ETL material (Figure 3b), and a continuous rise in the ranking table is observed for SnO 2 .A slightly different trend is observed for the word frequency analysis (Figure S5, Supporting Information), where the frequencies of SnO 2 appearing in the text source fluctuate in 2017 and 2020, demonstrating the noteworthy methodology effects for the   S2, Supporting Information), demonstrating the significant attention on SnO 2 for PSC applications compared with the traditional ETL materials such as TiO 2 and ZnO.For instance, in the years from 2014 to 2017, Al 2 O 3 /TiO 2 ranks high in the table of materials relevant to perovskite ETL; nevertheless, SnO 2 ranks significantly higher from 2018 to 2021 than TiO 2 , which agrees with the higher efficiencies and stabilities offered by the SnO 2 -based PSCs in the literature.
The present NLP model provides the HTL information for PSCs (Table S3, Supporting Information).CuSCN is selected as a case study to display the trend of the HTL materials for PSCs based on the word2vec model.Traditionally, spiro-OMeTAD is employed as the HTL material, but it suffers from instability in the ambient environment due to its organic nature; as a result, new inorganic materials such as CuSCN and NiO have been developed to substitute the organic counterpart.On the one hand, from 2014 to 2021, the NLP model demonstrates the rising popularity of the CuSCN HTL material for PSCs (Figure 3c); on the other hand, fluctuations are observed for the word frequencies of CuSCN, which highlights the better capability of the word-embedding method compared with the word-frequency method (Figure S6, Supporting Information) for the materials trend analysis.In addition, both CuSCN and NiO rank higher in the years from 2018 to 2021 (Table 1) than the organic counterpart because of the high performance and stability of the two inorganic materials.The NLP results highlight the importance of CuSCN and NiO x as HTL materials for PSC applications.
The present NLP model is also employed to analyze the additive materials (Table S4, Supporting Information) that are often introduced in the halide perovskite solution or ETL/HTL layers to enhance the perovskite device performance.[58][59] Li 2 CO 3 is selected as a case study to understand the applicability of the NLP method for analyzing additive materials for PSCs.After 2014, the rankings of Li 2 CO 3 fluctuate over the years (Figure 3d and S7, Supporting Information), which is partially ascribed to the complicated multidimensional design space of perovskite additives that have more candidates than the ETL and HTL; nevertheless, the ranking of Li 2 CO 3 is generally high over recent years despite several rises and falls, demonstrating the popularity of this additive material for PSCs.The top 10 list of the perovskite additives materials according to the word2vec model demonstrates the importance of Li 2 CO 3 for PSCs (Table S4, Supporting Information), especially in more recent years from 2014 to 2021.
The rankings of typical HTL materials calculated according to the values of their cosine similarity with the target output HTL demonstrate the prediction accuracy of the NLP model for PSC materials.For example, CuSCN, CuI, NiO x , MoO 3 , and CuO x are all existing HTL materials for PSCs (Figure 3e).However, a relatively uncommon HTL material, Fe 3 O 4 , exists in the materials table, and we suggest this relatively uncommon material is a potential appropriate HTL material for PSCs.
We have conducted a more comprehensive analysis by counting and incorporating the word frequencies of additional materials into the original figure.It is pertinent to highlight that the word frequencies of certain common materials, such as metals like Cu, Ag, Au, and even water which are apparently unrelated to perovskite HTL materials, rank higher than Fe 3 O 4 in terms of word frequency, far exceeding those of the validated HTL materials (Figure S8, Supporting Information).we conducted an additional model training using 100 full-text documents due to copyright restrictions and performed word vector dimensionality reduction clustering (Figure S9, Supporting Information).However, the performance of the model trained on the entire text is poor.This is partially because the full-text literature contains some noise and is not as concise and compact as the abstract, which may detriment the scientific inference.In contrast, training the model using abstracts can help circumvent these irrelevant noise interferences.Therefore, the model trained on the entire text showed poor clustering results and almost failed to correctly classify clearly related materials, such as MAPbI 3 , MAPbBr 3 , FAPbI 3 , CsPbBr 3 , etc.We expect more full-text articles can marginally improve the clustering effects, but the literature abstracts are important alternatives considering the scientific conciseness absence of copyright issues.It should also be noted that this material has not been realized to be a candidate perovskite HTL material in the present database.Therefore, the relatively uncommon HTL material Fe 3 O 4 is further analyzed via DFT calculations to reveal the predictive capability of the NLP model.
The atomic-level heterostructures and corresponding optoelectronic properties are calculated.The candidate Fe 3 O 4 is selected as a case study, which is predicted by the NLP model and is not present in the initial database as a representative HTL material. [60]It should be noted that the availability of the NLPbased materials with high rankings for PSCs is a good indicator of the accuracy of the present NLP model.Three The optoelectronic properties of the CH 3 NH 3 PbI 3 /Fe 3 O 4 systems are revealed by their PDOS spectra and UV-vis absorption spectra.For the three CH 3 NH 3 PbI 3 /Fe 3 O 4 heterostructures, the valence bands are predominantly contributed by the perovskite material, while Fe 3 O 4 contributes to both the valence and conductions bands (Figure 4).In addition, the presence of the Fe 3 O 4 layer introduces a semimetallic feature revealed by the negligible band gap, which represents enhanced charge carrier conductance at the interface at a cost of stronger interfacial charge-carrier recombination. [61,62]The specific orbitals such as Fe-3d, O-2p, and I-2p demonstrate that the Fe-3d orbitals contribute to the unoccupied states near the Fermi level, while the I-2p orbitals from the halide perovskite layer mainly contribute to the occupied states near the Fermi level (Figure S10, Supporting Information).In addition, the 2p orbitals of the oxygen also contribute to the occupied orbitals near the Fermi level.The spin-polarization PDOS spectra (Figure S11, Supporting Information) demonstrate the universal presence of the spin orbitals contributed by the Fe-3d orbitals near the Fermi level for MAI-O, PbI 2 -O, and PbI 2 -Fe.For example, a distinctive Fe-3d spin contribution in the energy region from 0 to 1 eV is available for MAI-O.The heterostructures demonstrate balanced light-absorption capabilities in the UV-vis region and the infrared region because of the presence of Fe 3 O 4 .65] There are limitations of the NLP model, and some of the perovskite materials pointed out by the model are controversial.For example, on the one hand, BaY 2 O 4 and LiMn 2 O 4 are not strictly perovskites, but they appear in the list.This may originate from the insufficient amount of data and inappropriate tokenization and named-entity recognition.On the other hand, it should be noted that although BaY 2 O 4 and LiMn 2 O 4 are not strictly ABX 3 -type perovskites, some studies suggest that they are perovskite-like structures. [66,67]Apart from that, an experimental investigation, an Fe 3 O 4 HTL material appeared recently in the literature; [60] however, it should be noted that the HTL is not present in the initial database, and it is interesting to realize that by reading the limited amount of data, the NLP model can successfully uncover an alternative HTL material that is only available and verified in another database.The present model can be improved in the future by preparing a larger database including more publishers, patents, and alternative language. [68,69]

Experimental Verification
From the above language machine learning analysis, Fe 3 O 4 may be a new inorganic hole-transport material.We conducted experiments to show its possibility as the HTL in perovskites.The normal device structure Au/Fe 3 O 4 /perovskite/TiO 2 /FTO was used (Figure 5a). Figure 5c shows the optimized concentration of Fe 3 O 4 in the solution during spin coating.With the increase of Fe 3 O 4 concentration, the PCE of PSCs increases and achieves a maximum of 9%, implying a suitable thickness of the Fe 3 O 4 HTL; however, further increase of the Fe 3 O 4 concentration decreased the PCE because the thickened HTL creates too long of a distance for hole transport and thus blocks the hole extraction.Figure 5b shows the increased short-circuit current from the Fe 3 O 4 concentration optimization, confirming the suitable thickness of the HTL.Finally, the ambient stability of the unencapsulated devices was tested.The coverage provided by the Fe 3 O 4 HTL inhibits water/oxygen invasion into the perovskite absorber, thus improving the ambient stability in Figure 5d.External quantum efficiency spectra and the calculated current density (24.17 mA cm À2 ) are presented in Figure S13 and Table S5, Supporting Information.

Conclusion
The PSC materials are successfully modeled via the NLP method.Various chemical information such as periodic table grouping and the automatic classification of perovskite/ETL/HTL materials can be achieved by the NLP model.The model suggests that iodine-based perovskite serves as the predominant metal halide perovskite material for PSCs compared with the Cs-, Cl-, and Brbased counterparts.In addition, the model suggests that SnO 2 , CuSCN, and LiCO 3 are highly relevant to ETL, HTL, and additive materials for PSCs.The first-principles calculations provide the atomistic view of the NLP-predicted HTL material candidate Fe 3 O 4 that receives insufficient attention in the literature.The present study highlights the viability of NLP-based machine learning techniques for PSC materials analysis.

Experimental Section
Fluorine-doped tin oxide (FTO) glass was chosen as the substrate.The FTO was cleaned by ultra-sonication sequentially in cleaning concentrate, deionized water, and ethanol for 30 min.Then the cleaned substrate was dried by nitrogen flow and treated in a UV-ozone cleaner for 15 min.The TiO 2 electron-transfer layer was prepared by chemical-bath deposition on FTO/glass substrate as reported previously.First, the cleaned FTO/glass substrate was treated with UV-ozone cleaner for 10 min.Next, the substrate was placed in the 0.2 M TiCl 4 aqueous solution at 70 °C for 60 min, and then the substrate was washed with deionized water and ethyl alcohol alternately three times.The substrate was then annealed at 200 °C for 30 min in air and treated in UV-ozone for 20 min to enhance the wettability.≈80 μL perovskite precursor was dropped onto the TiO 2 /FTO/ glass substrates, and the antisolution method was employed to form perovskite films with diethyl ether as the antisolvent.The solution was spin-coated at 3000 rpm for 10 s, and then at 5000 rpm for 30 s. Twenty seconds before the end of the 5000 rpm step, 600 μL of diethyl ether was quickly dropped onto the surface.Next, the film was annealed at 150 °C for 15 min.After the perovskite (PVK) film was cooled, 2-phenylethylamine hydroiodide solution (5 mg mL À1 in isopropanol) was spin-coated on the PVK film at 2000 rpm for 30 s for top surface passivation without thermal annealing.Then, 50 μL Fe 3 O 4 solution was spin-coated at 5000 rpm for 30 s to form the hole-transfer layer.The gold electrode (≈80 nm) was thermally evaporated onto the surface of the Fe 3 O 4 through a shadow mask, and the active area was 0.09 cm 2 .

Figure 1 .
Figure1.Flowchart of the NLP process for analyzing PSC materials.1) A starting database containing the perovskite literature is prepared.2) The preprocessing is performed using the NLTK toolkit for the tokenization, position tagging, and spell checker.ChemDataExtractor is employed to extract materials names and chemical formulas.3) The domain-specific technical words are identified in the name-entity recognition step and the neural network-based word2vec method is employed to construct the NLP model, and the relationships between the materials and the applications are extracted.4) The results are generated and visualized based on time-evolution plots and materials ranking.A potential HTL material candidate is selected and density functional theory (DFT) calculations are performed to obtain the atomic structure and optoelectronic properties of the NLP-predicted candidate material.The custom dictionary is included to facilitate the tokenization step.In this study, the custom dictionary includes the materials domain-specific terminologies such as "solar cell" and "perovskite solar cell".
CH 3 NH 3 PbI 3 and Fe 3 O 4 .Different terminations of CH 3 NH 3 PbI 3 and Fe 3 O 4 are simulated: both methylammonium (MAI)-terminated and PbI 2 -terminated surfaces are included for the halide perovskite layer; for the Fe 3 O 4 surface, both Fe-termination and O-termination are incorporated.These various combinations give rise to several different heterostructures; however, several heterostructures consisting of the perovskite layer and the HTL layer undergo convergence failure during the geometrical optimization stage, and three possible heterostructures of the perovskite/Fe 3 O 4 systems are reported here.The optical absorption coefficients in the UVvis absorption spectra are determined by the real part of the dielectric functions, which are obtained from the imaginary part via the Kramers-Kronig relationships.

Figure 2 .
Figure 2. Prediction of new HTL materials.a) Formation of elemental groupings from the machine learning model using the perovskite literature.b)Further illustration of the automatic chemical information extraction from the perovskite literature, demonstrating that the relationships between "perovskite" and "solar cells" can be explained via various bridging words.For example, "photovoltaic" exhibits large cosine similarities with "perovskite" and "solar cells," while "battery" can be correlated with both "perovskite" and "solar cells," signifying the energy-storage applications of the perovskite materials.c) Materials maps of perovskite, ETL, and HTL materials that are distributed in three respective regions, demonstrating the capability of the machine learning model to automatically recognize the PSC materials.A relatively uncommon HTL material, Fe 3 O 4 , is highlighted, which overlaps with other HTL materials and will be atomistically simulated to understand its optoelectronic properties in the context of the halide PSCs, and the axes refer the two dimensions after applying PCA to reduce the dimensionality of word vectors.d) Relationship extraction showing that the chemical elements distribute in a consistent direction, the oxides distribute in another direction and the applications of materials align in an alternative direction.
For example, from 2002 to 2005, the inorganic perovskite oxides such as SrGeO 3 , LiNbO 3 , LiMn 2 O 4 , LiNO 3 , Sr 2 RuO 4, NiCr 2 O 4, YMnO 3 , YAlO 3, LiNbO 3 , and LiMn 2 O 4 correspond to the 10 perovskite materials most relevant to solar cells.We find that these perovskite materials have been deployed for various optoelectronic and catalytic applications, despite the fact that they are directly employed for solar cells.From 2006 to 2009, the 10 perovskite materials most relevant to solar cells are LaSrAlO 4 , Sr 2 RuO 4 , AgCuF 3 , AgNbO 3 , La-Cr-O, BaTiO 3 , BaY 2 O 4 , AgTaO 3 , PrMnO 3 , and ZnFe 2 O 4 .Interestingly, during that time the PSCs were not receiving immediate attention from the scientific community.From 2010 to 2013, the 10 perovskite materials most relevant to solar cells are still BaTiO 3 , LaFeO 3 , CaCoO 3 , and their derivatives, with the absence of the halide perovskite materials.In contrast, from 2014 to 2017, the halide perovskite materials begin to be well-received by researchers and the 10 perovskite materials most relevant to solar cells correspond to a mixture of halide and oxide perovskites: CH 3 NH 3 PbI 3 , BaTiO 3 , MAPbI 3 , CH 3 NH 3 PbBr 3 , CsPbI 3 , La 0.58 Sr 0.4 Co 0.2 Fe 0.8 O 3 , CH 3 NH 3 SnCl 3 , La 0.8 Sr 0.2 Fe 0.8 Cr 0.2 O 3 , CsPbBr 3 , and SrTiO 3 .The oxide perovskites begin to totally disappear from the top 10 list in the years from 2018 to 2021: in this period, the perovskite materials most relevant to solar cells are CH 3 NH 3 PbI 3 , FAPbI, CsPbI 2 Br, CsPbI 3 , MASnI 3 , FAPbBr 3 , CsGeBr 3 , MAPbX 3 , MAPbI 3 , and CsPbBr 3 , demonstrating the golden period of the halide perovskite materials for solar cell application.The rankings of the perovskite materials in terms of their relevance to solar cells clearly demonstrate the transition from the oxide perovskites to the halide perovskites in solar cell research.It is expected that the halide perovskite materials will be the predominant perovskite types for solar cell application beyond 2021.The NLP model provides more domain-specific information including the time evolution of the metal halide perovskites.CH 3 NH 3 PbI 3 , CH 3 NH 3 PbBr 3 , and CsPbBr 3 are three common prototypical metal halide perovskite materials for PSCs and their evolutions are compared.The word2vec model suggests the predominance of CH 3 NH 3 PbI 3 over CH 3 NH 3 PbBr 3 and CsPbBr 3 in terms of its higher cosine similarity with the solar cell application (Figure 3a).The development of the iodine-based metal halide perovskite CH 3 NH 3 PbI 3 material remains promising, while the rankings of CH 3 NH 3 PbBr 3 and CsPbBr 3 based on their relevance to solar cells drop significantly after 2014, signifying the decreasing popularity of the bromine-based metal halide perovskite materials in recent years.The word frequencies of CH 3 NH 3 PbI 3 , CH 3 NH 3 PbBr 3 , and CsPbBr 3 (Figure

Figure 3 .
Figure 3. a) Evolution of the word2vec-based rankings of CH 3 NH 3 PbI 3 , CH 3 NH 3 PbBr 3 , and CsPbBr 3 according to their relevance to solar cells in different years, which is estimated by the cosine similarity between the material's formula and the application "solar cells."b) Evolution of the ranking of SnO 2 according to the correlation coefficients with ETL from 2014 to 2021.c) Evolution of the ranking of CuSCN according to the correlation coefficients with HTL from 2014 to 2021.d) Evolution of the ranking of Li 2 CO 3 as the additive for PSCs from 2014 to 2021.e) Rankings of typical HTL materials based on the NLP model, demonstrating the predictive accuracy of the model for the HTL materials of PSCs.A relatively uncommon HTL material, Fe 3 O 4 , is further analyzed via DFT calculations to reveal the predictive capability of the NLP model.
materials trend analysis.Importantly, SnO 2 ranks high (5th) in the years from 2018 to 2021 based on the cosine similarity (Table CH 3 NH 3 PbI 3 / Fe 3 O 4 heterostructures at the atomic level are successfully acquired, namely MAI-O (the CH 3 NH 3 PbI 3 surface terminates with MAI while the Fe 3 O 4 surface terminates with O; MAI stands for methylammonium), PbI 2 -O (the CH 3 NH 3 PbI 3 surface terminates with PbI 2 while the Fe 3 O 4 surface terminates with O), and PbI 2 -Fe (the CH 3 NH 3 PbI 3 surface terminates with PbI 2 while the Fe 3 O 4 surface terminates with Fe) (Figure 4).MAI-O demonstrates an interfacial H•••O distance of 1.88 Å between the methylammonium hydrogen and the Fe 3 O 4 oxygen, suggesting the importance of the intermolecular hydrogen bond for stabilizing the heterostructure.PbI 2 -O demonstrates an interfacial Pb•••O distance of 2.23 Å, while PbI 2 -Fe demonstrates an interfacial I•••Fe distance of 4.07 Å. Summarizing, to demonstrate the NLP design process, first-principles calculations are utilized to construct different CH 3 NH 3 PbI 3 /Fe 3 O 4 heterostructures, which are stabilized via the interfacial intermolecular hydrogen bonds contributed by the A-site cation molecule and the inorganic layer's oxygen species.

Figure 4 .
Figure 4. Atomistic view of different heterostructures of CH 3 NH 3 PbI 3 /Fe 3 O 4 .a) MAI-O, where the CH 3 NH 3 PbI 3 surface terminates with methylammonium iodide (MAI) while the Fe 3 O 4 surface terminates with O. b) PbI 2 -O, where the CH 3 NH 3 PbI 3 surface terminates with PbI 2 and the Fe 3 O 4 surface terminates with O. c) PbI 2 -Fe, where the CH 3 NH 3 PbI 3 surface terminates with PbI 2 and the Fe 3 O 4 surface terminates with Fe.Typical distance values between the perovskite layer and the Fe 3 O 4 layer are highlighted.However, PbI 2 -O suffers from structural disintegration in the perovskite layer after the geometrical optimization, while PbI 2 -Fe exhibits large interlayer distance (thus negligible interaction); as a result, MAI-O is suggested to have balanced structural integrity and decent interlayer interactions.d) PDOS spectra of MAI-O.e) PDOS spectra of PbI 2 -O.f ) PDOS spectra of PbI 2 -Fe.g) UV-vis spectra of MAI-O, PbI 2 -O, and PbI 2 -Fe.The Fermi level corresponds to 0 eV in the PDOS spectra.

Figure 5 .
Figure 5. a) Device structure of the solar cell.b) J-V curves of the solar cells formed using different FeO 3 concentrations.c) PCE statistics of the devices based on different concentration of Fe 3 O 4 in HTL.d) Ambient stability of the devices @25 °C and relative humidity 30%.

Table 1 .
Top 10 perovskite materials relevant to solar cells according to the machine learning model for perovskite articles published in different year ranges.