Functional Material Systems Enabled by Automated Data Extraction and Machine Learning

The development of new functional materials is crucial for addressing global challenges such as clean energy or the discovery of new drugs and antibiotics. Functional material systems are typically composed of functional molecular building blocks, organized across multiple length scales in a hierarchical order. The large design space allows for precise tuning of properties to specific applications, but also makes it time‐consuming and expensive to screen for optimal structures using traditional experimental methods. Machine learning (ML) models can potentially revolutionize the field of materials science by predicting chemical syntheses and materials properties with high accuracy. However, ML models require data to be trained and validated. Methods to automatically extract data from scientific literature make it possible to build large and diverse datasets for ML models. In this article, opportunities and challenges of data extraction and machine learning methods are discussed to accelerate the discovery of high‐performing functional material systems, while ensuring that the predicted materials are stable, synthesizable, scalable, and sustainable. The potential impact of large language models (LLMs) on the data extraction process are discussed. Additionally, the importance of research data management tools is discussed to overcome the intrinsic limitations of data extraction approaches.


Introduction
A current challenge for research on functional material systems is the need to simultaneously consider multiple aspects from Figure 1.Hierarchical structure of functional material systems based on metal-organic frameworks (MOFs).Reproduced with permission. [4]Copyright 2022, Wiley VCH; Reproduced with permission. [39]Copyright 2020, American Chemical Society; Reproduced with permission. [40]Copyright 2021, Wiley VCH; Reproduced with permission. [41]Copyright 2018, Elsevier; Reproduced with permission. [42]Copyright 2015, Wiley VCH; Reproduced with permission. [43]Copyright 2011, American Chemical Society.
additional challenge.The synthesis of functional material systems can be subdivided into the synthesis of molecular components and the assembly of these components with specific composition and morphology in the nano-or micrometer-length scale.In the next step, the materials are processed, for example, into thin films, membranes, or certain reactor designs, in order to implement and "fit" the materials to the final device.All these steps need tailored synthesis and processing conditions to ensure their performance.[16] The synthesis, characterization, processing, and application of functional material systems produce large amounts of hierarchical and interdependent data.Making this data machine-readable and ready for ML and combining it with data extracted from scientific literature represent a particular challenge. [17]The use of tailored research data infrastructure is highly recommended, especially when working in large interdisciplinary consortia.Thus, the development of such research data management tools represents an essential task for the scientific community.[20] In this perspective, we will briefly outline the design, synthesis, and characterization of functional material systems using metal-organic frameworks (MOFs) as example materials.Following this outline, we will highlight selected publications on enabling functional materials systems using a combination of automated data extraction and ML.We will discuss the accomplishments, prospects, challenges, and limitations of this approach.In the end, we will conclude with a discussion on research data management tools and unifying material science ontology. [21]The combination of research data management, data extraction from scientific literature, and ML are essential to fully explore the potential of functional material systems in addressing urgent social, economic, and environmental challenges.

Functional Material Systems
Functional material systems are typically composed of functional molecular building blocks, organized across multiple length scales in a hierarchical order (illustrated in Figure 1).[24][25] Their modular synthesis enables the incorporation of diverse functionalities and tuning of their structures for desired applications. [26]The chemical design space of new MOFs is virtually unlimited, due to the numerous possibilities of combining metal nodes and organic linkers.Currently, about 100 000 MOFs have been synthesized and over 500 000 predicted. [27,28]However, the wide design space also makes it impossible to screen for optimal structures via brute force trial and error or traditional high-throughput experimental screening approaches. [29]Multiple techniques were developed for the synthesis of MOFs to control their structure across multiple length scales. [30]This starts from the synthesis of the organic linkers and precursors of metal nodes, all the way to crystal synthesis and further processing in the desired shape and formulation.The synthesis of MOFs started with solvothermal synthesis via multiple heating methods.Over time, new techniques were added, such as mechanochemical, vapor phase synthesis, and sacrificial or epitaxial growth. [31]The choice of synthesis conditions and the synthesis method dictates the final MOF crystal quality, defect density, crystal size, and morphology and enables interfacial growth. [32,33]The MOF materials can afterward be processed, for example, as thin films or freestanding membranes, or formulated, for example, by mixing with polymers, palleted, and processed to the required shapes for the final device. [34,35]he enormous amount of research related to functional material systems based on MOFs, starting from the synthesis of the molecular components, their assembly into MOF crystals with different topologies and morphologies, and their integration and testing in the final device represents a hidden treasure. [36]xposure of this treasure of data and making it ready for ML applications could lead to the development of tools that guide researchers and accelerate their efforts in the preparation of MOF-based devices that can address global challenges. [37,38]o fully exploit this treasure of data, a combination of tailored research data management tools, efficient data extraction from scientific literature, and ML are essential.

Data Extraction
One of the main challenges in applying ML to problems of high scientific relevance is the lack of openly accessible, structured, and machine-readable data.Existing databases, typically maintained and extended by particular scientific communities (e.g., protein structure database, certain MOF databases, crystal structure databases, etc.), can be used to train ML models for particular tasks, for example, the prediction of materials properties.However, the majority of potentially relevant data generated in scientific labs is not published at all, and from the fraction that is published, the majority is published in the form of graphs, tables, and non-structured text.Therefore, the extraction of data from scientific literature opens a vast amount of yet untapped possibilities to train ML models and use them to predict materials properties, extract and learn relevant relationships in the data, and eventually discover or design new materials.In the following, we will describe approaches to extract structured data from publications, focusing on text extraction but also discussing the extraction of information from tables, graphs, and images.Data extraction in other scientific domains, for example, biology dates back more than 20 years, [39] with seminal work in the late 90s, for example, Andrade et al. [40] One of the earliest attempts to automatically extract information from chemistry literature was OSCAR [41] and based on that the ChemicalTagger method in 2011. [42]ChemicalTagger is a rule-based multistep method based on tokenization (preprocessing of raw text), tagging (using OS-CAR and regular expressions), phrase parsing (assignment of syntactical structure to text), and finally action phrase identification (extraction of chemical information) based on parse trees.The ChemDataExtractor Toolkit developed by Cole and coworkers starting in 2016 [43,44] extends the rule-based natural language processing approach further, among others with ML methods, and adds functionality for table extraction [44] During the last years, ML approaches started to play an increasingly important role in literature data extraction, where, for example, article section relevance scores [45] and learned word embeddings [12,45] were used to enhance existing information extraction methods, or conditional random field models were used.With increasing capabilities of language models such as BERT [46] and GPT, [47] new possibilities for extracting information from literature are generated.Seminal examples of literature extraction methods based on LLMs include MatSciBERT by Gupta et al.; [48] a fine-tuned BERT model for materials science by Huang et al., [49] Battery-Bert, which among others use question-answering algorithms to translate text to structured information; and a GPT-3 based model by Dunn et al. [50] which uses fine tuning to directly translate scientific text to structured tabular data in JSON format.Also, semi-manual and crowd-sourcing-based approaches to ex-tract information from chemistry and materials science literature were reported, [51][52][53] also extracting information from sources other than scientific literature, for example, lab notebooks to retrieve data about failed experiments which are usually not reported in scientific articles. [54]The automated extraction of data from tables, graphs, and images in many cases poses even larger challenges than the extraction of data from text.However, a detailed discussion of methods to extract data from tables, [44] graphs, [49] and images, in particular optical chemical structure recognition (OCSR), that is, the extraction of chemical structures from images [55][56][57][58] is beyond the scope of this article.Using various ways of literature data extraction, a large number of databases was generated and published, spanning from synthesis conditions [53,[59][60][61][62][63] over materials stability [64] to materials properties, for example, for magnetic and superconducting properties, [65][66][67] semi-conductors, [68] battery materials, [49] thermoelectric materials, [69] glasses [70] and more general knowledge graphs. [12,71]In most cases, the databases are only a means to an end, that is, to provide sufficient training data for ML models for the prediction of synthesis routes and conditions as well as materials properties of a wider range of materials.

Technical Challenges and Intrinsic Limitations
Despite fast progress and promising new avenues related to the increasing use of ML and in particular LLMs in literature data extraction, there are still a range of important limitations and challenges.These can be grouped in technical challenges, which can in principle be solved by improving the data extraction methods, and intrinsic challenges, which concern inherent problems of unstructured literature as well as the quality and reliability of data that can be extracted from that.Technical challenges include current limitations of LLMs such as GPT-3 and similar models, which are either only obtainable via OpenAI's commercial APIs, or require state-of-the-art GPUs with large amounts of memory for prediction and retraining, both of which are only affordable for a small group of researchers worldwide.Another limitation is the availability and free accessibility of research papers, which makes automated access difficult and again excludes a large number of researchers who do not have access to all journals and publishers.Furthermore, if access is limited to, for example, abstracts, the amount of information that can be extracted is rather limited. [71]Furthermore, the use of LLMs (compared to algorithmic, rule-based models) comes at the cost of potentially higher processing times due to the size and computational cost of the models (even after retraining), [49] as well as a non-negligible amount of uncertainty regarding the question whether the output of LLMs is fully trustworthy, or if they can potentially output wrong information and give wrong answers or generate data, which is not contained in the input text. [50]At the same time, LLMs might potentially help to analyze complex texts and sentence structures, which are not extractable using conventional approaches. [72]Beyond that, one of the main challenges in literature data extraction currently is related to the fact that large amounts of data, for example, synthesis protocols, are not tabular data but can only be represented in more complex data structures.Examples of that are flexible, potentially multistep processes with dynamic data types and complex relations, [59,73] which not only requires the development of more sophisticated extraction methods but furthermore need flexible data blueprints for complex scientific data.One development in that direction is formal description languages for materials science and chemistry, for example, the XDL language by Cronin and coworkers. [74]Intrinsic limitations mostly refer to the completeness, reliability, unambiguousness, and precision of data reported in scientific literature.Materials entity names might not always be unique and pose fundamental challenges to extraction algorithms. [72]Databases constructed from extracted literature data might contain noise and errors [59] due to differences in experimental setups, experimental measurement conditions, reporting accuracies, and missing metadata.Furthermore, even if data extraction from graphs and figures becomes possible and reliable, [49] the reported data might be highly processed and condensed (i.e., lacks possibilities for further analysis of raw data), has limits in accuracy and completeness, and might in many cases be ambiguous.Those intrinsic challenges are inherent to all approaches that aim to extract and collect data from published literature, independent of the reliability of the extraction methods used.Such intrinsic limitations can only be overcome if access to high-quality data and metadata is given directly by the research groups that produce the data, for example, through publication in repositories and databases, rather than through the "information bottleneck" of scientific literature.Given the rapid recent progress in the development of data extraction methods and more generally natural language processing tools, LLMs will likely become one of the most widely used tools to extract (also complex and heterogeneous) data from the literature, as LLMs are capable of systematically analyzing natural language and also generating formal languages, for example, tabular formats or structured data templates.Retraining LLMs on small datasets can help to improve their accuracy for given tasks, which will also become more affordable due to the development of smaller and more efficient LLMs.Major breakthroughs can be expected in the next years regarding systematic, wide-spread efforts to disclose data and knowledge currently hidden in scientific publications.The sustainable provision of-if possible-FAIR data, that is, findable, accessible, interoperable, and re-usable data in suitable databases and repositories could maximize the benefit for the whole scientific community.Main challenges on the way there include the development of more flexible yet formal and thus computer-readable descriptions of complex data structures, the standardization of data and metadata, as well as the further development of data extraction methods to reduce the amount of data missed during extraction as well as to reduce the error rate.However, intrinsic limitations of data extractable from scientific literature indicate that loss of data is unavoidable if the publication of data will not change in the future, implying that further development and use of methods for research data management and FAIR data publication [75] is of the highest importance, to ensure best possible outcomes in data science and ML approaches applied to questions in materials science, chemistry, and beyond. [76]

Research Data Management to Publishing Data in a FAIR Way
So far, we discussed approaches to extract published data from text, tables, and graphs of research papers and other scientific texts, along with associated limitations and perspectives.However, even if data extraction methods can be perfected, one of the main challenges cannot be solved with this approach, which is the fact that a lot of valuable data is not published at all, as it was considered not successful, not publication-relevant, or not published for other reasons.Nonetheless, this data can be highly relevant and thus valuable in other contexts, indicating the relevance of approaches to decrease the difficulty and thus the barrier to publishing the majority of generated data in a FAIR way, to make it accessible and also findable for other researchers.
It is well-accepted that the systematic collection of research data in digital form and its disclosure is highly important for the transparency and reproducibility of scientific work.If research data management (RDM) can be tied to the FAIR data principles, RDM processes have enormous potential to systematically provide any data the research community needs for a variety of projects.In past decades, the use of efficient tools for digital RDM was difficult to achieve for the wider materials science community due to the lack of the necessary software tools, storage resources, and policies.Meanwhile, great progress has been made in all three areas, especially in recent years, thus at least partial digitization of research processes and modern methods of RDM are technically achievable goals for scientists of many disciplines now.For best-practice guides, we refer to Talley et al. [77] and Herres-Pawlis et al. [78] Nevertheless, for a broad adaptation of RDM processes by scientists, a cultural change-that is, a change of the mindset with respect to the importance of research data and its appropriate storage-is needed.
While this cultural change is progressing only very slowly worldwide, scientists in Germany are facing an important turning point: After several years of preparation, the requirements for the practice of data provision have been changed by the German Research Foundation (DFG), one of the most important funding agencies in Germany, and an extension of the obligation to disclose research data will come into force in 2023. [79,80]Additionally, the importance of a FAIR provision of research data has recently been reaffirmed and strengthened by establishing the National Research Data Infrastructure (NFDI), [81] an infrastructure to store and preserve FAIR research data in Germany.As a result of these and many other changes in the scientific system, researchers are making more and more efforts to adopt existing RDM offerings which can make valuable contributions to the provision of high-quality, standardized, machine-readable data in the long run.In this regard, three essential steps can be described:

Methods and Software for Digitalization Strategies and Data Availability
Electronic laboratory journals (ELNs) or Laboratory Information and Management Systems (LIMS) have been used for decades in the industry as valuable RDM tools for the digitization of research processes.With the availability of powerful open-source software as an alternative to commercial systems, many academic institutions can use these RDM tools now.Thus, research data can be digitally stored and tagged with the relevant metadata as soon as they are created.Open source ELNs such as Chemotion ELN, [82] eLAB, [83] NOMAD ELN, [20] Kadi4Mat, [84] and many others, bring direct advantages, especially with regard to potential subsequent use of the data: Since they can be extended by own developments and thus, if necessary, also reflect changing requirements, the necessary data and metadata schemas can be made available to the scientists on a permanent basis.Open source ELN software allows the scientists themselves to specify the type and level of detail of the stored information.Automatic test protocols and algorithms can be integrated and used to achieve high data quality and, if necessary, to offer correction suggestions to the scientists.In this way, open source systems in particular form an important basis for the self-determined acquisition of research data and content.The central collection of data enables the collection of all relevant data via one UI.If data from measurement devices are consistently integrated into the ELN/LIMS process, the experiments can be linked to the measurement data without data loss or errors.Especially with regard to the importance of the completeness of data and its quality for ML, this step is a milestone to improve the data situation for various reuse purposes.ELNs are on the one hand the means for complete documentation of research processes for the individual scientist and on the other hand a powerful tool for building community-driven databases that can be searched and reused.

Standardization of Discipline-Specific Data, Processes, and Metadata
In addition to the systematic digital recording and linking of research processes and data, the standardization of data and metadata is particularly important in order to ensure their efficient subsequent use by others.The goal of standardization is to ensure the completeness of reported data and metadata, that is, all relevant variables and parameters should be included in a data standard to ensure qualitative reproducibility, which also includes external conditions which are known to be crucial for the respective experiment.Furthermore, the data accuracy should be high enough to ensure also quantitative reproducibility.Metadata schemas and ontologies are helpful for the standardization of data and processes, as well as to link data published by different researchers.The use of metadata schemas and ontologies becomes accessible to a broad range of scientists through their integration into ELNs.Currently, tools for standardizing data and metadata are also being developed in many initiatives.[90][91] With the mostly direct involvement of scientists, freely available descriptions and software can thus be obtained to enable uniform storage of information.As an example within the Excellence Cluster 3DMM2O, data converters are developed to obtain open, standardized data from non-standard, proprietary file formats that are available for comparative display, analysis, and interpretation. [92]This allows to extract also metadata and to merge it with standardized metadata schemas in a way that enables reuse without the need to develop custom scripts.When these digital tools and standardization elements are embedded in ELNs, the standardized data and metadata can then be made directly usable with appropriate interfaces and form an ever-growing resource for ML by machinereadable data.

Data Publication in Openly Accessible Repositories
If all locally available resources such as ELNs are brought together, there is huge potential for making data available across the entire materials science and chemistry community.This is possible, for example, through the use of research data repositories. [19,93,94]Research data repositories, especially if they provide a subject focus with appropriate support for relevant data and metadata standards, can serve as a central resource for decentralized provided data.In the case of curated repositories, the data can be further enhanced by author-independent, partly automated checks for consistent data quality. [95]Repositories offer many more options to host data in the long run: In addition to the most prominent functions to date for storing and providing research data contributed by the authors themselves, repositories can also be used to provide data extracted from the literature.This can provide a combination of research data repository and database, which may be able to provide a much larger number of data sets than would be possible through direct active contribution by the community based on actual papers.An example of this can be given with the extraction of chemical reactions from several supplementary information files, which has been used in the past to enrich the database of the Chemotion repository. [96]Methods of data extraction can, of course, be used to enrich data available in internal environments such as ELNs-but then the benefit to the community is limited.Being openly accessible, repositories could become a key infrastructure for materials science and the development of new AI methods in the future: Repositories could become the primary resource for obtaining data for ML and many other methods in the future.They could also be the perfect environment for many models obtained through AI to be tested and, if necessary, put to long-term use.AI models, for example, that enable data simulation, data analysis, or curation could enrich repositories with important functions that can be directly harnessed by scientists. [97]Thus, in the long term, repositories could provide the solution to current problems by making data available: Models trained on repository data could contribute to the curation of new data in the future and thus successively increase data quality (for previously non-curated repositories) or decrease processing time and time investment (for curated repositories).Furthermore, when it comes to computational studies, including the development and application of ML methods, not only data but also code should be published to improve reproducibility and accelerate development cycles within the scientific community.Repositories for sharing of open-access code, for example, GitHub are widely used.Best practice guides can be found in Coudert, [98] Wang et al. [99] and Artrith et al. [76]

Examples Where Data Mining and Machine Learning Enabled the Design and Application of Functional Material Systems
In this section, we will briefly describe the reuse of structured data with ML models and the benefit for the synthesis and optimization of functional materials systems.The selected examples focus on literature data extraction and ML related to MOF research, but the application of both is of course not limited to the scientific challenges of MOFs and could be applied to many other topics.
The most general use-case of data about materials, molecules, and their properties is computational (inverse) design.To introduce new materials, the conventional trial-and-error approach which involves a long stepwise procedure from molecule design down to experimental assessment has been recently proposed to be replaced by the fully data-driven inverse design methodology to directly design the target molecules. [100,101]Inverse design of materials focuses on identifying the desired properties of materials first and then determining the optimal structure and composition to achieve those properties.The traditional forward design process involves synthesizing and testing a large number of materials in order to find one with the desired characteristics.Inverse design relies on computational methods, in particular ML and thus large amounts of data, to explore vast chemical and structural design spaces more efficiently. [102]As the early applications of inverse design in materials science, Zunger et al. used a genetic algorithm to design solid-state materials with desired electronic properties. [103,104]More recent applications of inverse design were focused on the generative design of polymer dielectrics, [105] MOF membranes, [106] nanomaterials, [107] multilayer metasurfaces, [108] metamaterials, [109] and high entropy alloys. [110]ne of the most important aims of materials design is the improvement of sustainability of technologies, implied by, for example, the sustainable development goals by the UN, aiming to provide a more sustainable future for human society.A paradigm shift in how materials and chemicals discovery is approached, that is, a shift from conventional experimental exploration to computer-aided design and AI-facilitated experimentation, can help to reach sustainability goals.Using ML methods to learn from existing data as much as possible, in order to avoid redundant computations and experiments which consume energy and resources is critical.Without efficiently collecting and reusing previously published data and without reporting newly generated data in FAIR ways, this potential cannot be fully used.At the same time, ML methods can be used to develop materials for sustainable technology.For example, Hardian et al. described how ML methods can be used to produce MOFs in an environmentally friendly way. [111]Kumar et al. investigated several green solvents for the sustainable synthesis of covalent organic frameworks. [112]ML techniques can also be used in most steps of a conventional environmental risk assessment of, for example, smart nanomaterials to ensure sustainability. [113]Moreover, to develop sustainable and eco-friendly alkali-activated material (AAM) or geopolymers, Shah et al. employed ML methods to facilitate and accelerate the development of a one-part AAM binder with the desired properties. [114]Electrocatalysis has received enormous attention as a clean and sustainable technology.In this regard, Chen et al. reviewed the application of ML in electrocatalyst design as a circumvention of the traditional trial-and-error preparation method. [115]

Synthesis of FMS
The synthesis of MOF-based functional material systems involves multiple steps, starting from the molecular precursors, over the topology and morphology until the final device integration.Pioneering articles showed the possibilities to support researchers in finding suitable conditions using ML optimization algorithms, such as Bayesian optimization or genetic algorithms.Examples by Shields et al. [116] for the synthesis of organic molecules with improved yield and Moosavi et al. [117] for the synthesis of MOFs with improved crystallinity and BET surface area demonstrated the possibilities of using ML to rationally optimize the synthesis conditions for organic molecules and MOF crystals.Chen et al. [118] demonstrated the possibility to employ ML techniques to design MOFs with desired shapes or morphologies and Pilz et al. [119] demonstrated the possibility to optimize crystallinity preferential orientation of interfacially grown SURMOF thin films.However, these approaches rely on the generation of synthesis data on which the algorithms can operate and additionally require the knowledge of the involved scientists to set the parameter and condition space for the optimization algorithms.By operating on large synthesis databases, Segler et al. [120] demonstrated that retrosynthesis design is possible for small organic molecules.The work by Park et al. [73] and Luo et al. [59] demonstrated that automated data extraction can be combined with ML models to predict the synthesis conditions of new MOFs and gain insights into the synthesis process.Taken together, these selected examples demonstrate that automated data extraction and ML techniques are well suited for synthesis planning, parameter prediction, and further optimization of MOF-based functional material systems, starting from the molecular components up to the final MOF structures with desired topology, morphology, and crystal orientation.The combination of such tools promises to accelerate the discovery of new MOFs, especially if additional data become available via extraction from scientific literature or collected in tailored electronic lab notebooks and deposited in openly accessible repositories.

Optimization of MOF-Based FMS
The design of ideal MOF structures using high throughput computational screening and ML is a highly active and quickly developing area of intense research, [121] enabled by well-structured databases such as the MOF Cambridge structural database subset [28] and curated databases such as CoREMOF, [122] MOFX-DB, [123] ToBaCCo, [124] QMOF, [125] and others. [126]Starting from suitable databases allows the automated screening for ideal structures from a large pool of already synthesized or predicted materials. [127]Despite numerous publications on the design of MOFs via high-throughput computational screening and inverse design, there are only very few experimentally realized target structures. [11,128,129]The reasons why many interesting structures have not been realized experimentally are on the one hand their difficult or very expensive synthesis and on the other hand, their poor stability. [73,130,131]In addition, the communication between theoretical and experimental groups is often challenging, leading to missed opportunities to cooperate. [14,129]Addressing these issues, pioneering work based on simulation and ML for the predictions of mechanical stability by Moghadam et al. [132] and synthesizability by Anderson et al. [133] could be realized.The alternative approach of automated data mining from scientific literature combined with ML proved also a valuable strategy to predict important features of MOF.Important prediction tools were developed by Batra et al. [134] for water stability and Nandy et al. [64,135]  for thermal stability and stability toward solvent removal.Exploiting the large community knowledge hidden within the scientific literature will further refine these tools and enable the prediction of tailored MOF-based functional material systems for desired applications, that simultaneously fulfill multiple objectives imposed by the processing and operation conditions.Figure 2 describes the identification of functional material systems for a target application, biased by multiple objectives.The relevant data for such ML-based predictions can be mined from scientific literature via automated data extraction.In addition, the synthesis of the target structure can be facilitated via ML prediction and optimization tools.

Conclusions and Outlook
Simulation and machine learning (ML) have evolved as important tools for guiding researchers and for identifying materials of interest.By replacing the traditional heuristic approach, associated with labor and time-intensive trial and error experiments, the computational discovery or inverse design promises to speed up the development of new materials.However, ML approaches rely on sufficient data in machine-readable formats.Combining ML with automated data extraction from scientific literature, using natural language processing, allows not only to gain insights into the ideal design of functional material systems for a desired application but also allows to collect information on important features such as thermal or mechanical stability.An ML workflow can be implemented to utilize the extracted data and identify the ideal design, starting from the composition over the structure across several length scales to the final device.The additional features, such as stability, cost, or abundance of the components can be implemented in the ML workflow as a bias to identify the ideal material under the operating conditions of the desired application.In addition, the use of automatically extracted data on synthesis conditions, in combination with ML, can guide researchers to realize the target materials experimentally.Efficiently operating with such complex interconnected and hierarchical data, involved in functional materials systems, requires the use of advanced research data management tools.In addition, electronic lab notebooks can facilitate the implementation of feedback loops and the complementary use of new experimental data.Although at an early stage, the combination of automated data extraction and ML already showed promising results for the prediction of important properties and synthesis conditions as well as for high throughput computational screening and inverse design of functional material systems.The development of advanced tools such as LLMs (e.g., GPT-3) allows domain specialists in material science to automatically extract datasets to feed ML models.This workflow holds promise to accelerate the development of new functional material systems, urgently needed to tackle global challenges.
Pascal Friederich, after his Ph.D. in physics under the supervision of Wolfgang Wenzel, received a Marie-Sklodowska-Curie Postdoctoral Fellowship at Harvard University and the University of Toronto where he worked with Alán Aspuru-Guzik on machine learning methods for chemistry.In 2020 he was appointed assistant professor at the Informatics Department of the Karlsruhe Institute of Technology, leading the AI for Materials Science (AiMat) research group.His research focuses on developing and applying machine learning methods for property prediction, simulation, understanding, and design of molecules and materials.In 2022, Pascal Friederich received the Heinz-Maier-Leibnitz Prize from the German Research Foundation.

Figure 2 .
Figure 2. Automated data extraction and machine learning enable researchers to select and synthesize functional material systems tailored for their desired applications.