Turning chemistry into information for heterogeneous catalysis

The growing generation of data and their wide availability has led to the development of tools to produce, analyze and store this information. Computational chemistry studies and especially catalytic applications often yield a vast amount of chemical information that can be analyzed and stored using these tools. In this manuscript we present a framework that automatically performs a full automated procedure consisting in the transfer of an adsorbate from a known metal slab to a new metal slab with similar packing. Our method generates the new geometry and also performs the required calculations and analysis to ﬁnally upload the processed data to an online database (ioChem-BD). Two diﬀerent implementations have been built, one to relocate minimum energy point structures and the second to transfer transition states. Our framework shows good performance for the minimum point location and a decent performance for the transition state identiﬁcation. Most of the failures occurred during the transition state searches needed additional steps to fully complete the process. Further improvements of our framework are required to increase the performance of both implementations. These results point to the avoidhuman path as a feasible solution for studies on very large systems that require a signiﬁcant amount of human resources and in consequence are prone to human errors.


Introduction
Computational chemistry is nowadays ubiquitous and has applications in Chemistry, Biology, Physics, Materials Science and Nanotechnology. As the access to massive computers and robust codes (Lejaeghere et al., 2016) extends worldwide, databases for molecules, nanostructures and materials containing structural data, spectroscopic fingerprints (Grimme et al., 2017) and general properties can be easily generated. The ultimate applications of these databases can vary from environmental detection through spectroscopy to data mining for materials in Catalysis and Electrocatalysis (Kitchin, 2018). And yet, most of the purpose-oriented calculations are not saved (this is the general case in Materials and Heterogeneous Catalysis) or are only presented as lengthy xyz coordinate listings in error-prone Supplementary files. Only lately, the relevance of keeping this data in the form of databases has been acknowledged (Bo et al., 2018). Most of the systems, though, have emerged in Materials Science in projects such as the Materials Project (Jain et al., 2013;Materials Project, n.d.), NoMaD (NoMaD Repository, n.d.), Materials Cloud (Materials Cloud, n.d.) and Computational Materials Repository (Computational Materials Repository, n.d.). Data are mostly unlinked to the corresponding works and thus the traceability (who, when, what) and fairness (findable, accessible, interoperable, reusable) are lost (Wilkinson et al., 2016).
Human factors pose an additional problem. Researchers have to perform very routine tasks, therefore they can make mistakes and the generated dense datasets often have multiple deficiencies. Human factors include, for example: cognitive functions (such as attention, detection, perception, judgement and reasoning, including heuristics and biases), and decision making. Each of these is further divided into sub-categories. These issues are particularly problematic when statistical learning techniques (Garca-Muelas & Lpez, 2019;Bruix et al., 2019;Turcani et al., 2018;Butler et al., 2018;Gryn'ova et al., 2018;Ulissi et al., 2017;Gómez-Bombarelli et al., 2018;Schlexer Lamoureux et al., 2019;Meyer et al., 2018;Nandy et al., 2019;Moghadam et al., 2019) are applied; sparse datasets are biased towards a particular type of successful event, exactly what statistical learning algorithms need to avoid to ensure their robustness. Repetitiveness has been addressed by different groups by using scripts with different level of sophistication. However, the emergence of new frameworks that can ease the tedious tasks and generate/check/upload to a database significant data blocks of the phase spaces and provide the right tools into the automation concepts (Larsen et al., 2017;Pizzi et al., 2016;Open Babel: Te Open Source Chemistry Toolbox, n.d.;Gromski et al., 2019). This has lead the (pun) idea of avoidhumans in the data production. Generation is still one of the key steps. Some very recent efforts have been made to automate the identification of adsorption sites in metal surfaces, and to create new structures with different combinations of sites and adsorbates (Boes et al., 2019;Tran et al., 2018;Montoya & Persson, 2017). Statistical learning techniques have also been applied to get an estimate for the adsorption energies (Ulissi et al., 2017;Tabor et al., 2018). Our work tries to use structure inheritance between different metals not only to automate the study of reaction networks in different metals and alike materials (like oxides), but also to reduce the explicit Density Functional Theory calculation time for adsorbates (a relatively straightforward task) and, most relevantly, for transition states for a real catalytic problem. Diagnose exceptions and analysis will be the focus of the research in computational chemistry in the coming years. This will increase our abilities to find outliers that can be crucial to performance (and identification of new families of molecules and compounds with particularly appealing properties), to refine the analytics, to incorporate graph theory (Bjørn Jørgensen et al., 2019) or other encodings, like SMILES (Weininger, 1988), to be able to transfer active patterns irrespective of the nature of the compound (solid, enzymatic, molecular).
The ways of acquiring information have changed along with data generation and analysis. New editorial platforms like Authorea can also integrate the advances of these systematic approaches. The process of reading documents has changed drastically since the establishment of the World Wide Web and the possibility of several instances running simultaneously. Reading in the 21st century is a completely different experience than it has been for at least 500 years, as the meta-and linked data are accessible and are consulted almost simultaneously with the primary source. New ways of reading can thus now benefit from interactive viewers that can integrate the content and improve the visualization of complex information (for instance 3D).
Our manuscript tries to address all these new challenges in computational chemistry and with the overall idea of transforming the results into a true seamless information science that can be interactively read, dynamically searched, analyzed and mined, whilst ensuring the transferability among different fields through metadata storage. Hence, the combination of Fireworks (Jain et al., 2015), ioChem-BD (Álvarez-Moreno et al., 2014;ioChem-BD, n.d.) and Authorea (Open Research Collaboration and Publishing -Authorea, n.d.) constitutes a privileged platform.

Computational Details
Density functional theory (DFT) simulations have been performed using Vienna ab-initio Simulation Package (VASP) (Kresse & Furthmüller, 1996;Kresse & Furthmüller, 1996). Generalized Gradient Aproximation with the Perdew-Burke-Ernzerhof functional (GGA-PBE) have been used to obtain the exchange-correlation energy term (Perdew et al., 1996). Valence electrons have been represented using the Projector Augmented Wave (PAW) (Blöchl, 1994) with a kinetic energy cutoff of 450 eV. A Γ-centered mesh has been built using the Monkorst-Pack method (Monkhorst & Pack, 1976) to generate the k-points. The Improved Dimer Method has been employed to locate the transition states (Heyden et al., 2005).
Fireworks (Jain et al., 2015) is an open source software package that allows managing, building and running workflows. In this project, it has been used to build and configure the ties between the different steps of our calculations. Fireworks also allows to track the status of active workflows and stops the process if one of the steps fails, allowing to restart the process completely or to continue the step after the application of the needed changes. Software written in Python and Bash has been developed in our lab and used to script the preparation, transfer and checking processes. Developed Python libraries and scripts focus on geometry manipulation and input/output parsing, while Bash scripts manage the files related to the calculations, and control the execution of VASP.
The framework has been tested using a reaction network composed of 9 different metals, 7 of them with p(3x3)-(111) fcc packing and 2 with p(3x3)-(0001) hcp packing. One of the hcp metals serves as the ansatz host for the rest of the metals. Both packings share similar adsorption sites in their surfaces. The two lowest layers have been frozen and a vacuum of 15Å has been applied to all metal slabs.
A total of 12 organic intermediates with the chemical formula C 1 Br x H y and 8 transition states have been evaluated with our method. All the intermediates belong to the same reaction network; the transition states are all the possible elemental steps that link all our intermediates. For the minimum energy relaxations, different adsorption sites have been calculated for every metal/species combination, while only one adsorption site has been tested for the transition states. The species were first obtained manually for the hcp metal and then recalculated for the other metals by using our framework.

Results and Discussion
Full mechanistic DFT studies on the decomposition of organic molecules are among the most challenging processes in heterogeneous catalysis due to the amount of elemental steps and intermediate species comprised in the network, as well as to the subsequent screening that the calculation of these reactions over a set of metal slabs require (Muelas et al., 2017). The size of these networks are tightly bonded with the number of carbons of the organic species and, for a significant number of carbons, the systems require a massive amount of time and resources to be evaluated. To address this issue, some databases have been designed to store and ease the access to these steps with the aim of recycling old calculations and reducing the required steps when studying these networks (CatApp database, n.d.).
We propose an automated procedure to overcome the drawbacks associated with the repetitive processes required for the study of large reaction networks. We have built a framework that combines Fireworks, VASP, ioChem-BD and ad-hoc developed software to fully automate the study of reaction networks over different metals. This framework allows transferring all the relaxed species and transition states obtained for a metal slab to the slab of a different metal as well as preparing and performing the DFT calculations by using the generated geometries for the new slab automatically. The simplicity of both the C 1 species and the pure metal slabs, together with the complexity degree added by the inclusion of a halogen give us a optimal scenario to benchmark the framework.

Transfer algorithm
Figure 1: Interactive diagram of the steps that the transfer algorithm performs. While some of the steps only use the data from the host slab, the first (1)-(5) steps are applied concurrently to both slabs.
The process of generating guess geometries for intermediates is one of the most tedious and time-consuming tasks, particularly when dealing with large reaction networks, e.g. C 6 sugar alcohol decomposition consists of 10 6 intermediates (Sutton & Vlachos, 2015). Inheriting previously obtained structures from similar calculations eases the problem, but it still results in a repetitive routine problem that can be fully automated.
A transfer algorithm that performs the relocation of an adsorbed molecule in a metal slab to a similar metal slab has been developed. Automatic transfer of adsorbed species between similar metallic surfaces allows not only saving a considerable amount time during the geometry production, but also classifying the generated geometries accurately.
When there are two different metal slabs, a host with an adsorbed molecule (1a) and a guest consisting on an empty metal slab (1b), the transfer algorithm works as follows: First, it identifies the layers (2) for both slabs and selects the highest layer (3). Then, it searches for all the possible adsorption sites (4) on both surfaces. The site is found by triangulation of the different metal centers. Once the adsorption sites have been identified, the algorithm associates the nearest adsorption site with the lowest atom of the adsorbed molecule in the host slab. (5) In the next step, the adsorption length is computed employing the distance between the assigned site and the lowest atom of the adsorbate. However, for some molecules the lowest atom (z-axis position) is not perfectly aligned with the adsorption site. To overcome this issue, the deviation of the z vector between both is also computed in this step (6). Lastly, the algorithm transfers the molecule to a similar site in the guest surface, taking into account the adsorption site type and maintaining the adsorption length (7). As an optional step, the algorithm can be set to identify the nearest adsorption site of the second lowest (5o) atom and compute the angle between both (6o), this information is then used to rotate the molecule around the z-axis of the lowest atom to preserve the original alignment of the molecule (8). Figure 1 depicts the procedure of the algorithm.
The transfer algorithm applies different methods to find the possible adsorption sites. For the fcc and hcp holes detection, the Voronoi tetrahedron method (Isayev et al., 2017) is used to compute the bonds between the atoms of the upper layer, and then to search for cycles of three atoms. Differentiation between hcp and fcc holes is achieved by projecting the triangle formed by the cycles in the lower layer and searching for atoms inside this space. Once the bonds are defined, the bridge and top positions are easy to find. The geometries generated using the transfer algorithm are not optimized and require further DFT calculations to obtain a full description of the reaction network. To address this problem, two different workflows have been developed and embedded in the framework: one for the minimum energy relaxation (MER) and another for the transition state search (TS). Both workflows use the transfer algorithm as the first step to generate a new metal slab with the desired adsorbate; then, different Bash scripts are prepared and run the pertinent DFT calculations using VASP.

Workflow design
While the workflow to search relaxed geometries only generates a single geometry, the transition state (TS) workflow generates three unique geometries with different rotations along the z-axis of the lowest atom. A partial minimum energy search for a small number of ionic steps is performed for the three structures, and the atoms of the molecule with high initial forces are allowed to relax, thus preventing the TS to fall to a minimum energy point. The best candidate among the relaxed structures is selected using a lowest energy criteria; in the next step, the chosen candidate is used as the starting geometry in an improved dimer method calculation. Selection steps are unnecessary for the MER workflow due to the efficiency of the algorithms that relax the geometry to the minimum point energy compared with the transition search ones. Thus, the generated geometry for MER workflow is directly used as starting point for the minimum energy search calculations.
Different errors can occur during the calculation steps; besides, the geometry obtained after the calculation may be chemically meaningless, therefore, it is important to manage errors with care. To address these issues, a checking algorithm is applied to the results of the calculations in order to search for inconsistencies through the calculation steps. Additionally, a bond identification algorithm is used to verify that no bond breaking occurs during the relaxations. This algorithm, that detects spurious non-valid broken adsorbates was implemented at the checking stage. The process splits the structure between the (metal) surface and the adsorbed molecules. Then the bonds between the atoms of the adsorbates are detected using the Voronoi tetrahedra algorithm (Isayev et al., 2017) (with a cutoff radius) and converted to a graph and then the number of disconnected graphs is computed. The difference between the initial and final number of disconnected graphs illustrates whether the adsorbate has broken during optimization. If the result passes the check, a frequency calculation is launched.
After a few tests, an improvement has been integrated to the transfer algorithm. The bond recognition was reused to analyze the bonds of the geometries obtained from the MER workflows. As a result, a list of the average distance between the metal surface and the lower atom was obtained for each metal surface. The difference between these distances was used to correct the adsorption distance of some of the failed MER workflows and all of the TS workflows.

Storing data
Due to the large number of individual results that compose complex reaction networks, it is mandatory to compile, sort and store the results. ioChem-BD (Álvarez-Moreno et al., 2014;ioChem-BD, n.d.) provides the essential tools to perform the last step of our project: to convert our results into organized data. The shell client of ioChem-BD allows an easily upload of the generated output files to a private server, once these are uploaded, it transforms, parses and stores the results to ensure a clear representation and an easy online access to the data. It can also generate molecular labels as SMILES.
Consequently, as the final step of both previously described workflows, the results of the DFT calculations as well as the frequency calculations performed at the end of the workflows are uploaded to ioChem-BD. For the TS workflow, the data generated by the dimer method is uploaded, whereas for the MER workflow the data from the relaxation method is uploaded. (Pablo-Garca, 2019;Pablo-Garca, 2020) Once all the data are processed, they can be published in the public repository of the ioChem-BD server and then the obtained information can be shared with other researchers keeping the FAIR principles. Moreover, ioChem-BD generates interactive figures that improve the understanding of geometries and reactivity for complex adsorbed molecules when working in multidisciplinary environments (f. ex. with experimental groups).
Datasets can also be published with an embargo option, the dataset is published in a private repository and both a DOI and a Reviewer Link are generated. This link allows the coworkers, editor and reviewers to inspect the dataset before making it public. Once the associated manuscript is published, the dataset is synchronized to the public repository, making it accessible to everyone through the dataset's DOI. The data can then be accessed via DOI, or by searching directly in the platform by Author, date, SMILES, or chemical formula.
When a calculation is uploaded to ioChem-BD, additional metadata are generated and stored within the calculation, a trustworthy fingerprint of every calculation is also created and deposited in the system. Table  1 contains the minimum parameters that are saved when a calculation is uploaded to ioChem-BD. Two different schemes are used to generate the content tree: the Dublin Core standard (DCMI, n.d.) and a the Chemical Custom format created for ioChem-BD to append additional data.
While the output file from VASP is processed and converted to our custom .cml format, the raw input files are stored as they are. To optimize space usage, the particular electronic and ionic iterations are not parsed and only the final structure is stored. Inclusion of the raw input files and the VASP version used allows reproducing every calculation of the dataset accurately. Additional information on the structure of the different calculations or just for clarity can also be added. ioChem-BD offers several options to include this information and notes can be added to its description. However, this information is rather complex in some cases and requires a different format. When this happens, ioChem-BD offers the possibility of attaching raw files together with the input files. Both the description and the attachments are available once the calculation is published in the ioChem-BD server.

Test and results
Figure 3: Efficiency of the MER and TS workflows. "Good" points to regularly terminated calculations, "Additional steps" imply that error checking routines have identified a deficiency in the calculation (i.e. maximum convergence steps reached) and "Manual preparation" points out that the calculation failed and requires human intervention (i.e. bond breaking).
Both workflows yield reasonable results and apply the procedure for the different species (see Figure 3). Almost 88% of the minimum energy relaxations and 56% of the transition states converge at the first attempt. Additional steps are required for 0.1% of the relaxed models and 27% of the transition states. The rest require manual preparation and supervision for them to terminate successfully. The exhaustion of the number of cycles was the main cause of the failures for MER, the solution was to add about 10% ionic cycles (not completely used in all cases). This can thus be directly introduced in alternative network studies for metals and alloys. For transition states (TSs) there were two issues: (i) the same as for MER, insufficient ionic steps in the algorithm, this was sorted in the same manner and was by far the most common issue; (ii) alternative convergence was not achieved and thus the initial seed was not taken from the reference host (Ru) but from a metal with chemical properties closer to the running TS search (Pt was inherited from Ir).
In the case of relaxations, most of the unobtainable structures fall into different adsorption sites during the calculation, while only a few of them end the calculation with a broken bond. On the other hand, all the unobtainable TSs states end with a broken bond and more precise methods such as the Nudged-Elastic Band (NEB) (Henkelman & Jnsson, 2000;Henkelman et al., 2000) are required to obtain the geometries.
This automatization procedure has been employed to generate the databases for further machine learning studies. (Saadun et al., 2020)

Integration with Authorea
The Authorea (Open Research Collaboration and Publishing -Authorea, n.d.) platform allows an easy integration of the data stored in ioChem-BD. For instance, the reaction networks and particular structures can be directly linked in the database. In our case, the structure for the CH 3 on two different surfaces, Ni and Ru, can be retrieved from Guest Slab and Host Slab, and the transition state for decomposition from Transition State.

Conclusions
We have proved that our framework automates two different kinds of molecular transfers through similar metals. Although our method is not yet able to fully automate the entire process successfully, it is possible to classify the different error cases obtained during our study and incorporate the solutions as additional steps for our workflows. In addition, coupling our framework to the Authorea and ioChem-BD tools has the following advantages: (i) it enables establishing a seamless link between the computed data, the manuscript and linkes the corresponding structures interactively thus avoiding tedious and error-prone supporting information; (ii) this is particularly attractive for complex databases with massive reaction networks and/or several material compositions; (iii) the workflow reduces the computing time, systematizes the nomenclature and labeling of the different species, reduces the chaos, and increases transferability; (iv) the metadata directly embedded in ioChem-BD is made transparent through Authorea, which can improve the design (and self-definition) of the working flows that could potentially include data analysis through coupled machine learning algorithms; (v) the data curation is thus directly enforced by this procedure.

Acknowledgments
We are thankful to the "Ministerio de Ciencia e Innovacin" RTI2018-101394-B-I00 and the Barcelona Supercomputing Center (BSC-RES) for providing generous computer resources. We thank Ms. Vendrell for carefully reading the manuscript.