Towards automatic generation of Piping and Instrumentation Diagrams (P&IDs) with Artificial Intelligence

Developing Piping and Instrumentation Diagrams (P&IDs) is a crucial step during the development of chemical processes. Currently, this is a tedious, manual, and time-consuming task. We propose a novel, completely data-driven method for the prediction of control structures. Our methodology is inspired by end-to-end transformer-based human language translation models. We cast the control structure prediction as a translation task where Process Flow Diagrams (PFDs) are translated to P&IDs. To use established transformer-based language translation models, we represent the P&IDs and PFDs as strings using our recently proposed SFILES 2.0 notation. Model training is performed in a transfer learning approach. Firstly, we pre-train our model using generated P&IDs to learn the grammatical structure of the process diagrams. Thereafter, the model is fine-tuned leveraging transfer learning on real P&IDs. The model achieved a top-5 accuracy of 74.8% on 10,000 generated P&IDs and 89.2% on 100,000 generated P&IDs. These promising results show great potential for AI-assisted process engineering. The tests on a dataset of 312 real P&IDs indicate the need of a larger P&IDs dataset for industry applications.


Introduction
Piping and Instrumentation Diagrams (P&IDs) are important engineering documents of chemical plants depicting the arrangement of process equipment, valves, piping, control structure, and instrumentation [1].In contrast, Process Flow Diagrams (PFDs) focus on major equipment parts and material streams.While PFDs are typically used during the early-stage conceptual design phase, P&IDs are developed in the basic design and detailed engineering phases.They are essentially the central document in every industrial chemical plant for storing, revising, and exchanging information [2].The applications of P&IDs range from engineering and design, to hazard and operability studies (HAZOPs), construction, operation, maintenance, and decommission [2].
The development of P&IDs from PFDs is a tedious and time-consuming task that offers great potential to reduce costs and speed-up the development process [3].Commonly, process engineers manually develop P&IDs adopting and modifying schemes from prior projects, design rules, and their experience utilizing Computer-Aided Design (CAD) software.However, this traditional development can be laborious because finding, manually adjusting, and transferring suitable technical solutions from old projects can be tedious and error-prone.Time constraints can lead to the adoption of non-optimal solutions from previous projects and possible alternatives not being considered [3].Unleashing the potential of (semi-)automated process engineering may help to reduce development times, reduce costs, increase safety, and avoid errors.
Researchers have been working on the automation of process development since the 90s.To assist the engineering process during the creation of P&IDs, multiple rule-based systems have been developed [3,4,5].Modularization approaches of chemical plants commonly provide the underlying framework of rule-based systems and aim to accelerate process development [6,7,8].The method proposed by Blitz et al. [4] asks a user to define certain inputs, such as material properties and process-specific requirements.Then, a P&ID is generated based on the user input and the underlying knowledge-based approach, which is implemented as a decision tree.Similarly, Uzuner and Schembecker [3] and Obst et al. [5] also utilize a knowledge-based method, which is represented as a hierarchical decision tree.Uzuner and Schembecker [3] first divide the chemical process into modules to reduce the complexity of the design problem.Secondly, design questions and options are used to guide the user to obtain a P&ID of the desired module.While the previous works demonstrate the potential of computer-assisted P&ID development, they have not yet been broadly adopted by industry.In general, many expert systems in chemical engineering have not led to the expected major advances [9].In particular, rule-based systems are often difficult to set up, maintain, and extend [3,9].
Recent research and development in deep learning-based Artificial Intelligence (AI) applications promise improvements over expert systems revealing an outstanding performance in numerous disciplines, as highlighted by the following examples.In particular, Natural Language Processing (NLP), a subfield of AI focusing on natural language, with its powerful models (e.g.GPT-3 [10], T5 [11]) showed breakthrough performance in many natural language tasks outperforming systems that previously used handcrafted rules [10,11,12,13].Similarly, deep learning has outperformed rule-based approaches in other fields such as organic chemistry.For example, transformer-based language models can accurately predict reaction outcomes based on string representations of reactions using the Simplified Molecular-Input Line-Entry System (SMILES) notation [14,15,16,17,18].
In the context of process engineering, there exist a few very recent and promising methods, which learn patterns from existing PFDs and P&IDs [19,20,21,22].Zhang et al. [19] and Zheng et al. [20] use the Simplified Flowsheet-Input Line-Entry System (SFILES) [23] notation to describe the flowsheet topologies as strings in conjunction with sequence alignment algorithms to identify design heuristics in process diagrams.Oeing et al. [21] propose an AI-assisted method to predict the subsequent equipment using a Recurrent Neural Network (RNN).Similarly, we proposed a methodology for auto-completion of flowsheets based on transformer language models [22].To enable the use of NLP models, we utilized the SFILES 2.0 [24] notation.While previous methods focus on the completion of incomplete process diagrams, there is no method available that enables the (semi-)automated generation of P&IDs directly from PFDs.
We propose a novel methodology to generate P&IDs from PFDs.The underlying idea of our approach is to cast the control structure prediction as a translation task, the source language being the PFD and the target being the P&ID.To leverage the potential of state-of-the-art sequence-to-sequence translation models, based on the transformer architecture [12], we utilize the text-based SFILES 2.0 notation [24] to represent the topological information of P&IDs.
The remainder of this paper is structured as follows: Section 2 describes the fundamentals of the applied natural language model and summarizes the concept of the SFILES 2.0 string representation of chemical processes typologies.Thereafter, in Section 3 we describe the data acquisition.In Section 4 we introduce the transformer model adapted for predicting the control structure.Afterward, the results are discussed in Section 5 and demonstrated with an illustrative example in Section 5.3.

Background
This section summarizes the fundamentals of sequence-to-sequence models for the translation of natural language (Section 2.1).In Section 2.2, we highlight the transformer architecture as the state-of-the-art deep learning architecture for translation.Thereafter, the concept of the SFILES 2.0 notation, which enables a text-based representation of PFDs and P&IDs, is outlined in Section 2.3.

Sequence-to-sequence models
Sequence-to-sequence models are machine learning models that map an input sequence to an output sequence.They are utilized in numerous NLP tasks, e.g., in translation [25], text summarization [26], speech recognition [27], and image captioning [28].
Typically, a sequence-to-sequence model comprises an encoding and decoding stack as depicted in Figure 2.During encoding, a numerical embedding of the input sequence is determined, which is subsequently used by the decoder stack to generate the output sequence in an auto-regressive way.The decoder iteratively processes the preceding output sequence together with the numerical embedding of the encoder to predict the next token (e.g., a word).The iterative decoding is stopped as the decoder predicts the end of the sequence as the next token.

Encoder stack
Decoder stack

Input sequence Preceding output sequence
Next token prediction "Here is an example sentence.""Hier ist" "ein" Figure 2: Encoder-Decoder structure of sequence-to-sequence models During decoding, the decoder stack determines the probabilities for each token in its vocabulary, whereupon the next token of the output sequence is identified utilizing a decoding strategy (e.g., using the greedy or beam search decoding strategy).Greedy search selects the next token of a sequence based on the highest predicted probability at the decoding step.The greedy strategy is computationally cheap.However, it does not ensure a sequence with a maximal overall probability because sequences with a high probability can also contain some tokens with a low probability.To mitigate this issue the beam search algorithm was introduced in sequence-to-sequence models (e.g., [29,30,31]).Beam search selects and memorizes the N -best tokens at every decoding step creating a tree of possible output sequences.Every selected token is added separately to the preceding output sequence, and thus the decoder is prompted in total N -times to predict the output probabilities of the next tokens.Therefrom the decoder selects N tokens with the highest probabilities for the next decoding step and discards branches with a lower probability.In the end either the sequence with the overall highest probability is selected or the N -best sequences are returned to a user for selection.Choosing an appropriate beam size N is a trade-off between generating sequences with high probabilities and computational cost.
Training of sequence-to-sequence models is typically performed using the cross-entropy loss of the predicted output probabilities of the next tokens and the ground truth [11].With the aid of the computed cross-entropy loss the parameters of the model are adjusted to improve the model performance.Teacher forcing [32] is commonly applied to correct the model at each decoding step, which forces the model to generate the ground truth corresponding to a given input sequence [11].

Transformer architecture
Originally, the underlying model architectures of sequence-to-sequence models comprised variations of RNNs [29].To avoid vanishing or exploding gradients, long short-term memory [33] and gated recurrent neural nets [34] were also introduced.Recently, the transformer architecture [12], which is based on the sequence-to-sequence model structure, revolutionized the field of NLP demonstrating breakthrough performances on numerous tasks [10,11,12].
The transformer architecture [12] is based on the auto-regressive encoder-decoder model structure and was originally proposed to perform translation tasks.The transformer model relies entirely on attention mechanisms dispensing any recurrence or convolutions.Eliminating recurrence and using attention significantly reduces the number of sequential computations and enables fast parallel processing and model training [12].
The structure of the original transformer architecture comprises an encoder and a decoder stack each containing six identical layers.The encoder layers consist of a multi-head attention sub-layer followed by a position-wise fully connected feed-forward network.Each sub-layer is succeeded by layer normalization and the addition of a residual connection, which prevent "losing" information from the previous layer and facilitate the gradient flow.The structure of the decoder stack is similar to the encoder with the difference that the decoder contains two attention sub-layers.The first attention sub-layer is masked, which limits the decoder to attend only to already generated tokens and prevents the decoder "from glancing into the future".The second attention sub-layer, the encoder-decoder attention layer, performs multi-head attention combining the numerical embedding of the last encoder layer and the results of the preceding self-attention layer.
Attention, being an important core component of the transformer architecture, enables the model to efficiently capture the meaning of a token depending on the context present in the sequence.During model training, the weights of query, key, and value matrices are adjusted to learn the bidirectional context of words in a sequence.These matrices are used to compute a query q, key k, and value v vector from the input embedding.The resulting vectors are packed to query Q, key K, and value V matrices to efficiently compute the scaled dot-product attention.The implementation of the scaled dot-product attention in the transformer architecture includes a scaling factor d k corresponding to the layer size: To allow the model to learn different representations of a single word, multi-head attention is introduced.For this purpose, the queries, keys, and values are linearly projected to different dimensions and processed in parallel by multiple attention heads, which are thereafter concatenated.
The attention mechanisms are not able to cover any positional information due to the absence of recurrence or convolutions in the transformer architecture.Therefore, a positional encoding, utilizing sine and cosine functions, is added to the input and output embedding to provide information about the position in the sequence.

Graph-and text-based representation of process diagrams
This section briefly summarizes the graph-and text-based representation of process diagrams as SFILES 2.0 [24].Process diagrams (e.g., PFDs or P&IDs) of chemical plants can be represented as directed graphs [24,35].Unit operations and control units can be illustrated as nodes in the graph, while material streams and signals are directed edges connecting the nodes.Figure 3 shows an illustrative example process containing a reactor with level control and a recycle loop with flow control.This process diagram can be converted to its corresponding graph representation as depicted in Figure 4. Notably, the two-stream heat exchanger (hex-1) is split into two nodes to distinguish the two separate material flows, which do not mix inside the heat exchanger.The control units are stored as nodes like unit operations.3 SFILES 2.0 [24] is a text-based representation of process topologies, extending the original SFILES notation as proposed by d'Anterroches [23].The SFILES notation is inspired by the SMILES notation, which is used for representing molecules as strings [14,15].With SFILES 2.0 we can efficiently store the topological information of a process graph (e.g., Figure 4) as text, which enables us the application of advanced data processing methods, such as NLP models (Section 2.2).Converting the graph in Figure 4 to the SFILES 2.0 notation with our publicly available Github repository [36] results in the following string: The SFILES 2.0 notation is read from left to right with two consecutive unit operations respectively control units in parentheses implying a material flow in between.Branching in the process, for example after the stream splitter (splt), is represented by putting the individual branches in brackets (here prod), but omitting the brackets for the stream noted last at the branching point (here the recycle flow over C/FC).Material recycles are included in the SFILES 2.0 notation using a number # for the starting point (v) and <# for the corresponding target (mix).The heat exchanger is noted twice in the string with a number in braces, indicating that it is the same heat exchanger but two streams enter and leave the equipment.Independent material streams, such as the utility stream flowing through the heat exchanger compartment hex-1/2, are appended to the SFILES 2.0 string stream separated with n|.Control units are inserted in the same way as unit operations with subsequent braces indicating the letter code of the instrument.Signal connections are implemented similarly to material recycles but include an underscore (_#, <_#).

Data
We use generated data and a dataset of real P&IDs for model training and evaluation.Section 3.1 describes the generation algorithm of P&IDs, which are utilized for pre-training the model.Subsequently, Section 3.2 summarizes the pre-processing of real P&IDs derived from publicly available sources used for model fine-tuning.

Generated data for pre-training
Typically, NLP models are trained on huge corpora of text that are publicly available on the internet.For example, Common Crawl is a publicly available database that extracts around 20 TB of text from the web every month [11].Filtered and cleaned, data from Common Crawl was used as C4 (Colossal Clean Crawled Corpus) dataset with about 750 GB to pre-train the roughly 220 million parameters of the T5-base model [11].Commonly, transfer learning techniques are employed to reduce this massive data demand for new applications [11].
Although the SFILES 2.0 notation with its limited, small vocabulary is less complex than natural language, a reasonable amount of pre-training data is necessary to train the randomly initialized weights of the transformer model.Due to a completely different vocabulary of the SFILES 2.0 compared to natural language, we cannot leverage transfer learning on human language models.Also, there is no database of P&IDs publicly available.
We generate a large set of P&IDs by extending the previously proposed approach by Vogel et al. [22] for generating PFDs.For this purpose, we create P&ID patterns of sub-processes, such as thermal separation or reaction, in which a chemical process is typically divided.These P&ID patterns are thereafter added together to create the P&ID of a chemical process consisting of multiple sub-processes.The construction of the P&ID dataset follows a first-order Markov chain-like sampling process with fixed probabilities.I.e., the selection of the next sub-process only depends on the current state.Compared to Vogel et al. [22], we add control structures based on several basic design heuristics (inspired by [1,37]) for every generated sub-process.As illustrated in Figure 5, we initialize up to three feed streams, which may be pre-processed by inserting heat exchangers, pumps, compressors, or mixing units.Thereafter, a Markov transition selects either thermal separation or reaction as the next sub-process.Exemplary illustrated is the generation pattern for the reaction pattern: Firstly, upstream unit operations comprising heat exchangers, pumps, and compressors are selected.Thereafter, present heat exchangers may be pre-selected for heat integration utilizing a reactor outlet stream.In the next step, one of six stored reactor patterns with an optional material recycle stream is selected.Optionally, a second or third reactant is fed to the reactor in the final step completing the reaction pattern.In general, the patterns have several outlet streams transitioning to the "Next sub-process" state, which lead to multiple Markov transitions to subsequent sub-processes.Branches are either terminated after reaching the conditioning step or if the generation algorithm detects a node number exceeding 65, which prevents the generation of very large flowsheet graphs.Duplicates and process diagrams exceeding a node number of 100 are deleted.
The resulting process graphs with control structure are automatically converted to SFILES 2.0 using our graph to SFILES 2.0 algorithm [36].In a subsequent step, the SFILES 2.0 with control structure are converted to SFILES 2.0 without control structure by removing all control instruments (abbrev.C) with their corresponding letter code in braces and signal connections identifiable by an underscore before the number #.Finally, the generated pre-training dataset consists of process diagrams without control structure (input data) and process diagrams with control structure (output data).Table 1 summarizes the number of training/validation/test samples for model pre-training and key properties of the dataset.Besides the number of samples, Table 1 shows the average number of nodes n nodes , the standard deviation of the number of nodes σ(n nodes ), and the vocabulary size.

Real data for fine-tuning
We collected 312 P&ID-like images from publicly available sources including the google and bing image search engines 1 and extracted process diagrams from scientific literature using data mining [38].After the manual selection of process diagrams containing control structure, automatic object detection, and path exploration is performed using our flowsheet digitization algorithm [39].Correcting faulty nodes and edges, adding the letter code of control units, and adding the connectivity of the unit operations and control structures is performed using LabelGraph, which is our custom extension to LabelImg [40].The resulting process graphs are converted to SFILES 2.0 using our code [36].Then, all control structures are removed to build a dataset consisting of SFILES 2.0 without control structure as our input data and SFILES 2.0 with control structures as our output data.Key statistics of the dataset are denoted in Table 1.The table shows that the standard deviation of the number of nodes in the real data ( 28) is significantly higher than in the generated data (20) while the average number of nodes is smaller in the real data.This indicates that the sizes of the process diagrams vary more strongly in the real dataset.The table also highlights a significantly higher vocabulary size of the real dataset (390) compared to the generated dataset (113).The reason for this is mainly a diversity of additional, new letter codes, but also other new unit operations, which are not present in the generated data.

Data augmentation
Data augmentation methods are commonly applied to datasets to increase their size without the effort of manual labeling and to improve the robustness of machine learning models.In computer vision, images are rotated, cropped, or distorted to have multiple instances of the original image, which are from the computer's point of view completely different.In the field of NLP, data augmentation is more difficult since the meaning of the sentence has to be preserved.Techniques in NLP for data augmentation include e.g., synonym replacement, back-translation, random insertion, deletion, or swapping of words [41,42,43].
To augment the process diagram datasets, we modify the branching decision in the SFILES 2.0 generation algorithm to create different SFILES 2.0 strings representing the same process diagram [44].This procedure is motivated by significant performance advances when using augmented SMILES in neural networks [17,45,46].When generating augmented (non-canonical) SFILES 2.0, the branching decision is made randomly, whereas in the case of the determination of canonical SFILES 2.0 the branching decision is predetermined by assigning every node of the graph to a unique rank.The resulting augmented SFILES 2.0 is grammatically correct and contains identical information as the canonical SFILES 2.0 and thus describes the same process flowsheet.During augmentation, only the uniqueness of the SFILES 2.0 representation is lost.For the augmented model training runs we roughly doubled the training data by generating a second SFILES 2.0 for every PFD in the input dataset.As an example, the PFD corresponding to Figure 3 is represented by the following canonical SFILES 2.0 which can be augmented to

P&ID prediction model
In the following section, we provide an overview of the general procedure to predict the control structure of P&IDs utilizing a sequence-to-sequence transformer model.In Section 4.2, we describe the tokenizer that enables the model to process the SFILES 2.0 strings.Thereafter, key parameters of the utilized transformer architecture are briefly summarized in Section 4.3.

Overview
Figure 6a presents an overview of the P&ID prediction model, which is described in the following.Firstly, the PFD, which is subject to the development of a control structure, is converted to the corresponding SFILES 2.0 string as described in Section 2.3 (Step 1).Then, the SFILES 2.0 string is split into chunks of text using the SFILES 2.0 tokenizer as explained in Section 4.2 (Step 2).After converting the tokenized string to an input embedding and adding a positional encoding, the encoder stack computes a numerical embedding of the input string (Step 3).In Step 4, the decoder stack is initially prompted with a start-of-sequence token.In combination with the numerical embedding of the input sequence produced by the encoder, the decoder stack predicts the next token of the output sequence.The predicted token is then added to the preceding tokens of the output sequence and the decoder is again prompted to predict the next token (Step 5).This auto-regressive prediction of tokens is continued until an end-of-sequence token terminates the prediction process (Step 6).Lastly, the resulting SFILES 2.0 string is converted to its corresponding graph representation, the P&ID.Eventually, this procedure could be implemented in CAD software packages to automatically generate the control structure of a drawn PFD as depicted in Figure 6b.

Tokenization
Tokenizers are generally used in NLP to split text sequences into pieces that can be processed by the language model.The aim is to compress as many words of a language as possible into a fixed vocabulary while preserving the meaning of the words.Using the vocabulary, tokenizers convert the input sequence into a numerical vector, which can be processed by the NLP model.Different tokenization algorithms have been developed according to different languages and intended use cases.The most commonly used tokenization algorithms comprise word-and subword-based tokenizers, which split the text into words or parts of words and automatically build their vocabulary.Examples of popular subword-based tokenizers include Byte-pair encoding (BPE) [47] and SentencePiece [48].
We use the T5 transformer model [11], a state-of-the-art model easily accessible through Hugging Face, casting the control structure prediction as a translation task.Therefore, the employed model is a sequence-to-sequence model with an encoder-decoder structure as explained in Section 2.1.The T5 model is in large parts equivalent to the original transformer architecture proposed by Vaswani et al. [12].Modifications include removing the layer bias norm, placing layer normalization outside the residual connections, and applying a different positional encoding [11].Since the SFILES 2.0 vocabulary is limited to a few hundred entries, we utilize the T5-small version with originally about 60 million parameters.The T5-small model has an embedding size of 512, utilizes an 8-headed attention mechanism, and consists of six encoder and decoder layers each.Preliminary tests on a generated SFILES 2.0 dataset with around 10,000 samples indicate, that an even smaller architecture may be sufficient and advantageous.For this reason, we further decrease the model size of the T5-small model by reducing the embedding size to 128 and the number of encoder and decoder layers each to two.In summary, our model comprises roughly 7.9 million trainable parameters.During model training, early-stopping is utilized to prevent overfitting and unnecessary long training runs.Evaluation of the model is performed by generating predictions with beam search as decoding strategy as described in Section 2.1.The beam width is set to five and those five, most probable predictions are returned to the user as recommendations for possible control structures of the provided PFD.The implementation of a constrained beam search would be possible to prevent the model from predicting unit operations, which are not present in the PFD.However, such an implementation is not applied in the following experiments.

Results and discussion
This section summarizes the training procedure for pre-training and fine-tuning the P&ID prediction model.Thereafter, the model is evaluated based on the top-k accuracy metric.

Model training
We perform model pre-training with different generated training set sizes as denoted in Table 2. Additionally, an independent validation and test set is generated with 1000 samples each.During pre-training, we use a learning rate of 3 • 10 −4 and a batch size of 32.Model evaluation is performed depending on the dataset size every 500 steps for the training dataset containing 10,000 and 100,000 samples, every 25 steps for the dataset with 1,000 samples, and every 5 steps for the dataset with 100 samples.Early stopping is applied with patience of 10 steps to prevent overfitting.
Subsequently, we fine-tune the pre-trained model on real P&IDs splitting the dataset into a train (80%), validation (10%), and test (10%) set.Model fine-tuning is performed with a reduced learning rate of 0.5 • 10 −4 and a batch size of 2. We evaluate the model every 20 steps and apply early stopping with patience of 40 steps.
Figure 7a illustrates exemplary the training and validation loss curves during model pre-training with a dataset size of 10,000 generated P&IDs.The first few epochs exhibit a steep decrease of both training and validation loss, whereupon the losses in the subsequent epochs asymptotically approach a constant value.The gap between training and validation loss is small, indicating a small generalization error, which is likely due to the limited variance in the generated dataset.Additionally, the samples of training and validation set are drawn from the same probability distribution and thus forming a representative validation set.As indicated in Table 1, the real data shows higher variations in the number of nodes and due to additional other unit operations and letter codes in the control structures, resulting in an extended vocabulary size.Along with the small dataset size, the validation set is likely not representative.The experiments with different dataset sizes during pre-training resulted in qualitatively similar loss curves during pre-training and fine-tuning.

Model evaluation
The model performance after pre-training on different generated dataset sizes is evaluated based on the top-k accuracy.Therefore, the top-5 predictions are determined with beam search decoding.A prediction is counted as true, if the target P&ID is present in the top-k predictions of the model.The results, presented in Table 2, show that increasing the dataset size significantly improves the model performance.It is evident that a dataset size of 100 or 1,000 samples is not sufficient for pre-training the P&ID prediction model.With 10,000 generated process diagrams we already reach a top-5 accuracy of roughly 75% on the test set.The top-5 accuracy can be increased up to 89.2% on the test set when pre-training with 100,000 samples.Therefore we conclude, that the P&ID prediction model learns the grammatical structure of SFILES 2.0 and correctly gives recommendations for the control structure of unknown PFDs through learning the patterns present in the training data.In addition, the results indicate, that SFILES 2.0 data augmentation has positive effects on the model performance.Especially on the dataset with 10,000 samples, a significant increase in the top-1 accuracy is observed after augmentation.
Since valves are often omitted in PFDs, an additional pre-training run is performed with 10,000 training samples, where the entire control structure and all valves are removed from the input dataset.Thus, the model learns to predict not only the control structures but also the valves.The results, denoted in Table 2, indicate that it is significantly more difficult for the model to predict correct control structures.This causes the top-1 accuracy to decrease from 37.7% (10,000 input samples with valves) to 17.8% (10,000 input samples without valves).However, this demonstrates that the model is also capable of predicting correct valve positions, which are not necessarily present in the PFDs.In a first experiment, we trained the P&ID prediction model directly on 250 real P&IDs.This approach did not yield useful results, as apparently the size of the dataset of 250 real P&IDs is not sufficient to train a transformer-based NLP model.In a second experiment, we applied a transfer learning method.We fine-tuned the P&ID prediction model with real P&IDs using checkpoints obtained from pre-training with generated data.Still, the results after fine-tuning revealed a top-5 accuracy of 0% on the test set of real P&IDs.This result is not applicable for industry applications but is consistent with the results from pre-training.In particular, the pre-training on a small number of generated P&IDs indicates, that 100 or even 1,000 training samples are not sufficient for reasonable results (cf.Table 2).The pre-training results highlight, that a sufficiently large number (here, 10,000) of training P&IDs is necessary to enable the model learning patterns in the provided data.
Error sources and difficulties for the P&ID prediction model are not only due to the small real dataset size but also to the dataset composition.The P&IDs are derived from scientific literature and publicly available sources representing laboratory setups, but also chemical plants or fictive examples and may contain errors, incomplete control structures, and wrong, not standardized, letter codes.In addition, the real dataset, as described in Section 3.2, contains very heterogeneous and generally more complex P&IDs.In combination with the small size of the dataset, this leads to errors in the model predictions, including added or missing unit operations, invalid SFILES 2.0, not connected material recycles, or signal connections.These errors could be partly mitigated by implementing a constrained beam search algorithm, which sets the probabilities of unit operations not present in the input sequence to zero and forces the model to add only the control structure and valves to the output sequence.Nevertheless, since for every section of a chemical plant exists at least one P&ID, we believe that there is enough data available in the proprietary domain to train our P&ID prediction model making no arbitrarily changes in the PFD and predicting correct control schemes.

Illustrative example
This section illustrates the model predictions on one representative sample taken from the independent test set.For this illustrative example, we use the model that has been trained on 10,000 training samples without data augmentation and fine-tuning.The model is prompted with a PFD (colored black in Figure 8) of the test dataset, as denoted in the following SFILES 2.0 string: The model predicts with beam search decoding the following five, syntactically correct SFILES 2.0.These SFILES 2.0 contain the input PFD colored in black and the predicted, five most-likely control structures illustrated in blue: Figure 8 illustrates the five model predictions.The PFD, colored black in Figure 8, contains two feed pre-heater, a mixing point of two material streams, and a distillation column.The model predicts a temperature-dependent control of the utility stream for both feed pre-heater.Mixing of the two raw material streams is, according to the model, most likely done with a flow ratio control.Furthermore, the model provides correct predictions of four different distillation column control schemes, which are included in the seven column control structures used to generate the data.The first prediction (Figure 8a) corresponds to the ground truth for the corresponding PFD fed to the model as input.
Apart from the correct predictions, Figure 8 illustrates limitations and errors of the P&ID prediction model.In Figure 8d, the model inserts a flow transmitter and fails to predict a corresponding signal connection.In addition, the mixing of the material flows upstream of the distillation column could be problematic from a control perspective, as flow control is proposed here before and after mixing.This problem arises from model training with generated data, which is synthesized by adding small control patterns to a final P&ID.The addition of the utilized control patterns may not result in a meaningful, correct control architecture, and furthermore, no long-range dependencies are considered in the data generation procedure.any valve in the model input, the third prediction of the model, as illustrated in Figure 9, represents the ground truth.This shows that our model has also the potential to learn the positioning of valves in combination with the prediction of the control structure.Overall, our results show great potential for automatically predicting P&IDs from PFDs.However, the results also demonstrate several current limitations that need to be overcome for industry applications.The current model learns only from the process topology and lacks additional information about the context (e.g., operating conditions, reactants, products, safety measures, and sizing of the equipment).This severely limits the model performance.Also, the current model outputs an SFILES 2.0 string.The integration of the model in CAD software could enable the automatic P&ID drawing and enhance the user experience.

Conclusion
Predicting the control structure of P&IDs with machine learning models is a promising strategy to accelerate the development of chemical processes.We propose a novel method of casting the prediction task as a translation task and leveraging the transformer architecture from the field of NLP.To apply NLP techniques, we represent the graph-based process diagrams in the text-based SFILES 2.0 notation.We successfully trained a fully data-driven sequence-tosequence model to predict the control structure of generated chemical processes without relying on handcrafted rules.Experiments on 312 real P&IDs indicate that for reasonable results larger datasets are necessary.
Future work should focus on the acquisition of a larger dataset of real P&IDs, which can be used to fine-tune our model and leverage the possible advantages of transfer learning.Additionally, the context of the chemical process, such as operating conditions, basic control structures already present in the PFD, or stream information, may be included in advanced models to refine the prediction of the control structure.Besides the prediction of the control structure, extensions of the P&ID prediction model could include e.g., pipe classes or valve types.Moreover, validity checks may be included to further increase the accuracy of the model predictions.Ultimately, our model should not be seen as an alternative to the control engineer or existing rule-based systems.Rather, we envision a combination of the algorithm with other process development methods to assist the engineer with recommendations, reduce the number of manual tasks, and generally make process development more efficient.

Acknowledgements
This publication is part of the project "ChemEng KG -The Chemical Engineering Knowledge Graph" with project number 203.001.107 of the research program "Open Science (OS) Fund 2020/2021" which is (partly) financed by the Dutch Research Council (NWO).

Figure 3 :
Figure 3: Exemplary chemical process diagram with branching, recycle stream, control units and different mass trains

Figure 4 :
Figure 4: Graph representation of the process diagram in Figure 3

Figure 6 :
Figure 6: Overview of the control structure prediction with the transformer model.(a) Conversion of the PFD to SFILES 2.0 (1).Processing of the input SFILES 2.0 with transformer model to predict the control structure (2-5).Conversion of the output SFILES 2.0 to the PFD including the corresponding control structure (6-7).(b) Example control structure prediction

Figure 7b depicts the
Figure 7b depicts the training and validation loss curves during model fine-tuning.Compared to Figure 7a a larger gap between training and validation loss curve and generally higher fluctuations are observed.This behavior is most likely related due to the training on real P&IDs, which generally exhibit a higher complexity than the generated examples.As indicated in Table1, the real data shows higher variations in the number of nodes and due to additional other unit operations and letter codes in the control structures, resulting in an extended vocabulary size.Along with the small dataset size, the validation set is likely not representative.The experiments with different dataset sizes during pre-training resulted in qualitatively similar loss curves during pre-training and fine-tuning.

Figure 7 :
Figure 7: Training and validation loss curve during (a) pre-training with 10,000 training samples and (b) fine-tuning

Figure 9 :
Figure 9: Control structure prediction (in blue) of the model prompted with the PFD (colored black) as input

Table 1 :
Dataset properties and training (tr), validation (val), test (te) splits used for the experiments

Table 2 :
Top-k accuracy of the pre-trained model on the generated test set