Flowsheet synthesis through hierarchical reinforcement learning and graph neural networks

Process synthesis experiences a disruptive transformation accelerated by digitization and artificial intelligence. We propose a reinforcement learning algorithm for chemical process design based on a state-of-the-art actor-critic logic. Our proposed algorithm represents chemical processes as graphs and uses graph convolutional neural networks to learn from process graphs. In particular, the graph neural networks are implemented within the agent architecture to process the states and make decisions. Moreover, we implement a hierarchical and hybrid decision-making process to generate flowsheets, where unit operations are placed iteratively as discrete decisions and corresponding design variables are selected as continuous decisions. We demonstrate the potential of our method to design economically viable flowsheets in an illustrative case study comprising equilibrium reactions, azeotropic separation, and recycles. The results show quick learning in discrete, continuous, and hybrid action spaces. Due to the flexible architecture of the proposed reinforcement learning agent, the method is predestined to include large action-state spaces and an interface to process simulators in future research.


Introduction
The chemical industry is approaching a disruptive transformation towards a more sustainable and circular future [1][2][3] .As a major contributor to global emissions, tremendous changes are required and the chemical industry needs to face a paradigm shift 1 .This also requires rethinking regarding the conceptualization of novel processes 2,4 .Simultaneously, innovations are pushed by new possibilities due to emerging digital technologies.Digitization and in particular artificial intelligence (AI) offer new possibilities for process design and therefore have the potential contribute to the transformation of chemical engineering 1,3,5 .
In the last decade, reinforcement learning (RL) has demonstrated its potential to solve complex decision-making problems, e.g., by showing human-like or even superhuman performance in a large variety of game applications [6][7][8] .RL is a subcategory of machine learning (ML) where an agent learns to interact with an environment based on trial-and-error 9 .Especially since 2016, when DeepMind's AlphaGo 10 succeeded against a world-class player in the game Go, RL has attracted great attention.In recent developments, RL applications have proven to successfully compete with top-tier human players in even real-time strategy video games like StarCraft II 11 and Dota 2 12 .
The accomplishments of RL in gaming have initiated significant developments in other research fields, including chemistry and chemical engineering.In process systems engineering, RL has been mainly applied to scheduling 13,14 and process control [15][16][17][18][19] .After first appearances of RL for process control in the early 1990s 15 , the development was pushed with the rise of deep RL in continuous control in games 20 and physical tasks 21 .Spielberg et al. 16 first transferred deep RL to chemical process control.In recent works, the satisfaction of joint chance constraints 17 and the integration of process control into process design tasks 18,19 via RL were considered.
In contrast to continuous process control tasks, RL in molecule design is characterized by discrete decisions, such as adding or removing atoms.Several methods use RL for the design of molecules with desired properties [22][23][24][25][26] .First applications generate simplified molecular-input line-entry system (SMILES) strings using RL agents with pre-trained neural networks 23,26 .Zhou et al. 24 introduced a method solely based on RL, thereby ensuring chemical validity.Recently, RL based molecule design has been further enhanced in terms of exploration strategies 27 or by combining RL with orientation simulations 28 .In another approach, You et al. 22 introduced a graph convolutional policy network (GCPN) that represents molecules as graphs.It allows using graph neural networks (GNNs) to approximate the policy of the RL agent and to learn directly on the molecular graph.Using GNNs on molecule graphs to predict molecule properties [29][30][31][32] has also shown promising results besides RL.For example, Schweidtmann et al. 29 achieved competitive results for fuel property prediction by concatenating the output of a GNN into a molecule fingerprint and further passing it trough a multi-layer perceptron (MLP).
Graph representation and RL are also applied in other engineering fields.For example, Ororbia and Warn 33 represent design configurations of planar trusses as graphs in an RL optimization task.
Recently, important first steps have been made towards using RL to synthesize novel process flowsheets [34][35][36][37][38][39] .Midgley 34 introduced the "Distillation Gym", an environment in which distillation trains for non-azeotropic mixtures are generated by a soft-actor-critic RL agent and simulated in the opensource process simulator COCO.The agent first decides whether to add a new distillation column to the intermediate flowsheet and subsequently selects continuous operating conditions.In an alternative approach to generate process flowsheets, Khan and Lapkin 35 presented a value-based agent that chooses the next action by assessing its value, based on previous experience.The agent operates within a hybrid action space, i.e., it makes discrete and continuous decisions.In a recent publication, Khan and Lapkin 40 introduced a hierarchical RL approach to process design, capable of designing more advanced process flowsheets, also including recycles.A higher level agent constructs process sections by choosing subobjectives of the process, such as maximizing the yield.Then, a lower level agent operates within these sections and chooses unit types and discretized parametric control variables that define unit conditions.Due to the discretization, the agent operates only in a discrete action space.As another approach to synthesize flowsheets with RL, Göttl et al. 36 developed a turn-based two-player-game environment called "SynGameZero".The interpretation of flowsheeting as a two-player game allowed them to reuse an established tree search RL algorithm from DeepMind 8 .Recently, Göttl et al. 37 enhanced their work by allowing for recycles and utilizing convolutional neural networks (CNNs) for processing large flowsheet matrices.Additionally, the company Intemic 38 has recently developed a "flowsheet copilot" that generates flowsheets iteratively, embededded in a 1-player-game.Intemic offers a web front-end in which raw materials and desired products can be specified.Then, a RL agents selects unit operations as discrete decisions using the economic value of the resulting process as objective.Furthermore, Plathottam et al. 39 introduced a RL agent that optimizes a solvent extraction process by selecting discrete and continuous design variables within predefined flowsheets.
One major gap in the previous literature on RL for process synthesis is the state representation of flowsheets.We believe that a meaningful information representation is key to enable breakthroughs of AI in chemical engineering 5 .Previous works represent flowsheet in matrices comprising thermodynamic stream data, design specifications, and topological information 37 .However, we know from computer science research that passing such matrices through CNNs is limited as they can only operate on fixed grid topologies, thereby exploiting spatial but not geometrical features 41 .In contrast, graph convolutional neural networks (GCNs) handle differently sized and ordered neighborhoods 42 with the topology becoming a part of the network's input 43 .Since flowsheets are naturally represented as graphs with varying size and order of neighborhoods, GCNs can take their topological information into account.Another gap in the literature concerns the combination of multiple unit operation types, recycle streams and a larger, hybrid action space.While previous works proposed these promising techniques in individual contributions 34-40 , they have not yet been combined to a unified framework.
In this contribution, we represent flowsheets as graphs consisting of unit operations as nodes and streams as edges (c.f. 44,45).The developed agent architecture features a flowsheet fingerprint, which is learned by processing flowsheet graphs in GNNs.Thereby, proximal policy optimization (PPO) 46 is deployed with modifications to learn directly on graphs and to allow for hierarchical decisions.In addition, we combine a hybrid action space, hierarchical actor-critic RL, and graph generation in a unified framework.

Reinforcement learning for process synthesis
In this section, we introduce the methodology and the architecture of the proposed method.To apply RL to process synthesis, the problem is first formulated as a Markov decision process (MDP) which is defined by the tuple M = {S, A, T, R}.A MDP consists of states s ∈ S, actions a ∈ A, a transition model T : S × A → S, and a reward function R 9 .In the considered problem, states are represented by flowsheets graphs, while actions comprise discrete and continuous decisions.More specifically, the discrete decisions consist of selecting a new unit operation as well as the location where it is added to the intermediate flowsheet.The continuous decisions are to define one or several specific continuous design variables per unit operation.For the environment, we implemented simple functions in Python to simulate the considered flowsheet.Finally, a reward is calculated and returned to the agent.
While most RL methods can be divided into value-based and policy-based approaches, actor-critic RL takes advantage of both concepts 9 .In contrast to value-based RL methods that cannot be easily adapted to continuous actions 21,47 , actor-critic approaches can learn policies for both, discrete and continuous action spaces and are thus also suitable for hybrid tasks 48 .Subsequently, several recent state-of-the-art policy optimization methods propose an actor-critic setup 21,[46][47][48][49][50] .As shown in Figure 1, actor-critic agents consist of a critic that estimates the value function and an actor that decides for actions by approximating the policy 9 .
The RL framework presented in this work is derived from the actor-critic PPO algorithm by Ope-nAI 46 .In PPO, the objective function is clipped to prevent a collapse of the agent's performance during training.To favor exploration, an entropy term 51 is added to the loss function.Additionally, the generalized estimation of the advantage Â52 is used for updating the networks.

State representation
The main feature of the proposed method is the representation of the states by directed flowsheet graphs.This characteristic allows us to process the states in GNNs, thereby taking topological information into account.
Figure 2 demonstrates the graph representation of flowsheets.Feeds, products, and unit operations are represented by nodes, storing the type of unit operation and design variables.The edges include thermodynamic information about process streams, like temperature, molar flow, and molar fractions.
Intermediate flowsheets feature nodes of the type "undefined".Whenever a new unit operation is added to the flowsheet, the resulting open streams are considered as such "undefined" nodes.In subsequent steps, they represent possible locations for placing new unit operations.Consequently, adding a new unit operation practically means replacing an "undefined" node with a defined one.

Actor Critic
Value  Figure 1: Agent-environment interaction in an actor-critic policy optimization approach for flowsheet synthesis.The agent approximates the policy and makes decisions.Meanwhile, the critic estimates the value of the environment's state using the flowsheet graph, which is used to evaluate the agent's decisions.
Here, actor and critic both deploy graph convolutional neural networks.

Agent
At the heart of the proposed RL method stands a hierarchical, hybrid actor-critic agent composed of multiple GNNs and MLPs.Its characteristics are introduced hereinafter.

Hierarchical, hybrid action space
The architecture of the agent is decisively affected by the considered hierarchical and hybrid action space.The decision-making process is illustrated in Figure 3. Every action consists of three levels of decisions: (i) select a location, (ii) add a new unit operation, and (iii) define a continuous design variable.In the presented flowsheet, both streams leaving the column can be chosen.Then, the agent selects a unit operation.Thereby, the options are to add a heat exchanger, a reactor, a column, a recycle or to sell the stream as a product.Finally, a continuous design variable is selected for each unit operation.This third decision depends on which unit operation was selected previously.
In the first level, the agent decides for an open stream and thus for the location of the next flowsheet expansion.As discussed in Section 2.1, open streams are identified by "undefined" nodes.In the second level, the agent decides which type of unit operation will be added.Thereby, the agent can choose to add a distillation column, a heat exchanger, or a reactor.Furthermore, it can decide to add a recycle by introducing a splitter and a mixer into the flowsheet.As a fifth option, the agent can declare the considered stream as a product.If a unit operation is added, the third level decision is to specify the design variables of the corresponding unit operation.Although it is possible to set multiple design variables in this step, we chose to only set one variable for simplification reasons.Thus, one characteristic variable for each unit operation is defined in this step while all other variables are fixed.For the current implementation of the agent, the recycle stream is always inserted into the feed stream.Whereas the first two levels are discrete decisions, the third level decisions are continuous.This combination of discrete and continuous decisions is referred to as hybrid action space.

Using GNNs to generate flowsheet fingerprints
In RL, every iteration of the agent-environment-interaction starts with the observation of the environment's state s, as shown in Figure 1.In other approaches 34,36,37,40 , states or rather flowsheets are represented by vectors or matrices and, e.g., passed through CNNs for the observation step 37 .Instead, in the herein presented approach, states are represented by flowsheet graphs (cf.Section 2.1).To observe and process the therein stored information, the flowsheet graphs are passed through GCNs and encoded into a vector format called flowsheet fingerprint.The advantage of using graphs and GCNs is that it allows operating in variable neighborhoods with different numbers and ordering of nodes, thereby taking spatial and spectral information into account [41][42][43] .Thus, we believe that graphs and GCNs are better suited for representing and processing the branched connectivity of flowsheets than passing matrices through CNNs.
For this step, we transfer the method introduced by Schweidtmann et al. 29 , who apply GNNs to generate molecule fingerprints, to flowsheets.The approach utilizes the message passing neural network (MPNN) proposed by Gilmer et al. 30 .
The overall scheme to process a flowsheet graph is displayed in Figure 4 and consists of a message passing and a readout phase.First, the flowsheet graph is processed through an MPNN, using a GCN with several layers to exchange messages and update node embeddings.Afterward, a pooling function generates a vector format, the flowsheet fingerprint, in the readout phase.After several steps of message passing, sum-pooling is deployed for the subsequent readout phase..The flowsheet graph is processed through an MPNN, using GCNs to perform message passing and update node embeddings.In the readout step, a pooling function is applied, resulting in a vector format, the flowsheet fingerprint.
For every step in the message passing phase, first the node and edge features of the neighborhood of each node in the flowsheet graph are processed.Therefore, GCNs are utilized to exchange and update information in the message passing phase.The functionality of a graph convolutional layer is illustrated in Figure 5, following Schweidtmann et al. 29 .The figure visualizes the procedure to update the node embeddings of the blue node.Therefore, the information stored in the yellow neighboring nodes and the corresponding edges is processed and combined to a message through the message function M.Then, the considered node is updated through the message in the update function U.In each layer of a GCN, this procedure is conducted for every node of the graph.

Hierarchical agent architecture
For the architecture of the agent, a structure suggested by Fan et al. 48for hierarchical and hybrid action spaces is used.Thereby, individual MLPs are applied for each level of decisions and one MLP is applied as a critic to evaluate the decisions.
The architecture of the actor-critic approach is illustrated in Figure 6.In the "fingerprint generation" step, the state represented by a flowsheet graph is processed to a flowsheet fingerprint through a GCN (cf.Section 2.2.2).Additionally, the updated graph resulting from the message passing phase of the fingerprint generation is passed to the "actor" step.Therein, the updated graph is further processed by an additional GCN.This represents the first level of the actor which is to select an open stream to further extend the flowsheet.Thereby, the method takes advantage of the graph representation in which open streams end in "undefined" nodes.In the GCN of the first level decision, the number of node features is reduced to one (cf.related literature on node classification tasks 42 ).Furthermore, all nodes which do not correspond to open streams are filtered out.The remaining node feature of each nodes in the last GCN layer represents its probability to be chosen as the location for adding a new unit.Then, the ID of the selected node is concatenated with the previously computed flowsheet fingerprint before it is passed on to the second and third level actors as input.
The second level actor consists of a MLP that returns probabilities for each unit operation to be chosen.For each type of unit operation, an individual MLP is set up as the actor for the third level decision.Thereby, the third level MLPs take the concatenated vector including the flowsheet fingerprint and the ID of the selected location as an input.They return two outputs which are interpreted as parameters, α and β, describing a beta distribution B (α, β) 53 .Based on this distribution, a continuous decision regarding the respective design variable is made.The considered node is marked in blue and its neighbors in yellow.First, the information stored in the neighboring nodes and the respective edges is processed and combined through a message function M.Then, a message is generated to update the information embedded in the considered node through the update function U. The approach and its illustration follow a method proposed by Schweidtmann et al. 29 .
Actor The critic that estimates the value of the original state is displayed in the upper half of Figure 6.Therefore, the flowsheet fingerprint is passed through another MLP.This value is an estimation of how much reward is expected to be received by the agent until the end of an episode when starting at the considered state and further following the current policy 9 .In our approach, we utilize the value to compute the generalized advantage estimation Â introduced by Schulman et al. 52 .It tells whether an action performed better or worse than expected and is used to calculate losses of the actor's networks.By comparing the value to the actual rewards, an additional loss is computed for the critic.

Agent-environment interaction
The interaction between the environment and the hierarchical actor-critic agent is further clarified in Algorithm 1.After the environment is initialized with a feed, the flowsheet is generated in an iterative scheme.The agent first observes the current state s of the environment and chooses actions a for all three hierarchical decision levels by sampling.The agent returns the probabilities and the selected actions as well as the value v of the state.
In the next step, the actions are applied to the environment.Therefore, the next state s is computed by simulating the extended flowsheet.Additionally, the environment checks whether any open stream is left in the flowsheet, indicating that the episode is still to be completed.Since the weights of the agent's networks are randomly initialized, early training episodes can result in very large flowsheets.Thus, the total number of units is limited to 25 as additional guidance.If a flowsheet exceeds this number, all open streams are declared as products.
Additionally, the environment calculates the reward that depends on whether the flowsheet is completed or not.If the net cash flow is positive, the reward equals the net cash flow.If the net cash flow is negative, the reward equals the net cash flow divided by a factor 10.This procedure is implemented in order to encourage exploration of the agent.For the intermediate steps during the synthesis, process rewards of zero are given to the agent.After each iteration, the transition is stored in a batch and later used for batch learning.

Training
The presented method, including the flowsheet simulations, is implemented in Python 3.9.The training procedure is adapted from PPO by OpenAI 46 .It consists of multiple epochs of minibatch updates, whereby the minibatches result from sampling on the transition tuples stored in the memory.The agent's networks are thereby updated by gradient descent, using a loss function derived from summing up and weighting all losses of the individual actors, their entropies, and the loss of the critic.

Case study
The proposed method is demonstrated in an illustrative case study considering the production of methyl acetate (MeOAc), a low-boiling liquid often used as a solvent 54 .In an industrial setting, MeOAc is primarily produced in reactive columns by esterification of acetic acid (HOAc) 55,56 .For illustration, we consider only simplified flowsheets that use separate units for reaction and separation.

Process simulation
For computing new states and rewards, the flowsheets generated by the agent are simulated in Python.Therefore, we implemented a model for each type of unit operation that can be selected in the second level decision.In our case study, the agent can decide to place reactors, distillation columns, and heat exchangers.Furthermore, the agent can add recycles or sell open streams as products.
Reactor.The reactor is modeled as a plug flow reactor (PFR), in which the reversible equilibrium reaction shown in Equation 1 takes place.

HOAc + MeOH
MeOAc and its by-product water (H 2 O) are produced by esterification of HOAc with methanol (MeOH) under the presence of a strong acid.To calculate the composition of the process stream leaving the PFR, we formulated a boundary value problem, depending on the reaction rate, and manually implemented a fourth-order Runge-Kutta method with fixed step-size as solver.Thereby, the reactor is modeled isothermal, based on the temperature of the inflowing stream.The reaction kinetics are based on Xu and Chuang 57 .The length of the PFR is specified by the agent as the continuous third level decision within the range of 0.05 m to 20 m.Thereby, the relation of the cross-sectional area A of the PFR to the molar flow Ṅ passing through it is fixed to A/ Ṅ = 0.1 m 2 s mol −1 .Notably, the length of the reactor significantly influences the conversion in the PFR.In addition, the equilibrium of the considered reaction depends on the temperature of the process stream which thus affects the reaction rate and the conversion in the PFR.Thereby, the temperature of the process stream can be influenced by heat exchangers upstream of the reactor.
Heat exchanger.In the heat exchanger, heat is transferred between the process stream and a water stream.The continuous third level decision specifies the inlet temperature of the water and thus also whether the process stream is cooled or heated.To avoid evaporation of the process stream, the inlet water temperature is chosen within the range of 5 °C to 53.8 °C, where the upper limit corresponds to the lowest possible boiling point of the considered quarternary system.The heat exchanger model computes the heat duty, the required heat transfer area, and the outlet temperature of the process stream.The model is based on a countercurrent flow, shell and tube heat exchanger 58 .A typical heat transfer coefficient of 568 W K −1 m 2 is used 59 .Additionally, we assume that the process stream always approaches the water stream temperature within 5 K in the heat exchanger.

Distillation column.
The distillation column is deployed to separate the quarternary system MeOAc, MeOH, HOAc, and H 2 O.The vapor-liquid equilibrium of the system is displayed in Figure 7.It contains two binary minimum azeotropes between MeOAc and H 2 O, and respectively between MeOAc and MeOH.As shown in Figure 7, the azeotropes split up the separation task into two distillation regimes.To simplify the problem, we follow the assumption made by Göttl et al. 37 that the distillation boundary can be approximated by the simplex spanned between both azeotropes and the fourth component, HOAc.
We implemented a shortcut column model using the ∞/∞ analysis [60][61][62] .The only remaining degree of freedom in the ∞/∞ model is the distillate to feed ratio D/F .It is set by the agent in the continuous third level decision within a range of 0.05 to 0.95.Recycle.The agent can also select to recycle an open process stream back to the feed stream.Thereby, the ratio of the considered stream that will be recycled is selected by the agent in the third level decision.The recycle is modelled by adding a splitting unit and a mixing unit to the flowsheet.First, the considered stream is split up in a recycle stream and a purge stream.The latter one ends in a new "undefined" node.To simulate the recycle, a tear stream is initialized.Then, the Wegstein method 63 is used to solve the recycle stream flow rate iteratively.When the Wegstein method is converged, the tear stream is closed and the recycle stream is fed into the feed stream by the mixing unit.This method is based on the implementation of flexsolve 64 .

Reward
The reward assesses the economic viability of the generated process, following Seider et al. 59 for calculating annualized cost and Smith 58 for estimating unit capital costs.After completing a flowsheet by specifying all open streams as products, the agent receives a final reward.This final reward r represents an approximate net cash flow of the process within one year.If this net cash flow is negative, it is reduced by a factor 10 to encourage exploration of the agent.The economic value of incomplete flowsheets is more difficult to estimate because it may depend on future actions.Thus, a reward of zero is given after every single action since the actual value of an action can only be assessed when an episode is complete.As shown in Equation 2, the final reward includes costs for units and feeds as well as revenue for sold products.
The values of the products are estimated by an s-shaped price function P , depending on the purity of the considered streams.The pure component price C is used to compute the cost of the raw material stream.The annualized cost is computed by adding the annual utility costs U and the total capital investment I multiplied by a factor 0.15 59 .Furthermore, the reward is used to teach the agent to make feasible decisions.Whenever infeasible actions are selected that cause the simulation to fail, e.g., if the reactor simulation fails due to bad initial values in the solver, the episode is interrupted immediately and a negative reward of −10 Mio e is given.When the agent decides to not add units at all and just sell the feed streams, the same penalty is given to prevent the agent from falling into this trivial local optimum.
Notably, the considered case study is meant to facilitate illustration and the considered parameter values for prices are only approximations.

Results & discussion
In this section, we present and analyze the learning behavior of the developed agent.For investigating all single parts of the agent, the training procedure was first conducted in a discrete action space, consisting of the first and second hierarchical decision levels.Afterward, the same procedure was conducted in a continuous action space which only includes the third decision level.Finally, all decision levels are combined to the hybrid action space.In all runs, the environment was initialized with a feed consisting of an equimolar binary mixture of MeOH and HOAc.The feed's molar flow rate was set to 100 mol s −1 and its temperature to 27 °C.
The proposed learning process and the agent architecture include several hyperparameters that are listed in the appendix in Table 4.The selected hyperparameters are based on literature 29,30,46,65 .

Flowsheet generation in a discrete action space
To investigate the agent's behavior in a discrete action space, the third level actor was deactivated and only the first and second level decisions were conducted.Thus, in each step, the agent selected a location for a new unit operation as well as its type.Thereby, fixed values for the unit's continuous design variables were used.They are displayed in Table 1.Throughout the presented case study, constant pressure of 1 bar was assumed.The agent was trained in 10 000 episodes with the procedure described in 2.3.
Figure 8 shows the learning curve of the agent in the discrete action space.The displayed scores correspond to the reward which is the estimated net cash flow of the final process.Thus, they are a measure of the economic viability of the final process.
During the first 2000 episodes, the learning curve rises almost exponentially.In this early training stage, the agent produces predominantly long flowsheets and often reaches the maximum allowed number of unit operations.However, throughout the training the agent learns that shorter flowsheets are economically more valuable.Soon, the agent mainly produces flowsheets with a positive score, meaning that the final process is economically viable.Afterward, the learning curve still rises but only in minor scales.One reason for the marginal improvements could be that the agent mainly exploits its experience  at this time while still finding slightly better flowsheets through exploration.The best flowsheet the agent generated throughout training is displayed in Figure 9.The depicted process first uses a reactor (R1) to produce MeOAc and its side product H 2 O from the feed (F1).Then, the resulting quarternary mixture is split up in two distillation columns.The distillate (P1) of the first column (C1) is enriched with MeOAc but also includes MeOH and H 2 O.The bottom product of the first column is further split up in a second column (C2) to produce a mixture of H 2 O and MeOH in the distillate (P2) and pure MeOH in the third product stream (P3).90 % of the latter product is recycled and mixed with the feed stream.During the training, the agent learned, for example, that heat exchangers do not add value to the flowsheet.

Flowsheet generation in a continuous action space
The third level actor was investigated by deactivating the first and second level actors and thus only including continuous decisions.Therefore, the sequence of unit operations in the flowsheet was fixed, as shown in Figure 10, and only the continuous design variables defining each unit were selected by the agent.Within this structure, the agent was trained for 10 000 episodes.Similar to the findings in the discrete action space, the agent learns quickly at the beginning of the training.After the steep increase, the policy starts to converge and is almost constant after 10 000 episodes.The resulting learning curve of the continuous agent is displayed in Figure 11, showing the scores of the final flowsheets averaged over 50 episodes.
Table 2 lists the continuous design variables of the best flowsheet the agent observed throughout the training.In the heat exchanger (HEX1), the feed is slightly heated before entering the reactor.With a   10, the bottom product is partially recycled to the feed.Remarkably, the recycled ratio is set to zero in the depicted best flowsheet.These results show that a recycle does not make economic sense for the illustrative flowsheet used for this study.

Flowsheet generation in a hybrid action space
After the previous sections have shown that all three actors are able to learn separately, they are combined hereinafter.Therefore, the hybrid agent, combining all previously described elements, is trained in 10 000 episodes.
The resulting learning curve is displayed in Figure 12, showing the scores of the flowsheets generated during the training, averaged over 50 episodes.Despite the complexity of the hybrid problem, the agent is learning fast and quickly produces flowsheets with a positive value after approximately 1000 episodes.The best flowsheet the agent observed during training is shown in Figure 13.The continuous design variables the agent selected for this best flowsheet are shown in Table 3.
The feed (F1) is fed directly into a reactor (R1) where MeOAc and H 2 O are produced from esterification of HOAc with MeOH.With a length of 18.4 m, the reactor is significantly larger compared to the best flowsheet generated with the continuous agent in Section 4.2 which results in a higher conversion but also higher costs.In the next step, the resulting quarternary mixture is heated in a heat exchanger (HEX1) and split up in a column (C1).In the distillate of the column (P1), MeOAc is enriched but it also includes MeOH and residues of H 2 O.The bottom product of the column (P2) contains HOAc and   MeOH.The sequence of unit operations differs from the best flowsheet generated by the discrete agent in Section 4.1, were no heat exchanger and two columns were used.Here, the desired product MeOAc is completely in the distillate and the bottom product consists of less valuable chemicals.Thus, the agent learnt that the second column does not add economic value.Before entering the column, 24 % of the process stream are recycled to the feed.In contrast to the flowsheet investigated in the continuous action space in Section 4.2, the recycle does add value to the flowsheet since it increases the total conversion in the reactor.

Discussion
Overall, the learning curves shown in the previous sections indicate that all parts of the agent learn quickly.It is assumed, however, that the policy does not always converge towards the global optimum for the considered task since the hyperparameters have not been optimized for this first fundamental study.In future works, it is advised to conduct an extensive hyperparameter study to investigate their influence on the learning behavior.Compared to other approaches, the main contribution of the presented method is the representation of flowsheets as graphs and combining GNNs with RL.GNNs have already shown promising performance in various deep learning tasks 42 .One of their key advantage is that they are able to process the topological information of the graphs 43 .Since the structural information about flowsheets is automatically captured in the graph format, GNNs can take advantage of this structure.Deriving fingerprints from graphs with GNNs has already shown promising results in the molecule field 29,66,67 .Here, we transfer the methodology to the flowsheet domain.During the implementation and analysis of the training procedure, the graph presentation of the flowsheets has proven to be handy.The graphs generated by the agent can be visualized easily and thus immediately give an insight into the process and its meaningfulness.An additional advantage of the approach is its flexibility.Through its hierarchical structure, the different components of the agent can be easily decoupled and new parts can be added.By using a separate MLP for each unit operation in the third level decision, the number of the continuous decisions can vary for the different unit operations.In the presented work, only one continuous decision is made for each unit operation but the agent architecture allows including more decisions within this step.By allowing for more unit operations and setting more design variables, the action space and thus the complexity of the problem should be increased for future investigations.
Furthermore, the reward function will require additional attention.Giving rewards is not straightforward in the considered problem since it is hard to assess the value of an intermediate flowsheet.Still, it is crucial for the performance of the RL algorithm.In the presented work, the reward function is only an estimation of economic assessments that neglects multiple cost factors in real processes.However, for future developments, investigating ways of reward shaping 68 will be an interesting aspect that can stabilize the training process especially when the size of the considered problem gets larger.

Conclusion
We propose the first RL agent that learns from flowsheet graphs using GNNs to synthesize new processes.The deployed RL agent is hierarchical and hybrid meaning it takes multiple dependent discrete and continuous decisions within one step.In the proposed methodology, the agent first selects a location in an existing flowsheet and a unit operation to extend the flowsheet at the selected position.Both selections are discrete.Then, it takes a continuous decision by selecting a design variable that defines the unit operation.Naturally, each sub-decision strongly depends on the previous one.Thereby, flowsheets are represented as graphs which allows us to utilize GNNs within the RL structure.As a result, our ©A.M. Schweidtmann methodology generates economical valuable flowsheets only based on experience of the RL agent.
In an illustrative case study considering the production of methyl acetate, the approach shows steep and mostly stable learning in discrete, continuous, and hybrid action spaces.This work is a fundamental study that demonstrates that graph-based RL is able to create meaningful flowsheets.Thus, it encourages to incorporate AI in chemical process design.
A further advantage of the presented approach is that the proposed architecture is a good foundation for further developments like enhancing the state-action space.Thus, the selected structure of the agent is predestined for increasing the complexity and solving more advanced problems in the future.A subsequent step following this paper should be to implement an interface to an advanced process simulator.This will tremendously increase the complexity of the problem but also allow for easier extension of the action space and more rigorous simulations.As the process simulator will need to deal with random combinations of unit operations, guaranteeing convergence will become a major challenge and including constraints is advisable.

Figure 2 :
Figure 2: Example of a flowsheet displayed as a graph.Unit operations, feeds, and products are represented as nodes, whereas streams are represented as edges.

Figure 3 :
Figure 3: Hierarchical decision levels of the agent, starting from an intermediate flowsheet.In the first level, the agent selects a location where the flowsheet will be extended.Possible locations are open streams, represented by "undefined" nodes.In the presented flowsheet, both streams leaving the column can be chosen.Then, the agent selects a unit operation.Thereby, the options are to add a heat exchanger, a reactor, a column, a recycle or to sell the stream as a product.Finally, a continuous design variable is selected for each unit operation.This third decision depends on which unit operation was selected previously.

Figure 4 :
Figure 4: Flowsheet fingerprint generation derived from Schweidtmann et al.29 .The flowsheet graph is processed through an MPNN, using GCNs to perform message passing and update node embeddings.In the readout step, a pooling function is applied, resulting in a vector format, the flowsheet fingerprint.

Figure 5 :
Figure5: Update of the node embeddings during the message passing phase in a graph convolutional layer.The considered node is marked in blue and its neighbors in yellow.First, the information stored in the neighboring nodes and the respective edges is processed and combined through a message function M.Then, a message is generated to update the information embedded in the considered node through the update function U. The approach and its illustration follow a method proposed by Schweidtmann et al.29 .

Figure 6 :
Figure6: Architecture of the deployed actor-critic agent.First, a GNN is used to process the graph representation of the flowsheet into a flowsheet fingerprint.While the critic estimates the value of the fingerprint in one linear MLP, the actor takes three levels of decisions.The first decision is to choose a location for expanding the flowsheet.Practically, this means selecting the ID of a node representing an open stream.The selected node ID is combined with the flowsheet fingerprint and passed through an MLP for the second level decision of choosing a type of unit operation.Finally, a continuous design variable of the unit is chosen.Thereby, a different MLP is used for each unit type.

Figure 7 :
Figure 7: Vapor-liquid-equilibrium in the quarternary system consisting of MeOAc, HOAc, H 2 O and MeOH at 1 bar.The gray surface markes the distillation boundary spanned by the two azeotropic points and the fourth component HOAc, spliting the diagram into two distillation regimes.

Table 1 :Figure 8 :
Figure 8: Learning curve of the agent in a discrete action space over 10 000 episodes.It shows the scores of the generated flowsheets, averaged over 50 episodes.The score of each episode corresponds to the reward which is the estimated net cash flow.An episode is a sequence of actions to generate a flowsheet, starting with a feed.

Figure 9 :
Figure9: Best Flowsheet generated by the agent in a discrete action space after training for 10 000 episodes.In a reactor (R1), MeOAc and its side product H 2 O are produced from the feed (F1).Then, the resulting quarternary mixture is split up in two columns (C1 and C2).Parts of the third product stream (P3) are recycled and mixed with the feed stream.

Figure 10 :
Figure 10: Fixed flowsheet structure during the training in a continuous action space.It consists of a heat exchanger (HEX1), a reactor (R1) and a column (C1).The bottom product (P2) is split up and partially recycled.

Figure 11 :
Figure 11: Learning curve of the agent in a continuous action space over 10 000 episodes.Analogously to Figure 8, it shows the scores of the generated flowsheets, averaged over 50 episodes.

Figure 12 :
Figure 12: Learning curve of the agent in a hybrid action space over 10 000 episodes.Analogously to Figure 8 and Figure 11, it shows the scores of the generated flowsheets, averaged over 50 episodes.

Figure 13 :
Figure 13: Best flowsheet generated by the agent in a hybrid action space within 10 000 training episodes.First, MeOAc and its side product H 2 O are produced from the feed (F1) in a reactor (R1).Then, the resulting quarternary mixture is heated up in a heat exchanger (HEX1) and split up in a column (C1).Before entering the column, 24% of the stream are split up and recycled.The first product (P1) is enriched with MeOAc but also includes MeOH and residues of H 2 O.The second product (P2) is a mixture of HOAc and MeOH.

Algorithm 1
Pseudocode of the agent-environment interaction.

Table 2 :
Continuous design variables selected by the continuous agent in the best flowsheet observed during 10 000 episodes of training.A shorter reactor means a lower conversion but also lower costs.The column (C1) is characterized by the distillate to feed ratio D/F of 0.59.As a result, MeOAc is enriched in the distillate which also contains MeOH and H 2 O.The bottom product is a mixture of MeOH and HOAc.In the investigated flowsheet shown in Figure

Table 3 :
Continuous design variable selected by the hybrid agent in the best flowsheet observed during 10 000 episodes of training.