Using process data to generate an optimal control policy via apprenticeship and reinforcement learning

Reinforcement learning (RL) is a data-driven approach to synthesizing an optimal con-trol policy. A barrier to wide implementation of RL-based controllers is its data-hungry nature during online training and its inability to extract useful information from human operator and historical process operation data. Here, we present a two-step framework to resolve this challenge. First, we employ apprenticeship learning via inverse RL to analyze historical process data for synchronous identification of a reward function and parameterization of the control policy. This is conducted offline. Second, the parameterization is improved online efficiently under the ongoing process via RL within only a few iterations. Significant advantages of this framework include to allow for the hot-start of RL algorithms for process optimal control, and robust abstraction of existing controllers and control knowledge from data. The framework is demonstrated on three case studies, showing its potential for chemical process control.

Bellman optimality equation, which is a discrete-time analogue to the continuous-time Hamilton-Jacobi-Bellman equation. 3 Dynamic programming (DP) methods provide exact solution to the Bellman optimality equation. However, such an approach assumes knowledge of the exact process dynamics. DP becomes additionally impractical in the highly dimensional continuous state and action spaces often observed in the process industries. 4 In contrast, RL methods do not require knowledge of the exact process dynamics to learn a solution policy. Instead, RL learns from experience of the process, allowing for π(Á) to be recalibrated as the process evolves through time via process data. 5 Furthermore, RL has shown significant industrial potential as demonstrated in a number of research works, which have explored application to the calibration of PID controllers; 6 set point tracking; 7 dynamic optimization of nonlinear, stochastic systems; 5,8,9 de novo drug 10 and protein design; 11 and in augmentation of the performance of various model predictive control (MPC) approaches. 12,13 Indeed, the potential use of RL draws discussion of its relation to MPC in the development of APC schemes. MPC schemes require periodic recalibration, which demands expense in technical expertise and often process downtime. The data-driven nature of RL could well mitigate this. Further, the framework provided by MDPs accounts for process stochasticity in a closed-loop manner, converse to MPC where decisions are based on open-loop simulation of the process model, with the loop only "closed" upon observation of the system state at the next discrete time index. Hence, inputs from an RL controller will account for disturbance whereas MPC may not. This provides a theoretical basis for the benefit of RL over MPC controllers.
One set of RL algorithms are known generally as policy optimization methods. Policy optimization methods aim to learn a policy by implicitly learning the value or cost over the decision space [14][15][16] and directly parameterizing a policy. There are a number of approaches to policy optimization as underpinned by evolutionary strategies, finite difference and policy gradient methods. 17,18 Policy optimization methods have been deployed for tasks including dynamic optimization of nonlinear stochastic processes 19 and tracking problems 6 . For further review of RL methods and their application within the process industries, we direct the reader to the following works. 7,20 The learning process encapsulated by RL demands both time and technical investment in policy training. This is highlighted further given that RL-based controllers are currently unable to generalize well across control tasks, for example, different changes of set point, meaning policy training is typically undertaken for each task. 21 As a result, implementation of RL control policies is computation and expertise expensive. To solve this problem, this work proposes a method to reduce the time and resource investment demanded by RL, through leverage of process data to learn from demonstration provided by an existing (but unknown) control policy. Then, the initialized RL is improved by learning from the real process over a short time period, thus outperforming the existing control policy. This two-step strategy has been recently deployed in domains including autonomous helicopter flight 22 and self-driving cars. 23,24 To demonstrate this approach, Section 2 will introduce the preliminaries and motivation, Section 3 will outline the methodology, with Section 4 exhibiting different case studies.

| Policy gradients and reinforce
Policy gradient methods directly learn a policy. Through the use of artificial neural networks (ANNs) as parameterization, the policy may be deployed naturally in either discrete or continuous action spaces through appropriate network construction. 25 Policy gradient methods do not explicitly learn the value of the policy. Instead, under the policy gradient theorem, acting with respect to the policy and gaining experience of the process dynamics provides approximation of the direction in which value increases fastest in parameter space. Hence, learning proceeds through gradient ascent to update the parameters of the policy to ensure control policies of high value (or low cost) are more probable. 18 One policy gradient algorithm, reinforce with baseline, approximates the direction in which the policy observes increased performance through Monte Carlo realizations of the process dynamics under the current policy parameterization. This algorithm has several advantages such as convergence to locally optimal solutions in policy space 26 and efficient exploration of the decision space without requirement for a bandit strategy or further optimization routine for action selection-as is the case in many pure action-value methods. 27 Demonstration of the method is also available. 19 Therefore, it is used in this work to learn an RL parameterization of an existing control policy from process data. Despite favor of the reinforce with baseline algorithm, other RL methods capable of operating in continuous action and state spaces could be implemented, such as entropy regularized policy optimization methods, 16 trust region policy optimization, 14 and proximal policy optimization (PPO) methods. 15

| Learning from demonstrations via apprenticeship
Learning from demonstrations encompasses an increasingly prevalent and established group of methods, which leverage data generated F I G U R E 1 Translation of the framework provided by Markov decision process (MDPs) to process control, where the process is analogous to an environment, and the controller to an agent. x t is representative of the true system state at discrete time t; u t is the control action computed by the control law at discrete time t; and R t + 1 is the scalar feedback signal (reward) indicative of the quality of process evolution at time t + 1 from an existing but unknown control policy to aid learning-based control systems. This concept is generally termed as apprenticeship learning (AL). AL has been adopted in a number of complex control domains, 22,24 but to our knowledge, this work is the first to propose use of the method to leverage plant data directly, and this is one of the primary contributions of this work. The concepts of AL are expressed in three main subfields including behavioral cloning (i.e., supervised learning), inverse optimal control, and inverse reinforcement learning (IRL).
This study exploited IRL built upon the framework provided by MDPs. 28 MDPs express process objectives mathematically as a reward function. The reward function provides a scalar feedback signal indicative of the optimality of process evolution. IRL is concerned with the task of mathematically abstracting the reward function given process knowledge and demonstrations from an existing control policy. The IRL problem is formalized as: given observations of an existing policy over time, sensory inputs available for determination of the originally demonstrated control law and a model of the process; determine the reward function that can mostly justify the demonstrated behavior. 24,29,30 IRL proceeds on the assumption that demonstrated control action is noisily optimal under the reward function derived. 30,31 However, it should be noted that this does not necessarily imply that the policy is optimal in view of the true objectives for process control and optimization.
As such, IRL leverages process data to learn a reward function that encodes the control objectives of an existing scheme into a feedback signal. A control policy that maximizes the utility of this reward function within the MDP framework provides a parameterization of the existing control scheme. Hence the pairing of IRL with RL as an MDP solver, allows for synchronously learning the parameterization of an existing but unknown control policy as described in process data. The generated reward function can be used to compare against the process objective (if known) and suggest if the extracted control policy is suitable for online learning. Moreover, manual modifications are always implemented during process control even if the process objective is known. These manual modifications cannot be quantified by human operators, but can be retrieved from historical data by IRL.
Therefore, using IRL to generate a reward function is advantageous for parameterization of the optimal control policy.

| Motivation
In the following work, we demonstrate a framework for learning and optimization of chemical processes. The framework consists of two steps: offline learning, and online learning and improvement. Here, the use of terminology is converse to that common in the ML community. In this work, offline learning indicates a process of AL (via IRL) to infer control objectives from process data and the learning of a corresponding parameterization of the control policy described by data; online improvement then indicates the transfer of the learned parameterization to the real system for the purpose of further policy improvement under the true process objective. The framework enables the learning of an RL-based control policy, by leveraging process data from existing control schemes (offline) and subsequently improves the learned policy parameterization via further RL (online).
The automation of offline learning and the policy tuning process that is associated, provides a significant contribution given the technical, computational and data demands of RL-based policy learning.
Offline learning produces a parameterization of the existing control policy, which could be deployed directly for control. The parameterization will achieve similar performance to that expressed by the original control scheme. If necessary, the parameterization may then be transferred to the second stage of online learning for further policy improvement. It should be emphasized that the leveraging of process data is significant given the practical difficulties in learning an RLbased policy "from scratch". 19,32 The framework also lends itself to the improvement and recalibration of the control scheme temporally.

| Problem statement
The following work proceeds on the formulation of the underlying problem of process control as an MDP. The true dynamics of an MDP are described as follows: where x∈R nx is a vector of continuous variables representative of the true system state, u ℝ nu the manipulated variables (MVs), y ℝ ny the observed control variables and t is indicative of the discrete time index. 33 The process evolution between discrete time indices t and t + 1 is governed by the conditional density function p(x t + 1 j x t , u t ).
Similarly, the observation y t of the true state of the system x t is governed by the conditional density p(y t jx t ). To facilitate learning of a policy prior to transfer to the real system, approximation of the true dynamics proceeds based on state-space models and assumptions regarding process stochasticity, hence: where f Á ð Þ : ℝ nxÂnuÂn d ! ℝ nx is representative of the process dynamics and d t ℝ nd is representative of the process disturbance. The mapping g Á ð Þ : ℝ nx ! ℝ ny is the state observation associated with measurement noise 2 .
The following work deploys RL to learn a control policy from process data. The objective of RL is to minimize the expected cost of a dynamic process (or equivalently to maximize its value). In the following, a process trajectory, τ = (x 0 , y 0 , u 0 , …u T À 1 , x T , y T ), describes the manner in which a process evolves over a given discrete time horizon of length T. The cost or value G(τ) of the process trajectory over a finite horizon is denoted: where γ (0, 1] is a discount factor, which provides a net present value interpretation of future value; and R t is the credit (reward) assigned to the process' evolution between time indices t À 1 and t.
However, in view of process stochasticity, the probability of observing τ adheres to a conditional density p(τj θ) based on the control policy and process dynamics: where ρ x 0 ð Þ is the probability density of the initial system state; π(u t j y t , Á) is the conditional density function descriptive of the learned policy, which is parameterized by θ ℝ nθ ; and p(x t + 1 j x t , u t ) is the conditional density function representative of the process dynamics.
Note that the definition of a policy as a conditional density function implies it is stochastic. This is important in the scope of the learning process associated with RL but does not necessarily assert the use of a stochastic policy upon deployment for control of the real system (only the mode might be used in practice). The objective of the RL problem and learning process is to find a policy π(Á,θ * ) that maximizes the objective J(τ), such that The description provided in this section formalizes the problem of optimal control under the framework provided by MDPs. One approach to finding approximate solution to the problem described by is encompassed by policy optimization RL methods.

| Policy gradient and reinforce
Policy gradient methods are a subset of policy optimization methods, which estimate the gradient of the objective detailed by Equation (3.1.8) with respect to the parameters of the current policy.
Mathematically, this is described by the policy gradient theorem. 18 The Supporting Information (SI) provides full derivation and explanation of the policy gradient theorem. Given an estimate of the true policy gradient, gradient ascent methods facilitate policy improvement to make trajectories of higher reward more probable. In this manner, the policy parameterization is updated (via Equation (3.2.2)) in the direction provided by the policy gradient (Equation (3.2.1)): where j is the iteration of policy optimization, and ω is the step size in the direction of the policy gradient, r θ j ð Þ J τ ð Þ . The derivation of Equation (3.2.1) leverages the use of a logarithmic identity (see SI). This enables mathematical separation of the conditional probability functions descriptive of the process dynamics and policy (see Equation (3.1.6)).
Given the process dynamics are independent of the parameterization, θ, of the policy, π(θ, Á), examination of Equation (3.1.6) provides: Consequently, the policy gradient described by Equation (3.2.1) is reformulated as: dynamic stochastic control problems with extension to systems characterized by partial observability. 2 General detail of the mathematical operations specific to LSTMs can be found in the following works, 34,35 with figurative description of the network used in this application provided by Section SI.2 of the SI. The investigation utilized the Pytorch 1.3.1 framework and first-order gradient ascent method Adam to train the LSTM network proposed. The network structure was composed of two hidden layers, each with 20 LSTM cells. A leaky rectified linear unit (ReLU) activation function was applied across both hidden layers and a ReLU6 activation function was applied across the output layer, naturally bounding the output prediction. For a random variable z, the ReLU6 transformation is described as: The network designed in the context of this work, predicts the mean (μ t ) and standard deviation (σ t ) of a unimodal multivariate normal distribution. This distribution describes the conditional density function representative of the control policy, such that: , where H t is a learned parameterization of the history of process states provided by the LSTM cells, and σ 2 t is the variance. Here, we formally construct the control policy as stochastic.
However, upon deployment of the policy to the real system, the policy may be assumed deterministic through selection of the actions corresponding to the mode (equivalently, the mean) of the multivariate normal distribution, such that u t = μ t .
In this section, we have presented an approach to solving the MDP characteristic of a control problem through use of the policy gradient method, reinforce with baseline, in combination with an LSTM network for parameterization of the learned policy. In the following, we introduce an approach to policy learning, namely maximum entropy IRL (MaxEnt IRL), which utilizes existing process data to learn from demonstration. Conceptually, this approach is commonly known as AL.

| AL via IRL
AL via IRL is a general approach to policy learning from demonstration (i.e., process data). The benefits to such an approach are twofold. First, AL via IRL provides a parameterization of the existing control policy expressed in the process data. Second, it facilitates RL-based policy learning under the "real" process objective as it provides an initial policy to hot-start the RL procedure. Otherwise, initially, the agent (or controller) will explore the control action space randomly, which results in a data hungry and time-consuming approach. These benefits are exploited by the framework proposed in Section 2.3 as detailed by The foundational IRL algorithms construct the reward function R : Y ! ℝ as a linear combination of state features representative of the system state, φ ℝ d Â 1 , such that: where α i are feature weightings and φ i : Y ! ℝ explicitly represent the system state (y), but also implicitly encode control objectives.

Algorithm 1 Reinforce with baseline
Input: Initialize: a policy π with initial parameters θ 0 ; learning rate ω; episode length T; K episodes for Monte Carlo rollouts of the policy; and, N training epochs. Early stopping conditions may also be implemented.
Output: A policy π(u| y, θ) Typically, φ are hand designed based on process and control task knowledge. 29 Knowledge of process objectives can also be applied to place bounds on the weights α in the reward function; however, this may not always be desired as one could assert technical bias on the problem and reduce the feasible region. From this definition of the reward function R(α, y), consequent reformulation of the policy opti- This may be further decomposed through definition of trajectory features, υ i , such that for the discounted case:

| Maximum entropy IRL
In AL, we are interested in learning a policy as described by a conditional probability density function π(u t j y t , Á), such that upon deployment of the policy to the real system, the process observes the same evolution as that described by process data (see Equation (3.1.6)).
Explicitly, the investigation learns the expert's policy expressed by where υ = [υ 1 , υ 2 , …, υ d ], and Z(α, Á) = P τ Τ exp{α T υ(τ)} is the partition function, which enforces normalization of the distribution. Formally, the approach prescribes that each of the demonstrations, τ E Τ, are independently and identically distributed such that the likelihood of observing the set of trajectories, Τ, expressed in process data is: where Z(α, Á) is assumed constant for all τ E Τ; 30 and p(Τj α) is the likelihood of observing the set of demonstrations. Under the maximum entropy formulation, 30,31,36,39 optimal solution of the feature weights, α * is: The gradient of the log-likelihood objective (Equation (3.4.3)) with respect to feature weights, α, is formulated as:  It should be noted that the approaches to policy optimization provided by PPO and entropy regularization could provide further stability in learning and accuracy in estimation of the partition function, respectively. In view of the length of the horizon specific to many control tasks, discounted trajectory features υ γ , as described by Equation (3.3.4), should be used rather than the undiscounted features.
This establishes the upper MaxEnt IRL task as a nonconvex optimization 37 but provides performance improvements in the lower level policy optimization task. Algorithm 2 details the MaxEnt IRL algorithm further. x ¼ C A , T ½ T ð4:1:1bÞ

| Overview of the proposed methodology
In the case studies presented, the process model is of deviation variable form and was derived from first principles. The deviation variable, z * of random variable, z is expressed as: where z ss is the previous steady-state value of z. Process stochasticity (disturbance) is assumed zero mean Gaussian, as is the nature of system observation. Therefore, approximation of the true underlying process dynamics takes the form of a system of stochastic differential equations, such that 4. Perform gradient ascent such that α (n + 1) = α (n) + ð4:1:4bÞ and the Euler Maryuama method was utilized for system integration. 38 The SI provides formal derivation and parameter values. Given the formulation of the MIMO problem, the investigation is concerned with controlling the evolution of error, ε within both the temperature, T obs and reagent concentration, C ob A control loops.

| Design of state features for AL
The introduction provided in Section 3.4 outlines a framework for learning the weight vector α * , which provides a linear mapping from state representations, φ, to scalar cost. Further, for a given representation, a set of possible process trajectories exist, which match the counts of state features (trajectory features) of the existing policy.
Therefore, design of φ should consider both the process, optimization objectives and restriction of the possible set of trajectories. As a result, this work proposes the use of three types of state features, all of which provide consistent control objectives temporally and utilize knowledge of the underlying process control task.

| Type I
The first state feature proposed is encapsulated by the radial basis function (RBF). The RBF provides a similarity measure and allocates exponentially lower cost or greater value for those control policies which achieve set point tracking. The feature is formulated as: ε ¼ y sp À y y sp À y ss ð4:2:1Þ where y ss is the previous observed steady state of the system, y sp is the desired set point, β is the shape parameter and φ Iε ð Þ ¼ 0, 1 ½ . The closer the value of β to zero, the greater the offset tolerated and the denser the reward landscape. Conversely, higher values of β provide exponentially greater rewards for trajectories closer to the set point, but a sparser reward landscape. In the following case studies, the investigation utilized β = 10.

| Type II
Although the Type I feature is an absolute measure of control performance, alone it does not fully characterize the evolution of system response. Furthermore, the set of possible process trajectories, which could match the representation of the demonstrated policy v E is large.
To restrict the possible set, Type II and III features take inspiration from the PID control law, which at a given time is a linear combination of the error, ε = y sp À y, in the control loop at the current time point (proportional), the manner in which the error has evolved over time (integral) and the projected evolution of error in the future (derivative). Hence, the Type II state feature proposed intends to quantify how the absolute error in a control loop evolves temporally. As such, Type II state features are described as: where Δt is equivalent to the sampling time or times at which con-

| Type III
The design of Type III state features aims to quantify how the error in the control loop may evolve into the future. As a result, the feature approximates the derivative of the error in the control loop at the sampled time: where tc À 1 is the previous discrete time index. In view of the proposed state features, the investigation is able to characterize control trajectories and provide direct and consistent control objective. As a result, the reward function R of the MDP described is specified as

| Case Study I-Learning from near optimal demonstrations
The purpose of this case study is to construct an RL controller which learns from demonstration provided by a near optimal control policy and then to improve it further. As such, we demonstrate the  Table 2.
From Table 2

| Online learning and optimal control
Further improvement of the initial policy (Section 5.1.1) utilizes Algorithm 1 and a real process reward function shown as Equation (5.1.2.1), which expresses pure set point tracking objective that demonstrated by the PID controller. Explicitly, the policy improvement was provided by two rounds of online learning, with 10 training iterations (epochs) per round. As a result, the agent is able to facilitate a system response, which meets set point faster with less overshoot observed than using the PID controller (shown in Figure 3 (C)). Similar observations are made in analysis of Figure 3(B,D), which demonstrate the response of the temperature control loop. In this case, the online updated RL yields a better temperature response characterized by a fast rise time with no observable overshoot.

| Case Study II-Learning from suboptimal demonstrations
In Case Study II, the demonstrations (process data) are derived from a second PID controller (PID2 detailed by the SI). Compared to Case Study I, the demonstrations provided by the PID controller here are of an overdamped control response, which subjectively appears suboptimal.
T A B L E 3 The expected discounted trajectory features of the PID2 (υ γ,E ) and the policy learned through AL (υ γ,π ), and IRL's feature weight (α * ) generated in CS I. Y * À Type indicates the type of trajectory feature and the respective control loop error Trajectory features υ   In this case study, state features relevant to the concentration control loop are allocated the greatest weighting. This is because the set points are changed in the same direction (as detailed by Table 1).
Naturally, a rise in reagent concentration will cause a decrease in temperature (endothermic reaction), whilst a rise in temperature will facilitate the conversion of reagent concentration. As the reaction equilibrium is more sensitive to the temperature change, greater weightings must be added to the concentration control loop to reach the new set point. Furthermore, Type I state features are allocated negative weights, which is unusual. Intuitively, Type I features represent a similarity measure between the current state of the system and the desired set point. Given that the feature value is non-negative (φ I = [0, 1]), a negative reward weighting means that the IRL learnt objective function will prevent the process from reaching the new set point. This is the primarily attributed to the fact that a large proportion of the demonstrations never reached the new set point (Figures 4 and 5) due to the overdamped control response. As AL considers the expert's (i.e., PID controller) actions as a noisily optimal control policy, it will find the optimal solution of weight vector, α * , to reproduce this overdamped control response. Therefore, the current result indicates that if the demonstration data does not contain a good control policy, it is essential to further improve the AL generated policy through online learning.

| Online learning and optimal control
As in Section 5.1.2, online learning is performed to improve the AL policy (initialized for RL). Given that a degree of offset was present in both control loops as detailed by Figure 4, two short rounds of RL policy improvement, again consisting of 10 training epochs, proceeded with hand tuning of the parameter β in each round. Figure

| Case Study III-Knowledge transfer in learning from demonstration
Finally, Case Study III demonstrates how knowledge transfer from one task improves the efficiency of offline AL for further set points.
Here, we again assume the availability of existing demonstrations as described by process data. The control task (set point change) in this study is described by Table 1 and is different to both tasks examined in Case Studies I and II. Again, we would like to learn a parameterization of the control policy (offline) expressed in the process data and then improve it further (online), but we wish to reduce the computational budget associated with offline AL. Thus, we propose to transfer knowledge from a previous study to improve computational and learning efficiency.
Knowledge transfer is in the form of the offline learned policy parameterization, π po θ k0 ð Þ À Á and weight vector, α * , from a previous task. Here, knowledge is transferred from Case Study I, given its better PID performance than Case Study II. Both α * and π po θ k0 ð Þ À Á from Case Study I are provided as initialization for AL of the new task in Case Study III. Update of this initialization only takes 80 epochs.
Previously, the two studies recovered demonstrated behavior within a total of 300 and 250 epochs of policy optimization, respectively.
This reduction in the computational intensity of policy learning demonstrates that the computational burden of AL via IRL-under the current methodology-may be significantly reduced through knowledge transfer. In this study, process data were generated using PID1. Table 4 details the corresponding trajectory feature expectations, υ γ,E .
Given the parameterization as learned via IRL, a further two rounds of 10 epochs of RL enabled further policy improvement online.
The results are presented in Figure 6. Figure 6(A,B) highlights how the policy learned under knowledge transfer achieves pure set point tracking with a smoother control policy than that demonstrated by PID1. Once again, Figure 6(C,D) shows that this control policy successfully facilitates a system response with fast rise time, but no overshoot or oscillatory behavior around the set point, as is present in the demonstrations.

| CONCLUSIONS
In this article, we propose a framework based on AL to learn a control law based on process data, this approach allows us to synthesize a neural network control policy from a previous controller (e.g., PID, MPC, or human controllers) more robustly than with supervised learning. Having learned a parameterization of the control law, subsequent deployment of RL enables further policy improvement by directly interacting with the real process, thus outperforming the existing control law. Here, AL is implemented through IRL. Given the data-driven nature of IRL, the RL-based policy parameterization promises to express the action of the control scheme and process knowledge of the operators. RL is constructed using a policy optimization algorithm, although other methods could be also applied in the future. Based on the case studies, it is concluded that the proposed framework can effectively extract control information from available process data, transfer knowledge between different cases, and can result in a better optimal control policy efficiently. It should be noted that we assume the availability of rich informative datasets. If the data is not informative, the framework is unlikely to be effective. Future work will explore implementation of various data augmentation strategies, based on physical knowledge or statistical analyses, to artificially synthesize informative datasets.

DATA AVAILABILITY STATEMENT
Data sharing not applicable to this article as no experimental datasets were generated or analysed during the current study.