Online optimal and adaptive integral tracking control for varying discrete‐time systems using reinforcement learning

Conventional closed‐form solution to the optimal control problem using optimal control theory is only available under the assumption that there are known system dynamics/models described as differential equations. Without such models, reinforcement learning (RL) as a candidate technique has been successfully applied to iteratively solve the optimal control problem for unknown or varying systems. For the optimal tracking control problem, existing RL techniques in the literature assume either the use of a predetermined feedforward input for the tracking control, restrictive assumptions on the reference model dynamics, or discounted tracking costs. Furthermore, by using discounted tracking costs, zero steady‐state error cannot be guaranteed by the existing RL methods. This article therefore presents an optimal online RL tracking control framework for discrete‐time (DT) systems, which does not impose any restrictive assumptions of the existing methods and equally guarantees zero steady‐state tracking error. This is achieved by augmenting the original system dynamics with the integral of the error between the reference inputs and the tracked outputs for use in the online RL framework. It is further shown that the resulting value function for the DT linear quadratic tracker using the augmented formulation with integral control is also quadratic. This enables the development of Bellman equations, which use only the system measurements to solve the corresponding DT algebraic Riccati equation and obtain the optimal tracking control inputs online. Two RL strategies are thereafter proposed based on both the value function approximation and the Q‐learning along with bounds on excitation for the convergence of the parameter estimates. Simulation case studies show the effectiveness of the proposed approach.


INTRODUCTION
Reinforcement learning (RL) is a type of machine learning technique that has been used extensively in the area of computing and artificial intelligence to solve complex optimization problems. 1,2 Due to its successes, there have been concerted efforts by researchers in the control community to explore the overlap between RL and optimal control theory, which usually involves solving the general-purpose Hamilton-Jacobi Bellman (HJB) equations. The conventional approach to optimal control minimizes a weighted cost function composed of state and control minimization objectives. A closed-form solution (eg, Riccati equation) to this problem is available under the assumption that there are known system dynamics described as differential equations. 3 Without such models, this closed-form solution is not available. RL has been successfully applied to iteratively optimize these control cost functions for unknown or varying systems by providing solutions to the HJB equations online. 4,5 This article addresses the optimal online tracking control of varying systems under less restrictive assumptions than previously proposed solutions and builds upon the objectives of performance seeking and real-time optimization techniques. 6,7 Variations occur in systems due to a number of factors including degradation and changing operating conditions, which can result in a reduction in system performance. From the traditional control perspective, adaptive control offers strategies to compensate for the system variations and can be indirect or direct. Indirect adaptive schemes use the system measurements to learn new system models, which are then used in a conventional model-based control design while direct schemes use the system measurements to adapt some parameterized controllers. In both of these schemes, optimality is not directly achieved in the sense of optimizing some user-defined cost function. 8 RL enables the development of both optimal and adaptive strategies that are able to cope with the system variations by using only the system measurements and has been linked to both optimal and adaptive control. [9][10][11][12] These enabling methods are therefore classed as intelligent, defined as self-diagnostic, prognostic, and optimizing, resulting in a through-life adaptation strategy and has been widely reported in many applications. [13][14][15][16] Mathematical implementation of RL is enabled through approximate/adaptive dynamic programming (ADP) 17,18 and has been described by different other labels including neurodynamic programming and adaptive critic designs. 9,19,20 Through interaction with the systems, the RL-ADP strategies have been applied to incrementally improve the desired control behavior for the regulation of feedback systems involving unknown continuous and discrete-time (DT) dynamics. [21][22][23][24][25][26][27][28] For the tracking control problem, existing RL strategies are split between methods that employ dynamics inversion and those that use an augmented formulation. 29 In Reference 30 , a method employing the dynamics inversion for the infinite-time tracking control for the DT nonlinear systems has been proposed. The method assumes that the steady-state feedforward control input is known a priori and uses a new quadratic performance index to compute the feedback control input using RL techniques. A finite-horizon neurooptimal equivalent that minimizes the tracking error over a finite horizon but equally assumes a known steady-state feedforward input is proposed in Reference 31 . Likewise, the authors in Reference 32 have proposed an optimal tracking control for nonlinear DT systems that uses three online approximators and a heuristic tuning law for the feedforward portion of the control input but assumes bounded approximation errors with fixed model structures for the identification of the system parameters. A similar approach that uses three approximators and a generalized policy iteration ADP to include two iteration procedures for the tracking control has been proposed in Reference 33 . This approach learns a model of the system dynamics online and generally requires pretrained models while assuming fixed model structures for the identification. In contrast to these approaches, strategies that employ the augmented formulation obviate the need to have a predetermined feedforward control input by transforming the tracking control into a regulation problem using augmented system states. In Reference 34 , an approach that transforms the optimal tracking control problem into a regulation problem has been provided by augmenting the system states with the reference model dynamics for linear DT systems. The approach assumes that the reference generator matrix is governed by dynamics that tend toward zero, thereby limiting any practical usage of the approach in the case of nonzero reference inputs. Consequently, an approach that relaxes the restriction on the reference dynamics by using a discounted tracking cost is given in Reference 35 . Extension of the approaches to DT nonlinear systems using neural networks and discounted tracking cost is given in Reference 36 , while the continuous time equivalents of the strategies to include input constraints are given in Reference 37 . However, by introducing a discounted tracking cost, convergence of the tracking error to zero can no longer be guaranteed, thereby limiting the practicality of the approaches. As a result, none of the existing optimal online tracking RL strategies are able to guarantee a nonzero steady-state tracking error using the system dynamics inversion or augmented formulations. Moreover, the restrictive assumptions involved in both formulations make the approaches less desirable for use in practical tracking applications.
This article therefore presents an optimal online reinforcement learning tracking control framework for DT systems, which uses an augmented formulation with integral control and transforms the DT optimal tracking control problem into one of regulation. In contrast to the approaches discussed in Reference 30,31,[34][35][36] , the proposed framework removes the need to have either a predetermined feedforward control input, any restrictive assumptions on the reference model dynamics, or a discounted tracking cost that limits the practical applications of existing online tracking RL approaches. Furthermore, the proposed RL framework eliminates steady-state tracking error and is able to cope with systems with unknown or varying dynamics leading to a through-life adaptation strategy. It is shown in this article that the resulting value function for the DT linear quadratic tracker (LQT) using the augmented formulation with integral control is also quadratic. This enables the development of Bellman equations, which use only the system measurements to solve the corresponding DT algebraic Riccati equation (ARE) and obtain the optimal control inputs online. Two RL strategies are proposed in this article based on both the value function approximation (VFA) and Q-learning along with bounds on excitation for the convergence of the parameter estimates.
The rest of the article is organized as follows. Section 2 presents the general optimal tracking control problem for DT systems along with the existing solution strategies and their limitations. Section 3 presents the proposed augmented formulation, while Section 4 provides the model-based control solution to the augmented LQT problem. In Section 5, the two RL strategies and an intelligent framework for the augmented LQT problem are developed and respective algorithms provided, while Section 6 gives two representative simulation case studies using the proposed algorithms.

PROBLEM FORMULATION
Consider the control affine-in-input discrete-time system with the following dynamics: where x ∈ R n , u ∈ R m , and y ∈ R p are, respectively, the system states, inputs, and outputs. The aim of the tracking control problem is to minimize a cost function: where i Ru i is a quadratic energy function with Q T ≥ 0 and R > 0, r is a desired reference trajectory, and 0 < ≤ 1 is the discount factor.
It can be shown that for the special case where (1) is a linear time invariant (LTI) system: the standard solution to the given infinite horizon tracking problem using calculus of variation with = 1 is: 10 where: and P = P ⊤ > 0 is the solution to the algebraic Riccati equation: It is noted that the control input (4) for the tracking problem consists of both a feedback term K x that stabilizes the system and a feedforward term K v for reference tracking. Furthermore, the given standard solution is noncausal, 35 as it is dependent on a backward in-time recursion of variable v k . An implication of this is that the standard solution to the tracking problem can only be obtained offline and with full knowledge of the system dynamics. Consequently, causal solution strategies that can be computed online have been proposed in the literature and will now be briefly presented.
Causal solution to the optimal tracking problem Existing causal solution strategies to the online tracking control problem can be categorized into two and are briefly summarized as follows: (1) Strategies using dynamics inversion: [30][31][32] These methods assume that the desired reference dynamics is given as: where u d,k = g −1 (r k ) (r k+1 − f (r k )) is the feedforward tracking control input. The tracking error e k = x k − r k is minimized by defining a cost function as: with Q e ≥ 0, R e > 0 and where u * is the feedback tracking control input. The overall control input is thus given as:

Remarks
• Complete knowledge of the system dynamics is needed to compute the feedforward term u d , with a further assumption that function g(r) is invertible.
• Online implementation of this approach therefore assumes u d is known a priori, and only the feedback term u e is computed online. As a result, practical online adaptation strategies to cope with varying or unknown system dynamics are limited using this strategy.
(2) Strategies using augmented formulation: [34][35][36]38,39 These methods enable the simultaneous online computation of both the feedforward and feedback terms of the tracking control input. This approach assumes that the reference dynamics is governed by: where (r k ) is some reference generator model with (0) = 0. An augmented system is then formulated using the tracking error and the reference dynamics as: and with a new cost defined as: . This way, the tracking problem is recast as a regulation problem, the solution of which gives both the feedforward and feedback terms of the control input online.

Remarks
• It is assumed that (r k ) → 0 as k → ∞; where this is not, a discounted performance function with 0 < ≤ 1 must be used to ensure the value of the cost function remains finite. 35 This assumption poses a restriction on the class of reference generator that can be used with the approach.
• By using a discount factor in the cost function, this approach cannot guarantee zero steady-state tracking error as discussed in Reference 35 . This restrictive assumption on the reference dynamics and discounted cost makes the approach less desirable for use in practical tracking applications.
Consequently, existing RL techniques for the online optimal tracking control problem assume either the use of a predetermined feedforward input for the tracking control or use restrictive assumptions on the reference model dynamics and discounted tracking costs. In the following, a new augmented formulation for the online optimal tracking control problem that guarantees zero steady-state tracking error without imposing any restrictive assumptions on the reference dynamics or discounted performance cost is proposed to overcome the limitations of the existing strategies.

AUGMENTED FORMULATION FOR THE OPTIMAL TRACKING PROBLEM WITH INTEGRAL CONTROL
Consider again the optimal tracking control problem for system (1) and let a new statėz be defined as the integral of the difference between the desired reference and the system output as: Using Euler's approximation, an equivalent discrete-time state with sampling time t s gives: An augmented system can therefore be formed using the new state as: At steady state, the augmented system (14) becomes: For a constant reference signal, that is, r ∞ = r k , subtracting (15) from (14) gives: Further simplification of (16) becomes: and The tracking cost (2) is therefore redefined as: where Q 1 ∈ R (n+p)×(n+p) . This way, the tracking problem is converted to that of regulation such that the control input for a minimum of (18) eliminates the steady-state error by ensuring that x k → x ∞ and z k → z ∞ as X k → 0. Furthermore, as the new augmented system states are not dependent on the reference dynamics, this approach removes the restrictive assumptions of the existing methods. An equivalent difference equation to (18) for a given fixed policy is given by the value function and defined as: where Using the Bellman principle of optimality, 9 the optimum value becomes: Equation (20) gives the DT Hamilton-Jacobi-Bellman (HJB) equation for the augmented tracking formulation with integral control from which the optimal tracking control input is obtained as:

MODEL-BASED SOLUTION TO THE AUGMENTED LQT FORMULATION WITH INTEGRAL CONTROL
A model-based control solution to the optimal tracking problem using the augmented formulation with integral control for discrete-time (DT) linear systems is first presented to be used in comparison with the model-free RL approaches introduced in later sections. Using the system dynamics in (3), the augmented system of (17) becomes: Lemma 1. (Quadratic value function). Given the LQT cost of (18) and system with dynamics (22), for any stabilizing control law:ũ where K 1 ∈ R m×(n+p) , K x ∈ R m×n , and K I ∈ R m×p ; the value function for the augmented formulation with integral control is quadratic for some matrix P 1 = P ⊤ 1 > 0 ∈ R (n+p)×(n+p) and given as: For simplicity of notation in subsequent analysis, (x ∞ and z ∞ ) are dropped in the augmented states.
Proof. Change the lower limit for the summation in (19) and substitute forũ k to give: Noting that and Equation (25) becomes: Therefore, where and P (11) (20), the Bellman equation for the optimal value function is thus given as: and the optimal control input of (21) with = 1 becomes: where Equation (29) gives the model-based control solution to the augmented DT LQT problem consisting of both the integral feedforward and feedback gains. Substituting forũ k in (28) and simplifying gives the corresponding algebraic Riccati equation (ARE) as: Lyapunov stability can be shown for the LQT system by using the Lyapunov function: Substitute for control input (29) as: Add and subtract K ⊤ 1 RK 1 , then simplify further to give: Finally, substitute for However, the ARE for the LQT system is given in terms of P 1 in (30); therefore, Lyapunov stability is guaranteed for the following condition: if and only if Q 1 and R are positive semidefinite. Figure 1 shows the block diagram of the augmented tracking control framework with integral control consisting of both a feedforward integral gain K I and a feedback gain K x . The given baseline integral-proportional (I-P) control structure is widely used in practice where the tracking error is fed into the feedforward integral term, while the proportional term is implemented in feedback. 40,41 Therefore, using knowledge of the system dynamics, the above tracking framework with integral control can be used to achieve optimal tracking control online and does not impose restrictions on the reference model dynamics or use of discounted tracking costs. For systems with unknown or varying dynamics, an approximate online solution to the optimal tracking control framework with integral control is developed in the next section using reinforcement learning. This offers the advantage of not requiring the full knowledge of the system dynamics while converging to the optimum values.

REINFORCEMENT LEARNING FRAMEWORK FOR THE OPTIMAL TRACKING CONTROL USING AUGMENTED FORMULATION WITH INTEGRAL CONTROL
As discussed in Section 2, existing approaches for the optimal tracking control problem using RL either assume that the feedforward part of the control is known a priori or make restrictive assumptions on the reference model dynamics and use of discounted tracking costs. These restrictive assumptions are eliminated by using the augmented formulation with integral control as proposed in Section 3. Consequently, a novel optimal RL framework is proposed for the LQT problem that converges to the optimum solution for systems with varying or unknown system dynamics using the augmented formulation with integral control. Furthermore, unlike the previously proposed RL tracking approaches, [30][31][32][34][35][36]38,39 the proposed formulation is able to guarantee zero steady-state tracking error and provides adaptation for both the feedforward and feedback controller gains. The framework continually adapts the controller gains to optimum values and provides a through-life adaptation strategy.
Model-free RL approaches are enabled by iterative techniques that utilize the Bellman optimality equations to develop forward-in-time update equations, which are solved at each time step. 2,42 One of such iterative technique is the policy iteration (PI) method, which requires an initially admissible policy 8 (ie, stabilizing policy with a finite cost V(⋅)) and successively alternates between the following update equations for (20) and (21) as follows: ) .
Given an admissible policy (X), the value is evaluated by solving (36) till convergence while an improved policy is computed using (37). Both update equations, respectively, constitute the policy evaluation and policy update steps of the PI method. The PI method is justified in Reference 43 by showing that the improved policy ensures that V k+1 (X k ) ≤ V k (X k ) and is associated with the monotonicity property of the update equations. This way the PI recursion computes a strictly improved policy, and convergence to the optimal policy and value under Assumption 1 has been shown in Reference 44 .
Model-free approaches for the LQT problem are therefore enabled by approximating the value function of (20) as follows: where Φ(X) is a set of basis function and c are the function weights. Equation (38) gives the value function approximation (VFA) and is defined as the sum of the discounted reward signal L k starting from state X k under some fixed policy (X).
Similarly, an approximation to the state-action value function with basis function Ψ(X,ũ) and weights is approximated as: Equation (39) is the Q-function approximation (QFA), which is defined as the sum of the discounted reward signal L k starting from state X k and taking actionũ k , then following policy (X) thereon. Depending on the function that is being approximated, two RL strategies are therefore proposed for the LQT problem.

VFA-based RL algorithm
For the VFA approximation, the Bellman equation for the value function (38) becomes: A second function approximation is used to adapt the controller gains and given as: The RL adaptation utilizes the PI recursion (36), (37) consisting of both value and policy update steps. For the value update step, the policy is kept fixed while the value function parameters are updated using the system measurements at N episodic intervals (ie, from some initial state X 0 to a terminal state X N ). After each episode, the controller parameters are adapted from (21) using a gradient descent tuning as: where l a > 0 ∈ R is a tuning step size. This is repeated till convergence of both the value function parameters and the controller gains. This way, the VFA based RL method solves the online LQT problem of Section 2 using the proposed augmented formulation with integral control and without requiring knowledge of the system dynamics. Algorithm 1 describes the VFA-based adaptation of the controller parameters using a policy iteration (PI) recursion.

Algorithm 1. VFA-based RL algorithm for the LQT problem
Initialize V(X) ≈ ⊤ c,k Φ(X) at k = 0 for some stabilizing policy (X) = ⊤ a,k X, and do till convergence: Value function update step 1: for j = 0 ∶ N do 2: At X j , compute the control inputũ j with exploration signal asũ j = (X j ) + .

3:
Compute the least squares solution for c,j+1 using measurements L j = X ⊤ j Q 1 X j +ũ ⊤ j Rũ j , X j and X j+1 as: j = j + 1. 5: end forPolicy update step Require: Set c,k+1 = c,j+1 | j=N 6: Update the policy parameters using the gradient descent tuning as: ) 7: At the end of the gradient tuning, set a,k+1 = i+1 a,k and update the policy as:

Remarks on implementation of Algorithm 1
• The gradient tuning update steps i can be chosen as the number of episodic steps j for the value function update.
• The VFA-based RL algorithm is not completely model free as knowledge of the input matrix B 1 is needed in computing the controller parameters update. Consequently, this approach is limited to systems with variations occurring only in the drift or dynamics matrix A 1 as this is assumed unknown.
• For convergence of the parameter estimates, a persistence of excitation (PE) condition on the regressor matrix given by References 45 and 46 is required. An exploration signal is therefore added to the algorithm to ensure that the regressor matrix satisfies: where

Q-function-based RL algorithm
Similar to the VFA approximation method, the Bellman equation for the Q-function (39) becomes: The RL adaptation equally utilizes the PI recursion (36), (37) and consists of both Q-function and policy update steps. In contrast to the VFA algorithm, the Q-function explicitly approximates the control inputs for each state from which the optimal control input can be obtained via a greedy optimization. This makes the QFA algorithm completely model free by using only the measurements observed along the system trajectories for the controller updates and is further described in Algorithm 2.

3:
Compute the least squares solution for j+1 using measurements L j = X ⊤ j Q 1 X j +ũ ⊤ j Rũ j , X j and X j+1 as: j = j + 1. 5: end for Policy update step Require: Set k+1 = j+1 | j=N 6: Update the policy parameters using a greedy optimization as: The Q-function parameters are updated in each episode while keeping the policy fixed and constitutes the Q-function update step. For the policy update, a greedy optimization is performed after each episode using the adapted Q-function parameters as:ũ Like Algorithm 1, an exploration signal is added to ensure PE and to satisfy (43), ũ i+1 )]. F I G U R E 2 Schematic of the proposed RL framework for the optimal tracking control using augmented formulation with integral control. The RL block represents either the VFA or QFA algorithm that continually uses the observed system measurements to adapt the tracking controller gains to optimum values subject to varying or unknown system dynamics [Colour figure can be viewed at wileyonlinelibrary.com] The RL control strategies described above solve the online LQT problem without knowledge of the system dynamics or variations. Furthermore, by using the proposed augmented formulation with integral control, the RL frameworks do not require any predetermined feedforward tracking control input or restrictive assumptions on the reference generator dynamics and use of discounted tracking costs. The RL tracking control scheme is represented schematically in Figure 2, where the RL block represents either the VFA or QFA algorithm that continually uses the observed system measurements to adapt the tracking controller gains to optimum values subject to varying or unknown system dynamics.

SIMULATION CASE STUDIES
The LQT RL approach is demonstrated on two simulation case studies. The first is a system with an initially unstable and unknown dynamics that shows convergence of the proposed RL tracking methods to the optimal tracking controller gains. The second case study addresses the optimal tracking control problem in a buck power converter system, which is subject to uncertain or varying component tolerances under different operating conditions.

Case study 1
Consider a 2-state system with unstable dynamics given as: Using a sampling time t s = 0.03 seconds, an equivalent discrete-time system using Euler's discretization is formed as: The tracking control problem is to track a time-varying step reference input from any finite initial condition x 0 representative of step commands in a servo system or precision-tracking applications. Tracking cost parameters in (18) are considered as Q 1 = 2 * (3), R = 0.05, and = 1.
(1) Existing online solution approach with the use of discounted cost: Existing online solution to the optimal tracking control problem as discussed in Section 2 requires knowledge of the reference dynamics and the use of discounted tracking cost. For the given tracking problem, consider the reference dynamics of (9) to be given by the linear difference equation: where F = 1. An augmented system with the reference dynamics can then be formulated according to (10). Furthermore, as a result of using a reference dynamics that does not tend to zero, a discounted cost must be used. Comparison of the performance of this approach using different discount factors to the proposed augmented formulation with integral control is shown in Figure 3. As observed in the simulation result, a discount factor of = 0.8 had a slower response but a reduced steady-state error, while a discount factor of = 0.7 had a faster response but larger steady-state error. Existing online tracking approaches with the use of a discount factor are therefore not only restrictive to the type of reference dynamics that can be used but also cannot guarantee zero steady-state tracking error. In the following, the proposed online solution approaches that do not require knowledge of the reference dynamics or the use of discounted cost will now be presented.
(2) Model-based solutions using the proposed augmented formulation with integral control: Baseline solution for the augmented formulation with integral control using the system models is first presented. An augmented system with integral control is formed according to (22) as: Using the given system models (A 1(1) , B 1 ), the optimal solution to the corresponding ARE (30) is given as: with the optimal tracking controller gains as: However, in practice, the system dynamics may be unknown or time varying therefore motivating the use of online RL methods.
(3) Model-free RL solutions: The proposed model-free RL approaches can be used to obtain the optimal tracking controller gains online subject to the unknown or varying system dynamics.

VFA-based RL adaptation
From Lemma 1, the value function for the augmented formulation with integral control is quadratic, thus the value function approximation for the given 2-state system in Algorithm 1 is chosen to be the quadratic function: From Algorithm 1, an initially suboptimal but stabilizing policy is arbitrarily selected as: The rest of Algorithm 1 is then run online till convergence of the tracking controller parameters using only the observed system measurements. The VFA parameters converged to the following optimal values, *  (1) . To demonstrate the adaptation of the tracking controller gains to optimal values using the proposed RL tracking control framework, the system drift matrix A is changed instantaneously during simulation to: with a new baseline model-based solution from using the system A (2) matrix given as: (56) Following this system variation, the tracking controller gains are no longer optimal resulting in a decline in the system performance. This can be detected in practice by using a threshold on standard step response parameters like percentage overshoot (P.O.), rise time, and so on and used as an enable signal to reinitiate the RL learning process. The VFA parameters after the system variation converged to * c (2) (2) . Figure 4 shows the parameter convergence using the VFA-based RL adaptation to the optimal but assumed unknown values before and after the system variation. Figure 5 shows the overall system response to time-varying step reference inputs at the various stages of the RL adaptation. The region with a,c(0) in the figure corresponds to the system response using the initial suboptimal controller gains, while the region with a,c(1) shows the system response after convergence to the optimal controller values from the RL F I G U R E 4 Online adaptation and convergence of both the value function and controller parameters to the optimal values (in black dashed lines) using Algorithm 1. a,c(0) are the initial suboptimal controller parameters, while a,c (1) and a,c (2) are, respectively, the identified optimal controller parameters before and after the system variation [Colour figure can be viewed at wileyonlinelibrary.com]

F I G U R E 5 System response
showing the system states and tracked output at the various stages of the RL adaptations. Region with a,c(0) shows the response using the initial suboptimal controller gains, while region with a,c (1) shows the response from the adapted controller gains to the optimal values using the proposed Algorithms. Region with a,c (1) with variation shows the decline in the system performance following variations in the system dynamics while keeping the controller values fixed, while region with a,c (2) onward shows the response after adaptation to the new optimal control gains [Colour figure can be viewed at wileyonlinelibrary.com] adaptation. After the system variation and keeping the controller values fixed, the region with a,c (1) with variation shows the decline in system performance following which the RL adaptation is reenabled. The new system performance after convergence to the new optimal control gains is then shown in the region with a,c(2) .

QFA-based RL adaptation
The QFA provides a completely model-free approach to the LQT problem and similar to the VFA, the Q-functions from Algorithm 2 are approximated for the 2-state system using a quadratic basis set as: 0.5 (6) 0.5 (7) 0.5 (3) 0.5 (6) (8) 0.5 (9) 0.5 (4) 0.5 (7) 0.5 (9) (10) Corresponding controller gains are then derived according to (45) as: Therefore, the optimal controller gains with * 1 are computed as  Figure 6 shows the online adaptation and convergence of the Q-function parameters before and after the system variation, respectively. After convergence to the optimal values, the system response using the QFA-based RL adaptations is as shown in Figure 5. The QFA RL approach therefore provides a completely model-free online tracking control solutions.

F I G U R E 6
Online adaptation and convergence of the Q-function parameters to the optimal values (in black dashed lines) using Algorithm 2. (0) are the initial suboptimal controller parameters, while (1) and (2) are, respectively, the identified optimal controller parameters before and after the system variation [Colour figure can be viewed at wileyonlinelibrary.com]

Case study 2
This case study addresses the optimal tracking control problem in a buck power converter system, which is subject to uncertain or varying component tolerances under different operating conditions. Consider a buck power converter with a switching element and consisting of an inductor L p with a small series resistance r, a capacitor C p , and a diode. The voltage drop in the diode can be neglected as the value is typically small. 47 For a continuous conduction mode operation (CCM), the control input is defined as the duty-ratio u ∈ [0, 1], and the buck converter dynamics are given as: 47 where E is the dc input voltage, i is the inductor current, v is the output voltage, is the load current, and R L is the load resistor.
The aim of the controller is to regulate the output voltage to a given v ref .
With the states chosen as the inductor current i and output voltage v, a corresponding state-space dynamics is formulated as: The system component parameters are given as r = 0.5 Ω, L p = 1 mH, C p = 50 μF, and E = 48 V. Variations can occur due to modeling uncertainties and component tolerances under different operating conditions. For this example, the load resistor is changed instantaneously during simulation from R L = 200 Ω to 100 Ω and is assumed unknown. To demonstrate the proposed online tracking RL approach, an augmented system as given in (22)  Region with (0) shows the response using the initial suboptimal controller gains, while region with (1) shows the response from the adapted controller gains to the optimal values using the proposed Algorithms. Following variations in the load resistor R L , region with (2) onward shows the response after adaptation to the new optimal control gains [Colour figure can be viewed at wileyonlinelibrary.com] convergence of the online adaptation of the Q-function parameters compared with the optimal but assumed unknown values. With a variation in the load resistor to R L = 100 Ω, the Q-function parameters reconverged to *  Figure 8 and to optimal control gains a(2) = [−0.2954; −0.0851; 0.1090] = −K * 1 (2) . Figure 9 shows the overall buck power converter system response at the various stages of the online RL adaptation. The region with (0) in the figure corresponds to the system response using the initially suboptimal tracking controller gains, while the region with (1) shows the system response after convergence to the optimal controller values from the RL adaptation. Following variation in the load resistor R L , the system performance after convergence to the new optimal control gains is then shown in the region with (2) . This way the proposed online optimal and adaptive tracking RL framework is able to maintain the desired level of system performance subject to gradual variations in the system parameters.

CONCLUSIONS
This article has proposed and demonstrated an online optimal and adaptive reinforcement learning (RL) tracking controller using an augmented formulation with integral control for unknown or varying discrete-time (DT) systems. Existing online tracking methods either assume a predetermined feedforward input for the tracking control or use restrictive assumptions on the reference model dynamics and discounted tracking costs. Moreover, the existing online tracking methods are unable to guarantee zero steady-state tracking error. By contrast, the proposed method transforms the DT optimal tracking control into a regulation problem and solves a resulting DT algebraic Riccati equations online without knowledge of the system dynamics or any restrictive assumptions of the existing methods and eliminates steady-state tracking error. Two RL strategies are proposed for the LQT based on both the value function approximation and Q-learning. The approaches offer a through-life adaptation strategy for the controller gains and guarantee zero steady-state tracking error as shown in the case study examples.