Neural network-based optimal tracking control for partially unknown discrete-time non-linear systems using reinforcement learning

Otimal tracking control of discrete-time non-linear systems is investigated in this paper. The system drift dynamics is unknown in this investigation. Firstly, in the light of the discrete-time non-linear systems and reference signal, an augmented system is constructed. Optimal tracking control problem of original non-linear systems is thus transformed into solving optimal regulation problem of the augmented systems. The solution to optimal regulation problem can be found by solving its Hamilton–Jacobi–Bellman (HJB) equation. To solve the HJB equation, a new critic-actor neural network (NN) structure-based online reinforcement learning (RL) scheme is proposed to learn the solution of HJB equation while the corresponding optimal control input that minimizes the HJB equation is calculated in a forward-in-time manner without requiring any value, policy iterations and the system drift dynamics. The Uniformly Ultimately Boundedness (UUB) of NN weight errors and closed-loop augmented system states are provided via the Lyapunov theory. Finally, simulation results are given to validate the proposed scheme.


INTRODUCTION
Robot manipulators [1], unmanned aerial vehicle [2], nonholonomic wheeled mobile robot [3], aircraft [4] and many other physical systems in practical engineering can be modelled as non-linear systems. Optimal tracking control has always been a research hot topic of control theory, and there are also many practical engineering applications, e.g. [5][6][7] and references therein. Therefore, exploring the optimal tracking control of non-linear systems has great theoretical significance and practical value. Reinforcement learning is a biologically inspired approximate method and has certain advantages in coping with optimization problems with uncertainty models or unknown dynamics. Adaptive dynamic programming [8][9][10][11] and approximate dynamic programming (ADP) [12][13][14][15] also belong to the category of reinforcement learning, and overcome the deficiencies of traditional dynamic programming, such as the curse of modelling and the curse of dimensionality [16,17].
Over the past few decades, many studies have been carried out on optimal tracking control for non-linear systems. Reinforcement learning as an advanced intelligent optimization method has been successfully applied to optimal tracking control for both generally non-linear and linear systems, and has achieved good performance, e.g. discrete-time non-linear systems [18][19][20][21][22][23], continuous-time non-linear systems [24][25][26][27][28][29][30][31], discrete-time linear systems [32][33][34] and continuous-time linear systems [35,36] and references therein. For linear systems, an adaptive dynamic programming method without requiring the system dynamics was proposed in [36]. Input-output measured data-based reinforcement learning method was proposed to tackle the optimal tracking control for unknown discrete-time linear systems in [32,33]. Linear quadratic tracking control for partially-unknown continuous-time systems was investigated in [35]. For non-linear systems, a model-based reinforcement learning method was proposed in [29]. For solving the optimal tracking control of continuous-time non-linear systems, literature [37] developed a model-free RL method.
Literature [21] developed an actor-critic-based NNs reinforcement learning algorithm to solve the optimal tracking of partially unknown discrete-time non-linear systems. General value or policy iteration-based ADP was proposed to tackle the optimal tracking control for non-linear systems in [20,38]. Optimal tracking control of non-linear systems with input constraints was investigated in [28,30] using reinforcement learning. And, an integral reinforcement learning idea was proposed in [30]. Literature [22] applied the ADP method to coal gasification. Literature [23] developed a data-driven robust ADP method for optimal tracking control of unknown general non-linear systems. To achieve model-free optimal tracking control, a three-layer feed-forward NNs identifier is adopted to construct the non-linear system dynamics in [39]. The effort in [40] considered the H ∞ optimal tracking control for constrained non-linear systems by using an iterative adaptive learning algorithm. In [41], distributed H ∞ optimal tracking control for a class of physically interconnected large-scale non-linear systems in the presence of strict-feedback form, external disturbance and saturating actuators was presented. In [42], optimal output tracking control problem of discrete-time non-linear systems was considered and an adaptive dynamic programming algorithm with multi-step policy evaluation was presented. In [43], a critic-only Q-learning method was developed to solve the model-free optimal tracking control problem of discrete-time non-affine non-linear systems. Liu and Wei [44] developed a new discrete-time policy iteration ADP method for solving the infinite horizon optimal control problem of non-linear systems. In [45], an iteration ADP algorithm based on heuristic dynamic programming was introduced to deal with the optimal tracking control problem of a class of unknown discrete-time non-linear systems. A new infinite horizon neural network (NN)-based adaptive optimal tracking control scheme for discrete-time non-linear systems is developed by using use iterative ADP algorithm in [46]. Literature [47] employed a policy iteration reinforcement learning algorithm to find the solution to the infinite-horizon linear quadratic tracker. In [48], an iteration ADP algorithm using globalized dual heuristic programming technique was developed to deal with the optimal control for a class of unknown discrete-time non-linear systems.
Although reinforcement learning methods have been widely to tackle the optimal tracking control problems, the existing reinforcement learning methods are mostly based on policy iterations or value iterations. As we all know, policy iterations usually require initial admissible control, while value iterations generally converge very slowly. These characteristics limit the online practical application of RL methods. Therefore, in order to eliminate these limitations, this paper employs a new reinforcement learning scheme to deal with the optimal tracking control of partially unknown discrete-time non-linear systems without requiring any value or policy iterations. The main innovations of this paper are summarized in the following three aspects: (1) A novel NN-based online reinforcement learning scheme is proposed to approximately solve the Hamilton-Jacobi-Bellman (HJB) equation. The optimal solution of HJB equa-tion is learned in a forward-in-time manner instead of traditional value iteration or policy iteration. New NN weights tuning laws are thus proposed. (2) The system drift dynamics is not required in the proposed scheme. In other words, the proposed scheme allows the non-linear systems to be partially unknown. (3) The Uniformly Ultimately Boundedness (UUB) of NN weight errors and closed-loop augmented system states are analysed. The value function and the control input are also proved to converge to optimal value function and optimal control input with a small bounded error.
The structure of this paper is described as follows. In Section 2, we introduce the problem formulation. In Section 3, an NN-based online reinforcement learning algorithm is presented to solve the tracking HJB equation. In Section 4, the stability and convergence of our scheme are provided by Lyapunov approach. Simulation studies are given to demonstrate the effectiveness of our design scheme in Section 5. Section 6 concludes this paper and gives the directions of future research.

PROBLEM FORMULATION
Consider the following discrete-time non-linear systems: where x(k) ∈ ℝ n is measurable system states, u(k) ∈ ℝ m is the control input, f (k) ∈ ℝ n is the system drift dynamics, g(k) ∈ ℝ n×m is the system input dynamics and has a generalized inverse.

Assumption 1.
The non-linear system is stabilizable and f (k) + g(k)u(k) is locally Lipschitz and f (0 The desired reference signal is generated by the following bounded command: where r (k) ∈ ℝ n is the reference signal and the reference signal needs only to be stable in the sense of Lyapunov, not necessarily asymptotically stable. The objective of this paper is to find an optimal control input u * (k) so as to make x(k) follow the reference signal r (k) while minimizes a pre-defined cost function.
In order to achieve tracking of the reference signal, we define the tracking error as follows: The cost function is defined as follows: where Q = Q T ≥ 0, R = R T > 0, 0 < ≤ 1 is the discount factor. According to (1) and (3), the following tracking error dynamic are obtained: Define the augmented states (k) = [e(k) T r (k) T ] T ∈ ℝ 2n . Then, the augmented system dynamics comprised of (2) and (5) can be given as follows: where .

Assumption 2. F ( (k)) satisfies the Lipchitz condition and
Now the objective of this paper is become to find an optimal control input u * (k) for augmented system dynamics (6) which minimizes the following cost function: where In the rest of this note, we will be committed to finding the optimal control input u * (k) for the augmented system (6).

NN-BASED ONLINE REINFORCEMENT LEARNING OPTIMAL TRACKING CONTROL DESIGN SCHEME
In order to find the optimal controller u * (k), this section presents the design of an online NN-based RL scheme. Equation (7) can be rewritten as follow: where Q 1 ( (k)) = T (k)Q 1 (k).
In line with Bellman's principle of optimality [5], V * ( (k)) satisfies the following discrete-time tracking HJB equation: According to the stationary condition = 0 in traditional optimal control theory [5], the optimal control input u * (k) for (6) which minimizes (7) or (8) can be obtained as follows: where . As we all know, due to the inherent non-linearity and dependence on (k + 1), the discrete-time tracking HJB equation is difficult to solve directly.
In the next, we will develop a critic-actor NN-based online reinforcement learning approach to approximate the solution to tracking HJB equation (9).
In line with the Weierstrass high-order approximation theorem [49], using a single-layer NN, the value function V * ( (k)) and ∇V ( (k)) are represented as follows: where ( (k)) is a suitable linearly independent basis function vector including L 1 items. ( (k)) is the approximate error. .
Subsequently, we introduce an auxiliary residual error vector as In the critic NN, we tuneŴ c to minimize e HJBk and choose the following error function: Applying gradient descent in (17) and using the chain rule and normalizing, the weights tuning law for the critic NN is obtained as follows [51]: where 0 < l c < 1 is a positive constant and I is an identity matrix of suitable dimension.

Remark 1.
It is easily observed that the cost function (7) and NN cost function approximation (13) become zero only when (k) = 0. Once the augmented system states become zero, the cost function approximation is no longer updated. Thus, this can be viewed as a persistency of excitation (PE) requirement for the inputs to the cost function NN wherein the system states must be persistently exiting long enough for the NN to learn the optimal cost function. This PE requirement guarantees that there is a constant ℏ that satisfies ℏ ≤ ‖Ω k Ω T k ‖ F [52] and is consistent with the PE condition used in [53].
In order to get the estimation error dynamics of the critic NN weights, using (11), we can rewrite (9) as where Δ ( (k + 1)) = ( (k + 1)) − ( (k)). Defining the estimation error of the critic NN weights asW c = W c −Ŵ c and substituting (19) into (15), we have Therefore, (16) can be represented as where ⌢ . Now, according to (18) and (21), the critic NN weights estimation error dynamics can be represented asW where 0 < l c < 1 is a design parameter.

Assumption 4. E HJBk is bounded, such as ‖E
Similar to the value function, we also use a single-layer NN to represent the optimal control input where ( (k)) is a suitable linearly independent basis function vector including L 2 items, ( (k)), ( (k)) and W a are bounded However, the ideal weight parameter vector W a is unknown. Therefore, (23) can be approximated as follow: whereŴ a is the estimation value of W a .
In the actor NN, we choose the error function as follows: where Applying gradient descent in (25) and using the chain rule and normalizing, the weights tuning law for the actor NN is given as follows [51,54]: where 0 < l a < 1 is a design parameter.
Owing to the control input u * (k) in (23) minimizes the value function (9), therefore, we have Then, defining the estimation error of the actor NN weights asW a = W a −Ŵ a and subtracting (27) from (26), we have In terms of u * (k) andW a (k), the closed-loop non-linear system dynamics of (6) can be represented as follows: So far, the online critic-actor NN-based RL optimal tracking control algorithm for discrete-time non-linear systems is given as follows. The schematic of the proposed actor-critic learning scheme is shown in Figure 1.

Remark 2.
In our proposed Algorithm 1, the system drift dynamics F ( (k)) of augmented system (6) is not required.

THE ANALYSIS OF STABILITY AND CONVERGENCE
In this section, the main results of this paper are summarized as the following theorem: Theorem 1. Consider the augmented system dynamics (6) with the discrete-time tracking HJB equation (9), let the control input be provided by (24). Let the critic tuning law forŴ c (k) be given by (18) and the actor tuning law forŴ a (k) be given by (27). Then, there exist positive constants l c and l a such that the closed-loop augmented system states (k), and the NN errorsW c (k) andW a (k) are UUB for a sufficiently large L 1 and L 2 . Moreover, the approximate cost functionV (k) and control input u(k) are converged to the optimal value, i.e. ‖V * (k) −V (k)‖ < V , ‖u * (k) −û(k)‖ < u , where V , u are small positive constants.
Proof. Choose the following positive Lyapunov function: where L ( (k)) = (k) T (k), L a (W a (k)) = tr{W T a (k)W a (k)}, represents the maximum eigenvalue of R −1 . Ω max is a positive instant related to j .

SIMULATION STUDIES
Example 1. In this example, consider the discrete-time non-linear systems and desired reference signal directly borrowed from [21] as follows: (k), 1 (k) 2 (k), 1 (k) 3 (k), 1 (k) 4 (k), 2 2 (k), 2 (k) 3 (k), 2 (k) 4 (k), 2 3 (k), 3 (k) 4 (k), 2 4 (k)] T , where 1 (k) = e 1 (k), 2 (k) = e 2 (k), 3 (k) = r 1 (k), 4 (k) = r 2 (k). And the initial NN weights vectorsŴ c andŴ a are randomly taken from interval [− 1 1]. The evolution process of critic NNs weights are displayed in Figure 2. The evolution process of actor NN weights are displayed in Figure 3. As can be seen from Figures 2 and 3  quickly. The evolution of (x − r) between system states and reference signal are depicted in Figures 4 and 5. The reference trajectory and the actual trajectory are plotted in Figure 6. The evolution of tracking errors are shown in Figure 7. The optimal control input is plotted in Figure 8. Comparing our simulation results with the findings in literature [21], our design scheme has achieved good results in terms of convergence speed and tracking performance. This further validates the effectiveness of our design scheme.

FIGURE 7
The evolution of tracking errors The evolution process of critic NNs weights The desired reference signal is generated by the following command: x 12r x 21r x 22r To guarantee the PE condition, a small exploratory noise e noise = 0.05[sin(k) + sin(3k)+sin(5k) + sin(7k) + sin(9k) + sin(11k) + sin(13k)] is added to the control input u(k) for the first time 600.
Choose the critic NNs activation functions (e(k)) =

FIGURE 15
The optimal control input of our results with literature [55] findings shows that our design scheme presents better tracking performance.

CONCLUSION
This paper develops a novel NN-based online reinforcement learning optimal tracking control algorithm for discrete-time non-linear systems. Two NNs are adopted to approximately solve the tracking HJB equation of augmented systems. The UUB of NN weights errors and closed-loop augmented system states are proved. The value function and the control input are also proved to converge to optimal value function and optimal control input with a small bounded error. Finally, simulation studies are given to verify the effectiveness of our design scheme. In future work, we will extend the study results of this note to cooperative optimal tracking control for discrete-time non-linear multi-agent systems.