Learning ‐ based control for discrete ‐ time constrained nonzero ‐ sum games

A generalized policy ‐ iteration ‐ based solution to a class of discrete ‐ time multi ‐ player non ‐ zero ‐ sum games concerning the control constraints was proposed. Based on initial admissible control policies, the iterative value function of each player converges to the optimum approximately, which is structured by the iterative control policies satisfying the Nash equilibrium. Afterwards, the stability analysis is shown to illustrate that the iterative control policies can stabilize the system and minimize the performance index function of each player. Meanwhile, neural networks are implemented to approximate the iterative control policies and value functions with the impact of control constraints. Finally, two numerical simulations of the discrete ‐ time two ‐ player non ‐ zero ‐ sum games for linear and non ‐ linear systems are shown to illustrate the effectiveness of the proposed scheme.


| INTRODUCTION
Reinforcement learning, one of the popular research fields in machine learning, is devoted to constructing an interactive bridge which lets an agent be contacted with the environment to achieve a goal. The interactive bridge relies on the influence between an agent and the environment through a series of operations that an agent acts based on the reward supplied by the environment and the environment states are changed by the action [1][2][3][4]. Related to reinforcement learning and optimal control, Werbos advocated adaptive dynamic programming (ADP) for the first time [5]. Different from dynamic programming (DP), the traditional optimal control solution, it solves the optimal control problem forward-in-time rather than backwards, avoiding the difficulty brought by the dimensional explosion in the traditional method for high-dimensional and complex systems. The actor-critic structure is the basic one in ADP algorithm where the actor structure chooses control policy for the system based on the perceived environment and the critic structure improves control policy by system performance evaluation. So that an agent can interact with the environment and it will converge to the approximate optimal point in the continuous iteration for the optimal control problem [6][7][8].
It is shared for the iterative method in ADP algorithm which is divided into policy iteration (PI) and value iteration (VI) algorithm. VI algorithm consists of two parts which are value update and policy improvement. Meanwhile, PI algorithm also consists of two parts which are policy evaluation and policy improvement. As for VI algorithm, the convergence speed is low for the reason that the iterative control policies may not be admissible ones in the iteration process. By contrast, they are all admissible control policies in the policy iteration procedure with the problem of computational complexity [9,10]. In order to find the balance on the properties of PI and VI algorithm, generalized policy iteration algorithm (GPI) put forward in [5] for the first time. Besides, PI algorithm and VI algorithm are all special forms in the GPI algorithm [11]. Even though the idea of GPI algorithm has been put forward for many years, the proof of stability is not given until the appearance of [11]. As for the actor-critic structure and iterative method, it becomes an efficient and simple approximate optimal method for high-dimensional and complex systems with the effort of many researchers, such as multi-agent optimal control [12,13], zero-sum game (NZ) [14,15], non-zero-sum game (NZS) [10,16].
NZS game is a complex and difficult optimal control problem for multiplayer which is what we called multiple controllers. Different from other optimal control problems, which should guarantee the stability of the system and the minimum of performance index function for each player, it also needs to meet Nash equilibrium formed by a sequence of control inputs. Most of the discussion is on how to solve the coupled Hamilton-Jacobi equation for the Nash equilibrium solution. It can be seen that many researchers paid attention to the continuous-time NZS games. An online ADP method based on policy iteration for the continuous-time multi-player NZS game was presented in [17]. Afterwards, a near-optimal control was put forward for the continuous-time non-linear system using just a critic network rather than an actor-critic dual network in [18]. More recent studies which present an offpolicy learning and experience replay method used for continuous system are aroused in [19,20] respectively. It is obvious that there are few pieces of research on discrete-time multi-player NZS game except [10]. Meanwhile, it is widespread for control constraints in control systems with the damage of system stability and performance such as [21][22][23]. Therefore, there is a tendency to form a unified statement on discrete-time multi-player NZS game solved by the idea of ADP algorithm, which motivates our research drawing on the idea of GPI algorithm considering control constraints.
The contributions of this paper are provided as follows. In this paper, the GPI algorithm is the first time to solve the discrete-time multi-player NZS games considering control constraints. Benefit from the idea of the GPI algorithm, compared with the traditional PI and VI algorithms, is unified solution to the discrete-time multi-payer NZS games. In addition, the proofs of convergence and optimality for the control-constraint multi-player NZS games is provided in detail. Next, simulations of the linear system and non-linear system are provided respectively to show the effectiveness of the proposed scheme. In the end, based on statistic technique, it is the first time to prove that the GPI algorithm has slightly less energy loss and greatly less needed runtime compared with the traditional PI and VI algorithms.
The rest of article is organized as follows. In Section 2, the background of discrete-time control-constraint multi-player NZS games is presented in detail. In Section 3, the GPI algorithm for discrete-time control-constraint multi-player NZS game is developed. Besides, stability is demonstrated. Meanwhile, the actor-critic neural-network structure is shown. In Section 4, two numerical examples to illustrate the effectiveness of the proposed scheme. In Section 5, a brief conclusion is provided.

| PRELIMINARIES
Consider a class of discrete-time multi-player NZS games with control constraints described by where x k ¼ ½x 1k ; x 2k ; …; x nk � T ∈ R n is the state vector, � u q ðkÞ is the control vector, f ðx k Þ ∈ R n denotes the state matrix and g q ðx k Þ ∈ R n�m denote the input matrices. Let � u q ðkÞ ∈ f� u q | � u q ∈ R m ; | � u q | ≤ ν q ; q ¼ 1; 2; ⋯ ; Ng be the control inputs that satisfy control constraints, where ν q represents saturation bound of the qth actuator. In the following sections, x k+1 is also written as Fðx k ; � u q ðkÞÞ.
The performance index function of the qth player is presented as where Let U q ðx k ; � u q ðkÞ; � u q ðkÞÞ > 0 for ∀x k ; � u q ðkÞ; � u q ðkÞ ≠ 0 be the utility function and Q q (x k ), M qq , M qj be the positive definite symmetric matrices.
In this paper, a set of control policies f� u 1 ; � u 2 ; …; � u N g, which is devoted to stabilizing the system (1) and minimizing the performance index function (2), is considered as admissible. Given the admissible control policies of the qth player � u q , the optimal performance index function can be expressed as Assume that the optimal control policy of the qth player � u � q ðkÞ can be expressed as It should be pointed out that each player has equivalent competitiveness and arrives at a Nash equilibrium as the following definition.
Definition 2 [10] (Nash Equilibrium) A set of control which is deemed a Nash equilibrium solution.
Let the value function of the qth player be Hence, the multi-player NZS game can be expressed as In order to satisfy Equation (9) and control constraints [22], Equation (6) can be expressed as Then, Equation (9) can be derived as which is known as the coupled NZS game Bellman optimality equation [10].

| Generalized policy iteration algorithm
There are two iterations in the GPI algorithm, which is different from the traditional policy iteration and value iteration algorithm. Meanwhile, two iteration indices i and j i related to two iterations are presented, which increase from 1 and 0, respectively. Let the iterative value function of the qth player V i;j i q ðx k Þ constructed by � v i q ðx k Þ satisfy Equation (11): Define the initial control policy of the qth player � v 1 q ðx k Þ as an admissible one, which means � v 1 q ðx k Þ ∈ Ω u . Hence, for all x k ∈ Ω x , V 1 q ðx k Þ is an initial value function constructed by � v 1 q ðx k Þ which satisfies Equation (11): Then, two iterations in GPI algorithm can be expressed as follows: 2) Policy evaluation where which j i = 0, 1, …, N i and N i is denoted as an arbitrary nonnegative integer. In summary, the total process of the GPI algorithm can be presented in Figure 1.
It should be noted that they, PI and VI algorithms, are the special forms of GPI algorithm. When the iteration sequence {N 1 , N 2 , …, N i } is all settled as 0, then the algorithm can be seemed as VI algorithm. Similarly, if N i → ∞, the algorithm can be regarded as PI algorithm [11].

| Stability analysis
In this subsection, there will be a proof that the iterative value function of each player V i;j i q ðx k Þ converges to the optimal performance index function Theorem 1 Let the initial control policy of each player be an admissible one which satisfies (12). Let the iterative control policy � v i q ðx k Þ and the iterative value function V i;j i þ1 q ðx k Þ be obtained by Equations (12)-(17) for i = 1, 2, ⋯ and all x k ∈ Ω x . Then, these properties can be established.
Proof. This proof is divided into two parts.

Lemma 1
Let the iterative value function of the qth player V i q ðx k Þ be defined as Equation (17). Then, for i = 1, 2, …, the iterative value function of the qth player V i q ðx k Þ is a monotonically non-increasing and convergent sequence.

Lemma 2
If a monotonically non-increasing sequence {a n }, n = 1, 2, …, contains an arbitrary convergent subsequence, the sequence {a n } is convergent to the same limit of its subsequence.
The proof of Lemmas 1 and 2 is omitted, which is easily obtained by [11]. Then, the convergence of the iterative value function V i;j i q ðx k Þ to the optimal performance index function J � q ðx k ; � u 1 ; � u 2 ; … ; � u N Þ is provided as below by using above lemmas.
Theorem 2 Let the iterative value function V i;j i q ðx k Þ and the iterative control policy � v i q ðx k Þ be obtained using Equations (12)-(17) for i = 1, 2, …, j i = 0, 1, …, N i and all x k ∈ Ω x . The iterative value function V i;j i q ðx k Þ converges to the optimal performance index function Proof. Define a sequence of the iterative value functions as Meanwhile, another sequence can be defined as Then, it is obviously that fV i q ðx k Þg, a monotonically nonincreasing sequence, is a subsequence of fV i;j i q ðx k Þg. According to Lemma 1, the limit of fV i q ðx k Þg exists. From Lemma 2, the sequence fV i;j i q ðx k Þg and its subsequence fV i q ðx k Þg process the same limit, that is, Therefore, the proposition stated in Theorem 2 turns into the following statement: According to Lemma 1, the iterative value function of the qth player V ∞ q ðx k Þ can be defined as the limit of the iterative value function V i q ðx k Þ, that is, According to Equations (17) and (18), there is MU ET AL. -207 Then, according to Equation (34), there is When i → ∞, for all x k ∈ Ω x , there is Let ɛ p > 0 be an arbitrary positive number.
Hence, according to Equation (15), there is Since ɛ p > 0 and ɛ p is arbitrary, for all x k ∈ Ω x , there is According to Equations (37) and (40), for all x k ∈ Ω x , V ∞ q ðx k Þ can be obtained, that is, Let ɛ r > 0 be an arbitrary positive number. According to Equation (2), there exists a sequence of the iterative control policies � γ r ¼ f� γ 1 ; � γ 2 ; …; � γ N g which satisfies Assume that the iteration index is τ related to the sequence � γ r . Then, according to Equation (2) and Theorem 1, there is Combined with Equation (42), there is Since ɛ r > 0 and ɛ r is arbitrary, there is Next, according to Equation (5), Therefore, according to Equations (45) and (46), there is The proof is completed.

| Neural networks approximation
In this subsection, a neural network-based approach is presented to approximate the iterative value function and iterative control policy of each player respectively, which are illustrated as where W q denotes the weight matrix of the qth player; ϕ q (⋅) denotes the activation function of the qth player. Firstly, the critic network is constructed to approximate the iterative value functions of the qth player V i;j i ;l q ðx k Þ which is denoted as where l = 0, 1, 2, … is denoted as the iteration index of the following gradient descent iteration method. Recall that W cði;j i ;lÞ q is the critic network weights and ϕ c q ð⋅Þ : R n → R s is the activation function for the critic neural network. s is the number of neurons in the hidden layer. Define the approximation error of the critic network for the qth player as e cði;j i ;lÞ Then, define the objective function to be minimized in the critic network as Hence, the gradient-based weight update law [24,25] where α c q > 0 is the learning rate of the critic network for the qth player. Next, the actor network for the qth player will be introduced as follows.
The actor network is constructed to approximate the iterative control policy of the qth player b v i;l q ðx k Þ which is denoted as where l = 0, 1, 2, … is denoted as the iteration index of the following gradient descent iteration method. Recall that W aði;lÞ q is the actor network weights and ϕ a q ð⋅Þ : R n → R s is the activation function for the actor neural network. s is the number of neurons in the hidden layer. Define the approximation error of the actor network for the qth player as e aði;lÞ Then, define the objective function to be minimized in the actor network as Hence, the gradient-based weight update law [24,25] where β a q > 0 is the learning rate of the actor network for the qth player.

| SIMULATION RESULTS
In this section, there are two simulation examples to show the performance of the proposed scheme, which solves discretetime control-constraint multi-player NZS games.

| Example 1
Consider the following control-constraint two-player linear NZS game based on [26,27] of the form The performance index function is defined as Equation (2) where the weight matrices in the utility function for two players are expressed as Firstly, define the state space as Ω x = {x k | −0.5 ≤ x 1k ≤ 0.5, −0.5 ≤ x 2k ≤ 0.5}. Besides, states are chosen in Ω x at 0.1 intervals to apply the GPI algorithm for the acquirement of the optimal control policies. Four neural networks are utilized to approximate the iterative value functions and control policies, whose activation functions are in the actor networks and in the critic networks respectively. In addition, the initial control policies chosen for two players should obtain admissible control inputs. The iteration sequence {N 1 , N 2 , …, N i } for each player can be randomly chosen. Let the initial state be x 0 = [0.5,0.5,0.5,0.5] T , the iteration index i = 50 and actuator saturation | ν 1 | ≤ 0:2; ν 2 ≤ 0:2. Then, train the actor and critic networks of two players under the learning rate 0.65, 0.05, 0.8, 0.75, respectively. The trajectories of the system states and the control inputs are shown in Figure 2. Trajectories of the performance index function for system and each player under initial control policies and obtained control policies are presented in Figure 3. The trajectories of the actor and critic network weights for two players are shown in Figure 4, where the different coloured lines represent each term of the actor and critic network weights. From the above results, it is obvious to see that the system states converge to equilibrium after around 550-time steps. Besides, it can be observed that the obtained control inputs of two players achieve convergence under control constraints, as is shown in Figure 2. Compared with the initial control policies, the system and each player performance index functions obtained under the obtained control policies are significantly reduced, as is shown in Figure 3. Meanwhile, the actor and critic network weights approach to equilibrium after about 30 iteration steps.

| Example 2
Consider the following control-constraint two-player nonlinear NZS game based on [9,28] of the form The performance index function is defined in Equation (2) where the weight matrices in the utility function for two players are expressed as Q 1 ðx k Þ ¼ 0:01x 2 1k þ 0:01x 2 2k þ 0:02x 4 2k ; Q 2 ðx k Þ ¼ 0:02x 2 1k þ 0:02x 2 2k þ 0:01x 4 2k ; Besides, states are chosen in Ω x at 0.02 intervals to apply the GPI algorithm for the acquirement of the optimal control policies. Four neural networks are utilized to approximate the iterative value functions and control policies, whose activation functions are in the critic networks, respectively. In addition, the initial control policies chosen for two players should obtain admissible control inputs. The iteration sequence {N 1 , N 2 , …, N i } for each player can be randomly chosen. Let the initial state be  Figure 5. Trajectories of the performance index function for system and each player under initial control policies and obtained control policies are presented in Figure 6. The trajectories of the actor and critic network weights for two players are shown in Figure 7, where the different coloured lines represent each term of the actor and critic network weights. As shown in Figure 8, the proposed scheme by using GPI algorithm, PI algorithm or VI algorithm is tested for 100 times respectively in the same condition, which the parameter N i is settled as 0 for PI algorithm and 50 for VI algorithm, respectively. Besides, the error of each test is presented as a shaded part related to the colour of the algorithm used. Meanwhile, the runtime results of 10 times tests are presented in Figure 9. Finally, some concrete numerical analysis of runtime results and system performance index function results are given in Tables 1 and 2. From the above results, it is clear to see that the system states converge to equilibrium after around 700 time steps. Besides, it can be observed that obtained control inputs of two players achieve convergence under control constraints, as shown in Figure 5. Compared with the initial control policies, the system and each player performance index functions obtained under the obtained control policies are significantly reduced, as shown in Figure 6. Meanwhile, the actor and critic network weights approach to equilibrium after about 100 iteration steps. What's more, it can be obviously seen from Figure 8 that, on the whole, GPI algorithm is similar with the traditional PI and VI algorithms in the results of the system performance index function. When comparing carefully, it is obvious that slightly less energy loss is required in GPI algorithm whether it is the mean value or the minimum value. Although the GPI algorithm is slightly better than the PI algorithm in the results of the system performance index function, it will be much better in runtime results. As shown in Figure 9, in 100 times of tests, runtime results constructed by the GPI algorithm are more intensive than the ones in the -211 traditional PI and VI algorithms. Besides, the runtime results constructed by the GPI algorithm are significantly less than the ones in the traditional PI and VI algorithms whether it is the mean, the minimum or the maximum value. From tables 1 and 2, the runtime is greatly reduced by more than 25% in GPI algorithm compared with the PI algorithm whose results of system performance index function is suboptimal. Additionally, the standard deviation of the runtime results constructed by the GPI algorithm is smaller than the one used the PI algorithm up to around 40%, and the VI algorithm up to around 60%, which reflects the stability and reliability of the GPI algorithm are improved up to 40% compared with the PI algorithm. Therefore, the GPI algorithm holds higher effectiveness compared with PI and VI algorithms for discretetime control-constraint multi-player NZS games. It should be noted that, for the sake of clarity, a mark point is made every 20 time intervals in Figures 2 and 5. Besides, a mark point is also made every five iteration intervals in Figures 4 and 7. In addition, the index 'Mean' means the average of the results. The index 'Std' means the standard deviation of the results. The index 'Trimmean' means the average of the results removing ten maximum and ten minimum in Tables 1 and 2.

| CONCLUSION
In this article, the discrete-time control-constraint multi-player NZS games based on generalized policy iteration ADP approach is presented. Besides, the stability analysis of the proposed scheme is illustrated in detail. Meanwhile, a neural network-based implement is developed to approximate the iterative control policy and value function. Finally, the scheme we discussed is employed to show the performance in the linear and non-linear simulation examples, respectively.