Deep Reinforcement Learning‐Based Air Defense Decision‐Making Using Potential Games

This study addresses the challenge of intelligent decision‐making for command‐and‐control systems in air defense combat operations. Current autonomous decision‐making systems suffer from limited rationality and insufficient intelligence during operation processes. Recent studies have proposed methods based on deep reinforcement learning (DRL) to address these issues. However, DRL methods typically face challenges related to weak interpretability, lack of convergence guarantees, and high‐computing power requirements. To address these issues, a novel technique for large‐scale air defense decision‐making by combining a DRL technique with game theory is discussed. The proposed method transforms the target assignment problem into a potential game that provides theoretical guarantees for Nash equilibrium (NE) from a distributed perspective. The air‐defense decision problem is decomposed into separate target selection and target assignment problems. A DRL method is used to solve the target selection problem, while the target assignment problem is translated into a target assignment optimization game. This game is proven to be an exact potential game with theoretical convergence guarantees for an NE. Having simulated the proposed decision‐making method using a digital battlefield environment, the effectiveness of the proposed method is demonstrated.


Introduction
Modern warfare is heavily reliant on advanced weaponry, a decisive factor in determining the victor of a conflict. [1]Decisionmaking is the cornerstone of combat operations, necessitating the judicious utilization of multitype and multiplatform weaponry, based on intelligence collected about the adversary, the rational assignment of incoming targets, and the successful completion of countermeasures. [2]he battlefield dynamics faced by air defense operations have been further complicated and rendered more unpredictable by the emergence of new random air-attack weapons, such as unmanned cluster weaponry. [3]More unpredictable weaponry poses a new challenge to the operational decision system configured to defend against such air attacks.Figure 1 outlines an air defense operational process.Efficient decision-making in air defense operations can increase operational effectiveness by more than three times when compared to unstructured decisionmaking. [4]The rational schedule of combat resources to enhance interception efficacy is a pressing concern for operational decision-making systems.
Recently, deep reinforcement learning (DRL) has become widely adopted in control, [5] scheduling, [3b,6] and decisionmaking systems [7] and has emerged as a notable asset in the field of intelligent decision-making.Yang et al. [8] proposed a real-time scheduling algorithm for unmanned aerial vehicle (UAV) clusters driven by reinforcement learning (RL) to address the channel assignment problem in UAV cluster scheduling.Song et al. [6] developed an enhanced multiobjective reinforcement learning algorithm configured to efficiently balance multiple objectives.This algorithm was applied to solve the problem of offloading an application consisting of dependent tasks in multiaccess edge computing.Sun et al. [9] developed the multiagent hierarchical policy gradient algorithm to enable autonomous decision-making for air-combat confrontations, discovering that intelligence has tactical evolutionary abilities.Li et al. [10] proposed a generative adversarial deep reinforcement learning scheduling algorithm that leveraged expert knowledge to direct the DRL training process, resulting in the high-quality scheduling of workloads and optimization objectives.Ren et al. [7b] employed a reinforcement learning framework based on a multiagent deep deterministic policy gradient (DDPG) to enable autonomous decision-making in air warfare and allow for collaborative operations.Hu et al. [11] incorporated a dynamic quality replay method into conventional DRL, decreasing reliance on expert knowledge and augmenting the efficiency of the new algorithm.This method enabled autonomous cooperative operations involving multiple UAVs.Zhang et al. [12] designed an air combat maneuver decision model, using the actor-critic architecture, to achieve cooperative air combat decision-making with multiple UAVs.Liu et al. [13] developed a reinforcement learning approach, using one general agent with multiple narrow agents, to overcome the slow convergence of traditional task assignment methods and enable intelligent task assignment for air defense combat.
However, DRL methods suffer from weak interpretability, a lack of convergence guarantees, and high-computing power requirements in intelligent decision-making applications. [14]In order to solve these problems, methods combining game theory and reinforcement learning have emerged in recent years. [15]arahmadi et al. [16] modeled the multiagent credit assignment problem (MPC) in multi-intelligent reinforcement learning as a bankruptcy gam.They then solved the MPC problem by selecting the best assignment result through the evolutionary game.The results showed that the proposed method has good performance in terms of group learning rate, confidence, expertness, certainty, and correctness metrics.Zhang et al. [17] proposed a reciprocal reputation mechanism, combined with a double deep Q-network algorithm, to suppress attack motivation in vehicular ad hoc networks and reduce selfish node attacks.Xu et al. [18] formulated the task offloading problem in vehicular edge computing as an exact potential game, achieving task offloading resource optimization by solving the Nash equilibrium (NE) using multiagent reinforcement learning.Duan et al. [19] combined game theory and reinforcement learning to enable vehicle autonomous driving, proposing a multi-intelligence negotiation model based on an incomplete information game to facilitate automatic negotiation of agents.Zhu et al. [20] combined the minimax Q algorithm with game theory to solve the NE of two-player zero-sum games.Subsequent simulation results showed that the proposed method was applicable to both symmetric and nonsymmetric games.Cao et al. [15b] proposed a game theory-based inverse learning framework to obtain the parameters of both the dynamic system and individual cost of multistage games, by solving affine-quadratic games and calculating the gradient of the game system parameters.Peng et al. [21] formulated the distributed control problem as a differential graphical game and combined it with DRL to achieve bounded control.Albaba et al. [15a] combined hierarchical game theory with DRL for simultaneous decision-making in multiagent traffic scenarios.They experimentally demonstrated that the proposed method modeled more traffic situations and effectively reduced the collision rate, compared to conventional methods.Gao et al. [22] proposed a passivity-based methodology for multiagent finite games in reinforcement learning dynamics, as well as algorithms for analysis and design.Results indicated that the proposed method improved convergence.
Previous studies have applied DRL to decision-making; however, limited research has explored the integration of potential games into intelligent decision-making systems for air defense combat.To address this research gap, and to improve the interpretability and convergence of DRL when used in decisionmaking, potential game theory is introduced into the DRL framework.Furthermore, proof that the target assignment (TA) game is an exact potential game is provided, the primary contribution of this article.
In this article, an intelligent decision system for air defense operations is proposed based on potential games and DRL, leveraging the advantages of both DRL and game theory to attain an effective and high-quality decision system for air defense operations.Potential games are a branch of game theory that possess desirable properties that guarantee the existence of NE solutions.Specifically, potential game theory is utilized to construct a multiobject multi-intercept unit resource assignment model for efficient TA, as well as to obtain an NE solution in the TA phase.The primary contributions of this article are as follows.
First, an air defense operational decision-making framework is proposed that combines DRL and potential games for target selection and radar control.A type-different-based reward mechanism is employed to incentivize efficient decision-making and TA.
Second, a DRL-based target selection and radar control module is proposed to facilitate intelligent decision-making for air defense operations.The air defense combat problem is modeled as a Markov decision process (MDP).The action state space, action space, and reward are formulated by solving the problem characteristics.To facilitate the learning of better strategies for the above MDP, a bidirectional gate recurrent unit (BiGRU)based feature extraction method configured to extract the type of air situation and a multihead attention mechanism are proposed.Both are tailored to the specific characteristics of the air defense combat problem.
Finally, a potential game-based (PG-TA) module is proposed that enables agents to coordinate their behaviors through interactions and reach a cooperative equilibrium.In order to optimize the air defense combat TA, a TA optimization game is formulated from the perspective of distribution.Separately, it is proved that the game belongs to the potential game.Hence, the NE solution can be obtained by applying the optimal response method to solve the game while avoiding incorrect assignment, duplicate assignment, and omitted assignment, thereby increasing decision effectiveness.
The remainder of this article is organized as follows.Section 2 provides an overview of reinforcement learning and potential games.Section 3 describes the air defense combat problem and the methods used to solve it.Section 4 outlines the experimental setup, states the results from the experiments that were conducted, and evaluates those results.Section 5 concludes the results and evaluations drawn and discusses potential future research directions.

Deep Reinforcement Learning
The reinforcement learning process involves an intelligent agent interacting with the external environment in order to maximize the external cumulative reward. [23]The problem is typically modeled as a Markov reward process, which is represented by the tuple M ¼< S, A, T, R, γ >, where is the set of actions, T is the transition probability matrix, R is the reward function, and γ ∈ ½0, 1 is the discount factor.
At each time step t of the reinforcement learning process, the agent is in state s t , observes the environment, performs an action a t , receives environmental feedback R t , and transitions to the next state s tþ1 at a time step of t þ 1.In an MDP, a state is called a Markov state when it satisfies the following condition.
The state of the next moment is independent of the state of the prior moment.The state transition probability matrix describes the probability of transitioning from the current state s to the subsequent state s 0 .
The reward function represents the desired expected reward of the intelligence taking the action, after the transition where E is the expectation operator.Return R t , also known as the cumulative reward, is the sum of all rewards from the beginning to the end of the round The state value function V π ðsÞ is the expected cumulative reward when in the state s given a policy π The action-state-value function Q π ðs, aÞ is the expected cumulative reward of taking an action a while in a given state s, following a specified policy π Neural networks are typically utilized to estimate the action state value function Q π ðs, a, θÞ and the strategy π when the state space is large or continuous, where θ is a parameter of the neural network.This idea forms the foundation of the strategic gradient approach.Finding the gradient of the parameter allows for the creation of a more accurate expected reward function for the parameter θ.
The definition of the reward function is The gradient is computed as follows The two main categories of reinforcement learning techniques are typically policy-based and value-based methods.The advantages of these two approaches are combined in the actor-critic approach, in which the actor module employs a policy-based approach and the critic module employs a value-based approach.The asynchronous advantage actor-critic (A3C) technique [24] is the more popular implementation of the general actor-critic method.
Based on the advantage actor-critic (A2C) method, [25] the A3C algorithm employs asynchronous parallelism.Typically, there are m worker nodes and 1 server node.The server node holds the most recent policy network ðπ θ ðs t , a t ÞÞ and value network ðQ w ðs, aÞ or Vðs t ; θ w ÞÞ, where θ denotes the policy network parameter and w denotes the value network parameter.Based on the gradient data transmitted by the worker node, the server node modifies the parameters.Each worker node has a copy π θ 0 ðs t , a t Þ of the policy network and a copy Q w 0 ðs, aÞ or Vðs t ; θ w 0 Þ of the server node's value network, where θ 0 and w 0 represent the modified parameters of the policy network and the worker node, respectively.The gradient information is sent to the server node at regular intervals to update the parameters of the server node.The latest parameters are then requested from the server node, ensuring that the conditions θ 0 ¼ θ and w 0 ¼ w are met.Generally, the parameters of the policy network and the value network are shared in the nonoutput layer.
The definition of the updated reward function is where Â represents an objective estimation of the dominance function Â indicates whether it is advantageous or disadvantageous to perform an action a t .

Potential Game
The player, the strategy, and the utility function are the three components that are used to define a game in game theory. [26]he symbol defining a game is Γ ¼< N, S, fU i g i∈.V >, where each element i is a player and N ¼ f1; 2; : : : ng is the set of players.S ¼ S 1 Â S 2 Â • • • ÂS N represents the strategy space of the game Γ, S i is the strategy space of the i-th innings, and S i ∈ S i is the strategy of the i-th player.Let A represent a subset of N, ÀA represent the complement of A, and The utility function of the i-th player maps the combination of strategies to the real numbers in the set.This mapping is denoted by 12), the strategy combination z is said to be a pure strategy NE.
Ordinal potential games, weighted potential games, and precise potential games are the different categories of potential games.Below is a description of the precise potential game that was used in this study.
Definition 2 (Exact Potential Game): The game Γ ¼< N, S, fU i g i∈N > is said to be an exact potential game, if there exists a function G∶Y !R, which for ∀i ∈ N, Lemma 1: All finite-potential games have pure strategy NE solutions.

Problem Formulation and Proposed Method
This section introduces the air defense combat intelligent decision problem and models it as an MDP problem.The state space and action space of the air defense combat decision problem are proposed.An event-based reward mechanism is developed, and a potential game-based TA model is proposed that is configured to solve the air defense decision problem.

Problem Formulation
2b] First, the air defense combat intelligent decision problem is framed as an MDP and solved using a DRL-based target selection and radar control (DRL-TSRC) module.Neural networks and DRL techniques are used to analyze the air situation and threat level of individual targets, assign air targets to be intercepted and their threat level, before deciding when to turn on the guidance radar of the fire unit.
Second, PG-TA module is used to assign the target selected by the DRL-TSRC module for interception and control the fire unit for interception.The process is illustrated in Figure 2.
In response to air defense decision problems, there are several equipment limitations that should be considered: 1) The guidance radar cannot be turned off after it is turned on.2) The guidance radar of the fire unit should turn on before the fire unit can launch a missile to intercept a target.3) The guidance radar can only guide missiles launched by the fire unit.4) The blue side cannot detect the fire unit position until the guidance radar is turned on.5) The air situation posture of the red side is given by the airborne warning and control system and does not require guidance radar detection.6) If the guidance radar of a fire unit is destroyed, the fire unit cannot launch and guide missiles.7) Each fire unit satisfies the rational choice theory.
It should be noted that in this article, the term "red side" refers to the defending side, which utilizes ground-to-air missiles as its main weapons, while the term "blue side" refers to the attacking side, which employs UAVs and fighter aircraft as its main weapons.

Deep Reinforcement Learning-Based Target Selection and Radar Control Module (DRL-TSRC)
An agent takes action a i to intercept the enemy target at battlefield state s i .The Markov process model consists of a state set S ¼ ½s 1 , s 2 , • • • , s n and an action set A ¼ ½a 1 , a 2 , • • • , a n .The agent selects action a i from the action set A at state s i according to the policy π∶S Â A ! ½0, 1.The battlefield environment transitions to the next state according to the state transition function P∶S Â A Â S !½0, 1.The goal of the agent is to maximize the cumulative reward function where r t is the reward received at step t, T is the time horizon, and γ is the discount factor.Equation (5) provides the calculation for the state-value function V π ðsÞ, which is the expectation of the cumulative reward function R t , and Equation (7) provides the calculation for the action-state-value function Q π ðs, aÞ.When compared to the average reward under strategy π, or V π ðsÞ, the dominance function, A π ðs t , a t Þ ¼ Q π ðs, aÞ À V π ðsÞ, indicates how good the reward earned by taking action a t in the current state s t was.

State and Action Space of Air Defense Problem
In this section, the air defense combat intelligent decision problem is formulated as an MDP by specifying the states, actions, rewards, and goals in air defense combat.The rewards are related to the type of fire unit used and are specified in Section 3.2.The state space of the air defense combat problem includes the state of the defense sites, the state of the fire units, the state of the detectable enemy targets, and the state of the attackable enemy targets.This state information is maintained by the digital battlefield environment.The state information of the defense sites comprises the site number, location, type, and attack status.The state information of the fire unit includes the fire unit tag number, location, number of missiles remaining, usability, the number of targets the unit can attack, and attack status.The state of the reconnoitered enemy targets includes the target number, location, type, movement, and state under attack.The entire air defense combat state space is outlined in Table 1.
The action space of the air defense combat problem includes target selection, target threat level, radar selection, and radar action.The entire air defense combat action space is outlined in Table 2.

Air Situation Feature Extraction based on BiGRU Network
The BiGRU method is used to analyze air situation information in air defense missions, which are time-series events.This method is used to consider the state input for a period of time before the current time.The BiGRU network consists of a forward GRU and a reverse GRU, and its structure is shown in Figure 3.The BiGRU method enhances the ability of the model to learn air situation features by analyzing the laws of state information in two directions: past to present and present to past.The GRU is a simplified version of the long short-term memory neural network that uses an update gate, instead of an input gate, and a forget gate.The update gate is used to determine the retention of historical information and the forget gate is used to determine how the historical information is combined with the current information.The main parameters of the GRU are calculated as follows where x t is the input at time t, h t is the output at time t, r t is the reset gate, z t is the update gate, ht is the information generated according to the update gate, σ is the sigmoid activation function, tan h is the hyperbolic tangent type activation function, and w and b are the weight and bias terms, respectively.
The forward unit in the BiGRU network analyzes the forward state sequence law (from past to present), and the reverse unit analyzes the state sequence law in the reverse direction.The main process is computed as follows.
where, ht denotes the forward hidden layer state, h t denotes the reverse hidden layer state, w t1 , w t2 denotes the forward and backward propagation hidden layer output weights, b t is the bias, f is the activation function, and the sigmoid function is used in this paper.

Multiheaded Attention Mechanism
The digital battlefield environment is characterized by a highdimensional state space and a large amount of information to be processed by the network.The multiheaded attention mechanism enables the neural network to identify and prioritize important information with high attention.The multiheaded attention scheme assembles multiple self-attentive layers, and multiple attention layers can linearly transform the same input from different angles to obtain information from different subspaces.Figure 4 illustrates this multiheaded attention structure.
For single head self-attention, the processed situation information is transformed into three vectors: query Q, key K, and value V vectors.The linear transformation is given by the following equation where W P is the linear transformation matrix.The attention is calculated as a scaled dot product attention, which is calculated as follows AttentionðQ, K, VÞ ¼ softmax For parallel stacking of multiple self-attentive modules, the multihead attention mechanism is calculated as MultiHeadðQ, K, VÞ ¼ Concatðhead 1 , : : : where W Q i , W K i , and W V i are the matrices of learnable parameters in the data projection, and h is the number of heads.In this article, h ¼ 4.

Type-Different Reward Mechanism
Since the firing and exposure costs of long-range fire units are higher than those of short-range fire units, it is important to train the agent to use a higher proportion of short-range fire units to reduce the usage costs.To this end, the type-different-based reward mechanism (TDRM) was designed, which provides different reward values for different types of fire units.Depending on the type of fire unit, different rewards are given for radar detection, missile launch, and target interception.Since the state space and action space of air defense operations are large, giving a reward to the agent after each round of combat is not conducive to training.Therefore, this article proposes a critical event-based reward mechanism for different fire units to provide timely feedback after the agent makes a decision, thereby speeding up the training process.The specific reward design is shown in Table 3.

Potential Game-based Target Assignment Module
The target to be intercepted, as provided by the DRL-TSRC module, and the fire units for which the guidance radar has been activated, are modeled using potential game theory.The target groups are defined as f1, 2, • • • , Tg, where the targets to be intercepted are the targets provided by the DRL-TSRC module.
For the red side, the total number of fire units is denoted by mðm ¼ l þ sÞ, consisting of long-range fire units l sets and shortrange fire units s sets.To defend the red strategic location against blue air targets, the red side cooperates to intercept the targets using the launchers and guidance radars of each fire unit.

Player Set
The players are defined as the firing units whose guiding radar has been activated, with the set of players defined as N m ¼ fLM 1 , : : : , LM 0 , SM 1 , : : : , SM h g.The set of long-range fire units is denoted by LM ¼ fLM 1 , : : : , LM 0 g, while the set of short-range fire units is denoted by SM ¼ fSM 1 , : : : , SM h g.The number of long-range fire units and short-range fire units is s, and k, respectively, as indicated by the number of players, where jN m j ¼ s þ k ¼ m is the number of players.

Strategy Space
The strategy space is defined as S ¼ S 1 Â S 2 Â : : : Â S N , where S i represents the set of strategies of the i-th player.
S i ¼ fs i1 , s i2 , : : : , s ij , : : : , s iT g is the strategy of the i-th player.For each element If the target j-th satisfies the interception condition of player i, it is denoted as The equipment of each type of fire unit is limited.Long-range fire units can fire at up to eight targets at once, while short-range fire units can fire at up to six targets at once.These limits act as constraints

Utility Functions
The amount of fire units that can be fired is a local constraint that affects the player and was taken into account when designing the strategy set.It is preferable to intercept the target with the least amount of fire resources possible, in order to conserve fire resources.The penalty function is defined as Each player may improve their own gain in this distributed target assignment problem, and in order to prevent the waste of fire resources, the penalty function is added to each player in order to develop the utility function where α j denotes the threat level of the target j, t represents the target type, n i t represents the number of targets of type t assigned to the fire unit, and f i t represents the reward value obtained by the fire unit for intercepting a type-t target.Define J i ¼ fi 0 ji 0 ∈ N m , C ij ¼ 1, C i 0 j ¼ 1, i 0 6 ¼ ig as the set of proximity fire units of i. r i j is the range shortcut between fire unit i, and target j Â r mas j is the maximum value of the course shortcut from target j to the fire unit that can intercept it.The penalty coefficient β is a large constant used to ensure that the penalty function term converges to zero as the game progresses, thus achieving cost savings.
The reward function of the i-th fire unit is denoted by Then, the utility function can be written as The reward value for intercepting the same target varies between fire units according to their respective equipment characteristics and interdiction costs, as shown in Table 3.

Target Assignment Optimization Game
The target assignment optimization game (Γ) is proposed to solve the target assignment problem and is defined by the following equation ðΓÞ∶ max For the target assignment optimization game (Γ), a potential function exists Theorem 1: There exists an NE for the target assignment optimization game (Γ), and the optimal solution to the target assignment problem converges to the NE.Proof: The target assignment optimization game (Γ) is an exact potential game.For the potential function Gðy i , y Ài Þ, for player i and strategy When the strategy of the i-th player changes from y i to z i , the change in the potential function is equal to the change in the utility function of the i-th player: ΔU ¼ ΔG.This implies that the target assignment optimization game (Γ) belongs to the exact potential game.
For the potential game, by Lemma 1, the NE must exist and the optimal point of the potential function converges to the pure strategy NE.Hence, the equilibrium solution of the target assignment optimization game (Γ) exists and is optimal.Therefore, it is feasible to design an algorithm to search for the NE of the game.
It is not difficult to prove by contradiction that P m i¼1 y ij ¼ 1 holds at the optimal point, that is, each target is assigned to only one fire unit.

Best Response-based NE Solving Method
Definition 3: The best response dynamics can be defined as the dynamics at which each player i is able to maximize the utility function by choosing a strategy based on the strategies of other players when making decisions.In other words Lemma 2: The best response dynamics converges the exact potential game to a pure strategic NE.
Theorem 2: The use of best response dynamics converges the target assignment optimization game (Γ) to a NE.
Proof: Based on Theorem 1, the target assignment optimization game (Γ) is an exact potential game.When combined with Lemma 2, Theorem 2 can be obtained.
For the target assignment optimization game (Γ) proposed in this article, an NE solving algorithm is proposed.This game is a noncooperative game, and the best response approach can be used to solve the NE.Since the strategy change of a single player will lead to the strategy change of other players, a bestresponse-based NE solving algorithm is proposed to consider this interaction.However, since the dynamic best-response algorithm requires participants to make sequential decisions in the decision process, which typically requires a decision coordinator, it is not conducive to distributed decision-making.Therefore, this section proposes a distributed decision-maker selection protocol that randomly selects participants to make decisions in each round.
Algorithm 1 sets a maximum waiting time τ max at the outset and, at the start of the iteration, each player i generates a random waiting time τ i in the interval ½0, τ max .After the start, player i waits for the first τ i time units and abandons the decision-maker selection if it receives a DR signal from other players.Otherwise, a DR signal is sent at time τ i , and player i is identified as the decision maker for this round.
Using Algorithm 2, a pure strategy NE point of the game (Γ) can be obtained, which is the optimal solution of the target assignment optimization game according to Theorem 1.The proposed algorithm has a complexity of oðN it N s Þ, whereas the exhaustive search algorithm has a complexity of oðN s !=ðN s À NÞÞ.Consequently, the proposed algorithm performs better than the exhaustive search approach.

Results and Analysis
In this section, the effectiveness of the proposed method is evaluated in a simulated environment that faithfully replicates the physical laws, terrain occlusion, and Earth curvature of a realistic battlefield environment for air defense operations.The weapon parameters used in the simulation are consistent with those of a real battlefield.The digital battlefield environment is shown in Figure 5.

Force Configuration of Red Army
In the simulations, the red army (acting as the defending side) has a number of defensive units that can be employed: 1) Strategic location: command post, airport; 2) Short-range fire unit: includes one short-range radar and three long-range launch vehicles (with each vehicle including eight short-range missiles); 3) Long-range fire unit: six sets of units including one long-range radar and eight long-range launch vehicles (with each vehicle including six long-range missiles and three close-range missiles); and 4) AEW aircraft: One unit with a detection range of 400 km.Initialize the target assignment game Γ ¼< N, S, fU i g i∈N >, where M and T are the number of fire units and the number of targets, respectively.N it is number of iterations.Randomly initialize the strategy combination S r ; let n denote the player that changes strategy in the iteration.
The decision maker generated by Algorithm 1 is player i.
Let S i denote the set of available strategies for the i-th fire unit and N 0 denote the number of elements in the set S i ; Let S i denote the strategy for the i-th fire unit, S i ∈ S r , let S 0 i ¼ S i À fS i g; Take the strategy S i for the i-th fire unit, S i ∈ S r , calculate the utility function r randomly from S 0 i as the strategy for the n-th fire unit and update the strategy set S 0 i ¼ S 0 i À fS 0 r g; if C i 0 ðS 0 i Þ ¼ 1 then Let the i-th fire unit temporarily change its strategy from S 0 n to S n end if Update the strategy set S 0 r and calculate the utility function

. Force Configuration of Blue Army
In the simulations, the blue army (acting as the attacking side) has a number of attacking units that can be employed: 1) Fighters: 12, each carrying 6 ARMs and 2 air-to-surface missiles; 2) UAVs: 20, each carrying 2 ARMs and 1 air-to-surface missile; 3) Bombers: 4; 4) Cruise missiles: 18; 5) Electronic jammers: 2

Battlefield Environment Rules
There are a number of constraints that have been placed on the simulation as a result of the battlefield environment.These rules are listed below: 1) If the guidance radar of a fire unit is destroyed, the fire unit is rendered inoperable.2) During the guidance procedure, the guidance radar must be turned on and produce electromagnetic radiation.
3) The maximum range for the guided radar is 4) , where H T is the altitude of the target and H R is the altitude of the radar antenna, which is set as H R ¼ 4 m in this simulation.The range of the guided radar is affected by the terrain shading and the Earth's curvature by taking atmospheric refraction into account.5) An antiaircraft missile follows the minimum energy trajectory during flight.6) The short-range and long-range antiaircraft missiles have maximal interception ranges of 40 and 160 km, respectively.7) The high and low kill probabilities in the kill zone for cruise missiles are 45% and 35%, respectively.8) The high and low kill probabilities in the kill zone for fighters, UAVs, bombers, ARMs, and air-to-surface missiles are 75% and 55%, respectively.9) The air-to-surface missiles have a range of 60 km and an 80% hit rate.10) The ARMs have a range of 110 km and an 80% hit rate.11) The effective jamming direction of an electronic jammer is a cone of about 15 ∘ , and jamming against the radar lowers the kill probability of a missile.

Performance Analysis
This section compares the performance of the proposed decision-making method, the A3CPG algorithm, with mainstream methods such as the A3C, proximal policy optimization (PPO), and DDPG algorithms, with the DDPG algorithm being used as a baseline result in the comparison.
Figure 6 illustrates the reward curves obtained by the agent during the training process, with the horizontal axis representing the confrontation episodes.
It can be observed that the various methods have different degrees of reward growth as the number of training rounds are increased during the learning process.Similarly, the growth rate of reward and the rewards obtained are also different.From Figure 6, it is clear that the highest degree of reward growth is achieved by the A3CPG algorithm, outperforming the other conventional methods.Out of the conventional methods, the A3C algorithm outperformed the PPO algorithm, which outperformed the DDPG algorithm.The proposed method is also significantly better than the other methods in terms of average reward ðP < 0.05Þ.In terms of learning speed, from best to worst: A3CPG > A3C > PPO % DDPG.The proposed method is also significantly better than the other methods in terms of convergence speed (P < 0.05).It can be seen that the proposed A3CPG method can produce more effective target assignment results than the A3C method in the target assignment process, helping the agent to accelerate the learning process and obtain a better air defense strategy.The DDPG method had demanding requirements for the level of reward design and feedback in the training process.In contrast, the PPO method normalized rewards during training and had less demanding requirements for the reasonableness of reward setting and real-time feedback.In the experimental environment of this study, real-time feedback of rewards was challenging due to the large state-action space and multiple entities, which may explain why the PPO method performed better than the DDPG method.The A3CPG and A3C methods were more efficient at exploring the environment and were able to utilize more computational resources due to parallel computing, which may explain why they outperformed the PPO and DDPG methods.

Cumulative Reward and Winning Rate
Figure 7 shows the training winning rate statistics, and the analysis combined with Figure 6 shows that the higher reward method has a higher winning rate.The reward function is designed to guide the agent in learning the winning strategy.Consequently, a method that yields a high final reward is indicative of a reasonable reward design and the ability of the designed reward to guide the agent toward better strategic choices.The proposed method achieved the highest winning rate of 81.7%, followed by the A3C method with 77.3%, followed by the PPO method with 70.5%, with the DDPG method displaying by far the worst winning rate of 45.4%.The proposed method showed a statistically significant superiority over the other conventional methods (P < 0.05), with a 5.69% increase in winning rate compared to the next best method (the A3C method).It is possible that the superior performance of the A3CPG method in the context of target assignment can be attributed to the potential game module it incorporates.By leveraging the NE, this module facilitates efficient target assignment and enables the attainment of optimal strategies, resulting in a higher winning rate.The variances of the A3CPG, A3C, DDPG, and PPO methods were 6.9, 7.1, 9.3, and 11.2, respectively.In terms of algorithm stability, the A3CPG method performed slightly better than the A3C method.This behavior can be attributed to the A3CPG system being embedded with a potential game method, which was more stable than the A3C system based on a neural network.
The mechanism is used in the method; ○ The mechanism is not used in the method.

Mechanism Effectiveness Analysis
To evaluate the effectiveness of the proposed multiple attention mechanism, BiGRU mechanism, TDRM, and ablation experiments were conducted.
The experimental design of the ablation experiment is shown in Table 4, which compares the effects of the three mechanisms proposed in this article on the reward.The experiment uses the A3C method as a baseline and compares the effectiveness of using a single mechanism with the A3C approach to the A3CPG approach, in which all three mechanisms are used.The results of the ablation experiment are shown in Figure 8.
As shown, the use of any mechanism with the A3C approach is able to enhance the reward compared to the baseline approach, indicating that the three proposed mechanisms are effective in the experimental scenario.In particular, the multiheaded attention mechanism displayed the best reward results and faster training speeds of all the single-mechanism methods at the end of the mechanism training.This indicated that the reward after training was higher for the A3C-M method than for the other two single-mechanism methods, and that the slope of the reward curve was steeper for the A3C-M method.This could be attributed to the addition of the multiheaded attention mechanism, which allowed the agent to analyze the battlefield situation information more efficiently, thus speeding up training and focusing on more important targets to obtain better air defense strategies.With the addition of different types of reward mechanisms in the A3CPG approach, the trained reward of the agent was slightly improved compared to the A3C-B and baseline approaches.This could be due to the TDRM helping the agent to apply the strategy in obtaining a more reasonable amount of defensive force and using less long-range fire units and exposure, thereby accelerating the convergence speed of the agent significantly.After the BiGRU mechanism was added, the convergence speed of the agent was significantly accelerated relative to the baseline.This is evidenced by the fact that the slope of the A3C-B method before 48 episodes was significantly higher than that of the A3C-M, A3C-T, and baseline methods, almost reaching the convergence level.The reason for this could be attributed to the fact that the BiGRU mechanism had more capabilities for analyzing the air situation information, which allowed the agent to have a better perception of the current situation and thus accelerate the training.However, the final reward was similar to the baseline due to the lack of attention to the global situation.

Convergence Analysis of Potential Function
Figure 10 shows the variation of potential function during training, where Figure 10a shows in the case of scenario 1 in Figure 9 and 10b demonstrates the case of scenario 2 in Figure 9.In these figures, the horizontal axis represents the number of iterations of the potential game and the vertical axis represents the potential function value.
From Figure 10, it can be seen that in both scenarios the potential function is able to converge to the NE, with the potential function value in scene 1 converging to 9.56 and the potential function value in scene 2 converging to 13.38.The convergence to NE occurred at step 57 in scene 1 and at step 74 in scene 2. The greater number of steps required for convergence in scene 2 relative to the number of steps required in scene 1 can be attributed to the larger number of targets in scene 2 than in scene 1, further complicating the air situation.As a result, the process of convergence toward NE entails a greater number of steps.

Rational Analysis of Target Assignment
In order to compare the performance between the proposed method and the traditional method, two metrics are introduced: The ratio of the targets eliminated the missiles consumed (ROT), which is calculated as where N c is the number of consumed missiles, and N e is the number of eliminated targets.
The ratio of consumed short-range missiles to consumed long-range missiles (ROS), which can be calculated by where N s is the number of consumed short-range missiles, and N l is the number of consumed long-range missiles.
According to Figure 11, the ROT values of the two methods at 20 k episodes were 4.26 and 4.28, respectively, with no significant difference (P > 0.05).However, at 100 k, the ROT values for the A3CPG and A3C methods were 1.91 and 2.31 respectively, with a statistically significant reduction of 17.32% for the A3CPG method over the A3C method (P < 0.05).This improvement can be attributed to the potential game module embedded in the proposed A3CPG method, which can optimize the assignment of firefighting resources by assigning targets to the appropriate firefighting units.The observed decrease in ROT values for both methods during training suggests that the agent learnt effective strategies to conserve firefighting resources.
According to Figure 12, the ROS values of the two methods were 0.541 and 0.592 at 20 k, respectively, and increased to 1.971 and 1.537 as training progressed to 100 k.The observed increase in ROS values suggests that the agent learnt to preferentially use the short-range missile strategy.At 20 k, no significant difference in ROS values was observed between the two methods (P > 0.05).However, after training, the A3CPG method showed a 28.24% improvement over the A3C method at 100k, and the difference was statistically significant (P < 0.05).These results indicate that the potential game module in A3CPG can effectively enhance the rationality of firepower resource use and promote the preferential use of short-range missiles for target interception by the agent, over more costly long-range missiles.

Conclusion
In this article, a potential game-embedded reinforcement learning-based intelligent decision-making method is proposed.The proposed method is divided into two modules: DRL-TSRC and PG-TA modules.'The DRL-TSRC is based on DRL and uses BiGRU and multi-head attention mechanisms for battlefield situational information processing to achieve target selection and radar control.The PG-TA module models the multitarget multifire unit assignment problem as a target assignment optimization game, proves the game as a potential game, and integrates the potential game into the intelligent decision system to ensure that the target assignment result is a NE solution.The experimental results demonstrate that the proposed intelligent decision-making method can effectively enhance the winning rate (5.69% improvement compared to the traditional method), optimize the rationality of target assignment, conserve firepower resources, and generate superior strategies compared to other methods (17.32% reduction in ROT and 28.24% improvement in ROS compared to the traditional method).
In the future, we plan to explore two directions for improvement.First, we will seek to embed the potential game more deeply into the DRL model and develop game theory-inspired DRL algorithms to solve the agent game problem.Second, we plan to design more complex tasks, such as scheduling air defense firepower, fighter aircraft, and early warning aircraft, and develop the corresponding algorithms to solve these tasks.

Figure 2 .
Figure 2. The architecture of the proposed decision-making method.The proposed method is composed of a DRL-TSRC module and a PG-TA module.

Figure 3 .
Figure 3. Diagram of the Bi-GRU structure.Figure 4. Diagram of multihead attention structure.

Figure 4 .
Figure 3. Diagram of the Bi-GRU structure.Figure 4. Diagram of multihead attention structure.

Algorithm 1 .
Distributed Decision Maker Selection Protocol.Set maximum waiting time τ max The iteration starts with each player i generating a random wait time τ i in ½0, τ max Start timer; if Player i receives a Decision Request (DR) signal from another player in the first τ i time after the iteration starts; then Withdraws from the selection of decision maker else Player i is identified as the decision maker for the round and sends DR signals to the other players end if Algorithm 2. Best response-based target assignment NE solving (BRTA) algorithm.

Figure 5 .
Figure 5. Schematic diagram of digital battlefield environment.

Figure 7 .
Figure 7. Air defense combat battle winning rate after training.

Figure 6 .
Figure 6.Change curve of agent reward during the training process.

Figure 8 .
Figure 8. Change curve of agent reward during the training process.

Figure 9 .
Figure 9. Air defense combat scene during training, a) scene 1 and b) scene 2.

Figure 10 .
Figure 10.The convergence process of the target assignment optimization game.a) The convergence process of scene 1. b) The convergence of scene 2.

Table 1 .
Air defense operation states definition.

Table 2 .
Air defense operation actions definition.

Table 3 .
Air defense operation reward design.