Toward Collaborative Multitarget Search and Navigation with Attention‐Enhanced Local Observation

Collaborative multitarget search and navigation (CMTSN) is highly demanded in complex missions such as rescue and warehouse management. Traditional centralized and decentralized approaches fall short in terms of scalability and adaptability to real‐world complexities such as unknown targets and large‐scale missions. This article addresses this challenging CMTSN problem in three‐dimensional spaces, specifically for agents with local visual observation operating in obstacle‐rich environments. To overcome these challenges, this work presents the POsthumous Mix‐credit assignment with Attention (POMA) framework. POMA integrates adaptive curriculum learning and mixed individual‐group credit assignments to efficiently balance individual and group contributions in a sparse reward environment. It also leverages an attention mechanism to manage variable local observations, enhancing the framework's scalability. Extensive simulations demonstrate that POMA outperforms a variety of baseline methods. Furthermore, the trained model is deployed over a physical visual drone swarm, demonstrating the effectiveness and generalization of our approach in real‐world autonomous flight.


Introduction
Multitarget search (MTS) and navigation with intelligent systems like autonomous drones [1] are highly demanded in critical applications, [2] such as disaster rescue in earthquakes, surveillance and reconnaissance in hazardous areas, and parcel management in cluttered warehouses (see Figure 1).The utilization of multiagent systems (MAS) has been considered a more efficient way with more information exchange to conduct such missions while enhancing the resilience and robustness of the search operations. [3,4]However, ensuring effective collaboration among these agents is a challenging task, especially in obstaclerich environments with multiple targets.This involves not only information exchange but also synchronization of actions and strategies among all participating agents.Moreover, coordinated interaction is fundamentally required where agents work together toward a common goal, sharing information and making decisions that benefit the overall mission.
Traditional approaches mainly address multitarget search (MTS) problems via heuristic algorithms [5] or subtask optimization [6] in centralized or decentralized manners.Centralized approaches, such as task assignment from the central node, rely on a central controller to allocate tasks to individual nodes and execute their actions based on global information.Gábor et al. [6] proposed a centralized optimization algorithm that considers the spatial constraints of the environment and the capabilities of drones to control a swarm of drones to achieve coordinated flocking in constrained environments.While these methods are easy to deploy, they are vulnerable to single-node failures and lack scalability.Once the central node is attacked or disturbed, [7] the entire system will cease to function.Meanwhile, the complexity of communication and computation increases rapidly with the number of nodes in the system, which hinders the extension to large-scale scenarios.Decentralized approaches, such as distributed predictive control [8] and decentralized trajectory planner, [9] have been proposed to overcome the limitations of centralized approaches.These methods rely on local observation and communication among agents to achieve the desired performance but suffer from state consensus problems. [10]Besides, these approaches often struggle to adjust to uncertainties and complexities when operating in real-world environments since simplified environments and simulated sensor data are assumed.Furthermore, they do not always fully leverage the potential for collaboration when multiple agents and targets are involved as all agents are computed to take the optimal strategy individually.
Multiagent reinforcement learning (MARL) [11][12][13][14][15][16] has recently emerged as a powerful tool for tackling uncertainties and generating effective collaboration among agents in sophisticated tasks such as path planning and navigation in cluttered environments.These methods leverage reinforcement learning (RL) algorithms to find the optimal policy for agents with intentionally designed reward functions that aim to encourage collaboration among agents and benefit the entire team.However, these methods either require global observations from the environment [11] or rely on bird-view grid states and stable communication, [12,15] which are infeasible to achieve fully decentralized planning in an uncertain real-world environment such as the MTS.In this article, we aim to solve the collaborative multitarget search and navigation (CMTSN) problem in 3D space with local visual perception.There are several challenges to achieving fully decentralized decision-making for this task: (1) balancing exploration and exploitation in a 3D sparse reward space; (2) addressing efficient credit assignment problems within a team; and (3) improving the scalability of the trained model.An adaptive curriculum embedded multistage learning framework [17] is proposed to address the first challenge, yet it was only for the single target search.Multiagent posthumous credit assignment [18] solved the posthumous credit assignment by considering the group contribution among multiple agents and introducing the attention mechanism to the centralized critic network.However, MA-POCA ignores the individual's motivation and does not consider the variable observation during execution, which is critical for local perception and scalability since the detected signal may be lost.Besides, existing MARL and MTS approaches rarely consider the physical implementations for CMTSN, which poses more challenges in communication and observation synchronizations when the number of found targets changes.
In this work, we propose a POsthumous Mix-credit assignment with Attention (POMA) framework to address the remaining challenges, namely efficient credit assignment and variable observation.POMA first combines adaptive curriculum learning (ACL) and mixed individual-group credit assignment based on MA-POCA, and then introduces the attention mechanism into observations to improve the scalability of the framework when variable local observations are involved.We validate our framework through extensive simulation and real-world flight experiments.
Our main contributions include: 1) we address the complex CMTSN task by formulating it as a partially observable MARL problem and proposing a novel MARL approach, POMA, to tackle the critical challenges of credit assignment and model scalability in a 3D sparse reward environment.We demonstrate its significant performance improvement over other wellestablished MARL approaches such as MADDPG, [19] MA-POCA, and IPPO; [20] 2) we develop a real drone swarm equipped with local visual perception capabilities for CMTSN in indoor, obstacle-rich environments (see Figure 2), utilizing an attention mechanism to enhance collaboration and manage variable observations, and target status synchronization to update the global information; 3) to the best of our knowledge, this is the first work to deploy a deep reinforcement learning (DRL) model on a physical drone swarm for CMTSN tasks, proving its effectiveness in real-time flight scenarios and highlighting its practical applicability, thereby establishing a new benchmark in robot swarm intelligence.

MTS
MTS has attracted significant attention due to its critical importance in various missions such as search and rescue (SAR).
Figure 1.A drone swarm is inspecting parcels in a warehouse where efficient collaborations are required. [47]rget 2 MTS involves exploring the environment and finding multiple, initially unknown targets.Distinct from mere environment exploration, [21,22] which aims at mapping explored areas to a unified representation, MTS focuses on finding all targets efficiently, prioritizing high success rates and minimal search time.Probabilistic search (PS) methods, [23] which leverage probabilistic models such as probability hypothesis density (PHD) [24] or belief [5] to estimate target locations and derive optimal search paths, are prevalent in MTS.Commonly, these PS challenges are reduced to particle swarm problems and tackled using heuristic strategies like particle swarm optimization (PSO), [25,26] ant colony optimization (ACO), [5] artificial bee colony (ABC), [27] or brain storm optimization (BSO). [28]However, these studies typically rely on simplified 2D maps and simulated sensor data, limiting their applicability in real-world scenarios.For instance, ref. [5] utilized an enhanced ACO technique for MTS within a 2D, obstacle-free grid cell environment, employing simulated radar sensors, which simplifies the real-world complexity.
To overcome these limitations, some researchers have framed MTS as a partially observable Markov decision process (POMDP), solved using search algorithms [29,30] or RL. [31]For instance, ref. [30] modeled MTS in 3D space using a POMDP framework and addressed it with the adaptive belief tree (ABT) algorithm.In an attempt to maintain multi-UAV connectivity, ref. [31] adopted a RL model, utilizing omnidirectional perception in a 2D grid environment, and applied a convolutional neural network (CNN) to train policies based on image representations of trajectory histories and connectivity states, using DQN. [32]These approaches aim to more accurately reflect MTS challenges but still fall short of real-world deployment due to their reliance on grid maps and simulated sensors.
To better address the model uncertainties in real-world applications and facilitate collaboration, we propose a novel MARL approach to solve the CMTSN problem, which involves multiple agents and obstacles, using a real visual drone swarm.As demonstrated in Table 1, which summarizes recent MTS works, our work is unique in utilizing a real visual drone swarm for MTS in a high-fidelity 3D continuous environment, distinguishing it from the predominantly 2D, simulation-based approaches.Our work bridges this gap by leveraging a novel MARL framework validated through physical experiments, marking a significant advancement in real-world CMTSN applications.
Our approach, POMA, is specifically designed to address inherent challenges in applying MARL to CMTSN, such as learning in sparse reward spaces, resolving individual-group credit conflicts, and managing variable observations with inactive agents during task execution.

MARL
Recently, MARL [33][34][35] has been widely applied in path planning, navigation, and target search for intelligent systems.From the perspective of the training framework, MARL can be categorized mainly into three structures, namely, centralized training and execution (CTE), decentralized training and execution (DTE), and centralized training with decentralized execution (CTDE).The CTE [36] contains a central controller that can collect all the state observations and actions from the agents and share them with the group.The controller not only trains the policy but also executes the policy with perfect information.However, CTE is not scalable due to the curse of dimensionality and is limited in environments where perfect information is not accessible.DTE provides a scalable way to search for the optimal policy for each agent via local observations, without communicating with other agents.PPO [37] is a common algorithm to generate individual policies but can be well extended to MARL within the DTE framework. [14]However, the agents are simply existing together without significant collaboration.Hence, CTDE is the preferred framework in most research works, such as MADDPG [19] with centralized critics, COMA [38] with centralized joint policy search, and QMIX [39] with value function factorization.CTDE allows involved agents to collaborate during training while maintaining scalability during execution, as it combines the advantages of CTE and DTE.
However, to effectively apply the CTDE framework to the CMTSN tasks, we need to address the sparse reward space and credit assignment issues during the training, especially for agents with only local visual perceptions.PRIMAL [12] and PRIMAL2 [40] managed to combine imitation learning and RL to train fully decentralized policies for multiagent path finding in sparse reward worlds and highly structured warehouses, respectively.To better utilize the information from the neighbors, Li et al. [15] developed a GNNHIM framework based on the graph neural network to extract the information from multihop neighbors.However, all these methods assume 2D grid environments and known target information, which is not practical for CMTSN with visual drones in our case.

Credit Assignment
Credit assignment is a crucial issue in cooperative MARL problems as it determines the contribution of each agent to the group's success or failure.In our CMTSN problem, where agents work together to find all targets, it is essential to identify which action by which agent led to a higher search success rate.
Various works have been done to address the credit assignment issue in MARL.To estimate the contribution of each agent, Devlin S. et al. [41] calculated the difference between the global reward and the estimated reward if ignoring this agent.However, this method suffers from the computational cost since the combinations of each agent's absence are simulated at every step.COMA [38] used a centralized critic to estimate the actionvalue function, and for each agent, it computes a counterfactual advantage function to represent the value difference, which can determine the contribution of each agent.However, COMA still does not fully address the model complexity issue.QMIX [39] is well developed to address the scalability issue by decomposing the global action-value function into individual agent's value functions.However, QMIX assumes that the global action-value function is a monotonic function of individual value functions, which is hard to meet in practice [18] and may not be able to directly handle scenarios with continuous action space.
Another issue regarding credit assignment is the variable number of agents, which induces the posthumous credit assignment problem.For instance, in the CMTSN task, drones may crash into an obstacle or other drones during the flight, but the remaining drones are required to continue the search task.MA-POCA [18] provides valuable insights for addressing the posthumous credit assignment via introducing a self-attention mechanism across value function estimation and counterfactual baseline [42] but it lacks consideration of the individual's motivation.This would bring down the performance of MA-POCA, especially for environments with sparse reward space and multiple targets like CMTSN.
To address these challenges, in this article, we combine the advantages of MA-POCA and PPO to balance group contribution and individual motivation through well-designed group reward and individual reward functions.We further introduce the attention mechanism to encode the local observations, which significantly improves the inner collaboration and the scalability of the policy in various scenarios.

MARL
The CMTSN problem can be characterized as a decentralized POMDP defined by M∶ N , S, fO i g i∈N , fA i g i∈N , P, R, ρ 0 , where N ¼ f1, : : : , Ng represents the set of finite N ≥ 1 agents.The state of the environment at time t is denoted by s t ∈ S. The local observation and action of agent i related to s t are defined by o i t ∈ O i , a i t ∈ A i , respectively.Let us define joint action space the transition function from state s t ∈ S to the subsequent state s tþ1 ∈ S given the joint action a t ∈ A with probability Pðs tþ1 js t , a t Þ.The reward function shared by all agents is given by R∶S Â A ↦ R with rðs t , a t Þ representing the received reward for all agents.To be simple, r t is used to denote rðs t , a t Þ. ρ 0 is the distribution of initial states.γ ∈ ½0, 1Þ is the discount factor for calculating the cumulative rewards. Let is the joint policy with π i ða i t jo i t Þ representing the policy of each independent agent i ∈ N .T is the maximum allowable steps in an episode.In the CTDE framework, the centralized state-value function for state s t is described as: The centralized state-action value function is: The objective of MADRL is to find an optimal joint policy π maximizing the state-value of the team starting from the initial state s 0 .To evaluate the contribution of each agent, the counterfactual baseline [38] b i ðs, aÞ is computed by marginalizing the action of agent i from the team.
Here, a 0 is the action of agent i and a Ài is the joint action without a 0 .With the baseline, the advantage of agent i can be defined by: The optimal policy for each agent i is obtained by iteratively updating the policy with the gradient: The advantage function considers the contributions of individual agents to the shared team reward by measuring the value difference with or without the agent's involvement.
To address the posthumous credit assignment issue, MA-POMA [18] utilized observation entity encoders and self-attention modules to handle the varying number of agents at every time step.First, to obtain the Q value, the attention-masked centralized state value function parameterized with ϕ in MA-POMA is: where ½k t ∶1 ≤ k t ≤ N is the number of active agents at time step t; g i ∶O i ↦ E is the observation entity encoder for agent i, and RSA is the self-attention module used.Similarly, the counterfactual baseline parameterized with ψ for agent i to calculate b i ðs, aÞ is: here, f Ài ∶O Ài Â A Ài ↦ E is the encoding network for observation-action pairs from agents except for i.

Attention Mechanism
The attention mechanism [42] has been widely applied in various deep learning architectures.In DRL, attention has been used to create a weighted sum of observations, where the weights reflect the relevance of each observation to the task. [43]Let's consider a partial observable task where we have an uncertain observation sequence o i t ∶ ¼ fs j t jj ∈ N Ài g for agent i and we want to encode the observation into a fixed-length context vector c i t with attention weights w ij : The attention weights can be calculated by: where scoreð⋅Þ is the scoring function to calculate the relevance of the neighbor states s j t to the ego state s i t .The piece-wise dot product is widely adopted as the score function.W v , W q , and W k are the parameters to be trained to encode the states.

Problem Formulation
CMTSN with autonomous agents is highly demanded and has been investigated in various applications such as disaster SAR and warehouse management in complicated environments.Unlike the two-dimensional (2D) discrete grid scenario considered in other works, [15,40] in such real-life environments, QR code distribution and prior goal information are infeasible in most cases.
In this article, we consider a bounded obstacle-rich threedimensional (3D) continuous space D ∈ R 3 with the size of L Â W Â H, where L, W, and H are the length, width, and height of the environment space, respectively.Assume that we have N agents N ¼ f1, : : : , Ng that can move in 3D with continuous action, and M blocks B ¼ fB 1 : : : B M g with size of l j Â w j Â h j , B j ∈ B. Each agent starts from a preset initial position p i ini and attitude q i ini with random position perturbation p i w and rotation perturbation q i w for i ∈ N .The CMTSN task is defined as that agents need to find and approach all G targets T ¼ fT 1 , : : : , T G g with the shortest time.All targets will be marked with found F or not-found NF.The status of each target is denoted as ST k ∈ fF, NFg k∈T .Note that, during this task, the agent can always continue finding other targets unless it crashes or all targets are found.

Observation and Action Space
In the partial observable RL task, each agent draws actions from its individual policy on the condition of the current local observation, i.e., a i t $ π i ðo i t jθ i t Þ with θ i t the trainable parameter.In this article, for each agent i, the local observation includes the visual observation o i v,t , i.e., RGB image I i t ∈ R 3Â224Â224 , the ego-state observation o i e,t , i.e., position p i t ∈ R 3 , rotational quaternion q i t ∈ R 4 , normalized forward direction n i W and previous action a i tÀ1 ∈ R 4 , and the ego-centric observation o i c,t , i.e., the relative position toward neighbors p ij t ¼ p j t À p i t for j ∈ N Ài .Since the number of agents is not fixed in our task, a maximum number of observable neighbors N max is preset, and hence p ij t ∈ R N max Â3 .The attention mechanism is adopted to calculate the scores of observable neighbors, which is discussed in the following proposed POMA framework.The visual observation I i t is encoded into a vector o i v,t ∈ R 256 with a simple CNN.To align with the real operation of drones, the action space of agents consists of four continuous velocity commands, namely, the translational velocity vB x , vB y , vB z and the yaw speed ωz in the body-fixed frame.The action space of agents is constrained with jjv B jj ≤ v max and Àω max ≤ ωz ≤ ω max .

State Transition Function
Assuming the low-level proportional-integral-derivative actuator controller can track the actions generated with acceptable delay, we consider the high-level policy for the drone.The 4-degree-offreedom dynamics with the aforementioned velocity commands can be described as: where s ¼ ½x, y, z, ψ T with x, y, z the world position coordinates and ψ the heading angle.gðsÞ is the transition matrix from bodyfixed frame Ω B to world frame Ω W and a ≔ ½v B x , vB y , vB z , ωz T is velocity commands (generated actions) sent to each drone.ω represents the process noise and uncertainties, which satisfy the Gaussian distribution.Assuming no significant tilting perturbation, gðsÞ is: Note that the transition function (10) applies to each agent and determines the next state s tþ1 in the discrete space given the current state s t and the joint action fa i t g ∈ A with stochastic noise ω.Hence, the policy is not deterministic in our task.

POMA
In this section, we describe our proposed POMA framework (see Figure 3) extended from MA-POCA (keep the same posthumous credit assignment mechanism), which leverages ACL, mixed credit assignment (MCA), and attention enhanced observation (AEO) to improve the performance and generalization capability of MA-POCA over CMTSN tasks.

ACL
Similar to automatic curriculum learning, [44] the motivation of our ACL module is to gradually guide the agent learning to search for targets from near locations (simple tasks) to far locations (difficult tasks) considering the sparse rewards in our scenario.Our ACL adjusts the task difficulty level (TDL) with on-policy training but only controls the initial state ρ 0 .Hence, it can be regarded as one specific instance of automatic curriculum learning.
During each episode beginning, the targets are randomly spawned in three space sets based on the search distance (L direction), namely, the near set C N , the middle set C M , and the far set C F , with probabilities of PrðC N Þ, PrðC M Þ, and PrðC F Þ, respectively.Note that PrðC Since the agents are expected to find the far targets more difficult, the TDL is denoted by ε ¼ 1 À PrðC N Þ ∈ ð0, 1Þ which represents the difficulty of finding all targets.The main performance metric for our task is the average success rate (ASR) of the team over a moving time window T w : where, 1ð1Þ ¼ 1, otherwise 1ð0Þ ¼ 0 and OutðiÞ denotes if the task is successful.Our ACL adjusts TDL to change against the variation direction of the success rate achieved by the team with clipped control law: where η is the rate that controls the update speed and ξ is the coefficient that controls the update range of ε. ε 0 and ASR d are the initial TDL value and the desired ASR in one task.The basic idea is that when the ASR is low at the beginning, the TDL drops to the lower boundary of 0.5 À ξ and when the ASR is greater than the desired ASR, the TDL will increase to give more challenges to the team but bring down the ASR.The feedback from the ASR aims to adjust the TDL in real-time and ε converges to ε Ã with any ASR.

MCA
Different from existing baselines (IPPO [20] using only individual critics to evaluate individual agent's behaviors or MA-POCA [18] only considering a group critic to evaluate each individual agent's behavior toward the group objectives), the proposed MCA module combines the individual critics and a group critic to optimize the policy with individual and group rewards, as illustrated in Figure 4. Within the MCA module, the individual critic aims to encourage the agent to search for as many targets as possible with individual effort while the objective of the group critic is to collaborate with other agents to find all targets efficiently.Compared to MA-POCA, an additional individual critic is added in POMA to provide more information and criticism for agents to make better-informed decisions.The individual critic is defined by: where Q i and V i are the state-action value function and state value function kept for the agent i. r I t ðs t , a i t Þ is the individual reward.The generalized advantage estimation for this individual critic is: where δ tþj ¼ Q i ðs tþj , a i tþj Þ À V i ðs tþj Þ denotes the temporal difference error.γ is the discount factor and λ is the hyperparameter.Since each actor now has two separate critics (an individual critic and a group critic) providing guidance, the dilemma of pursuing individual or group rewards is posed to the actor.Compared to IRAT in ref. [45] trying to clip the difference between the individual policy and group policy for each agent i, which brings heavy trainable parameters, especially for largescale MAS, our MCA module aims to shape the advantage function from the individual rewards and group rewards straightforwardly.We combine the advantage function from individual reward and group reward for agent i as: where Âi is the advantage function calculated from ( 2), (4), and ( 7) and λ c is the hyperparameter to adjust the group critic an individual critic.Note that the reward in ( 2) is the group reward r G t ðs t , a t Þ.The updated gradient to iterate the policy for agent i is: To better utilize the MCA module, the reward functions need to be well-designed to balance individual effort and team contribution in different tasks.In this article, individual rewards r I are given only when the agent finds a target or crashes.This encourages competitive behaviors within the team and ensures that the entire group is not punished due to a particular agent's mistake.By adding group rewards for finding all targets and all drones crashing, the group rewards provide better feedback to the group on collaborative search behavior while avoiding higher crashes.Beyond these task rewards, the existential reward r e ¼ À1.0=T and the collision reward r c are added for all methods.Here, where we define p ig T as the displacement from the collision position to the target position for agent i, and p ig ini as the initial displacement value.The norm of a vector is denoted by k ⋅ k, and the inner product of two vectors is denoted by ⋅ ð Þ.We introduce two weights, α ∈ ð0, 1 and β ∈ ð0, 1, which are utilized to adjust the penalty associated with the navigation distance and the penalty arising from the forward direction.The individual rewards and group rewards are listed in Table 2 where G means group reward, while I means individual reward.

AEO
To address the generalization issue of policy with variable egocentric observations, the relative position toward the agent with the minimum distance is set as the ego-centric observation in ref. [17].However, such a design ignores the information from other agents.To better utilize all observable neighbors, the attention mechanism is adopted.The ego-centric observations are first padded with zeros to I i with a fixed size of N max Â 3. The padding observation is then embedded into vectors query Q, key K, and value V with linear normal layers LNðI i jΘ i Þ of embedding size d k =m.The output of the AEO is: where m is the number of heads and d k is the dimension of padding observation.The output is further processed with residual modules and batch normalization as with the Transformer architecture. [42]

Network Architecture
The actor network architecture comprises a visual encoder and a policy network.The visual encoder is implemented as a simple CNN with two convolutional layers (16 Â 8 Â 8j32 Â 4 Â 4).The activation function used in the visual encoder is Leaky ReLU.Its purpose is to encode the raw image input into a compact 256-dimensional vector representation.Additionally, the visual encoder is followed by the concatenation of ego-state observation o e (C 1 ∈ R 270 ) and then cross-encoded with ego-centric observation (o c ∈ R N max Â3 ) with a multihead attention network, resulting in a vector C 2 ∈ R 398 with concatenation from C 1 .In contrast, the policy network consists of two fully connected (FC) layers, each consisting of 256 nodes.The final layer of the policy network comprises 4 nodes, responsible for generating four continuous action values to control the agent.Both the individual critic and group critic networks are the same with two FC layers of 256 nodes and a final value layer.

CTDE with Behavior Manager
The proposed framework is trained with CTDE, where the critic network collects all agents' trajectories, and each agent executes the same actor network.A centric behavior manager is designed to manage the group behavior of all objects of interest (OOIs) (targets, agents, and obstacles) such as the group registration and the number of OOIs while the agent's behavior like dynamics is controlled by its own behavior script.The group rewards are also defined in the behavior manager and triggered in the agent behavior script.The detailed training process for POMA is listed in Algorithm 1.
The behavior manager controls the initialization and updates of the environment with the TDL rule (13).Each agent is updated with individual Monobehaviour to obtain the observations and transit to the next time step with dynamics and actions from the current policy.The policy is updated for each episode with the policy gradient (17).13), the TDL ε converges to ε Ã ∈ ½0.5 À ξ, 0.5 þ ξ with any given desired ASR, ASR d , and initial TDL ε 0 .
Proof.Assume that any ε 0 ∈ ð0, 1Þ is given such that ASR < ASR d over finite episodes ½E, then ε t !0.5 À ε within time steps E Â T such that ASR converges to ASR 0 since ASR ξ and ASR 00 > ASR d .If ε 0 is given so that ASR > ASR d over finite episodes ½E, let ASR 0 ¼ ASR be denoted; similarly, it follows the conditions discussed and This completes the proof.
Theorem 4.1 provides theoretical support to ensure the convergence of training with ACL if the basic policy improvement is convergent.

Settings
Our simulation environments are developed with the Unity game engine.The proposed POMA framework is coded based on the Unity ML-Agents Toolkit Release 20. [46] In the training process, two visual drones with a field of view (FOV) of 82.5 deg are randomly spawned to search for two targets (G ¼ 2) in a room of size 5m Â 5m Â 2m with 2 obstacles.The initial TDL is set to ε 0 ¼ 0.1.All OOIs are randomly spawned and pre-checked to avoid any collisions at the beginning of each episode.The ASR is calculated in 500 time steps (T w ¼ 500) after each episode and the TDL is adaptively adjusted according to the rule (13) with ξ ¼ 0.4, η ¼ 0.001, ASR d ¼ 0.7.For each episode, a maximum of 5000 time steps are set and we train the policy for 10 million episodes.We set N max ¼ 5 since the room size limits the number of agents in the team.The training is accelerated with 2 parallel environment instances with each of the 6 scenarios inside.
Considering IPPO has shown similar or superior performance over both MADDPG and QMIX in complex tasks, [14] and the better performance of MA-POCA than COMA, we use the IPPO [20] and MA-POCA [18] as the baselines and test their performance with different missions such as 1 drone versus 2 targets (1D2T), 3D3T, etc. in the 3-obstacle room and set the TDL ε ¼ 0.2.Note that MADDPG [19] is also compared to further support our baseline selection based on the project https://github.com/4kasha/Multi_Agent_DDPG.The ablation study is further conducted to examine the effectiveness of the designed modules.

Environment Setup
In this article, the policy is first trained in simulated environments and then transferred to the physical world.The training scenario is developed with a Unity simulation engine, as illustrated in Figure 5, where the drones are trained to search for several target boxes covered with AprilTag and meanwhile to avoid obstacles (blocks and walls).To better the generalization capability of the policy, the size of blocks is randomized at the beginning of each training episode.Each block is randomly placed in space D and cannot overlap with other blocks.updated TDL ε (make sure all OOIs are not overlapped); All OOIs are meshed with colliders (for collision detection) and labeled with agent, target, and obstacle tags.The agent disappears and the number of agents N decreases by one if the agent collides with other agents or obstacles.Similarly, the target disappears if the target is approached by any agent, and the number of targets G decreases by one.The training episode terminates if N ¼ 0 or G ¼ 0 or the maximum allowable steps T is reached.

Hyperparameters for RL
Since all of our implementations are extended from the PPO and MA-POCA, they share the same policy update mechanism and hyperparameters for RL.The critics and actor networks share the same network size.Table 3 lists the hyperparameters used for all training.

Hyperparameters for POMA
The parameters used in this work are optimized with trials and tests in several modules.For instance, the changing rate η is chosen to reduce the training vibration.Table 4 summarizes other hyperparameters specifically for POMA in the training environment.
All training and testing are carried out on a workstation with the configuration of Ubuntu 20.04, CPU AMD ZEN 3 RYZEN 9 5900X, and GPU RTX 3090 Ti.Unity version 2021.3.11,mlagents 0.27.0,Pytorch 1.8.2, and Python 3.8 are used to develop the simulation and algorithm.

Domain Randomization
To improve the generalization of policies during the training process, several domain randomization techniques are adopted.Table 5 lists the parameters for different types of domain randomization.Note that Uðx 1 , x 2 Þ means the uniform distribution over ½x 1 , x 2 .Bðc, rÞ denotes the sphere uniform distribution with center position c and radius of r. x l , x r , z l , and z u are the left side, right side, low side, and upper side of the room, respectively.The position of targets depends on the TDL.When ε is larger, the mean distance between the drones and targets is further in the L direction.The light intensity randomization provides generalizations for visual perception.

Main Results
Table 6 lists the performance of our trained model in comparison with baseline methods and ablation methods over different testing missions with 3 obstacles.Their performance is evaluated with metrics including ASR and average time steps (ATS) over 500 episodes.More results are provided in Figure 7 and Table 7.

Baseline Comparison
In all multidrone missions, POMA consistently exhibits the highest ASR and lowest ATS, outperforming MA-POCA and IPPO.This can be greatly attributed to its superiority in addressing the sparse rewards and in capturing the effects of variable neighbors.IPPO is better in single-agent missions but with the highest ATS since only individual contributions are considered.However, IPPO does not encourage collaboration as their success rate decreases in 3-drone missions compared to 2-drone missions.This is because of the limited room space when more drones are involved.On the contrary, MA-POCA and POMA can handle collaborations in constrained spaces since their success rate increases when more drones are involved.Besides, as expected, MADDPG shows the worst performance in all missions due to its inefficiency in handling posthumous credit assignments.

Ablation Study
From the ablation study, we can observe that the AEO module improves the POMA framework in multiagent missions as the performance of POMA decreases a lot without the AEO module.
The attention mechanism considers neighbors with scores and hence, improves collaborations.However, the AEO module downgrades the performance in single-agent missions since the model achieves the best in single-agent missions without the AEO module.It is the side-effect of introducing an extra Transformer network.Without the MCA module, the POMA shows worse performance in single-agent missions, which verifies the effectiveness of the designed MCA module.In No ACL training, we fix ε ¼ 0.2.From the results, the ACL module significantly addresses the sparse reward issue and improves training efficiency as illustrated in Figure 7.

Qualitative Analysis
The snapshots from the simulation are illustrated in Figure 6.
From (a) to (b), Agent 2 sees the two targets and navigates toward them, while Agent 3 is searching for other targets in front, and Agent 1 stays behind.From (c) to (d), Agent 2 finds the last target when Agent 1 is moving forward to search.To avoid collisions, Agent 1 moves back to provide space for Agent 2 to navigate to the final target.These maneuvers show significant collaborations and the advantages of using a visual drone swarm to search targets when prior information is not available.

Quantitative Analysis
The training curves regarding the ASR are illustrated in Figure 7.
From the curves, MADDPG shows the worst training performance, and IPPO shows higher learning efficiency, while POMA obtains the highest success rate even under a higher TDL.Without the ACL module in the sparse reward space, it is hard to find the optimal policy as observed in the POMA w/o ACL curve.Compared to the baseline MA-POCA, all designed modules have contributed to the improvement of POMA.controlled from decentralized edge computers via WIFI communication.To label if the target is found, the status of targets is broadcast from a centric computer.The positions of OOIs are captured with a motion capture system and streamed from the centric computer.Implementation details are provided in the following.We tested our model in 1D1T, 1D2T, 2D2T, and 3D3T missions.

Physical Implementation
The policy model is trained in the simulation (2D2T mission) and deployed over several edge computers.The framework of the control is illustrated in The Robotic Operating System Melodic is deployed on all computers with the centric computer as the master node.The edge computer subscribes to the pose topic from the centric computer and the video stream from the connected drone.The obtained information is further processed and passed into the loaded policy model with the ONNX runtime package on the edge computer.The output actions drive the Tello Edu drones to search and navigate to targets with the Tello Edu SDK 2.0.To label the status of targets, a target-status topic (vector) is created and maintained on the centric computer side and updated by all drones.If one target is found (distance to the drone is within 5 cm), the status of the target is labeled with 1 and its position is set to infinity.When all targets are found, the drones stop flying.This can avoid repetitive searching.

Main Results and Discussion
Figure 2 illustrates the performance of our trained model in a typical 3D3T mission.Three drones fly to search for targets while avoiding obstacles with real-time video stream (see Figure 2a).Their trajectories show significant collaborative behaviors as Tello 3 attempts to navigate to Target 1 when Tello 2 has already found Target 3 and Target 2 (see Figure 2b).Meanwhile, Tello 1 tries to find other hidden targets.These collaborations improve search and navigation efficiency, especially in time-critical missions such as SAR. Figure 10 illustrates the trajectories of drones in different missions (1D1T, 1D2T, and 2D2T), demonstrating the effectiveness of Sim2Real with the trained model in different missions.Compared to traditional approaches, the DRL-based policy can be seamlessly applied to various missions without fine-tuning.The trained policy has showcased outstanding adaptiveness and scalability in different scenarios when a variable number of targets and agents are involved.
Note that in the simulation, the target disappears when touched by drones, while the target is still observable in physical experiments when it is labeled "found".This definitely affects the search performance of drones (the drone may hover some time in front of the target) even though we designed the mechanism to avoid repetitive searching (setting the found target's position to infinity).To address this issue, we can add some rules in the loop to guide the drone.In addition, the unstable position stream from the motion capture system also affects the search behavior as we can observe that some drones may hesitate to move at the corners of the room.However, the search policy still shows the resilience to finish the task and can be extended to various missions.More physical experiments are provided in Video, Supporting Information.

Conclusion
This article presents the POMA framework, an advanced solution that leverages MADRL for the complex problem of CMTSN.Our proposed method incorporates novel ACL and mixed individual-group credit assignment mechanisms that balance individual and group contributions in sparse reward environments.Simultaneously, the attention mechanism in POMA refines variable local observations, leading to significant enhancements in collaboration and scalability.Experimental results conducted on drone simulations demonstrate that our model attains superior performance over baseline methods.
We further deploy the trained model over a visual drone swarm and conduct physical tests in different missions.Real-world flight experiments in various missions demonstrate the effectiveness and generalization capability of our approach.The framework's ability to balance individual and group contributions in sparse reward environments, coupled with its demonstrated effectiveness in drone swarm management for real-world applications, underscores its potential to enhance efficiency, decision-making, and adaptability across different diverse and critical fields when collaborative behaviors are required.In future work, the ease of communication loss and human guidance in the loop will be investigated to improve the performance of our policy in more complex scenarios.

Figure 2 .
Figure 2. a) Physical experiments on 3 drones versus 3 targets (3D3T) mission.The initial positions are labeled with directions.b) Trajectories of 3 Tello drones from motion capture system.Their trajectories show significant collaborative behaviors as Tello 3 attempts to navigate to Target 1 considering Tello 2 already found Target 3 and Target 2. Meanwhile, Tello 1 tries to find other hidden targets.

Figure 3 .Figure 4 .
Figure 3. Overview of our POMA framework.A centric behavior manager controls the TDL of the environment with an ACL feedback module.The MCA module integrates the individual critic into the group critic to address the individual-group conflicts.The AEO encoder with a self-attention mechanism improves the generation capability of policy with variable neighbors.

Figure 5 .
Figure 5.The simulation environment for the CMTSN with 3 drones, 3 targets (3D3T), and 3 obstacles (3O) of different sizes.The task is successful only if all targets are found and approached.The left 3 views are drones' visual perception.

Figure 9 .
Each edge computer (Jetson Orin NX) is connected to the WIFI hotpot from the Tello Edu drone (AP mode with different names) and meanwhile wire-connected to the TP-Link router with different fixed IP addresses (192.168.0.121 -192.168.0.123).The centric computer is wire-connected to the OptiTrack system and streams the pose information via WIFI connected to the TP-Link router (192.168.0.100).

Table 1 .
Comparative analysis of different methods for MTS.

Table 2 .
Reward function used for various methods.

Table 3 .
Hyperparameters for all RL algorithms.

Table 4 .
Parameters for the POMA and environment.

Table 6 .
Comparison with baselines and ablation models over different testing missions with 3 obstacles (ASR "/ATS #, the best performance is highlighted with bold).

Table 7 .
Performance comparison in large-scale missions (ASR "/ATS #, the best performance is highlighted with bold).