Bidirectional Obstacle Avoidance Enhancement‐Deep Deterministic Policy Gradient: A Novel Algorithm for Mobile‐Robot Path Planning in Unknown Dynamic Environments

Real‐time path planning in unknown dynamic environments is a significant challenge for mobile robots. Many researchers have attempted to solve this problem by introducing deep reinforcement learning, which trains agents through interaction with their environments. A method called BOAE‐DDPG, which combines the novel bidirectional obstacle avoidance enhancement (BOAE) mechanism with the deep deterministic policy gradient (DDPG) algorithm, is proposed to enhance the learning ability of obstacle avoidance. Inspired by the analysis of the reaction advantage in dynamic psychology, the BOAE mechanism focuses on obstacle‐avoidance reactions from the state and action. The cross‐attention mechanism is incorporated to enhance the attention to valuable obstacle‐avoidance information. Meanwhile, the obstacle‐avoidance behavioral advantage is separately estimated using the modified dueling network. Based on the learning goals of the mobile robot, new assistive reward factors are incorporated into the reward function to promote learning and convergence. The proposed method is validated through several experiments conducted using the simulation platform Gazebo. The results show that the proposed method is suitable for path planning tasks in unknown environments and has an excellent obstacle‐avoidance learning capability.


Introduction
Planning a safe and efficient path has become a key requirement for mobile robots due to its potential benefits.Robotic autonomy presents challenges for path planning, particularly in unknown dynamic environments without advance mapping.In unknown dynamic environments, the main intention of a mobile robot is to safely move to a goal position without environmental information.Currently, the most widely used planning method is hierarchical path planning, which combines global and local path planning. [1]However, classical global path planning methods, such as Dijkstra, [2] A* algorithm, [3] and rapidlyexploring random tree (RRT), [4] highly rely on the prior global map, which is timeconsuming and labor intensive to build and maintain.Local path planning methods, such as the artificial potential field (APF), [5] dynamic window approach (DWA), [6] and time elastic band (TEB), [7] may be trapped in a local minimum and require high-precision sensors to update the local cost map.Therefore, a path-planning method with low dependence on environmental information should be determined.
Real-time path planning in unknown dynamic environments is a significant challenge for mobile robots.Many researchers have attempted to solve this problem by introducing deep reinforcement learning, which trains agents through interaction with their environments.A method called BOAE-DDPG, which combines the novel bidirectional obstacle avoidance enhancement (BOAE) mechanism with the deep deterministic policy gradient (DDPG) algorithm, is proposed to enhance the learning ability of obstacle avoidance.Inspired by the analysis of the reaction advantage in dynamic psychology, the BOAE mechanism focuses on obstacle-avoidance reactions from the state and action.The crossattention mechanism is incorporated to enhance the attention to valuable obstacle-avoidance information.Meanwhile, the obstacle-avoidance behavioral advantage is separately estimated using the modified dueling network.Based on the learning goals of the mobile robot, new assistive reward factors are incorporated into the reward function to promote learning and convergence.The proposed method is validated through several experiments conducted using the simulation platform Gazebo.The results show that the proposed method is suitable for path planning tasks in unknown environments and has an excellent obstacle-avoidance learning capability.
Reinforcement learning (RL) [8] is a branch of machine learning that does not require prior information and enables an agent to learn a behavior policy by continuously interacting with the environment.However, traditional RL methods have significant challenges when dealing with complex action spaces or highdimensional state spaces in real-world scenarios.Deep reinforcement learning (DRL) [9] aims to incorporate the decision-making capability of RL and the perceptual capability of deep learning to achieve superior performance in complex scenes and tasks. [10,11]RL-based path planning methods enable a mobile robot to learn the ability to make continuous moving decisions according to real-time states, including sensor data, robot velocity information, and relative goal information.Consequently, DRL-based path planning methods provide a practical approach for addressing the challenges of mobile-robot path planning in unknown dynamic environments.
The deep deterministic policy gradient (DDPG) [12] is a classical DRL algorithm for continuous control problems that have been frequently applied to mobile-robot path planning in unknown environments.However, the obstacle-avoidance performance of the original DDPG algorithm is limited to complex and dynamic environments.On the one hand, laser range data are used as part of the input due to their lower sim-to-real gap than image data.With increasing environmental complexity, sparse laser range data ignore much critical information, whereas dense laser range data contain much redundant information.Therefore, increasing the utilization of valid information in the dense laser range data is beneficial for obstacle avoidance.On the other hand, in goal-driven path planning tasks, the agent may hesitate to balance the trade-off between avoiding obstacles and reaching the goal during the algorithm's training process.Hesitation is commonly caused by a sparse reward function, which has a detrimental effect on training and convergence.The prevention of such trade-off hesitation scenarios is also a problem worth considering.
To enhance the obstacle-avoidance learning ability of an agent and solve the above problems, we propose a novel method called BOAE-DDPG, which incorporates the designed bidirectional obstacle avoidance enhancement (BOAE) mechanism into the original DDPG algorithm.Human behavior involves a selective mapping from stimulus to reaction.Dynamic psychology suggests that the advantage of one reaction over another depends not only on the stimulus causing the reaction but also on the reaction itself. [13]Specifically, the mapping mechanism of stimuli to reactions in dynamic psychology is similar to the mapping of state to action in RL.Based on the previous analysis, the agent should consider not only the state but also the priority of different actions in the current context when selecting the next action.Based on this motivation and to train an agent with excellent obstacle-avoidance capabilities, the agent's focus on obstacleavoidance reactions must be enhanced.Therefore, the core intention of the BOAE mechanism is to amplify the attention devoted to valuable obstacle-avoidance information and the corresponding behavioral advantages.
Thus, the BOAE mechanism shown in Figure 1 was designed based on the actor-critic framework and incorporated into the DDPG algorithm.The cross-attention [14] module is initially introduced to amplify the attention given to valuable information related to obstacle avoidance under the current context.Specifically, the BOAE mechanism places more emphasis on critical information, such as smaller values in dense laser range data within risky scenarios, which can reduce unnecessary consideration of redundant information and improve the utilization of effective information.Additionally, a modified dueling network [15] is proposed to separate the obstacle-avoidance behavioral advantage from the overall behavioral advantage, thereby improving the learning efficiency of obstacle-avoidance behaviors.To further reinforce the agent's mobile-robot path planning process, new assisted reward factors are added to the reward function.These factors are designed to accelerate the convergence process, which can also reduce hesitation.
The main contributions of this study are summarized as follows.1) The BOAE mechanism motivated by dynamic psychology is applied to DDPG to enhance the learning ability of obstacle Figure 1.Analysis of the reaction advantage in dynamic psychology is analogous to RL; thus, the BOAE mechanism is designed and implemented based on this motivation.It aims to enhance the agent's focus on obstacle-avoidance reactions by amplifying the attention devoted to valuable obstacle-avoidance information and corresponding behavioral advantage.In detail, the BOAE mechanism combines the cross-attention module and modified dueling network with the actor-critic framework.avoidance.2) New assisted reward factors are introduced to guide the agent to avoid obstacles for faster convergence.3) Experiment results on the Gazebo platform prove that BOAE-DDPG outperforms DDPG and twin-delayed DDPG (TD3) [16] during the learning process.
The remainder of this article is organized as follows.Section 2 describes related studies on hierarchical and DRL-based path planning methods.Section 3 defines the problem and then provides details on the network structure and algorithm.Finally, the designed reward function is presented.Section 4 describes the simulation environment and analyzes the results.Finally, Section 5 concludes the study and discusses the future outlook.

Hierarchical Path Planning
The widely used hierarchical path planning for mobile robots comprises global and local path planning. [1]Global path planning determines a global path for the robot based on the goal position and global cost map, whereas local path planning dynamically adjusts the local path according to the global path and local cost map during the robot's movement.Essentially, the global path planner serves as the guide for the local path planner, and the local path planner acts as a bridge between the global path planner and robot controller.
Extensive research has been conducted on global path planning algorithms, which can be broadly classified into graph search, random sampling, and intelligent bionic algorithms.Dijkstra [2] and A* [3] are efficient graph search algorithms in relatively simple 2D environments, but they both suffer from high computational complexity in large-scale or high-dimensional environments. [17]The RRT [4] algorithm inspires a series of random sampling algorithms, which have excellent performance in dynamic or high-dimensional environments.They have a slow convergence and thus require significant amounts of memory and computational resources to determine the optimal path. [18]he intelligent bionic algorithms, generally including the genetic algorithm, [19] ant colony algorithm, [20] and particle swarm optimization algorithm, [21] have also been explored extensively.However, they share common challenges such as low convergence speed and a tendency to fall into a local minimum.
Commonly used local path planning algorithms include the APF, [5] DWA, [6] and TEB. [7]The APF method considers the movement space of the mobile robot as a magnetic potential field.The final moving direction is the result of the repulsion force generated by the obstacle and the attractive force generated by the target point.APF has the advantages of good real-time performance, small number of calculations, and fast convergence; however, it is not suitable for complex environments.The DWA algorithm samples multiple groups of velocities in the velocity space, simulates their motion tracks at a specific time, and selects the speed command corresponding to the optimal trajectory based on the evaluation function.The DWA has low computational complexity and is available for real-time obstacle avoidance; however, it performs poorly in dynamic environments.The TEB algorithm modifies the initial trajectory generated by the global path planner to satisfy multiple constraints through the TEB.Although the TEB has a better dynamic obstacleavoidance effect than the DWA, it has high computational complexity and unstable control performance.

DRL-based Path Planning
Recently, research on DRL-based mobile-robot path planning has increased due to advancements in artificial intelligence technology.DRL provides a practical end-to-end learning method for a control strategy that enables agents to learn from the experiences generated by constantly interacting with the environment, and the learnt policy is the mapping from the observable information to the available action.Specifically, deep learning is responsible for extracting environmental features from observable information, and RL selects an action according to the obtained environmental features and continuously updates the behavior policy based on the rewards from performing the selected action.DRL-based path planning methods can completely eliminate the dependence on map information and potentially achieve effective path planning in unknown environments.
The deep Q network (DQN) algorithm [22] proposed by Google DeepMind combines deep learning with Q-learning, achieving end-to-end learning from perception to action using a neural network to approximate the value function.Tai et al. [23] proposed an end-to-end learning method based on the DQN to explore unknown environments; it uses depth images as the only input and outputs control commands.The neural network weights are trained end-to-end.Zhang et al. [24] and Xin et al. [25] used the DQN algorithm to train an end-to-end behavioral policy from visual perception to action.Because the DQN cannot address continuous control problems, the above methods simplify the control of the mobile robot into a discrete control problem, which is inconsistent with reality.Moreover, due to the high training cost, most DRL-based path planning algorithms are trained in a simulation environment; thus, the sim-to-real gap limits their application in the real world.
To address continuous control problems, Google DeepMind proposed the DDPG algorithm [12] by integrating the DQN algorithm with the actor-critic framework.DRL-based path planning methods that use laser range data as input have a low cost and a small gap to the real world and have attracted considerable interest in recent years.Tai et al. [26] used the asynchronous DDPG to train an end-to-end mapless motion planner.The input of their planner included sparse 10D laser range data, previous velocity, and relative target position.Based on this research, Jesus et al. [27] applied DDPG to the navigation of mobile robots in a virtual environment.Zhao et al. [28] proposed the D-DDPG, which integrates a dueling network into the critic network to improve the estimation accuracy of the Q-value.Gong et al. [29] and Zhou et al. [30] introduced long short-term memory (LSTM) into the DDPG to achieve long-term capability in mapless navigation.Li et al. [31] added prioritized experience replay and dynamic delay policy updates to the TD3 algorithm to improve the path planning performance of DDPG.However, the approaches developed from the DDPG algorithm consider sparse laser range data as part of the input and ignore critical information in complex and dynamic environments.The absence of critical information restricts the ability to learn obstacle avoidance.In addition, the sparse reward problem of the path planning task influences their learning effect.Overall, obstacle-avoidance learning abilities in complex and dynamic environments should be improved by solving the above problems.

Experimental Section
The DRL framework provides a promising solution for mobilerobot path planning through end-to-end policy learning.The improved end-to-end model based on the proposed BOAE mechanism incorporates the actor-critic framework, cross-attention module, and modified dueling network.The BOAE-DDPG algorithm is proposed based on an improved model and the original DDPG algorithm. [12]

Problem Definition
This study aimed to build a real-time path planner for mobile robots in unknown dynamic environments (Figure 2).We attempted to determine such an end-to-end mapping from the robot's observable information to the velocity control commands for the goal-driven path planning.
where l t is the 72D laser range data from the laser scanner equipped with the robot, g t includes the relative distance and angle from the robot to the goal, and v c includes the current linear and angular velocities of the robot.To obtain more valid environmental information, we set the field of view of the laser scanner to 360°.The obtained dense laser range data is sampled at 5°intervals and ranges from 0.1 to 3.5 m.The velocity action commands v t include linear and angular velocities, and the robot observes a new state after executing the action.Thus, the robot can be controlled in real time by continuously executing an action command and observing its changing state.

Network Structure with the BOAE Mechanism
According to the analysis of the reaction's advantage in dynamic psychology, [13] the superiority of one reaction over others can be attributed to the stimulus of the reaction and the reaction itself, which implies that the state and action are both critical to increasing the relative advantage of the reaction.Therefore, the BOAE mechanism aims to enhance the robot's ability to learn obstacle avoidance by focusing on valuable obstacle-avoidance information and the corresponding behavioral advantage.Subsequently, the cross-attention module and modified dueling network are applied to achieve the BOAE mechanism in BOAE-DDPG.The primary objective of mobile-robot path planning is to reach the goal position, and the secondary objective is to avoid obstacles.Because different objectives focus on different types of information, observable information can be divided into two parts according to its relevance to obstacle avoidance.The obstacle-avoidance state s oa guiding the robot includes the laser range data l t .The other state s ot consists of the relative goal information g t and robot velocity information v c .The dense laser range data adopted in BOAE-DDPG contain a large amount of redundant information; therefore, the robot should pay more attention to the information critical for obstacle avoidance and thus improve the utilization of environmental information.In the path planning process, the priority of the obstacle avoidance reaction changes dynamically with the scenario of the robot, which makes the critical information different.For example, in dangerous scenarios, the robot must prioritize obstacle avoidance and pay more attention to the critical information for obstacle avoidance.In addition, the determining factor of critical obstacle-avoidance information should consider not only s oa but also s ot .Therefore, BOAE-DDPG incorporates a crossattention module at the front of the actor network.As shown in Figure 3, the cross-attention module is based on multihead attention, [14] takes s oa and s ot as inputs, and computes the obstacle-avoidance state with attention s attn oa using The dueling network [15] divides Q values into state values and behavioral advantages, which can achieve a more robust learning effect through decoupling.To enhance the focus on the advantage of the obstacle-avoidance reaction, BOAE-DDPG splits the behavioral advantage into an obstacle-avoidance behavioral advantage from s oa and another behavioral advantage from s ot .
The independent estimation of obstacle-avoidance behavioral advantages further enhances the learning efficiency of obstacleavoidance behaviors.The final Q value in BOAE-DDPG is calculated as follows.
Qðs, aÞ ¼ VðsÞ þ Aðs oa , aÞ þ Aðs ot , aÞ The overall network structure of BOAE-DDPG with the implementation of the BOAE mechanism is shown in Figure 3.All the inputs of the actor and critic networks are normalized before the input.In the actor network, a cross-attention module is used to obtain the obstacle-avoidance state with attention s attn oa .Subsequently, s attn oa is integrated with the relative goal information g t and robot velocity information v c of s ot .The integrated 76D vector passes through two fully connected (FC) neural network layers with 256 nodes and is transferred to the linear and angular velocities through different activation functions.The linear velocity is constrained in (0, 1) through the sigmoid activation function, and the tanh activation function is used to constrain the angular velocity in (À1, 1).The output action is multiplied by a high to obtain the final velocity command executed by the mobile robot, where a high comprises the robot's maximum linear and angular velocities.In the critic network, s oa is integrated with s ot to predict the state value VðsÞ.In parallel, v t is integrated with s oa to predict the obstacle-avoidance behavioral advantage Aðs oa , aÞ and is integrated with s ot to predict the other behavioral advantage Aðs ot , aÞ.In addition, all three prediction networks contain two FC neural network layers with 256 nodes, and each FC neural network layer of the actor and critic networks uses rectified linear unit (ReLU) as the activation function.

Algorithm Details
The BOAE-DDPG algorithm shown in Algorithm 1 is built based on the original DDPG, [12] which incorporates the experience replay mechanism and target network method from the DQN [22] to minimize correlations between samples and provide consistent targets during temporal difference backups.In contrast to DDPG, BOAE-DDPG uses well-performing uncorrelated mean-zero Gaussian noise as the exploration noise, instead of time-correlated Ornstein-Uhlenbeck noise.Moreover, to prevent the algorithm from converging to a local minimum, we employ an epsilon-greedy exploration strategy along with exploration noise.This hybrid exploration strategy enables the agent to explore the environment more frequently by adopting random control commands under nongreedy conditions.The output action a t ¼ v t ðv, ωÞ at time step t is where n is the uncorrelated mean-zero Gaussian noise, θ represents the parameters of the online actor network, and ε is the greedy rate of the epsilon-greedy strategy.The environmental state s t includes the laser range data l t , relative goal information g t , and robot velocity information v c .The clip function limits the action with noise to ða low , a high Þ.Furthermore, random actions are sampled randomly from ða low , a high Þ. a low and a high are the robot's linear and angular velocity boundary values, respectively, which are determined by the robot model used in the experiment.
Whenever the algorithm updates the actor and critic networks, a minibatch of samples is sampled from the replay buffer.The critic network is then updated in one step of gradient descent to minimize the following loss function.

Randomly sample a batch of transitions B
Compute targets using Equation ( 6) Update the critic network using the gradient of Equation ( 5) Update the actor network using Equation ( 7) Update target networks using Equation ( 8) where B represents the samples for updating networks, jBj is the batch size.ϕ represents the parameters of the online critic network.The target Q value yðr, s', dÞ in Equation ( 5) is calculated as follows.
yðr, s', dÞ ¼ r þ γð1 À dÞQ ϕ targ ðs', μ θ targ ðs'ÞÞ (6)   where ϕ targ represents the parameters of the target critic network, and θ targ represents the parameters of the target actor network.r t represents the reward value, and d indicates whether the episode ends.The value of d is set to 1 when the episode ends; otherwise, it is 0. γ is the discount factor of the delayed reward, which is used to adjust the impact of the delayed reward on the target Q value.The actor network is updated by one step of gradient ascent using The Adam optimizer is used to learn the parameters of the actor and critic networks.The target networks are updated using the soft update method, which means that the parameters of the target networks are updated by a small amount after a certain number of steps as follows.
where τ is the soft update factor, which frequently has a small value to avoid frequent updates and can improve learning stability.

Reward Function for Path Planning
Path planning is a classical sparse reward problem in RL.
Researchers typically introduce assisted reward factors based on mainline reward factors to accelerate convergence and improve performance.The mainline reward factors for path planning commonly consist of a reward r success when the robot reaches the goal and a penalty r collision when the robot has a collision.The reward for subgoals is the primary form of the assisted reward factor, which can increase the probability of completing the main goal by guiding the agent to complete the subgoals.In mobile-robot path planning, the agent frequently learns to reach the goal position by incorporating an assisted reward factor r dis that varies with the distance between the agent and the goal. [26]ll the reward factors of the reward function rðs t , a t Þ are as follows.
where d t is the distance from the robot to the goal, d succ is the distance used to determine whether the robot has reached the goal position, d c is the collision distance used to determine whether the robot has a collision, and d safe is the safety distance used to warn the robot if there is a risk of collision.o t is the relative angle from the robot to the goal, v l t is the linear velocity of the robot, and v a t is the angular velocity of the robot.l min is the minimum value of all the laser range data observed by the robot, which is used to identify whether the robot is in collision or at risk of collision.l moving min is the minimum value of the laser range data at 120°near the robot's moving direction, indicating the risk of collision in the forward direction of the robot.c d , c o , c s , c l , and c a are all positive scaling factors of the reward factors used to adjust the relative magnitudes of the reward factors and ensure the centrality of the mainline reward factors.
The main line reward factors include r success and r collision .The assisted reward factors for reaching the goal are r dis and r ori .In addition to the assisted reward factor r dis that encourages the robot to approach the goal position, the assisted reward factor r ori is added to encourage the robot to move in the direction of the goal.Furthermore, r safe , r linear , and r angular are assisted reward factors for avoiding obstacles, which can also reduce hesitation.Specifically, r safe is responsible for maintaining a safe distance between the robot and the obstacles to reduce collisions in risky scenarios.In addition, if there is a risk of collision in the robot's forward direction, a penalty r linear is expected to discourage acceleration behavior.Conversely, if there is no risk, a penalty r angular is imposed to discourage turning behavior.These assisted reward factors are designed to expedite the learning process of the agent by guiding the robot to avoid obstacles.

Results and Discussion
Several experiments were conducted on the Gazebo simulation platform to compare the learning ability during training and the learnt obstacle-avoidance policies of the different DRL algorithms.All the training and testing processes were performed in Ubuntu20.04,and the Robot Operating System [32] was used to implement the communication between the algorithms and simulation environments.We first validated the effectiveness of the introduced assisted reward factors on the original DDPG algorithm and then compared different algorithms during the training and random goal performance tests.The compared algorithms were DDPG, TD3, and the proposed BOAE-DDPG.We also compared the effects of separately incorporating the cross-attention module (DDPG with CA) and the modified dueling network (DDPG with dueling).Furthermore, DDPG incorporating the LSTM network (DDPG with LSTM) was used to compare the improved performance of the LSTM and BOAE mechanisms.At the end of the testing period, we conducted fixed-goal tests to compare the planning and generalization effects of BOAE-DDPG and DDPG.The relevant experimental parameters are listed in Table 1.

Simulation Setup
We constructed four simulation scenes in Gazebo for training and testing.TurtleBot3-Waffle equipped with a 360 Laser Distance Sensor LDS-01, shown in Figure 4, was used as the DRL agent to learn obstacle avoidance.LDS-01 is a 2D laser scanner that can sense 360°and was used to collect laser range data around the robot.Maps of the simulation scenes are shown in Figure 5. Scene-1 was applied in training, scene-2 and scene-3 were applied in both training and testing, and scene-4 was applied only in the generalization test.Scene-1 did not contain any obstacles except walls, scene-2 and scene-3 incorporated some obstacles based on scene-1.The distribution of static obstacles in scene-3 was more complex and contained two dynamic obstacles.Scene-4 was similar to scene-3, with more static obstacles, more dynamic obstacles, and a denser distribution of obstacles.The motion trajectories of the dynamic obstacles are marked by blue two-way arrows, and the goal of each episode is indicated by a red cube spawned at a randomly selected location within the marked green areas.The selection of the goal-generating areas followed the principle of making the difficulty of the corresponding tasks as similar as possible.

Assisted Reward Factors Validation
To validate the effectiveness of the proposed assisted reward factors, we conducted training on scene-1 and scene-2 using DDPG to compare the reward functions with and without the new assisted reward factors, which included r ori , r safe , r linear , and r angular .The agent was expected to acquire the skills required to reach the goal and avoid walls during training in scene-1.Subsequently, the model trained in scene-1 was trained in scene-2 to learn static obstacle avoidance.In detail, the agent was trained with 200 episodes in scene-1 and 2000 episodes in scene-2.We recorded the number of successful episodes at certain intervals to compare the training performance.
The training results are shown in Figure 6.In scene-1, the algorithm trained by the reward function without the new assisted reward factors converged in ≈110 episodes, whereas that trained with the new assisted reward factors converged in only ≈50 episodes.In scene-2, the results showed that the algorithm trained by the reward function without the new assisted reward factors reached convergence in ≈1050 episodes, whereas that trained with the new assisted reward factors reached convergence only in ≈650 episodes.These improvements demonstrated that these assisted reward factors can accelerate the learning speed and thus promote convergence.

Training Analysis
The training processes were conducted in scene-1, scene-2, and scene-3 to compare the learning effects of the different algorithms within a certain number of episodes.Throughout the training process, the training strategy was to first train the algorithm in scene-1 to primarily learn the ability to reach the goal.The training results were then used to initialize the network parameters for scene-2 and scene-3 at the beginning of training to learn obstacle avoidance.Note that during the training in scene-1, the agent performed random actions in the first ten episodes to collect experiences for updating networks in the early stage.Goals were randomly spawned in the goal-generation areas before the start of each episode, and the termination conditions for each episode were as follows.1) Goal: reach the goal; 2) Collision: have a collision; and 3) Overstep: training steps are out of range.
Figure 7a shows the training process in scene-1.The average return increased during the training episodes and gradually converged to a stable value.Because scene-1 was relatively simpler than the other scenes, the changing trends of the training curves in scene-1 were similar among the different algorithms, whereas BOAE-DDPG learnt better and changed more smoothly than the other algorithms.Figure 7b,c shows the training process in scene-2 and scene-3, respectively.As the comparison of BOAE-DDPG, DDPG, DDPG with LSTM, and TD3 showed, BOAE-DDPG had a better performance both in scene-2 and scene-3.In scene-2, BOAE-DDPG outperformed the other algorithms.Additionally, TD3 improved the original DDPG by introducing a few strategies, including double-Q learning, delayed policy updates, and target policy smoothing, which was the main reason TD3 outperformed DDPG in scene-2.However, in scene-3, TD3 was only slightly superior to DDPG, which proved that   with increasing scene complexity and the appearance of dynamic obstacles, the performance improvement of the tricks in TD3 was limited.However, BOAE-DDPG improved the training effect more consistently than TD3, obtaining better results both in scene-2 and scene-3.DDPGeithLSTM also had better results than DDPG and TD3, whereas the enhancement was inferior to that of BOAE-DDPG.
In the comparison of BOAE-DDPG, DDPG, DDPG with Dueling, and DDPG with CA, BOAE-DDPG performed better than the other algorithms both in scene-2 and scene-3.As shown in Figure 7b,c, the changing trends of DDPG with Dueling were similar to those of DDPG, but its final converged values were higher than those of DDPG, which demonstrated that the modified dueling network can improve the learning efficiency to some extent.Similarly, the changing trends of BOAE-DDPG and DDPG with CA were relatively similar, and the final converged values of BOAE-DDPG were higher than those of DDPG with CA, which also proved the role of the modified dueling network.Additionally, the average return values of the algorithms with the cross-attention module, including BOAE-DDPG and DDPG with CA, were higher at the turning point of the change curves.This phenomenon indicated that integrating the cross-attention module enabled the algorithm to obtain a relatively higher reward value during the early stages of training.This meant that the agent accumulated more experiences that were beneficial to learning, which could increase the learning speed throughout the training process.The test results proved that the cross-attention module effectively improved the learning performance of the algorithm by focusing on critical information in the obstacleavoidance state.
The success rate and average return of the training process in all scenes are shown in Table 2, and the improvements indicated that BOAE-DDPG had better learning performance than the other algorithms.

Testing Analysis
To compare the performance of the agent behavior policy trained by different algorithms, we conducted random-goal performance tests using the trained policies to run 2000 episodes in scene-2 and scene-3.The results of these tests are presented in Table 3.Because of the moderate scene complexity, the results of all algorithms in scene-2 were not significantly different; however, BOAE-DDPG had a slightly higher average return value than the other algorithms.In scene-3, BOAE-DDPG, DDPG with LSTM and TD3 outperformed the other algorithms, whereas BOAE-DDPG was better than DDPG with LSTM and TD3.
We also conducted fixed-goal tests by selecting fixed goal positions to observe the motion trajectories of the paths in detail.We expected to further compare and analyze the path-planning performance of BOAE-DDPG and DDPG using a fixed-goal test in scene-3.In addition, the fixed-goal test in scene-4 was applied to compare the generalization effect.The results of the fixed-goal tests are shown in Figure 8, and the selected goal positions for the test in scene-3 were the same as those in scene-4.Whenever the DRL path planner output an action command, the robot's positions were recorded and combined to construct its motion trajectory.
The motion trajectories in Figure 8a show that BOAE-DDPG accomplished six fixed goal path planning tasks with higher quality than DDPG.Except for the sixth goal point, the trajectories of BOAE-DDPG in scene-3 were smoother than DDPG.DDPG collided at the first and fifth goal points.A common feature of these two goals was that they were located in the corner.When the robot approached the goal point, it approached the corner walls.At this time, the robot hesitated to approach the goal point or move away from the corner wall, eventually resulting in collision.In contrast, BOAE-DDPG did not have hesitation in any of the four corners of scene-3.
The fixed-goal test results in scene-4 were applied to compare the generalization effect because scene-4 was not used for training.The motion trajectories in Figure 8b indicate that BOAE-DDPG had a better generalization effect than DDPG.Although part of the environment in scene-4 was slightly similar   to scene-3, DDPG successfully reached the goal point only once, whereas BOAE-DDPG had only one collision with a static obstacle.In addition, DDPG collisions were caused by dynamic obstacles, which meant that BOAE-DDPG performed better in unknown dynamic environments.

Conclusion
This article proposes a novel DRL-based method called BOAE-DDPG for mobile-robot path planning in unknown dynamic environments.The designed BOAE mechanism applied to BOAE-DDPG was inspired by the advantage of reactions in dynamic psychology. [13]This aims to enhance the learning ability of an algorithm for obstacle avoidance by increasing attention to valuable obstacle-avoidance information and behavioral advantage.
The key contribution of this study was the design and implementation of the BOAE mechanism.The BOAE mechanism enhances the attention of the algorithm to valuable information in the obstacle-avoidance state using the cross-attention module.Furthermore, it enhances the algorithm's attention to obstacleavoidance behavioral advantages by individually estimating through the modified dueling network.The combination of the cross-attention module and modified dueling network achieves an overall bidirectional enhancement of obstacle avoidance.
With the application of the BOAE mechanism and new assisted reward factors, BOAE-DDPG can acquire obstacleavoidance abilities without relying on environmental information.The results of the brief validation demonstrated that incorporating new assisted reward factors can accelerate learning and convergence.Moreover, the training effects of the different algorithms showed that the algorithms with the dueling network exhibited higher learning efficiency, whereas the algorithms with the cross-attention module exhibited more effective learning performance.The test results for policies trained using different algorithms indicate that BOAE-DDPG outperformed the other algorithms in both scene-2 and scene-3 and outperformed DDPG during the generalization test in scene-4.However, further studies are required.Fundamentally, the proposed BOAE mechanism is implemented based on the actor-critic framework, potentially enhancing the learning ability of avoidance.Therefore, the learning of other abilities can theoretically be enhanced using the BOAE mechanism combined with DRL algorithms based on the actor-critic framework.While the laser range data provide information about the distance of obstacles, they do not provide spatial information near the robot, which is implicit in the image data of an red, green, blue (RGB) or red, green, blue, depth map (RGBD) camera.Therefore, overcoming the sim-to-real gap in image data [33] or fusing laser range data with image data [34] is a promising research topic in DRL-based path planning methods.In addition, the training effect improved using LSTM is considerable; thus, the combination of the BOAE mechanism and recurrent neural networks is worth exploring.

Figure 2 .
Figure 2. DRL path planner to compute and output the real-time moving direction v t of the mobile robot; its input includes laser range data l t , relative goal information g t , and current velocity v c .

Figure 3 .
Figure 3. Network structure of BOAE-DDPG with the BOAE mechanism achieved by the cross-attention module and modified dueling network.Note that the FC here means a fully connected neural network.
[0.1, 3.5 m] Action lower boundary a low [0 m s À1 , À1.5 rad s À1 ] Action Higher boundary a high [0.25 m s À1 , 1.5 rad s À1 ] Success distance d succ 0.25 m Collision distance d c 0.2 m Safety distance d safe 0.4m Scaling factor of reward factors: c d , c o , c s , c l , c a 10, 0.1, 10, 1, 0.1

Figure 5 .
Figure 5. Simulation scenes for training and testing.a) Scene-1.b) Scene-2.c) Scene-3.d) Scene-4.The red cube in (a) is the goal of the path planning task, and the green areas are the goal-generation areas.The static obstacles were modeled as wood-colored geometry, and the dynamic obstacles moving along the blue two-way arrows were modeled as white cylinders.

Figure 6 .
Figure 6.a) Number of successful episodes per 20 episodes during the training process in scene-1.b) Number of successful episodes per 100 episodes during the training process in scene-2.

Figure 7 .
Figure 7. Average return of different algorithms during the training process in a) scene-1, b) scene-2, and c) scene-3.

Figure 8 .
Figure 8. Results of fixed goal test in a) scene-3 and b) scene-4.

Table 2 .
Results of training.The function of bold data is to reflects the leading position of training data.

Table 3 .
Results of the random-goal performance test.The function of bold data is to reflects the leading position of testing data.