A Multiobjective Collaborative Deep Reinforcement Learning Algorithm for Jumping Optimization of Bipedal Robot

Due to the nonlinearity and underactuation of bipedal robots, developing efficient jumping strategies remains challenging. To address this, a multiobjective collaborative deep reinforcement learning algorithm based on the actor‐critic framework is presented. Initially, two deep deterministic policy gradient (DDPG) networks are established for training the jumping motion, each focusing on different objectives and collaboratively learning the optimal jumping policy. Following this, a recovery experience replay mechanism, predicated on dynamic time warping, is integrated into the DDPG to enhance sample utilization efficiency. Concurrently, a timely adjustment unit is incorporated, which works in tandem with the training frequency to improve the convergence accuracy of the algorithm. Additionally, a Markov decision process is designed to manage the complexity and parameter uncertainty in the dynamic model of the bipedal robot. Finally, the proposed method is validated on a PyBullet platform. The results show that the method outperforms baseline methods by improving learning speed and enabling robust jumps with greater height and distance.


Introduction
Bipedal robots hold a significant position in robotics research.Their bipedal structure allows them to emulate human locomotor functions, adapting to human living and working environments, and facilitating seamless collaboration.This offers enriched application scenarios and research opportunities. [1]owever, realizing human-like flexibility in such robots presents a considerable challenge.Jumping, a crucial human motor ability, provides robots with enhanced flexibility and improved adaptability to irregular and complex terrains. [2]The robot's jumping process encompasses three swift contact transitions.First, in the takeoff phase, the robot needs to generate ample force to lift off.During the flight phase, the robot, being in an underactuated state, lacks sufficient contact points for effective motion control.Finally, in the landing phase, the robot must adjust its body posture to regain ground contact, minimizing impact force, and ensuring a safe, stable landing.Managing the multivariate, strongly interconnected jumping process presents a significant challenge, particularly for high-dimensional bipedal robots, which further complicates advancements in jump height and stability.
[5] Some have utilized pneumatic actuators and tendon-driven mechanisms to achieve stable jumping. [6]thers have integrated bird-like legs and perching capabilities to enhance performance. [7]Albeit initial successes, these biomimetic mechanisms have limitations in control algorithms and gait diversity and rely heavily on structural design rationality.
To study jumping dynamics of humanoid robots with mass distribution, some researchers have simplified robot models and planned trajectory constraints using principles of dynamics, along with trajectory tracking control methods.For instance, Xiong et al. modeled the robot as a spring-mass system for trajectory optimization to achieve dynamic jumping. [8]However, these methods overlooked stability and other constraints during jumping.To address stability, simulations and experiments have implemented online trajectory generation based on resolving the Eulerian zero moment point (ZMP). [9,10]Without zeroing angular momentum, undesirable torso fluctuations were greatly reduced.However, lacking scalability and compatibility, it was difficult to incorporate overlooked constraints like joint limits or append new tasks to the scheme.Hong et al. generated optimal vertical center of mass (CoM) trajectories using asymptotic optimization for DARwIn, [11] while Shen et al. optimized energyefficient and highest jumping motions via nonlinear programming. [12]However, these methods were limited to specific robots in certain environments.Although presenting improved results, they lacked scalability when robot and environment parameters changed.
Deep reinforcement learning (DRL) has emerged as a promising approach for bipedal robot motion control in recent years.DRL is an artificial intelligence method that leverages the strengths of neural networks and RL.It requires no prior knowledge and derives optimal policies through trial-and-error interactions with the environment. [13,14]As such, DRL has seen widespread use in novel controller design. [15,16]Bellegarda et al. developed a DRL-based approach enabling robust jumping on uneven terrain. [17]While DRL has shown great potential for robot motion control, several challenges remain.Previous work has primarily focused on walking control, with limited research on bipedal robot jumping.Moreover, jumping control involves high-dimensional state spaces, which a single DRL framework may not adequately address.Directly training jumping strategies could also lead to overly conservative robot behavior due to lack of experience and inefficient action selection.The goal of this study is thus twofold.First, we seek to reduce the sample complexity of bipedal robot learning and expedite the acquisition of stable jumping skills.Second, we aim to enable high-performance jumping capabilities through a more sample-efficient DRL method.The main contributions of this work are then summarized as follows: 1) An a multiobjective collaborative deep reinforcement learning (MCDRL) method based on the actor-critic (AC) framework is proposed to address poor robustness and height limitations in bipedal robot jumping (Section 4.1).The method establishes two deep deterministic policy gradient (DDPG) networks for the key objectives of stability and height when jumping.The interacting networks collaboratively learn to achieve global optimal control and ensure high, stable jumping.2) To promote network convergence, a dynamic time warping (DTW)-based recycle experience replay mechanism is created that samples with varying probabilities, improving experience utilization efficiency and accelerating training (Section 4.2).Moreover, a timely adjusted discount factor is incorporated to alleviate large value function fluctuations in the AC framework and enhance convergence accuracy (Section 4.3).3) Considering the high complexity and uncertainty of bipedal robot dynamics, a Markov decision process (MDP) model for jumping gait control is established that reduces adjustable variables and the action space (Section 4.4).Various reward functions are devised in the model to optimize the learned control strategy.

Traditional Jumping Control Methods
Except for specific jumping robots designed based on biological principles, [3][4][5]7] traditional robotic jumping methods usually consist of two types. Oneuses a motion model and the expected motion to plan a reference trajectory for jumping, with various constraints imposed during trajectory planning.The other approach optimizes parameters using control algorithms, then calculates the trajectories of each joint angle using kinematic models, and tracks them online with the controller.To ensure jump stability of bipedal robots, researchers have developed various methods.For instance, an online trajectory generation method based on Eulerian ZMP resolution has been used to reduce torso angle fluctuations without forcing the angular moments to zero.[18,19] Some have divided the jumping motion into multiple phases, mathematically modeled each phase, followed by optimizing the trajectory to achieve long-displacement forward jumps.[20] However, due to nonconvex constraints and high dimensionality, jump trajectories are often difficult to reliably optimize.To address this, a two-layer optimization approach has been used to quickly find the optimal trajectory for jumping.[21] In addition, hybrid planning has been utilized to eliminate redundancies and find dynamically executable trajectories for continuous jumps.[22] Another approach has been to divide the CoM trajectory of the robot into vertical and horizontal directions to maintain the speed and stagnation time for jumping.[23] Although effective, these approaches face challenges of complex modeling and high computational costs.

RL-Based Jumping Control Methods
As RL does not require an exact mathematical model, researchers are increasingly applying RL methods to control the motion of bipedal robots.For instance, Gil et al. used RL to learn the joint angles of a bipedal robot, obtaining a sequence of poses that enabled the robot to achieve the fastest linear walking motion and the farthest distance in the shortest time. [24]With further development of machine learning, researchers have proposed DRL by combining RL and deep learning, which is an end-to-end artificial intelligence algorithm. [25,26][29] Escontrera et al. optimized the reward function mechanism to train a natural robot movement policy to obtain results that outperform other algorithms. [30]In contrast, Batke et al. developed a running gait for a bipedal robot, but their reference movement required additional parameterization, which was prescriptive and limited the robot's ability to respond flexibly to the environment. [31]Despite the various skills that have been developed for bipedal robots, such as walking and running, there has been limited research on using DRL to enable bipedal robots to jump.

Background
In this section, a brief overview is provided on the background of DRL and the DDPG network, which is utilized in the construction of the MCDRL algorithm.

DRL Background
RL is a machine learning paradigm where an agent learns the optimal policy for sequential decision making through interactions with the environment so as to maximize cumulative rewards.In RL, the agent-environment interaction is commonly modeled as a MDP defined by the tuple ðS, A, P, RÞ.Here, S denotes the state space, representing all possible states of the system; A denotes the action space, representing all actions available to the agent; P refers to the state transition probability function, defining the probability of transitioning from current state s t to next state s tþ1 after taking action a t , formulated as p s tþ1 js t , a t ð Þ , where t indexes the time step; R denotes the reward function, defining the immediate scalar reward r t obtained by taking action a t in state s t and transitioning to s tþ1 , formulated as r s t , a t , s tþ1 ð Þ .The jump control problem is formulated as an MDP, where the jumping robot is modeled as an agent that can perceive states and take actions.At each time step t, based on the observed state s t , the agent takes action a t , resulting in a transition to a new state s tþ1 and a reward r t returned from the environment.The objective of RL is to maximize the expected discounted cumulative rewards E½ P T t¼0 γ t r t throughout an episode of length T, where γ is a discount factor.RL encounters the curse of dimensionality when confronted with large or continuous state and action spaces, hindering its direct application.To overcome this limitation, the utilization of function approximators becomes imperative.Function approximators facilitate RL in generalizing across states and actions, enabling more effective representation and learning within intricate environments.DRL aims to approximate the policy using deep neural networks (DNNs) as function approximators.Specifically, DRL employs DNNs to learn an optimal policy π Ã ∶S !A that maps states to actions.The parameters of the DNNs are optimized through extensive interactions with the environment, during which the agent experiences sufficient state transitions and learns from trial and error.The framework diagram is shown in Figure 1.

DDPG Background
The DDPG algorithm can be applied to decision-making problems with continuous action spaces.DDPG adopts an AC framework wherein the actor network generates actions, while the critic network evaluates the state-action value function.Specifically, the actor network is trained to output optimal policies by maximizing the expected reward, while the critic network learns to assess the quality of state-action pairs.The network architecture is depicted in Figure 2.
The update process for the DDPG algorithm can be described through the following mathematical formulas.Here, θ μ and θ q denote the parameters of the actor and critic networks, respectively.The discount factor is denoted by γ, and λ represents the smoothing parameter used during the update of the target networks.The parameters of the target actor and critic networks are represented by θ μ 0 and θ q 0 , respectively.
The parameters of the critic network are updated according to the following equation: where N represents the batch size, and η q denotes the learning rate of the critic network.
The parameters of the actor network are updated using the following equations:  where N represents the batch size, and η μ denotes the learning rate of the actor network.The parameters of the target critic network and actor network are updated by

Multiobjective Collaborative Deep Reinforcement Learning Algorithm
To address the issues of low learning efficiency and suboptimal utilization of experience in DRL methods for bipedal robot jump control and to achieve more stable and higher jumps, we propose an MCDRL algorithm.The robot jumping task is modularly decomposed and two collaborative agents are trained to control two key objectives of the jump.This generates an improved and easier-to-implement jumping control strategy.Figure 3 illustrates the framework of our proposed algorithm.This paper describes the approach in four parts.First, we present the framework of MCDRL.Second, we delineate a recovery experience replay mechanism introduced to quickly train an effective strategy.Third, we detail the parameter update laws of adaptive optimization to improve model accuracy.Finally, we construct the MDP model.

The Framework of MCDRL
We address the bipedal robot jumping task using a modularized cooperative DDPG approach.Specifically, two DDPG algorithms are implemented to train autonomous agents for stability control (stability DDPG agent) and height control (height DDPG agent), respectively, within the continuous action space.Each DDPG architecture contains actor and critic networks (denoted A i , C i , where i ¼ 1, 2, respectively), as well as corresponding target networks.The actor network generates the control action for each agent based on the observed environment, while the critic network evaluates the quality of actions.Target networks aid stable learning and prevent overestimation during training.The agents are trained collaboratively through action coupling and cross-network information exchange.During training, the actor outputs of each agent are added as additional inputs to the critic networks of the other agent.In this process, each agent is an active participant in decision-making, adjusting their strategies based on the behavior of other agents.In contrast, in distributed RL, collaboration among multiple agents is typically achieved through alternative mechanisms such as information exchange.The main purpose of distributed RL is to utilize multiple parallel agents to speed up the learning process and directly optimize the global goal.In this case, individual agents do not consider the strategies of other agents.The specific interaction is as follows: where a means action and s means state.Through this interactive training approach, the agents are able to adjust their strategies in response to one another's actions.Specifically, each critic network evaluates the synergy between its corresponding actor's output and that of the other agent, providing valuable feedback.This feedback is then used to adjust the actor network's output via backpropagation, thereby facilitating effective collaborative learning.Algorithm 1 outlines the MCDRL algorithm methodology and network updates in detail.

Recovery Experience Replay Mechanism
The conventional DRL algorithm employs a single experience buffer, typically structured as a queue.While this design ensures exploration randomness and adheres to space constraints, it may inadvertently discard valuable experience data.Furthermore, achieving convergence within an acceptable timeframe often necessitates a substantial investment of training time.To address this limitation, recent studies have introduced the concept of partitioning the experience sample set based on experience importance.This partitioning distinguishes between near-policy experiences and far-policy experiences, enabling a higher probability of sampling near-policy experiences. [32]However, it is important to note that excessive sampling of near-policy experiences can lead to their overutilization, thereby causing the gradient network to oscillate and increasing the susceptibility of the algorithm getting trapped in local optima.
To address the aforementioned issues, a recovery experience replay mechanism based on DTW is proposed.In this study, the recovery judgment operation is performed on experience samples that are nearing removal due to buffer capacity constraints.Potentially valuable experience samples are retrieved and stored in the experience recovery unit, which are subsequently selected with varying probabilities.Initially, the experiences obtained from the interaction between the biped robot and the environment are stored in the replay buffer, denoted as RM ¼ fτ 1 , τ 2 , • • • , τ M g, τ i ¼ fs i , a i , r i , s iþ1 g.When the buffer reaches its capacity, the DTW algorithm is employed to assess the similarity between the oldest experience sample τ o and the new experience sample τ n , facilitating the retrieval of potentially valuable experiences.The specific operation is as follows: 1) Treat the two experience samples τ o and τ n as time series and convert them into feature vectors, i.e., τ o ¼ fo1, o2, o3, o4g, τ n ¼ fn1, n2, n3, n4g and 2) use the following recursion to measure the similarity between the two sequences, taking the two feature vectors as input sequences: where Dist o i , n j À Á represents the Euclidean distance between o i and n j , as shown in Equation ( 9): A similarity distance threshold, denoted as φ, is employed to assess the similarity between samples.φ represents the average distance between experience samples in the replay buffer.When the similarity between the oldest experience sample τ o and the new experience sample τ n exceeds the threshold φ, τ o is selected for retrieval and transferred to the experience recovery unit.During each iteration, the probability of parameter sampling and updating from the replay buffer is defined as α, while the probability of parameter sampling and updating from the experience recovery unit is 1 À α.In comparison to existing studies, this recovery experience replay approach not only maintains the role of the experience replay mechanism and the diversity of samples but also preserves the positive influence of potential samples on the strategy.

Timely Adjustment Strategy
In AC algorithms, the policy network aims to maximize expected returns, while the value function network aims to minimize error in value estimation.The discount factor γ represents the weighting of future discounted rewards versus immediate rewards.Higher γ values emphasize long-term returns, whereas lower γ places more weight on short-term returns.However, during early training when the value function has yet to converge, a high γ can induce instability in temporal difference errors and fluctuations in the value estimation.Therefore, it is recommended to initially use a lower γ focused on short-range rewards.As learning progresses, γ can be gradually increased to enhance longterm planning.This approach reflects threshold effects observed in economic psychology and can improve model convergence accuracy.Based on these insights, we propose an adaptive method for timely discount factor adjustment to facilitate gradual value expectation estimation: 1: Randomly initialize two actor network parameters and two critic network parameters with θ i,μ , θ i,q , i ¼ 1, 2 2: Initialize target network with weights θ i,μ 0 , θ i,q 0 , i ¼ 1, 2 16: Update actor networks 17: Update target networks 18: θ i,q 0 ←λθ q þ ð1 À λÞθ i,q 0 .19: θ i,μ 0 ←λθ μ þ ð1 À λÞθ i,μ 0 .20: end for where γ M is the maximum discount factor, eposide denotes the number of training sessions, and δ 1 and δ 1 are tuning parameters.When eposide = 0, the initial γ 0 ¼ tanh δ 1 δ 2 Â γ M .As eposide increases, γ grows nonlinearly according to Equation (10).This adaptive discount factor enables more emphasis on long-term rewards with more training.

State and Action Space
Continuous control of bipedal robotic locomotion in dynamic, uncertain environments poses significant challenges.State information forms the basis for decision-making and long-term reward estimation by bipedal robots.The design of the state space directly impacts the performance of control algorithms.When defining the state space, two key principles must be considered: relevance, by including variables affecting stability and height; and consistency, through conformity with bipedal robotic motion characteristics.The action space determines the achievable performance ceiling of an algorithm, as it represents how the robot interacts with its surroundings.Therefore, optimal action space design is critical for accomplishing superior motion control.Overall, careful consideration of state and action space design enables bipedal robots to make effective decisions and maximize dynamic control performance.
To enhance algorithm performance, factors such as the robot's motion state, dynamics, and environment are comprehensively considered in state representation.Selected observation variables include Position P i ¼ ½x i , y i , z i of each joint i relative to the world coordinate origin.Linear velocity V i ¼ ½v xi , v yi , v zi of each joint i.Relative rotation θ i ¼ ½θ roll i , θ pitch i of each joint i from its initial posture.Ground contact forces F ¼ ½F N l , F N r .CoM offset CoM off ¼ ½CoM of f x , CoM of f y , CoM of f z on coordinate axes.The action space for the stability DDPG agent is defined as the lower body joint angles.For the height DDPG agent, the action space comprises the upper body joint angles.

Reward Space
Reward design is crucial in DRL to guide the agent toward desired goals.Sparse rewards can hamper optimal policy learning, whereas properly structured reward and punishment functions facilitate faster convergence.We define a comprehensive reward R comprising several terms to reinforce stable, coordinated bipedal locomotion: The reward R 1 motivates stable motion transformations in both transient and long-term decisions by penalizing instability.It is formulated as where n is the number of sensors, E p and E v are the expected position and velocity sensor values at each moment based on the prior timestep, respectively, and k 1 and k 2 are weighting coefficients.
The reward R 2 penalizes unstable body postures to reduce fall likelihood: The touchdown reward R 3 encourages equal contact forces on both feet for stability: where F N l and F N r are left and right foot contact forces.The time reward R 4 minimizes decision sequence duration: where t i stands for the current moment.The stability reward R 5 enables smooth motions via joint angle gradients: The imitation reward R 6 utilizes errors between predicted and actual joint values to guide learning: where P l and CoM offl are prediction values obtained by applying reference motion.Reference motion is a human jumping motion animation edited by MOCAP, collected from a public dataset. [33]he height reward R 7 drives upwards movement where v z is the vertical velocity.Arm swing reward R 8 guides the upper body to assist jumping through angle and velocity matching: where θ I and V I are prediction values obtained by applying reference motion.Together, by optimizing R ¼ R 1 þ : : : þ R 8 , the agent develops coordinated whole-body policies for bipedal jumping.

Experimental Contents and Configurations
In line with our endeavor to verify the jump optimization capabilities of MCDRL, a series of experiments were systematically conducted.The jumping gait of the NAO robot, an integrated, programmable, humanoid robot standing at a height of 58 cm and weighing approximately 5.4 kg, was rigorously tested.This testing was performed within the confines of a Pybullet simulation environment, an open-source physics engine renowned for its applicability in diverse areas such as dynamics simulation and control.For the simulation process, we utilized the URDF model file of the NAO robot and loaded the model into Pybullet.The model of the NAO robot boasts 26 degrees of freedom, partitioned as follows: 2 for the head, 5 for each of the left and right arms, 2 for the pelvis, and 6 for the left and right legs, respectively.The range of joint angles for the NAO robot is comprehensively provided in Table 1.To simulate the robot's physical properties with utmost precision, we extracted parameters such as the robot's physical dimensions and mass from the official documentation of the NAO robot.Following this, these parameters were accurately set in Pybullet.Concurrently, the joint limit of the robot was set to ensure that the robot does not exceed its permissible range of motion during the jumping process.
Figure 4 depicts the action sequence of a forward jump executed by the bipedal robot in a PyBullet simulation environment.The robot entered the flight phase after takeoff at approximately 1.3 s, reaching maximal altitude around 1.6 s airborne.Clearly, the robot successfully completed the entire jumping process, validating that the proposed algorithm was able to produce coordinated behavior for the jumping motion.

Performance Evaluation of Algorithms
To further evaluate algorithm efficacy, the proposed MCDRL algorithm was benchmarked against trust region policy optimization (TRPO), proximal policy optimization (PPO), and DDPG algorithms under identical experimental conditions.
Training Parameters: At each round, the robot state and position were reinitialized and training conducted for 40 000 rounds.The AC network architecture comprised two hidden layers of 500 nodes each.Smoothing and learning rates were set to λ ¼ 5 Â 10 À3 and η ¼ 1 Â 10 À3 , respectively.A replay buffer of 10 000 samples was used.A total of 4 Â 10 7 timestep interactions were simulated for policy optimization.The actor and critic networks were updated every 4 steps.Computationally, each trial required approximately 156 h to complete on a single GPU.
Performance was analyzed by comparing the learning curves achieved by MCDRL, DDPG, TRPO, and PPO over iterations.
Figure 5 shows the episode length curves of the four algorithms over training time.For MCDRL, episode duration was initially shorter in the first 1 Â 10 7 timesteps compared to later stages, intuitively because the agent lacked a complete strategy and fell frequently at the outset.However, through learning, the agent gradually learned to avoid falls, realizing longer Table 1.The range of some of the joint angles of the NAO robot.
episodes.The heightening episode length of MCDRL relative to the initial 1 Â 10 7 timesteps indicates the agent expanding its exploration of state space to discover an optimal policy.Compared to the other algorithms, MCDRL attained the shortest episode lengths and fastest training.More rapid termination of episodes allows for increased environmental interactions and parameter updates per unit time, thereby accelerating the learning process.
As depicted in Figure 6, all four DRL algorithms demonstrated convergent policy optimization following extensive exploratory training.The observation that MCDRL converged faster than the benchmarks suggested that the introduced recovery experience replay mechanism effectively accelerated the training process.The results suggested that recovery experience replay mechanism enabled the agents to learn more efficiently from past experiences and thus converge more rapidly.Overall, the average reward value obtained by the proposed MCDRL algorithm was higher than those achieved by the three comparative algorithms.This implied that the MCDRL framework facilitated consistent identification of superior action sequences for acquiring the optimal bipedal jumping policy.
The jump heights of the robot controlled by the four algorithms were compared based on maintaining stability.Table 2 presents the average results from 200 experimental trials.As shown, the proposed MCDRL algorithm achieved the highest jump heights among the approaches, obtaining an average of 9.236 Â 10 À2 m, a maximum of 9.41 Â 10 À2 m, and a minimum of 9.17 Â 10 À2 m.Among the remaining algorithms, the robot optimized by the DDPG algorithm achieved the best performance with an average height of 6.874 Â 10 À2 m, and MCDRL achieved jump heights that were more than 25% higher than DDPG.
Figure 7 shows a comparison of the CoM heights of the bipedal robot controlled by the four algorithms preceding the landing phase of the jump.As can be seen, under MCDRL control, the robot achieved a deeper crouching posture, which would     induce higher forces upon the leg joints and consequently result in a greater jump height.These results suggested that the MCDRL algorithm was more effective than the other algorithms at optimizing the jumping motion polcy by permitting deeper sinking before launch, thereby augmenting the potential energy convertible to kinetic energy during the flight phase.Figure 8 compares the changes in knee joint angle of the bipedal robot controlled by the four algorithms.The joint angle trajectory of the robot under MCDRL control exhibited the smoothest transition, signifying a high degree of stability throughout the jumping process.The jumping motion produced by the proposed algorithm effectively mitigated explosive unstable behavior at the knee joint and oscillatory fluctuations in joint angle.These results suggested that the MCDRL algorithm was more effective than the alternative approaches at optimizing the jumping motion policy by generating smoother and more stable joint angle dynamics.
Minimal body oscillations during landing were desired for the NAO robot.In this study, the ZMP position was used to estimate oscillation amplitude.ZMP is an important parameter for describing a robot's balance and stability.If the ZMP remained within the support area, the robot was stable; if the ZMP went beyond the support area, the robot may have lost balance and fallen over.Figure 9 illustrates the change in ZMP position during the landing phase of the bipedal robot controlled by MCDRL.The bounding range of the supporting polygon had been indicated in the figure.The ZMP constantly fell within the support polygon.Therefore, the stability of the bipedal robot during landing could be inferred from this experiment.
In summary, the MCDRL algorithm demonstrated significant advantages over the other algorithms with regard to training efficiency, height, and stability in controlling the bipedal robot.The experiments showed that MCDRL achieved higher training efficiency and faster convergence, properties crucial for practical applications.Additionally, the robot under MCDRL control attained a greater jump height and smoother joint angle transitions, signifying superior motion performance.Furthermore, stability during landing was enhanced for MCDRL, as evidenced by the ZMP remaining constantly within the supporting polygon.These findings suggested that the MCDRL algorithm constituted a promising technique for optimizing the jumping policy of bipedal robots with the potential to augment execution capabilities and safety in robotic jumping tasks.

Conclusion
In this work, an MCDRL algorithm was proposed to optimize jumping behavior in a bipedal robot.By decomposing the motion modularly, two DDPG algorithms were constructed to control objectives of stability and height independently, interacting to ultimately produce a cooperative optimal jumping strategy enabling heightened jumps with maintained stability.Additionally, an experience recovery replay mechanism was introduced to augment the influence of promising experiences, improving training speed.Network convergence accuracy was also optimized.Through simulations and experiments, the proposed optimization method for bipedal jumping was demonstrated to be effective.Further studies are needed to investigate the robustness and adaptability of the MCDRL approach under varied conditions and environmental contexts.

Figure 1 .
Figure 1.Schematic diagram of the DRL framework.

Figure 2 .
Figure 2. The architecture of the DDPG network.

Figure 3 .
Figure 3. Schematic diagram of the overall framework of MCDRL.

Figure 5 .
Figure 5. Episode length values among four algorithms for jump training.

Figure 6 .
Figure 6.Average rewards for the four jump training algorithms.The curves record the average value of the rewards obtained using different algorithms and the coloring differs from one algorithm to another.The proposed method has shown the best performance in the early stages of training.

Figure 7 .
Figure 7.The CoM height before landing of the bipedal robot controlled by four algorithms.

Figure 8 .
Figure 8.The variation of the knee joint angle of a biped robot controlled by four algorithms.

Figure 9 .
Figure 9.The change of ZMP position during the landing phase of the bipedal robot controlled by MCDRL.The dashed lines mark the maximum and minimum values of ZMP on the y-axis and x-axis.
, get reward r t , observe new state s tþ1 .11: τ←ðs t , fa 1,t , a 2,t g, r t , s tþ1 Þ, store τ in D 12: if D is full do execute recovery experience replay mechanism 13: Sample from D and DR with α and 1 À α.
3: Initialize the maximum value of replay buffer D as N and recovery unit DR as N R 14: Calculate the target Q

Table 2 .
Average height under four algorithms.