Learning passive policies with virtual energy tanks in robotics

Within a robotic context, we merge the techniques of passivity-based control (PBC) and reinforcement learning (RL) with the goal of eliminating some of their reciprocal weaknesses, as well as inducing novel promising features in the resulting framework. We frame our contribution in a scenario where PBC is implemented by means of virtual energy tanks, a control technique developed to achieve closed-loop passivity for any arbitrary control input. Albeit the latter result is heavily used, we discuss why its practical application at its current stage remains rather limited, which makes contact with the highly debated claim that passivity-based techniques are associated with a loss of performance. The use of RL allows us to learn a control policy that can be passivized using the energy tank architecture, combining the versatility of learning approaches and the system theoretic properties which can be inferred due to the energy tanks. Simulations show the validity of the approach, as well as novel interesting research directions in energy-aware robotics.


I. INTRODUCTION
Robotics is increasingly focusing on the development of control frameworks allowing the transition from industrial cages to unstructured environments.This transition carries the ambitious objective of stable and safe execution of complex tasks where robots coexist with other robots or possibly humans.The practical impossibility of careful dynamic modeling of the environment along those tasks makes the stability objective even more challenging, and model-based approaches poorly suitable.
Passive controllers have been presented as a feasible solution to tackle this problem as the stability of the closedloop system is in principle independent on the external environmental interaction [1], [2].Passive systems route physical energy rather than producing it, which is the ultimate reason why stability proofs (using energy functions as Lyapunov candidates) are conveniently assessed in this framework [3], [4].A powerful technique allowing the passivization of any control action is represented by virtual energy tanks [5]- [7].These are able to store information about physical energy flows undergoing the system, and introduce at a control level the scalar information representing the energy budget for the controlled robot.A conditional check on this budget with proper passive reaction strategies in case of depletion of the latter is what guarantees closed-loop passivity.The main drawback of passivity-based control methods (comprising those involving energy tanks) is the lack of optimization over a task performance metric along the design of the controller [8].This fact led to the claim that passivity-based control methods are associated with a loss of performance, which becomes more severe as the complexity of tasks increases.
On the other side of the control theoretic spectrum, recent advancements in the machine learning community are leading to robots with an outstanding awareness of complex environments and tasks.The intelligence encoded in the control policies, normally optimized by means of performance-based metrics, reflects complex high-level decision-making strategies which are learned thanks to the availability of huge datasets [9]- [11].The drawback of these families of approaches is the difficulty in guaranteeing system theoretic properties such as passivity and stability for the controlled system [12].
In this work we merge the passivity-based control (PBC) design involving energy tanks and reinforcement learning framework (RL), combining system theoretic properties of the closed-loop system induced by the ultimately passive design, and high-performance achievement peculiar of the RL framework.This procedure presented both in inference and in training, allows meaningful scale PBC to tasks requiring the learning of complex control policies.
Related work: We recognize related work casting the energy tank algorithm (seen as a task-agnostic passivization tool) into a framework using task-based information for performance augmentation of the underlying passive system.In [13] the authors introduce the so-called valve-based energy tanks (also used in e.g., [14]), in which extra parameters are introduced on the energy tanks in order to embed control objectives in the design, beyond achieving passivity.Task specifications are translated into tank design rules by controlling the power flows undergoing the system.The idea of using a reference power trajectory in combination with energy tanks is present also in [15], [16], where task-based specifications are used to tune the tank parameters.These works present very stiff task specifications and badly scale to complex tasks in which high-level policies need to be learned.A family of methods using energy tanks and sharing a similar motivation of this work is [17], [18], where an explicit optimization problem is introduced to find the closest passive approximation of a given control action.The idea is to exploit the versatility of energy tank architecture to perform an optimization that can be generalized and scaled for different tasks and whose outcome is a passively controlled system.An important difference between this work and the cited ones (beyond the specific technique to solve the optimization) is that in [17], [18] the desired control action is already given, and the optimization aims at finding the closest passive approximation to it, imposed by a non-depletion constraint of the tank.We will comment on the difference between the approaches, which involves the degree of freedom represented by the initial state in the tank.
The paper is organized as follows.In Sec.II a throughout explanation of the PBC problem in robotics and its achievement using energy tanks is reviewed, followed by a discussion regarding the limitations of the methodology.In Sec.III the proposed scheme is presented, and validated in Sec.IV by means of different simulations.Conclusions and future works are sketched in Sec.V.

II. BACKGROUND AND MOTIVATION
We review, restricted to robotics, the motivation underlying PBC, and how energy tanks formally solve the problem of passivizing an arbitrary control input.We then discuss the limits of the approach, which are tackled with the subsequent RL integration.

A. Passivity-based control and energy tanks
Consider the dynamic model of an n-DoF robot in Lagrangian joint-space coordinates: M(q) q +C(q, q) q + D(q) q + ∂V (q) where q ∈ R n are the joint coordinates, M(q) = M ⊤ (q) > 0 is the inertia tensor, C(q, q) is the matrix collecting centrifugal and Coriolis terms, D(q) = D ⊤ (q) ≥ 0 is the matrix collecting friction coefficients, V : R n → R maps the joint coordinates into the total conservative potentials (e.g., gravity and elastic springs), and τ ∈ R n are the generalized forces at the joints, collocated with the joint coordinates q.The time derivative of total mechanical energy E(q, q) =1 2 q⊤ M(q) q + V (q) verifies the inequality which is a statement of passivity for system (1) with the energy function E(q, q) as storage function.Inequality (2) states that system (1) is passive with respect to its energy function E(q, q), and its so-called power port (τ, q), an inputoutput pair whose pairing q⊤ τ produces the scalar power flow undergoing the system.Once (2) is integrated in time, the lefthand side of the equation E(t) − E(0) represents the stored energy in the system, which is always less or equal than the supplied energy through the power port, represented by the right term t 0 q(s) ⊤ τ(s)ds.The positive dissipated energy t 0 q(s) ⊤ D(q(s)) q(s)ds determines the convergence rate to the equilibrium, and acts as a passivity margin, i.e., the higher it is, the bigger is the margin for which the passivity inequality (2) results satisfied 1 .If applied to an (open-loop) physical system like (1), passivity just represents the thermodynamic statement of energy conservation, while if applied to a controlled system, passivity implies stability for the minimum of the storage function under weak conditions qualifying the latter as a Lyapunov function for the equilibrium [2], [4].Next, we illustrate the full motivation behind the necessity of achieving a passive closed-loop system.
Passivity as must: To understand the motivation behind a passive design consider system ( 1) as an open one, i.e., a system that, independently on the way it is controlled, can interact with a new, dynamically unknown system, that we call the s-environment2 .This is shown by considering system (1) in which we distinguish the torque contributions in two terms, i.e., where τ c are the control torques that can be directly applied at the joints by means of the actuators, while τ e = J ⊤ (q)F corresponds to the torques at the joints produced by an external interaction at the end-effector of the robot.Here F ∈ R 6 is a vector collecting forces and torques at the end-effector in the workspace and J(q) is the Jacobian matrix of the robot.The latter relates joint-space and work-space velocities and forces as the dual relations ẋ = J(q) q and τ e = J ⊤ (q)F, where ẋ ∈ R 6 is the rate of work-space coordinates x, collocated with the forces F, resulting in τ ⊤ e q = F ⊤ ẋ.Specializing the power balance (2) to the uncontrolled system (1) with the split (3) one obtains i.e., the robotic system (1) is passive with respect to two different power ports: (τ c , q), whose pairing is the power flow undergoing the actuators, and (F, ẋ), whose pairing is the power flow undergoing the s-environment, an external dynamical system with its own (unknown) dynamics.In this context, the PBC objective can be described as follows PBC objective.Given the system (1) with the split (3) inducing power balance (4), design a control law on the port (τ c , q), possibly by designing a controller as a dynamical system in its own right, with state x c and energy V c (x c ), such that for the closed-loop storage function E (q, q, x c ) = E(q, q) +V c (x c ), the passivity condition is verified, i.e., that the controlled system is passive with respect to the s-environment port (F, ẋ).
This objective is considered of utmost importance (from which the passivity as must claim [1]) since, when an interaction takes place (F ̸ = 0) the dynamics of the closed-loop system deeply depends on the dynamics of the s-environment, which cannot be modeled in an oversimplified way, e.g., assuming linearity.This claim is particularly valid if the task to be executed takes place in an unstructured, hazardous scenario where sources of disturbances like unforeseen interactions with humans are possible.
The passive design (5) aims at excluding the possibility that in the moment the robot interacts with a passive s-environment, an unbounded amount of energy can be produced during the interaction. 3This property is referred to as contact stability.
For completeness, we report that passivity induces also a stronger property than contact stability, which is sometimes referred to as robust stability [17]: even if the s-environment is not a passive system, contact stability can be assured if the passivity margin of the controlled system covers the energy generated by the active s-environment.All these properties follow from the fact that passive systems are closed under power-preserving interconnection [2], [3], and we refer to [2] for further relations between passivity and different system theoretic stability properties.To conclude, in this work we are aligned with the necessity of achieving the PBC objective when a robotic task needs to be performed.
Energy Tanks: An extremely effective method to achieve the PBC objective relies on the energy tank control algorithm, which we describe next and whose basic schematics is represented in Fig. 1.Mathematically an energy tank is a dynamical system that constitutes an atomic energy-storing element.It can be represented as the (lossless) 1D system: where the energy function is simply , and as a consequence it presents a power port (u t , y t ) since Vc = y t u t .The tank (6-7) is interconnected to (1) in a way that allows the implementation of some control action that fulfills the execution of some task, and at the same time meeting the desired passivity constraint.This is possible thanks to a suitable power-preserving interconnection between the two systems, defined as follows: where w ∈ R n is the desired task-dependent control action to be passively implemented.This interconnection, which is the core of the energy tank algorithm, produces two effects: i) the desired action w is correctly implemented, i.e., from the side of the robot (1), one obtains M(q) q +C(q, q) q + D(q) q + ∂V (q) ∂ q = w + J ⊤ (q)F; and ii) at every time the mechanical power exiting the robot produced by the actuators q⊤ τ c leaves the energy tank since Evaluating the variation of the closed-loop storage function E = E +V c one achieves the PBC objective (5), which shows that the interconnection achieves indeed a passive closed-loop system with respect to the s-environmental port.The scalar V c (x c ) indicates the amount of energy that is still at disposal of the control mechanism implementing the action w before losing the described formal passivity.In fact, the interconnection (8), which is modulated by the tank state, becomes singular when x c , and thus V c (x c ) is zero, representing the moment in which it becomes impossible to passively perform the desired action w.To meaningfully implement the energy tank algorithm, it is thus necessary to constantly observe the energy in the tank in order to implement a control action w, rather than w, where for some small positive energy level ε.In this way, formal passivity of the closed-loop system is recovered completely since the system is just detached from the controller at the moment in which no energy budget is left.Before discussing the peculiarities and limitations of the described algorithm, we introduce the concept of task energy, that we will denote by e * , firstly defined in [19], which is the minimum energy in the tank necessary to fulfill passively some task.In other words, if V c (x c (t))| t=0 > e * , the tank never depletes along the whole task horizon.It is important to notice that e * does not depend on the task only, but also on the specific tank dynamics which is chosen: the choice u t in (8) is not unique, and many variations have been proposed [5] (e.g., recirculation in the tank of dissipation) which would change the value of task energy for the same task.
Comments and limitations: i) The energy tank algorithm constitutes a clever way to disembody from any physical dynamics the implementation of a passive control action.In fact, there is no need to design the controller in a way that it mimics a physical system (e.g., in the PD case the controller can be seen as a control spring and damper): as long as a task is achieved through a control input w and e * is finite, a passive implementation is possible by means of the energy tank algorithm, just setting V c (x c (t))| t=0 > e * ; ii) The aforementioned initial energy assignment is a degree of freedom in the algorithm, whose implications are often naively addressed.In fact, a very high (yet bounded) energy initialization in the tank would technically still fulfill the PBC objective ( 5), yet creating so-called "practically unstable behaviors" [5], [17].As a consequence, a naive tank design de facto makes robust stability a property that is not connected to any safety guarantee.In fact, notice that till the energy in the tank is not depleted, the control input w is completely transparent to the tank algorithm, which reduces to a trick to formally prove passivity, with limited significance in the context in which tasks need to be performed in unstructured scenarios.The fact that the mechanical power flow is undefined in sign (i.e., when q⊤ τ the tank gets refilled) worsens this criticality, which is sometimes addressed with empirical tank saturation arguments.
To conclude the discussion, both the task energy e * and the control action w are often difficult to be determined a priori, and these are not independent variables.For complex task executions, it is reasonable to take advantage of simulations to optimize for both variables in a combined way.If the PBC objective can be naively achieved just by initializing the energy in the tank to a sufficiently high value, what we claim to be a useful system theoretic property in the energy tank context is the achievement of the PBC objective combined with an estimation of the task energy.In fact, the knowledge of the latter would lead to a meaningful energy tank initialization, so that a depletion of the tank can be used as a diagnostic tool to detect an important divergence from nominal task execution, beyond formally achieving a passive closed-loop system.

B. Reinforcement learning
Reinforcement learning (RL) [20] is a model-free framework, consisting of an agent interacting with an environment, to solve optimal control problems stated as Markov decision processes (MDPs).MDPs are a mathematical formulation of decision-making in situations where outcomes are partly stochastic and partly under the control of a decision-maker.A MDP is defined by the tuple (S , A , p, r): at any given state s k ∈ S at time step k, the agent chooses and executes an action a k ∈ A according to a learnt policy π(a k |s k ), then the environment transitions to a new state s k+1 ∈ S with the unknown state transition probability p(s k+1 |s k , a k ) and produces a reward r k = r(s k , a k ).An important condition that characterizes the MDPs is that the state transition probability satisfies the Markov property p(s k+1 |s k , a k ) = p(s k+1 |s k , a k , . . ., s 1 , a 1 ) for any trajectory s 1 , a 1 , . . ., s k , a k , which means that the environment is memoryless because the transition to the next state depends only on the current state and action.The Markov property is usually fundamental to guarantee the convergence to the optimal policy in RL algorithms [20].The goal of an RL algorithm is to find an optimal policy π * that maximizes the expected γ-discounted sum of future rewards (that we define as return).
While the proposed framework could be combined with any RL algorithm, benchmarking is out of the scope of this work, for this reason in the rest of the paper we only consider the offpolicy soft actor-critic (SAC) algorithm [21] with continuous action spaces.

III. ENERGY TANKS MEET RL
The arbitrariness of the control input w and its disembodiment from any physical dynamics are tempting motivations to choose it as a decision variable in an optimization framework.Indeed, the output of an RL control policy can be directly mapped to the torques w commanded at an actuator level, as shown in Fig. 2.This map can be trivially the identity (in case the generalized forces are directly learned in the RL scheme) or be represented by a transducer (e.g., an internal PID controller that maps a position or velocity reference to the commanded torques w), which is often shown to critically improve sample efficiency in the RL framework [22].No specific transducer between the action space and w is required to implement the proposed scheme, as long as the commanded torques w are accessible.Since MDPs are discrete-time processes, we start by exploiting a simple but powerful result that allows to reproduce the energy tank algorithm at the discretetime level.
The energy sampling approach: Let us denote e k the level of energy in the tank at the discrete time step k, i.e., e k = V c (kT ).The energy in the tank at the next time step, initialized by e 0 , is updated according to the rule e k+1 = e k − ∆e k+1 (12) where ∆e k+1 is the amount of energy exiting the tank at time k + 1, which approximates, at a discrete-time level, the energy leaving the tank in a sampling interval with duration T , i.e., With a look at the continuous energy balance (10), this energy equals the sum of energies used by each actuator of the robot in the time interval [T k, T (k + 1)): T k w(s) ⊤ q(s)ds.Now, assuming the torque signal is constant with value w k along the k-th sampling interval, and indicating with q k := q(T k), we can further massage the previous expression to define ∆e k+1 as: It is worth remarking that, in the common conditions of constant torque along the sampling time interval and position sensor collocated to the motor, such defined quantity produces the exact amount of injected energy by the actuators independently of the value of T , and not an estimate.This property makes the connection between the discrete-time RL framework and the energy tank framework particularly appealing since the tank algorithm can exactly be reproduced at a discrete time level without the need of integrating tank dynamics (6) with the input defined in (8).
The definition of ∆e k+1 in ( 13) exactly reproduces at a discrete time level the tank algorithm as reviewed in Sec.II, but this choice is not unique, and variations are possible on the basis of the passivity margin that one wants to achieve for the closed-loop system.In the sequel of this work, we want to prevent refilling of the tank, and thus use the following update rule Such an update rule physically represents the impossibility of recovering negative mechanical energy flowing into the system, which is a reasonable assumption for many mechanical systems.Furthermore, this choice serves best our purpose of interpreting the task energy as a metabolic metric to assess an initial energy budget: the energy in the tank can only decrease along the task execution.In this discrete-time implementation of energy tanks, we aim to achieve the following finite difference version of the PBC objective (5).We indicate E k = E(kt) the sampled mechanical energy of the robot and E k = E k + e k the sampled closed-loop storage function.Sampled PBC objective.Given the system (1) with the split (3) inducing power balance (4), design a control policy inducing desired control action τ c = w k such that the condition is verified ∀k, i.e., that the sampled difference of closedloop storage function is bounded by the energy injected in a sampling interval by the s-environment through the port (F, ẋ).
Theorem.Assuming constant generalized forces w k in the sampling interval [T k, T (k + 1)], the PBC objective ( 15) is achieved using the tank update rule (14).
Proof.Define R k+1 := T (k+1) T k q(s) ⊤ D(q(s)) q(s)ds the dissipated power by friction in a sampling interval.Using (12) and integrating (4) in a sampling interval, we compute , where we used the fact that τ c = w k is constant in a sampling interval.The final claim (15) follows from (14) and R k+1 ≥ 0.
Note that the update rule ( 14) induces a stronger passivity margin with respect to (13) since when ∆e k+1 = 0 the term w ⊤ k (q k+1 − q k ) is negative, and as such acts as dissipation.
We define the energy spent at step k as the sum of the energies exiting the tank up to that step: Since with the update rule ( 14) the sequence êk is monotonically non decreasing, the task energy e * over a task with N time steps can be simply calculated as We describe now two procedures combining the tank and the RL algorithm.The first one aims at passivizing a control policy that was previously learned agnostically to any passive design.The second exploits the tank architecture also in the training phase, and produces by construction a passive learned policy.

A. Passivization in Inference (Passivizing learned policies)
We indicate with the term inference the phase where the training is finished and the agent implements the learnt policy, without exploring anymore.The combination of the energy tank architecture and the reinforcement learning technique in inference is quite straightforward, as an RL agent represents in this context an arbitrary controller that the tank algorithm is able to passivize.In fact, we can use any previously trained agent in a generic environment and passivize the controlled system in inference by wrapping the control action with (11), where the energy in the tank is updated according to (12) and (14).Furthermore, an agent able to fulfill a given task in a generic environment with a specific task energy e * , is also able to passively fulfill that task in an environment where the control action is wrapped with (11) and the energy tank is initialized with a value greater than e * .This represents indeed the only design requirement to preserve the agent's performance in executing the task with the learnt policy while achieving formal passivity (15).In order to equip the controlled system with a meaningful passivity property, it is important to initialize the tank with an amount of energy that slightly exceeds e * , which can be estimated by running the agent without (11) for a number of episodes and measuring the maximum consumption of energy along all the episodes.The proposed procedure provides a constructive way to achieve the PBC objective and brings the additional benefit of task energy estimation, which produces a meaningful closed-loop passivization.This step provides a diagnostic tool indicating a non-regular task execution in case of tank depletion, which is a piece of extra information that a naive tank design lacks.
Note.It is meaningful to compare the described procedure with the works [17], [18], which aim at passivizing an already given control input, and as such can also be seen as a passivization in inference algorithm.In [17], [18] the task energy is not explicitly estimated and as a consequence the initialization of the tank (whose non-depletion constraint in an optimization problem is what produces the passive approximation) is arbitrary.By solving the optimization with different tank initialization, in [17] different passive approximations are achieved.Here instead, motivated by the fact that passivity can be achieved anyhow for any finite task energy and that the learnt policy represents the way the task is optimally executed, we first estimate the task energy to constraint the degree of freedom of tank initialization.

B. Passivization in Training (Learning passive policies)
In many circumstances, it might be desirable to continuously achieve a passive closed-loop system also in the training phase, i.e., when the task-dependent control policy is being learned.This is desirable for instance when the training takes place in real life, and not in simulations, such that initial exploration phases are guaranteed not to undergo hazardous unbounded energy generation.Furthermore, the passivized training phase endows the agent with awareness of the metabolic spending encoded in the tank architecture, induced by the initial energy budget e 0 .As a consequence, the learned policy will be influenced by the tank initialization e 0 , and in particular, the task energy e * will be directly learned together with the control policy in a combined way, strengthening the significance of the resulting passivity property.This mechanism provides a clear biomimetic perspective to the scheme since the energy budget and the task-dependent policy are not independent variables.
The proposed procedure is achieved by implementing the We recognize the following technical issue which needs to be addressed before implementing the training.
Loss of Markovian property: Notice that the current value of the tank state e k depends on the entire state-action trajectory of the system from the beginning to step k.This is true due to the update rule (12), where ∆e k+1 depends both on the joint position q k (which in most robotics applications is part of the RL state s k ) and the action a k .
Furthermore the value e k is able, due to (11), to influence the future dynamics of the environment in case of tank depletion.For these reasons the new environment that the agent perceives by including also the energy tank architecture as in Fig. 2 does not preserve the memorylessness of the original environment, or in other words it loses the Markov property.
If the Markov property is not satisfied, most RL algorithms lose formal proof of convergence to the optimal policy, and long-term prediction performance can degrade when the onestep predictions defining the Markov property become inaccurate [23].We identify two solutions to restore the Markov property.
Extended State: A common practice when the environment is non-Markovian because a relevant time-dependent part of the information is missing from the agent's state but accessible is simply to include it in the state [23].However, under this formulation, the agent might require a dedicated tuning process to maintain similar performance.
Extended Termination: As an alternative, the episode's termination condition can be extended to the situation in which the energy in the tank depletes.In case a robotic platform is involved, terminating the episode would mean breaking the joints, instead of giving zero torque as done in the original formulation.After the termination, the environment is reset and a new episode can start with a restored level of energy in the tank.In this formulation, the Markov property is restored without changing the state and the reward.In fact, the termination of the episode removes the switch behavior (11), and the agent never meets a state influenced by the free dynamics.
The reward function has to be non-negative, since the length of an episode is not fixed.In fact, using a negative reward, the agent might attempt to maximize the cumulative return by learning actions leading to tank depletion.

IV. SIMULATIONS
We consider two environments implemented in MuJoCo physics simulator [24]: 1) DoorOpening: Here, we adopted an implementation from Robotsuite simulation framework [11] where a 7-DoF robotic arm must learn to open a door by turning the handle as shown in Fig. 3.The door location is randomized at the beginning of each episode.The robot is a Panda Franka Emika which mounts a parallel-jaw gripper equipped with two small finger pads.The torques provided to the 7 joints of the robot are generated using a proportional control law τ = k p ( qd − q), with k p = 0.03, that follow a target joint velocity qd provided by the agent at 50Hz.
2) Pendulum: In this environment we created a pendulum composed of a rod suspended by one extremity from a pivot actuated by a motor controlled in torque with a control frequency of 50Hz.The rod is subject to gravity while friction loss is present in the joint.Let us denote the joint angle as β .In the reference configuration β = 0 the pendulum heads horizontally right.At the beginning of each episode, the angle is randomly initialized β ∼ N (−π/2, 0.05π).The environment state is defined as the vector s k = sin β k cos β k tanh βk ⊤ , while the agent action corresponds to the torque applied to the joint.Since the inverted pendulum problem consists of bringing the rod to the inverted position (sin β * = 1) and holding it there as long as possible, we design a positive reward function r k = (1 + |sin β k − 1| + 0.1 tanh βk + 0.01 |τ k |) −1 which is inversely proportional to a weighted sum of the absolute values of position error (sin β k −1), velocity error (tanh βk ) and torque (τ k ).These last two components help to remove oscillatory behaviors in the transition phase and at the equilibrium point.In all the simulations presented in this work the agents are trained using SAC algorithm implemented in PyTorch and trained with an NVIDIA GeForce GTX 1080 Ti on an Intel Core i7-7700K CPU clocked at 4.20GHz.In Table I the training configurations for both environments are reported.Additionally, generalized State-Dependent Exploration (gSDE) is employed, where a new noise matrix is sampled every episode.The entropy regularization coefficient is learned automatically as done in [25].
In the rest of this section, we expose the results obtained from the simulations run on the two environments introduced above.In particular, the passivization of a policy learned on the DoorOpening and the learning of a passive policy on Pendulum are discussed respectively in IV-A and IV-B.In Table II the notation adopted for each agent in training and inference is briefly reported.

A. Passivization in Inference
To study the effects of passivization in inference, an external field of force is applied to the system as a way to perturb  the nominal task.The magnitude of the force is chosen in order to let the tank deplete when it is initialized with an amount of energy corresponding to the task energy measured in the nominal case.For this experiment, we consider the DoorOpening environment and we employ a pre-trained model available online 4 .As visible in Fig. 4, under the effect of the external force the agent without the passivization in inference φ ∞|∞,δ can spend an unbounded amount of energy, while the consumption is limited when the same agent is wrapped with passivization φ ∞|e * ,δ .The energy tank is initialized with the task energy e * measured as the maximum êk over the 100 episodes in the agent without the external force and without passivization φ ∞|∞ .From this simulation, we can better comprehend and appreciate the versatility of the proposed framework.In fact, an agent trained by a third party to optimize the process in a non-passive way is readily passivized by wrapping the decision variable as detained in III-A and without the need to retrain/modify the agent.

B. Passivization in Training
Let us consider two agents φ ∞ and φ e 0 trained in the Pendulum environment sharing the same hyperparameter configuration, but φ e 0 is passivized with the framework of Sec.III-B and the Extended Termination method.For each simulation on the Pendulum, 5 instances with different random seeds for the pseudo-random generators are trained.Implementing the training with a sufficiently low e 0 provides 4 https://github.com/ARISE-Initiative/robosuite-benchmarksome level of meaningful robust stability in the training phase.Furthermore, we aim to analyze how the energy budget e 0 imposed in training influences the learned policy and the task energy e * , which are correlated variables.
We choose as e 0 the task energy estimated from φ ∞ , measured in inference as the maximum êk over 100 episodes.As visible in Fig. 5d, the task energy e * estimated by φ e 0 is lower than the energy that the system spends to perform the task when using the policy φ ∞ .Indeed we appreciate how the agent φ e 0 learns to perform the task with lower energy consumption.Furthermore, the average return of the two agents during the training visible in Fig. 5a converges to the same value (almost 2500), while φ ∞ can arrive to spend a level of energy that is almost 50 times greater than the one spent in φ e 0 , see Fig. 5b.Clearly, the episode terminations caused by the energy constraints (i.e., when the tank depletes), disturb the exploration during the training, which results in a slower convergence to the final policy.Fig. 5c shows a comparison, in inference, of the two agents where a measure of the position error d k = 1 − sin(β k ) is adopted as a success metric of the task.As visible both the agents converge to the same error, with a smoother trajectory for φ e 0 .

V. CONCLUSIONS AND FUTURE WORK
We introduced a framework merging the energy tank algorithm, used as a tool to passivize arbitrary control schemes, and reinforcement learning, representing the most versatile method to learn control policies along complex tasks.The presented procedures allow us to passivize any control policy, and to learn constructively passive policies, tackling the problem represented by a lack of optimization in the design phase of passive controllers.As future work, we are processing a throughout study of energy-aware robotics in the proposed neural framework, exploiting the energy tank architecture to embed metabolic and safety metrics, which depend on physical energy and power flows undergoing the robot.This goes beyond passive designs, which is a necessary step for many important tasks which require continuous energy injection (e.g., periodic locomotion), and because safety and passivity, although sometimes naively considered equivalent, are distinct concepts.

Fig. 1 :
Fig. 1: The essential of the energy tank algorithm.

Fig. 5 :
Fig. 5: In (a) and (b) respectively the average returns obtained and the energy spent in training.In (c) and (d) respectively the average position error and energy spent in inference.

TABLE I :
Hyperparameters