Sliding mode based fault diagnosis with deep reinforcement learning add‐ons for intrinsically redundant manipulators

This article presents a fault diagnosis control scheme for intrinsically redundant robot manipulators based on the combination of a deep reinforcement learning (DRL) approach and a battery of sliding mode observers. The DRL plays the role of detecting and isolating possible sensor faults, thus generating an alarm and pin‐pointing the source. This in turn allows to compensate the sensor faults independently from the actuator ones. The latter are therefore detected and isolated by a set of sliding mode observers driven by input laws designed according to an optimal reaching algorithm. In order to design and apply such observers, a global feedback linearization is performed, which transforms the multi‐input‐multi‐output (MIMO) nonlinear robot model into a chain of double integrators. The proposal is analyzed and assessed in realistic conditions using the PyBullet environment in which a 7 degrees‐of‐freedom (DOFs) Franka Emika Panda robot manipulator is reproduced.

model, and then the corresponding output signals are analyzed (see, e.g., References 3 and 4).On the other hand, active approaches exploit the injection of specific signals in order to improve the detectability of the faults (see References 5-7, among many others).A recent alternative is the so-called projection methods which is based on the definition of a set of controllers from which the suitable control law associated to the specific detected fault is selected, thus giving rise to a switched mechanism.Other methods are instead based on adaptation and rely on robust control approaches, ranging from  ∞ methods, 8 to model following and multimodel approaches, 9 using linear quadratic controllers (LQR) 10 or model predictive controllers (MPC), 11 and relying on neural networks, 12 to cite a few.
Among these techniques, also sliding mode control (SMC) has resulted as a particularly promising method for FD applications. 13SMC is indeed well known for its robustness in front of a wide class of uncertain terms, and in particular in front of matched uncertain terms acting directly on the input channel. 14This property in certain case allows to avoid a controller reconfiguration, but requires some assumption on the knowledge of possible faults.Therefore, SMC based observers have been studied and introduced to design fault tolerant control schemes.In fact, SMC observers are capable of providing a robust estimate of the fault signals through a suitable scaling and filtering of the so-called equivalent output error injection.More precisely, the concept of the so-called equivalent control 15 is exploited, and the fault identification is operated without requiring any trade-off between state estimation and fault sensitivity.This technique has been exploited in several works (see, for instance, 16,17 among many others) and several fault scenarios.
In the context of FD a particular attention is devoted to robotics machines, since nowadays robots are rapidly expending to numerous applications.9][20] Moreover, such faults can occur simultaneously on both these components and more than one fault at a time can perturb the system operations.This scenario is made further complex by the nonlinear coupled nature of the robot model, 21,22 so that, if in most FD literature such coupling effects in the system are considered negligible, 19,20,23,24 this is generally false in robotics.A possible solution to decouple the effects of the fault signals acting on sensors and actuators can be the introduction of additional vision sensors, such as cameras externally located with respect to the robot systems.Instead, among the adopted methodologies to address the FD problem, the use of sliding mode based approaches to robot manipulators has been also investigated, both in the case of genuine controllers (see e.g., References 25-30), or in the case of observer design (see, for instance, References 31 and 32).
In Reference 31, a vision sensor is involved to enhance the fault diagnosis capability of the proposed scheme relying on the use of SMC based observers.An extension of this work is proposed in Reference 32, where a planar robotic manipulator has been considered and the MIMO nonlinear coupled model is transformed into a set of decoupled linear systems using the so-called inverse dynamics approach. 33Moreover, detection, isolation and identification of sensor faults have been carried out using a low-cost IP camera, by processing the images in order to extrapolate the correct position of the end-effector in case of fault alarms.Suboptimal second order sliding mode (SSOSM) 34 based unknown input observers (UIOs) have been also adopted.
In this work, we put forward a novel fault diagnosis control scheme capable of detecting, isolating and identifying both sensor and actuator faults in the case of intrinsically redundant robot manipulators.It is in fact well known that kinematically redundant robots have more degrees of freedom than it is needed to execute a given task, thus implying possible infinite solutions to the inverse kinematics problem.This makes the use of the vision sensor not sufficient to correctly compute the joint positions given the end-effector coordinates in space.The novelty of the proposal lies in the application of a DRL model-free based reconstruction of signals in order to correct the corrupted measurements from sensors before they are used by the robot controller and by a battery of sliding mode observers aimed at the actuator fault diagnosis.The considered scenario is related to the so-called recoverability problem (see e.g., References 35 and 36 and the references therein), according to which a sufficient redundancy on the robot is needed to accommodate fault situations.Specifically, this article aims at providing an alternative method to the design of a FD control scheme enabling the following features: 1. a DRL approach relying on the twin delayed deep deterministic policy gradient (TD3) algorithm is adopted to generate the map capable of estimating multiple sensor faults independently of the actuator faults; 2. the sensor fault signals are compensated to provide the correct joint positions, which enable to close the loop and perform the actuator fault diagnosis via dedicated observers; 3. exploiting the diagnosis decoupling provided by the DRL mechanism, a battery of sliding mode observers is applied for multiple actuator fault detection, isolation and identification; 4. differently from previous works, 31,32 where a suboptimal second-order sliding mode (SSOSM) approach was used, a second-order sliding mode (SOSM) with optimal reaching 37 is here adopted.If on the one hand, differently from References 31 and 32, an estimate of the joint acceleration is needed, the proposal allows to improve the fault estimation thanks to the minimum time convergence property that such an approach can guarantee versus the worst realization of the uncertainties.
The proposal is finally assessed in a realistic simulation scenario created in the PyBullet environment, in which a 7 DOFs Franka Emika Panda robot manipulator is recreated using data provided by a real setup.
The article is organized as follows.In Section 2, the considered robot model and the faulty scenario are described, as well as the addressed problem is formulated.In Section 3 the adopted control scheme is described, while in Section 4 the proposed FD approach characterized by the DRL mechanism and the SOSM UIOs is discussed in detail and analyzed.Simulation results are illustrated and commented in Section 5, while some conclusions are gathered in Section 6.

PROBLEM FORMULATION
In this section the considered robot model is presented and the faulty scenario is described.

Robot kinematic and dynamics
In this work, an intrinsically redundant robot manipulator is considered (see Figure 1), that is the dimension of the operational space is smaller than the dimension of the joint space 33(Ch. 2) .The robotic system consists of n q revolute joints and n q + 1 links.The joint variables vector is denoted as q ∈ R n q , where the orientation of the first link with respect to ⃗ y-axis is clockwise positive, while the displacement of the generic jth link, j ≠ 1, is clockwise positive with respect to the (j − 1)th one.Then, let O 0 , {⃗ x, ⃗ y, ⃗ z}, be the base-frame for the robotic manipulator, so that the center O 0 is placed in the center of the first joint of the robot, while O e , {⃗ n, ⃗ s, ⃗ a} is the end-effector frame, with center O e placed on the robot end-effector and associated to the axes {⃗ n, ⃗ s, ⃗ a}.In the following, the proposed control scheme exploits both the kinematics and dynamics of the robot, defined as follows.
As for the kinematic model, the so-called direct kinematics to determine the pose of the end-effector with respect to the base frame can be computed through the homogeneous transformation matrix T e 0 ∈ R 4×4 given by where p e 0 (q) ∈ R 3 is the position of the end-effector frame with respect to the base frame, R e 0 (q) ∈ R 3×3 is the rotation matrix between the two frames, while 0 1×3 is a row zero vector.The dynamics of the n q -joints robot manipulator is instead captured by B(q) q + n(q, q) = , (2a) n(q, q) = C(q, q) q + F v q + F s sgn( q) + g(q), ( where q ∈ R n q is the joint velocity vector, q ∈ R n q is the corresponding joint acceleration vector, B(q) ∈ R n q ×n q is the manipulator inertia matrix, C(q, q) ∈ R n q ×n q is the matrix containing the Coriolis and centripetal effects, F v ∈ R n q ×n q is the viscous friction matrix, F s ∈ R n q ×n q is the static friction matrix, g(q) ∈ R n q is the vector which takes into account gravitational forces and torques, while  ∈ R n q is the vector of the motor torques.

Faults modeling
In this article, faults on both sensors and actuators are considered.A fault on the ith sensor is modeled as an error Δq i ∈ R which is added to the value of the ith element of the vector q.Analogously, a fault on the ith actuator is modeled as an additive term Δ i ∈ R which sums up with the ith element of the torque vector .Moreover, letting Δq ∈ R n q and Δ ∈ R n q be the vectors containing sensor and actuator faults, the following assumption holds.
If single or multiple sensor faults (SFs) occur, the vector of the joint variables containing such anomalies is indicated as while motor torques which are affected by actuator faults (AFs) are

Problem statement
We are now in a position to introduce the considered scenario and formulate the problem to solve.Consider any possible task in the operational space described by m < n q variables, that is more DOFs than task variables are available.Assume that all the joint variables are measurable from the encoders fastened on the robot joints, and these are susceptible to faults, while an external sensor (e.g., a vision sensor) is capable to directly retrieve only the end-effector position p e 0 without being affected by any possible fault.Furthermore, assume that, during the task execution, the fault events (FEs) reported in Table 1 can occur.Therefore, the goal is to design a FD scheme capable of providing at each time instant t the estimates Δq ∈ R n q and Δ ∈ R n q of the sensor and actuator faults in order to minimize the following performance index where the first term is related to the sensor fault estimation error, that is, while the second one corresponds to the actuator fault estimation error, that is

ADOPTED ROBUST CONTROL SCHEME
Before introducing the proposed FD approach, it is instrumental to clarify the role of the controller in the overall control scheme.

Inverse dynamics
In order to achieve an equivalent controlled system behaving as the union of double integrators, the so-called inverse dynamics approach is applied. 33At this stage, assume that Δq = 0 and Δ = 0 (i.e., no faults occur), and consider the control scheme in Figure 2.For the sake of simplicity, without losing any generality for the applicability of the proposal, the following assumption is considered.
Assumption 2 (Model knowledge).Both the inertia matrix B(⋅) ∈ R n q ×n q and the force vector n(⋅, ⋅) ∈ R n q are known.
The inverse dynamics of a robot manipulator can be expressed as a nonlinear relationship between plant inputs and outputs.Given an auxiliary control vector y ∈ R n q dimensionally equal to an acceleration, the control torque is designed as Hence, by applying ( 6) into (2), and by virtue of Assumption 2, one obtains that is a chain of n q decoupled double integrators, one for each joint of the robot.

F I G U R E 2
The adopted control scheme for the considered Franka Emika Panda robot manipulator.

Controller design
We are now in a position to introduce the control algorithm.In fact, if Assumption 2 holds and the inverse dynamics law ( 6) is applied, the controller can be designed for each joint in order to stabilize an unperturbed double integrator.However, having in mind faulty scenarios, the inverse dynamics law affected by faults becomes By adding the actuator faults Δ to (8), and substituting it into (2), one has that the equivalent controlled system is q = B −1 (q)B(q)y + B −1 (q)(n(q, q) − n(q, q)) + B −1 (q)Δ = (q, q, q, q) + Δy + G(q, q)y, (9)   where (q, q, q, q) = B −1 (q)(n(q, q) − n(q, q)) ∈ R n q and G(q, q) = B −1 (q)B(q) ∈ R n q ×n q are unknown vector fields, while the corresponding acceleration fault is given by Therefore, the application of the inverse dynamics linearization law in presence of faults results in being a coupled uncertain MIMO system.The following assumption needs now to be introduced.
Assumption 3 (Boundedness of vector fields).The control effectiveness matrix G(q, q) ∈ R n q ×n q and the drift term (q, q, q, q) ∈ R n q are bounded, that is, (q, q, q, q) ∈ , with Γ > 0 and  being a compact set containing the origin, limited by  sup = sup ∈ {||||}, respectively.
Moreover, from (8), the acceleration fault Δy is bounded by virtue of the boundedness of Δ.
Assumption 4 (Boundedness of the acceleration fault).The acceleration fault Δy is bounded, that is, with  being a compact set containing the origin, limited by The previous assumptions are instrumental to design a suitable controller capable of guaranteeing the closed-loop stability in presence of model mismatches.A viable solution can be a SMC law for MIMO systems, see for example, References 38 and 39.Note that, the design of the controller is out of the scope of the present paper, which proposes a FD scheme aimed at detecting, isolating and identifying sensor and actuator faults, without compromising the closed-loop stability of the control system.

PROPOSED FAULT DIAGNOSIS SCHEME
In order to solve the problem discussed in Section 2, in this work, we propose the FD control scheme illustrated in Figure 3.
Starting from the scheme in Figure 2, two additional key elements are included: the DRL based block for sensor faults diagnosis, and the battery of SOSM UIOs.In the following, some elements about the DRL approach are given before describing our proposal.Then, the SOSM UIOs are described and analyzed.

Elements of reinforcement learning
The basic concept underlying any reinforcement learning (RL) algorithm is that a decision maker, called agent, interacts with an environment while performing a sequence of actions chosen according to a certain strategy, called policy.The goodness of each action depends on the so-called reward function, which is assigned to the agent by the environment.
The value of such a reward in turn depends on the task that the decision maker must perform.The main goal of the agent is to improve its policy in order to perform a sequence of actions that maximize not the single reward received after each action, but a long term reward which takes into account all the actions that have been performed during a certain time horizon.In order to do that, it is important that the agent exploits the actions that in the past lead to a higher instantaneous reward while exploring new actions that could lead to better results.As for the environment, it is characterized by a certain set of state variables s ∈ , with  being the state space.The action that the agent performs is instead represented by a set of variables a ∈ , with  being the so-called action space.At each time instant t, the environment is described with a state s t ∈ , and the agent, according to a policy (a t |s t ), selects a certain action a t ∈  and, on the basis of a transition function P(s t+1 |s t , a t ), the environment changes its state from s t to s t+1 ∈ .Depending on the landing state s t+1 , the environment gives a reward r t+1 ∈ , where  is the so-called reward space.
Therefore, given the current state s, the action a, and the next state indicated as s ′ , the expected value of the reward, that is, E{⋅}, can be computed as r(s, a, s ′ ) ≔ E{r t+1 |s t , a t , s t+1 }. ( The ultimate goal of the agent is then to maximize the so-called cumulative reward R t , which takes into account all the rewards collected during the time horizon T h .In particular, one has where 0 <  ≤ 1 is the discount factor which prioritizes earlier rewards over later ones.Given a policy (a|s), it is possible to compute the value of a state s, denoted as V  (s), and defined as the expected cumulative reward if the agent starts in state s and chooses actions according to policy , that is, with R t as in (14).Analogously, it is possible to define the value of an action a in a certain state s by computing the expected cumulative reward starting in s, taking action a and then following a policy  thereafter.Such value is called action-value function for policy  and it is defined as If the state space  and the action space  have finite dimension, it is possible to approximate the action value function using the so-called Q-Table .Such a table has a number of rows equal to the number of states and a number of columns corresponding to the number of possible actions.The Q-Table can be then easily approximated using Q-Learning algorithms. 40If instead the state space becomes too large or continuous, building the Q-Table becomes unfeasible, and a parametric approximator for the action-value function must be used.A way to build such an approximator is the use of a deep neural network (DNN), that is a parametric function which is able to model complex nonlinear relationships.
When dealing with DNN, it is possible to approximate the action-value function also using deep Q network (DQN) algorithms. 41However, DQNs are no longer sufficient when dealing with continuous spaces.In these cases, several algorithms such as deep deterministic policy gradient (DDPG) 42 and Twin Delayed Deep Deterministic Policy Gradient (TD3) 43 can be used.
In this work, having in mind a robotic application with continuous state space, the TD3 algorithm is adopted (see Algorithm 1).It is aimed at concurrently learning the optimal action value function and the policy.It uses target networks and a replay buffer , and it presents some peculiar characteristics that allow to improve the performances with respect to other similar approaches such as DDPG.The first one is that it uses two Q-functions, approximated by two critic networks, Q  1 and Q  2 , parametrized by two sets of weights  1 and  2 .Given a state s and an action ā, chosen according to a policy   , the smaller of the two Q-values is used to represent the targets in the Bellman error loss functions (lines 13-16 in Algorithm 1).The second characteristic is that the update of the parameter  of the actor network   and the update of the target networks are not done every time step but with a certain frequency f  (lines 17-23 in Algorithm 1).Finally, the third peculiarity is that, in order to reduce the risk that the policy exploits some error in the Q-functions, TD3 adds noise to the target action (see line 12 in Algorithm 1).Algorithm 1. TD3 algorithm 43 1: Initialize the "critic" networks Select the action with added exploration noise a ←   (s) +  with  ∼  (0,  e ) )) 14: Update the "critic" networks 15: if  it mod f  then 18: Update  using deterministic policy gradient Update target networks 20: : end if

Fault diagnosis for sensors: DRL algorithm
In order to perform detection, isolation and identification of faults concurrently occurring on multiple sensors, a model-free DRL agent is designed.Although the application of the proposal to an intrinsically redundant robot might increase the computational complexity for the DRL, its training phase is easily performed off-line, and it has the advantage to overcome possible issues due to the kinematic inversion which could lead to infinite solutions.
Remark 1.Note that, if the fault detection concerns only sensor faults, then, one could use a model based approach when the model is available.For example, one could apply the control signal  both to the robot and to the model ( 2) and evaluate the differences in the output signals.In our case, the possible contemporary presence of actuator faults and sensor faults require to invent a method to decouple the effects of the two types of faults to allow for the fault diagnosis activity.▿ Making reference to the previous section, the environment is represented by the state space  given by depending on the end-effector position expressed in the base reference frame, and the joint variable vector affected by faults, as defined in Section 2.2.Note that, assuming to know the position of the end-effector as a state is quite reasonable in practice because it can be recovered either in simulation or by using, for instance, a stereo camera.The objective of the agent is therefore to provide at each time time step an estimate Δq ∈ R n q of the sensor faults.The bounded action space  for the DRL agent is instead defined by with Δq i being the estimate of the fault on the ith sensor.Now, the following assumptions need to be introduced.
Assumption 5 (Boundedness of sensor fault estimation error).The sensor fault estimation error Δq − Δq is bounded, that is there exists E > 0 such that Assumption 6 (Relationship between sensor fault estimation errors and end-effector position).Sensor fault estimation errors depend on the actual end-effector position through some unknown static function.This implies that when the end-effector is in a certain position then the associated sensor errors take always the same value.
Note that Assumption 6 is always true for constant sensor faults.
In order to train the agent, exploiting a realistic simulator of the robot, different sensor fault signals Δq ∈ R n q are injected into the system, and the reward function, used to assign a value to the action, is designed as where r s is as in (5b), while the minus sign in the reward indicates that it is considered as a penalty.
During the training phase, the simulated robot is asked to follow random trajectories using the closed-loop scheme.At each time step, data from sensors are retrieved and constant faults Δq i , i = 1, … , n q with random amplitude are added.The result is fed into the DNN together with the end-effector pose, which generates estimates for the fault, and, depending on the reward, updates its weights.Once the training phase is completed, the weights of the DNN are kept constant for the online test phase.Note that, since the auxiliary control y does not belong to , it does not affect the estimates of the sensor faults.
Once the DRL agent provides an estimate Δq, it is possible to perform both detection and isolation by building a vector f s ∈ R n q , which contains a flag for each sensor component, that is where and Q s i are constants indicating the ith lower and upper thresholds, respectively.Note that, in an ideal scenario, such thresholds are equal to zero, but noise on the sensors and on the vision system must be taken into account.Moreover, by analyzing the time evolution of the signals Δq, it is possible to perform a fault identification, thus compensating the corrupted signals in order to close the control loop.It is worth to highlight that this fault diagnosis on sensors is independent on the actuator faults.In other words, by virtue of the DRL mechanism the sensor fault can be compensated, thus improving the fault diagnosis of the actuators failures.Specifically, let q c be the joint position vector used to close the feedback, by virtue of the DRL based identification, one has Remark 2. Note that, the previous definition of the corrected joint variables q c means that, if no fault occurs, its ith component is equal to the actual position of the robot joint, otherwise, q c i is the corrected position after the compensation of the ith sensor fault.Relying on the control scheme described in Section 3, such a value is used to close the inner inverse dynamics loop and the outer feedback loop.If the DRL estimate Δq i differs from the actual fault Δq, since by Assumption 5, the estimation error is bounded, the controlled system becomes the one in (9), and, as mentioned in Section 3, it is assumed that a suitable robust controller is designed to guarantee the closed-loop stability even in this case.▿

Fault diagnosis for actuators: SOSM UIOs
In order to design the battery of observers aimed at the fault diagnosis on the actuators, a canonical form for the state model of the plant has to be introduced.This step indeed enables to describe the dynamics of the tracking error in a suitable form for SMC design.Therefore, exploiting the nominal linearized model of the robot in (7) when no fault occurs, and using the corrected joint variables measures q c such that qc = y + Δy, one can design a battery of observers, one for each joint, in order to estimate possible actuator faults.Define the UIO model as where qc ∈ R n q are the states of the observer, and u ∈ R n q is the observer input.Consider e 1,i = q c,i − qc,i and e 2,i = ̇e1,i , for any i = 1, … , n q .We are now interested in designing a sliding mode based observer capable of enabling an optimal reaching time of the estimation error which has to be steered to zero, and a minimum time convergence whenever the worst realization of the uncertain terms occurs.Making reference to Reference 37, a SOSM law is hereafter designed.Define the sliding variable as the linear combination of the estimation errors as where  > 0 is constant.By computing the time derivative of the sliding variable  1,i , relying on (23), the relative degree is equal to 1, so that a first order sliding mode law would apply.Since our goal is to design a SOSM law according to Reference 37, we introduce an auxiliary system with relative degree 2, given by where v i is the new observer input, while f (e 2,i , q c,i , ̇yi − ̇yi .Due to the mechanical nature of the robotic system, the following assumption holds.
Assumption 7 (Boundedness of the auxiliary drift term).The drift term f (e 2,i , q c,i , ̇yi where F i > 0 is the bound associated to the ith joint.
The auxiliary control input v i is therefore designed as where  i > F i are the control gains, while  r,i are the so-called reduced control amplitudes defined as that is the minimum control amplitudes to cope with the worst realization of the drift uncertain terms.The following result proves the minimum time convergence of the sliding variables to zero.
Proposition 1 (Convergence).Consider system (2) with the inverse dynamics law in (6) based on measurements (22)  provided by the DRL mechanism.If Assumption 7 holds and f (e 2,i , q c,i , ̇yi ) = F i sign(v i ), then each of the components  1,i (t) of the auxiliary system (25), controlled by (27), is steered to zero in minimum time.
Proof.The proof of the minimum-time convergence to the origin of the space { 1,i ,  2,i } directly follows from Reference 44(Th.2) and 45(Th.2).Consider for the sake of simplicity that the initial condition is ( and  2,i (0) > 0 (all the other symmetric or specular cases are analogous).Since corresponds to the minimum time curve in the nominal case, one has to prove that in the case of the worst realization of the uncertain terms, the auxiliary state trajectory under the control law (27) follows this curve, in an equivalent sense.In fact, computing the vector field, one has that is the trajectory moves towards the curve, while pointing downward.When the curve is reached, the control sign changes and the vector field becomes with  2,i < 0. As a consequence this means that the trajectory is always tangent to the curve, and the state moves towards the origin in minimum time.▪ The previous result guarantees the finite-time convergence to zero of the sliding variable and its derivatives, in turn implying that the error signals e 1,i (t) and e 2,i (t) exponentially decay to zero constrained to  1,i (t) = 0, ∀t ≥ t, with t ≥ 0 being the convergence time.The following proposition shows instead that, once in sliding mode  1,i =  2,i ≡ 0, the equivalent control, namely ũi , allows to identify the ith actuator fault.
Proof.In order to prove the proposition, we need to exploit the concept of equivalent control in Reference 15.More precisely, the ith element of the equivalent control signal ũi can be computed by solving σ1,i = 0, that is, so that one has Since by virtue of Proposition 1, enforcing the sliding mode  1,i =  2,i ≡ 0 implies the error e 2,i exponentially going to zero, therefore meaning that, given Δy = B −1 Δ so that ũ = B −1 Δ, as t goes to infinity then Δ − Δ = 0, in turn implying r a = 0, which concludes the proof.▪ Making reference to the previous proposition, a practical way to obtain the equivalent control is to apply a low-pass filter to the discontinuous control input (27).However, in our case, we do not need to add a filter because, despite the auxiliary control signal v i (t) is discontinuous, the observer input u i (t) is continuous and equal to This means that the filtering operation is performed by the integrator in (33), and it is possible to conclude that the signal u i (t) can be used as an estimate of the actual actuator fault Δy i (t).
Analogously to the sensor fault diagnosis, it is possible to perform the fault detection and isolation by defining a vector with Y a i and Y a i being the ith lower and upper thresholds, respectively.
Remark 3. In order to tune the ith SOSM UIO, the knowledge of the bound of the drift term F i , and of the reduced control amplitude  r,i is required.The former can be estimated through data retrieved directly from experiments on the system.Then, the reduced control amplitude  r,i can be computed according to (28), selecting  i > F i .As for the residual threshold in (34), in order to avoid some intermittent detection when oscillating faults occur, possible practices could be the reduction of the thresholds (taking always into account the bounds on the modeling mismatches in order to avoid false detections), or the adoption of a waiting time between two subsequent detection moments (in this second case one has to accept possible misdetections).▿

CASE STUDY
In this section the performance of the proposed FD scheme are discussed relying on simulations carried out on PyBullet environment, which is a module used for physics simulation, robotics and DRL algorithms, developed on the Bullet Physics SDK.In particular, sensor and actuator faults have been injected in a virtualized model of the 7 DOFs Franka Emika Panda robot manipulator, which has to reach a specific point in the working space (see Figure 4).

DRL training
As discussed in Section 4.2, the DRL algorithm to execute a sensor fault diagnosis is based on the TD3 agent, while the state space  and the action space  are defined as in (17) and (18).The training phase is divided into episodes, and each episode has a fixed duration of 5 s such that at each time-step the simulation advances of 4.2 ms, and a vector of random constant sensor faults Δq is injected into the system.Specifically, such faults belong to the set  = {Δq i ∈ R| − 0.35 ≤ Δq i ≤ 0.35, ∀ i = 1, … , n q }.Notice that the choice of considering constant sensor faults, even if the proposed approach can also deal with more general cases, is motivated by the fact that constant signals are commonly used as sensor faults to emulate a malfunction given by a bias in different applications, see, for example, Reference 46 for a classification of faults in robotics.The reward function is defined as in (20).Both the actor and critic are approximated by using multilayer perceptron (MLP) artificial neural networks, with an input layer composed of 10 neurons, 2 hidden layers composed of 64 neurons, and an output layer with 7 neurons.Figure 5 illustrates the average cumulative reward R t during the training phase.One can notice that the reward is maximized, as expected, meaning that the agent learns how to perform the fault diagnosis whenever a sensor failure occurs.

Simulation results
In order to assess the proposal, the behavior of the whole FD scheme in Figure 3 has been tested in case of scenarios FE3, FE4, and FE5 reported in Table 1.The initial values of the sensors estimates are set equal to the sensor readings, while they are equal to zero for the actuator fault estimates.The parameters of the SOSM UIOs are selected as  i = 150, that is, greater than F i = 50, i = 1, … , n q , and  = 1.The adopted control scheme consists of the inverse dynamics inner loop as discussed in Section 3, while the controller of the outer loop is the one selected by default in the simulator of the considered Franka Emika Panda robot manipulator, that is, a PD controller with control gains K P = diag(3, 3, 3, 3, 3, 3, 3) and K D = diag(2, 2, 2, 2, 2, 2, 2).The sampling time is instead selected as 1 ms.

Faults on multiple actuators (scenario FE3)
In this scenario, the faulty actuators are those of joints 1, 3, 4, and 6, so that Δy = [ 4 sin(6t) 0 7sin(10t) 3 sin(2t) 0 8sin(4t) 0 ] ⊤ , which can be interpreted, for instance, as a classical bearing failure (see e.g., Reference 46 for a brief overview on fault characteristics analysis).Faults occur at different time instants, in particular the fault on joint 1 occurs at t = 1.5 s, the one on joint 3 at t = 2 s, the one on joint 4 at t = 1 s, and the one in joint 6 at t = 3.5 s. Figure 6 shows that the battery of SOSM UIOs is capable to identify in finite-time the actuator faults acting on the corrupted joints.The thresholds in (34) are selected as Y a,i = −0.5 and Y a,i = 0.5 in order to take into account possible measurement noises or disturbances.Then, the root mean square estimation error (RMS) has been computed for all the considered actuators, and the results are presented in Table 2.

Faults on multiple sensors (scenario FE4)
In this scenario, we assess the effectiveness of the proposed DRL algorithm.A constant fault equal to −0.0309 rad is first injected into the sensor of joint 2, while a constant fault equal to 0.1149 rad is added to the measurement of joint 6.The thresholds in (21) are selected as Q a,i = −0.0262and Q a,i = 0.0262, to take into account possible measurement noises or disturbances.
The results are satisfactory and illustrated in Figure 7, while in Table 3 the RMS estimation errors are reported to confirm the effectiveness of the proposed approach.
Then, also joints 1, 4 and 5 are considered affected by sensor faults equal to 0.134, −0.202, and −0.08 rad, respectively.The results are illustrated in Figure 8, and the corresponding RMS estimation errors are indicated in Table 3.
It is worth to highlight that, although the estimated value slightly deviates from the actual one, the proposed approach has the merit to perform a completely model-free fault diagnosis, given only the measurements of the end-effector position and data from sensors, and without compromising the closed-loop stability of the controlled system.

5.2.3
Faults on multiple sensors and actuators (scenario FE5) In this scenario, both actuator and sensor faults can contemporarily occur on the same joint.Specifically, joint 1 is affected by a constant sensor fault equal to −0.3 rad and by a sinusoidal actuator fault equal to 3 sin(8t) rad s −2 , which occur at t = 1.5 s and t = 1 s, respectively.Moreover, on joint 7, only a sinusoidal actuator fault equal to 5 sin(12t) rad s −2 occurs at time t = 3 s.The time evolution of the estimated faults are depicted in Figure 9.It is possible to observe that also in this more complex scenario the proposed control scheme is capable of distinguishing among the components and satisfactorily identify all the faults.These results are also confirmed by the RMS estimation errors reported in Table 4.
Finally, maintaining the same setting of parameters, we consider a scenario where no fault occurs on joints 1 and 7.The results, presented in Figure 10, highlight that, in absence of faults, no false alarm is generated by the proposed fault detection approach.Note that, all the results presented in this subsection have been obtained in presence of a random measurement noise, sampled uniformly between −0.1 deg and 0.1 deg, this in order to assess the performance of our proposal in a more realistic setting.As evident from Figures 9 and 10, even in the presence of measurement noise on the joint sensors, the fault detection approach results in being satisfactorily robust in detecting both sensor and actuator faults.

F I G U R E 1
Franka Emika Panda robot manipulator.Setup (a).Schematic view of a planar manipulator with three joints (b).

TA B L E 1
Possible FEs during the task execution.

F I G U R E 3
The proposed FD scheme for the considered Franka Emika Panda robot manipulator.

6
Time evolution of the actual and estimated actuator faults when joints 1, 3, 4, and 6 are subject to failures in the case of scenario FE3.(a) AF joint 1, (b) AF joint 3, (c) AF joint 4, (d) AF joint 6.

7 F I G U R E 8
Time evolution of the actual and estimated sensor faults when joints and 6 subject to failures in the case of scenario FE4.(a) SF joint 2, (b) SF joint 6.TA B L E 3 RMS estimation errors for scenario FE4.Time evolution of the actual and estimated sensor faults when joints 1, 4, and 5 are subject to failures in the case of scenario FE4.(a) SF joint 1, (b) SF joint 4, (c) SF joint 5.

9
Time evolution of the actual and estimated faults on both sensors and actuators when joints 1 and 7 are subject to failures in the case of scenario FE5.(a) SF joint 1, (b) AF joint 1, (c) AF joint 7. TA B L E 4 RMS estimation errors for scenario FE5.Joint # RMS ( Δq − q) [rad] RMS ( Δy − y) [rad s −2 ] and the "actor" network   , with random parameters  1 ,  2 and  2: Initialize the target network  ′ 1 ←  1 ,  ′ 2 ←  2 and  ′ ←  3: Initialize the replay buffer  ← 0 4: Set the episode counter  ep = 1 5: Set the iteration counter  it = 0 6: for  ep ≤ N ep do 7:for t ≤ t ep do 8: Average cumulative reward R t during the training phase.
Laboratory setup (left) and virtual robot in PyBullet environment (right).F I G U R E 5