Online reinforcement learning control via discontinuous gradient

This work proposes a reinforcement learning control scheme for systems affected by persistent external perturbations. This scheme relies on ℋ∞$$ {\mathscr{H}}_{\infty } $$ and high‐order sliding mode control techniques combined to estimate the parameters with a certain degree of precision and simultaneously attenuate persistent and state‐dependent perturbations. The proposed solution is a novel design technique based on the minimization method via Discontinuous Gradient. The stability of the proposed scheme is proved via the Lyapunov approach.

Various articles use adaptive or robust control techniques to achieve the mentioned objective, to name a few, in work by Cao et al. , 9 a model of the infectious spread of COVID-19 is established, and estimates of the unknown parameters at Starting from an adaptive law, however, it is a model without perturbation.Moreover, in Yala et al. 10 an adaptive controller for linear systems was designed, nevertheless without considering external perturbations.In these works, the condition of persistent excitation is necessary to ensure stability in the Lyapunov sense . 11Furthermore, in Varga et al. , 12 an adaptive gain-PID is presented with an alternative scheme to the one based on Lyapunov; however, it is limited to unperturbed mechanical systems.
On the other hand, in recent papers on adaptive control, various works deal with perturbed systems.Just to cite a few, Xiao et al. 13 and Cao et al. , 14 describe an SM control to attitude tracking for an aircraft with uncertainty in inertia.Besides, Edwards and Shtessel , 15 proposed a ST control algorithm with adapted gains, to avoid a vanishing disturbance with unknown boundary.In the work of Cai et al. , 16 an adaptive control method was implemented to estimate the unknown parameters under perturbation.In the paper of Shtessel et al. 17 a pseudo-continuous control is described, showing that the parameters error are driven to zero in finite time in the case of nonvanishing perturbations.Furthermore, in Furtat and Chugina , 18 a control with adaptive loop method and SM was designed for linear systems.In general, a disadvantage of these works is that they require the matched condition of the perturbation.
The case of systems with parameter uncertainties and matched and unmatched perturbations has been studied in some recent works.For example, in the work of Haibin et al. , 19 an adaptive backstepping control combined with the Levant's exact differentiator is used to estimate the perturbation for nonlinear systems.The relative limitation is that the perturbation is modeled as polynomial functions.Also, in the paper of Golestani et al. , 20 a solution for non-polynomial perturbations is proposed, although applicable only to second order systems.Alternatively, Yang et al. 21presented a control for n-order systems, valid only in the case of linear systems.
To attenuate the unknown perturbations optimally, in the literature of nonlinear  ∞ -control, previous works present design techniques for perturbed systems.For example, Loukianov et al. 22 combine SMs to reject the matched perturbation with the  ∞ algorithm for the unmatched disturbance.However, this control scheme uses known parameters.In the work of Sun et al. , 23 a nonlinear- ∞ SM controller is presented, making use of the nominal system parameters.On the other hand, in the work of Fu et al. , 24 a  ∞ control with variable gains to estimate uncertainties, is applied to polynomial systems with time-varying parameters and additive perturbations.Again, the nominal value of the system parameters need to be known.
An alternative approach involves using AI techniques such as neural networks (NN) or deep learning in conjunction with control methodologies.Several notable works have explored this avenue: Li et al. , 25 a scheme is described where nonlinearities are adapted using NN; In Wang et al. , 26 control with NN is described for a very specific type of systems; in Zhuang , 27 deals with the attitude control of a drone by adapting the parameters with NN.However, these works do not consider perturbations in their models.On the other hand, Ning et al. 28 and Liu et al. 29 describe a control with reinforcement learning (RL) using NN for a missile and predator-prey system, respectively, considering perturbations, but these are matches.In Verrelli et al. , 30 they introduce an adaptive control method tailored for nonminimum phase systems with periodic reference signals.Nevertheless, it is important to note that this method does not establish stability guarantees for perturbed systems and is limited to periodic reference signals; Perez-Villalpando et al. 31 present a robust control technique that addresses parametric variations, incorporating NN to predict and mitigate external disturbances.However, it's worth highlighting that this approach demands a substantial historical dataset for training the NN, which can be challenging to obtain in practice.
In contrast, the contribution of this paper stems from a novel approach that combines elements of RL algorithms with  ∞ control techniques building on the preceding discussions; the primary objective here is to devise an effective adaptive control scheme through RL for single-input single-output nonlinear systems, particularly those affine to the input.The unique aspect of this contribution lies in developing an algorithm that operates independently of the tracking error.It achieves this by employing a discontinuous function based on the gradient of a predefined cost function that depends on the adaptive parameter error.This novel approach fills a notable gap in the current literature, as this algorithm can approximate the system parameters with a certain degree of precision for single-input single-output affine nonlinear systems with unknown parameters while dealing with matched and unmatched perturbations.The Actor policy, defined as Discontinuous Gradient (DG), first involves designing a state estimator using the ST algorithm.Then, the cost function of the Critic is proposed as the equivalent control of the ST.Once the gradient concerning the adaptation error of this function is calculated, the adaptive policy is proposed.To mitigate the action of external perturbations, the nonlinear- ∞ technique is used to attenuate the chattering effect.This algorithm guarantees stability in a finite time of the adaptive errors despite external-unmatched perturbations.The works of Busoniu et al. 32 and Kamalapurkar et al. 33 classify the DG-algorithm as a RL Control, that is, an adaptive control using optimal feedback, specific, policy gradient.
The paper is organized as follows.In Section 2, the problem to be studied is stated.In Section 3, a RL control scheme is described for an unperturbed system.In Section 4, the control scheme is extended to perturbed systems using the nonlinear- ∞ algorithm to attenuate the perturbation.In Section 5, the simulation results of the novel design technique is shown.Finally some conclusions are drawn.

PROBLEM STATEMENT
Consider an uncertain nonlinear system of the form where x ∈ D x ⊂ R n is the system state vector, u ∈ R is the control, y ∈ R is the output, w ∈ R p is the disturbance, and The control objective is to design a control scheme which achieves tracking of a smooth reference signal y r despite uncertainties in the unknown constant parameters  x ,  u .To this end, the following assumptions are fundamental.
Assumption 1.The state x of system (1) is measurable.Assumption 2. The relative degree of system (1) is well defined and equal to n. Assumption 3. The disturbance w and its derivative ẇ are bounded, namely where c 1 , c 2 , and c 3 are known positive real constants.
The scheme proposed in this paper is shown in Figure 1.In the following, the problem of the adaptive control design is first solved for the case of unperturbed system, that is, when w = 0. Afterwards, the solution to the same problem is extended to the case of non-zero disturbance.

State and parameter estimator design: actor-critic
Let be w = 0 in (1).To calculate the estimates θx , θu of  x ,  u , the following fictitious estimation x of the system state where v e is implemented using the well-known super-twisting algorithm 3 with e = x − x is the state estimation error and ,  > 0. Furthermore, sign(e) ∶ R n → R n is defined as the vector ( sign(e 1 ), … , sign(e n ) ) T .From ( 1), (4), and ( 5) one works out the state error dynamics where ) T , θ =  − θ.Now, the norm ||Γ(t, x, u) θ|| since it is being multiplied by the parameter estimation error, it is natural to assume that the estimation error e limits the product from Super-Twisting in the Critic.That is, ||Γ(t, x, u) θ|| ≤  1 ||e|| for a known scalar  1 > 0, then, as proved in Prop. 1 of Nagesh et al. , 4 there exist gains ,  in (5) such that e, ̇e converge to zero in finite time, and remain zero for all subsequent time instants.When e ≡ 0, one obtains the so-called equivalent input v e = v eq ∶= Γ(t, x, u) θ. (7)   which can be used to define the scalar quadratic function that will be used to design a feedback law for θ.Notice that The following lemma is instrumental to obtain the desired Actor-Critic policy in the absence of the disturbance w.
Lemma 1.Let consider the quadratic function (8) and the error θ =  − θ.Then, from the definition of ( θ), and considering an increment  ≪ 1, one works out where it has been considered that −(1 − )z T Γ T Γz < 0. Hence, for  → 0, one obtains and therefore the lemma is verified.▪ The following result ensures the finite-time stability of the parameter estimation error θ.

Proof. Deriving the Lyapunov function
where Lemma 1 has been used.This implies that θ goes to zero in finite-time.▪ Finally, taking Λ = diag{ i } with i = {1, 2, … , q}, the feedback law (9) becomes

Control design
Considering the following diffeomorphism 34 where the operator ℒ n f h means the nth Lie derivative of h in f direction.Then, under Assumption 2, one works out Hence, the proposed control is of the form the matrix which stabilizes the origin of the closed-loop system ( 12), (13), given by where , with the mappings Notice that, due to the convergence in finite time of θx , θu , some initial conditions θx (0), θu (0) can be chosen such that in (13), h(x) will be different from zero for all x, θx , θu in some neighborhood of the equilibrium point x 0 .
The following theorem shows the stability of the overall closed-loop system.
Proof.Substituting (10) in ( 14), the whole system can be rewritten in the form where ) .From Theorem 1, θ has a finite-time stable equilibrium point, and from Prop. 1 in Reference, 4 (e, ) = (0, 0) is a stable finite-time equilibrium point.Also, since A is Hurwitz, the first equation of ( 15) is ISS with input θ. 35 Since θ converges to zero in finite-time, there exists an upper bound class function  1 such that Therefore, with  2 is a class function, and  a class function.Substituting ( 16) in (17) and defining a variable (t) satisfying where  is a class function.Therefore, the origin of the system ( 15) is globally asymptotically stable.▪

A perturbed system estimator
In the case of nonzero disturbance w, taking the same estimator (4) and the feedback law (10), the state estimation error dynamics results to be Proceeding as in the previous section, and choosing the same diffeomorphism (11), the control stabilizes the system where For this system, a virtual output  can be chosen as where the matrices C and D are chosen such that the pair(C, A) is observable, whereas D satisfies D T D = I and C T D = 0 . 36n order to attenuate the external disturbance w, the  ∞ -control technique will be used together with the Actor-Critic scheme previously considered in Section 3. To this end, system (20), with the virtual output , can be rewritten as follows where ) T .Consider now a positive definite function (z) satisfying the Hamilton-Jacobi-Isaacs equation where  z is the Jacobian of (z).Hence, selecting the auxiliary control v as follows when z(0) = 0 and for a sufficiently small positive constant , the following result is obtained 6 Therefore, for w Γ bounded, the equilibrium point z = 0 of system ( 21) is stable. 35To analyze the stability in the case of disturbed system, the following assumption is considered.
Hence, the Lyapunov function derivative is definite negative for all || θ|| >  1 with Next we have to find the ultimate bound from the inequality where 0 <  1 < 1.Then, the adaptive error is uniformly ultimately bounded by Finally, let  z a function that satisfies (22) and define the Lyapunov function candidate Substituting (2) in the previous equation and defining Finally, defining some positive constants  2 ,  3 such that the inequality Then, the ultimates bound are 35 ||z|| ≤ Hence, the composite system (20) with estimator error (18) and the feedback law (10) has a ultimately bounded solution.▪ On the other hand, the following result guarantees ultimate boundedness regardless Assumption 4, in a region D close to the origin.Theorem 4. Let Assumptions 3 hold.Then, for the composed system (20), (10), (18), the estimation error is bounded by and the solution z(t) is ultimately bounded by Then, for subsystem (10), take the same Lyapunov function candidate  θ = θT θ, and choose a positive constant  3 such that || 5 B w w|| Then, the solution θ(t) is ultimately bounded, and the input-to-state stability of subsystem ( 18), ( 10) is demonstrated.Finally, for  2 = (z) +  θ , the solution found in a region D = D 4 ∩ D 2 , where D 4 = D x × R n , is valid.Then, the solution (e(t), θ(t), z(t)) of the system ( 18), (10), (15) is ultimately bounded in a region D. ▪

SIMULATION RESULTS
The following examples show the performance of the proposed scheme.

RL control for an unperturbed system
Let us consider the unperturbed magnetic suspension system 35 with dynamics where x 1 and x 2 are the position (downward) and velocity of a ferromagnetic ball of mass m, x 3 is the current through an electromagnet with inductance L(x 1 ) = L 1 + L 0 1 + x 1 ∕a , being L 1 and L 0 the fixed and variable inductance, respectively; and a is the variation coefficient of inductance.k is the viscous friction coefficient, g is the acceleration due to gravitational interaction, R is the serial resistance in electric circuit, and u is the voltage source.
To establish a relationship between (1) and ( 27), L(x 1 ) is assumed to be completely known, so that: So, the system has relative degree 3 and unknown constant parameter .The reference, y r (t) = 0.2 + 0.1 sin(t) is established.Following the previous sections, one can consider the diffeomorphism from (11) as follows and calculate the control input as (13) with: ) , the matrix stabilizing the origin of the closed-loop system is chosen as K = ( 10,000 2000 600 ).To find the unknown parameter, the estimator is defined since (4) as where v e = ( v e1 v e2 ) T is designed as in (5).Finally, using (10) the Actor policy results in ) To simulation, the system parameters, aside from gains for the Online RL structure are shows in Table 1.As results; the tracking, and exact parameter estimation in Figures 2 and 3, are shown using the proposed discontinuous gradient method.The voltage input control needed is depicted in Figure 4. F I G U R E 3 Parameter approximation using Discontinuous Gradient policy.θ1 (blue), k∕m estimation.θ2 (orange), 1∕m estimation.θ3

TA B L E 1
(green), R estimation.

F I G U R E 4
Voltage input control to position tracking.

F I G U R E 5
Online calculation for  z components to H ∞ -control technique.

RL control for a perturbed system
Let us consider now the case of a perturbed magnetic suspension system described by ) , with the same choice for  1 ,  2 ,  3 , and y r .Analogously to the previous example, one computes the control from (19).To attenuate the perturbations w 1 and w 2 through v using the  ∞ technique.Because the disturbances w 1 and w 2 occur in the transformed variables z 1 and z 2 , a virtual output of closed-loop system (21) is setting as  = z 1 + z 2 + v, with the matrix: being  attenuated by the control v = − ( 0 0 1 )  T z , where  z is calculated on-line according to (22) with  = 0.5.And the Actor policy is as in (28).
The components of the Jacobian  z , are shown in Figure 5.These values allow the calculation of the control required to attenuate the disturbances.Figure 6 shows tracking for the proposed discontinuous gradient method in the presence of disturbances and Figure 7 shows the evolution of parameter estimation.Notice that θ2 and θ3 converge asymptotically, whereas θ1 converges to a small neighborhood of the nominal value, due to the nature of the disturbances affecting the system.
F I G U R E 6 Output tracking in perturbed magnetic suspension system.

F I G U R E 7
Unknown parameter estimation for a perturbed magnetic suspension system using Discontinuous Gradient policy.

CONCLUSIONS
This paper presents a new technique to deal with the problem of parameter estimation for a class of perturbed nonlinear systems.In the absence of perturbation, the parameters are known exactly in finite time thanks to the Sliding-Modes in the Actor-Critic Discontinuous Gradient.The parameters are approximated with a calculable margin for systems with match and unmatch disturbance at the control input.
With the help of the H ∞ -control technique, the relationship between the output and the disturbance is optimized.In addition, in the RL scheme based on DG, the minimization of the error for the knowledge of the parameters is obtained.However, online calculation of the Jacobian  z is necessary for this robust control scheme using RL.

1
Diagram of the proposed control algorithm.The quantities with a circle are measured.