Adaptive Dynamic Programming for Model-free Tracking of Trajectories with Time-varying Parameters

In order to autonomously learn to control unknown systems optimally w.r.t. an objective function, Adaptive Dynamic Programming (ADP) is well-suited to adapt controllers based on experience from interaction with the system. In recent years, many researchers focused on the tracking case, where the aim is to follow a desired trajectory. So far, ADP tracking controllers assume that the reference trajectory follows time-invariant exo-system dynamics-an assumption that does not hold for many applications. In order to overcome this limitation, we propose a new Q-function which explicitly incorporates a parametrized approximation of the reference trajectory. This allows to learn to track a general class of trajectories by means of ADP. Once our Q-function has been learned, the associated controller copes with time-varying reference trajectories without need of further training and independent of exo-system dynamics. After proposing our general model-free off-policy tracking method, we provide analysis of the important special case of linear quadratic tracking. We conclude our paper with an example which demonstrates that our new method successfully learns the optimal tracking controller and outperforms existing approaches in terms of tracking error and cost.


INTRODUCTION
Adaptive Dynamic Programming (ADP) which is based on Reinforcement Learning has gained extensive attention as a modelfree adaptive optimal control method. 1 In ADP, pursuing the objective to minimize a cost functional, the controller adapts its behavior on the basis of interaction with an unknown system. The present work focuses on the ADP tracking case, where a reference trajectory is intended to be followed while the system dynamics is unknown. As the long-term cost, i.e. value, of a state changes depending on the reference trajectory, a controller that has learned to solve a regulation problem cannot be directly transferred to the tracking case.
Therefore, in literature, there are several ADP tracking approaches in discrete time 2,3,4,5 and continuous time. 6,7 All of these methods assume that the reference trajectory can be modeled by means of a time-invariant exo-system +1 = ref ( ) (anḋ derived. Whenever this reference trajectory and thus the function ref changes, the learned value function and consequently the controller is not valid anymore and needs to be re-trained. Therefore, the exo-system tracking case with time-invariant reference dynamics ref is not suited for all applications. 8 For example in autonomous driving, process engineering and human-machine collaboration, it is often required to track flexible and time-varying trajectories. In order to account for various references, the multiple-model approach presented by Kiumarsi et al. 9 uses a self-organizing map that switches between several learned models. However, in their approach, new sub-models need to be trained for each exo-system ref . Thus, our idea is to define a state-action-reference Q-function that explicitly incorporates the course of the reference trajectory in contrast to the commonly used Q-function (see e.g. Sutton and Barto 10 ) which only depends on the current state and control . This general idea has first been proposed in our previous work, 11 where the reference is given on a finite horizon and assumed to be zero thereafter. Thus, the number of weights to be learned depends on the horizon on which the reference trajectory is considered. As the reference trajectory is given for each time step, this allows high flexibility, but the sampling time and (unknown) system dynamics significantly influence the reasonable horizon length and thus the number of weights to be learned. Based on these challenges, our major idea and contribution in the present work is to approximate the reference trajectory by means of a potentially time-varying parameter set in order to compress the information about the reference compared to our previous work 11 and incorporate this parameter into a new Q-function. In doing so, the Q-function explicitly represents the dependency of the expected long-term cost on the desired reference trajectory. Hence, the associated optimal controller is able to cope with time-varying parametrized references. We term this method Parametrized Reference ADP (PRADP).
Our main contributions include: • The introduction of a new reference-dependent Q-function that explicitly depends on the reference-parameter .
• Function approximation of this Q-function in order to realize Temporal Difference (TD) learning (cf. Sutton 12 ).
• Rigorous analysis of the form of this Q-function and its associated optimal control law in the special case of linearquadratic (LQ) tracking.
• A comparison of our proposed method with algorithms assuming a time-invariant exo-system ref and the ground truth optimal tracking controller.
In the next section, the general problem definition is given. Then, PRADP is proposed in Section 3. Simulation results and a discussion are given in Section 4 before the paper is concluded.

GENERAL PROBLEM DEFINITION
Consider a discrete-time controllable system where ∈ ℕ 0 is the discrete time step, ∈ ℝ the system state and ∈ ℝ the input. The system dynamics (⋅) is assumed to be unknown.
Furthermore, let the parametrized reference trajectory ( , ) ∈ ℝ which we intend to follow be described by At any time step , the reference trajectory is described by means of a parameter matrix ∈ ℝ × and given basis functions ( ) ∈ ℝ . Here, ∈ ℕ 0 denotes the time step on the reference from the local perspective at time , i.e. for = 0, the reference at time step results and > 0 yields a prediction of the reference for future time steps. Thus, in contrast to methods which assume that the reference follows time-invariant exo-system dynamics ref , the parameters in (2) can be time-varying, allowing much more diverse reference trajectories.
Our aim is to learn a controller which does not know the system dynamics and minimizes the cost where ∈ [0, 1) is a discount factor and (⋅) denotes a non-negative one-step cost. We define our general problem as follows.
Problem 1. For a given parametrization of the reference by means of according to (2), an optimal control sequence that minimizes the cost (3) is denoted by * , * +1 , … and the associated cost by * . The system dynamics is unknown. At each time step , find * .

PARAMETRIZED REFERENCE ADP (PRADP)
In order to solve Problem 1, we first propose a new, modified Q-function whose minimizing control represents a solution * to Problem 1. In the next step, we parametrize this Q-function by means of linear function approximation. Then, we apply Least-Squares Policy Iteration (LSPI) (cf. Lagoudakis and Parr 13 ) in order to learn the unknown Q-function weights from data without requiring a system model. Finally, we discuss the structure of this new Q-function for the linear-quadratic tracking problem, where analytical insights are possible.

Proposed Q-Function
The relative position on the current reference trajectory that is parametrized by means of according to (2) needs to be considered when minimizing the cost as given in (3). In order to do so, one could explicitly incorporate the relative time into the Q-function that is used for ADP. This would yield a Q-function of the form ( , , , ). However, this would unnecessarily increase the complexity of the Q-function and hence the challenge to approximate and learn such a Q-function. Thus, we decided to implicitly incorporate the relative time on the current reference trajectory parametrized by into the reference trajectory parametrization. This yields a shifted parameter matrix ( ) according to the following definition. Thus, is a modified version of = (0) such that the associated reference trajectory is shifted by time steps, where ( ) is a suitable matrix. Note that ( ) is in general ambiguous as in the general case > 1 the system of equations (4b) in order to solve for ( ) is underdetermined. Thus, ( ) can be any matrix such that (4) holds.
Our proposed Q-function which explicitly incorporates the reference trajectory by means of is given as follows.
Here, the optimal control policy is denoted by * (⋅), hence * ( +1 , (1) ) = * +1 . This Q-function is useful for solving Problem 1 as can be seen from the following Lemma. Lemma 1. The control minimizing * , , is a solution for * minimizing in (3) according to Problem 1.
Lemma 1 and (9) reveal the usefulness of * , , for solving Problem 1. Thus, we express this Q-function by means of linear function approximation in the following. Based on the temporal-difference (TD) error, the unknown Q-function weights can then be estimated.

Function Approximation of the Tracking Q-Function
As classical tabular Q-learning is unable to cope with large or even continuous state and control spaces, it is common to represent the Q-function, which is assumed to be smooth, by means of function approximation 14 . This leads to * , , Here, ∈ ℝ is the unknown optimal weight vector, ∈ ℝ a vector of activation functions and the approximation error. According to the Weierstrass higher-order approximation Theorem 15 a single hidden layer and appropriately smooth hidden layer activation functions (⋅) are capable of an arbitrarily accurate approximation of the Q-function. Furthermore, if → ∞, As is a priori unknown, let the estimated optimal Q-function be given bŷ * , , In analogy to (9), the estimated optimal control law is defined aŝ * ( , ) = arg min ̂ * , , .
Based on this parametrization of our new Q-function, the associated TD error 12 is defined as follows.
Definition 3. (TD Error of the Tracking Q-Function) The TD error which results from using the estimated Q-function̂ * (⋅) (11) in the Bellman-like equation (7) is defined as Our goal is to estimatê in order to minimize the squared TD error 2 as the TD error quantifies the quality of the Q-function approximation. However, (13) is scalar while weights need to be estimated. Thus, we utilize ≥ tuples from interaction with the system in order to estimatê using Least-Squares Policy Iteration (LSPI) (cf. Lagoudakis and Parr 13 ). Stacking (13) for the tuples  , = 1, … , , yields If the excitation condition holds,̂ minimizing ⊺ exists, is unique and given bŷ according to Åström and Wittenmark, Theorem 2.1. 16 Note 1. Using (1) = (1) (5) in the training tuple  (14) rather than an arbitrary subsequent +1 guarantees (in combination with (1)) that the Markov property holds, which is commonly required in ADP. 1 Remark 1. The procedure described above is an extension of Lagoudakis and Parr, Section 5.1 13 to the tracking case where minimizing the squared TD error is targeted. In addition, an alternative projection method described by Lagoudakis and Parr, Section 5.2 13 which targets the approximate Q-function to be a fixed point under the Bellman operator has been implemented. Both procedures yielded indistinguishable results for our linear-quadratic simulation examples but might be different in the general case.
Note that̂ * (⋅) in̂ * + depends on̂ , i.e. the estimation of̂ * + depends on another estimation (of the optimal control law). This mechanism is known as bootstrapping (cf. Sutton and Barto 10 ) in Reinforcement Learning. As a consequence, rather than estimatinĝ once by means of the least-squares estimate (17), a policy iteration is performed starting witĥ (0) . This procedure is given in Algorithm 1, where ̂ is a threshold for the terminal condition.
Note 2. Due to the use of a Q-function which explicitly depends on the control , this method performs off-policy learning. 10 Thus, during training, the behavior policy (i.e. which is actually applied to the system) might include exploration noise in order to satisfy the rank condition (16) but due to the greedy target policŷ * (cf. the policy improvement step (12)), the Q-function associated with the optimal control law is learned.
Lagoudakis and Parr 13 point out that the appropriate choice of basis functions and the sample distribution (i.e. excitation) determinē . According to Theorem 1, Algorithm 1 converges to a neighborhood of the optimal tracking Q-function under appropriate choice of basis functions (⋅) and excitation. However, for general nonlinear systems (1) and cost functions (3), an appropriate choice of basis functions and the number of neurons is "more of an art than science" 18 and still an open problem. As the focus of this paper lies on the new Q-function for tracking purposes rather than tuning of neural networks, we focus on linear systems and quadratic cost functions in the following-a setting that plays an important role in control engineering. This allows analytic insights into the structure of * , , and thus proper choice of (⋅) for function approximation in order to demonstrate the effectiveness of the proposed PRADP method.

The LQ-Tracking Case
In the following, assume and Here, penalizes the deviation of the state + from the reference ( , ) and penalizes the control effort. Furthermore, let the following assumptions hold. Note 3. Assumption 1 is rather standard in the LQ setting in order to ensure the existence and uniqueness of a stabilizing solution to the discrete-time algebraic Riccati equation associated with the regulation problem given by (20) and (21) for = (cf. Kučera, Theorem 8 19 ). Furthermore, it is obvious that the reference trajectory ( , ) must be defined such that a controller exists which yields finite cost in order to obtain a reasonable control problem. As will be seen in Theorem 2, Assumption 2 guarantees the existence of this solution.
The tracking error , can be expressed as = 0, 1, … , where denotes the × identity matrix and , the extended state. The associated optimal controller is given in the following Theorem. (2) with shift matrix ( ) as in Definition 1 be given. Then, (i) the optimal controller which minimizes (21) subject to the system dynamics (20) and the reference is linear w.r.t. , (cf. (22)) and can be stated as * ( + , ( ) ) = * + = − , ,
Considering that the discounted problem is equivalent to the undiscounted problem with √ ̃ , √ ̃ ,̃ and (cf. Gaitsgory et al. 20 ), the given problem can be reformulated to a standard undiscounted LQ problem. For the latter, it is well-known that the optimal controller is linear w.r.t. the state (here , ) and the optimal gain is given by (24)  Note 4. The proof of Theorem 2 demonstrates that in case of known system dynamics by means of and , the optimal tracking controller can be directly calculated by solving the discrete-time algebraic Riccati equation 22 associated with √ ̃ , √ ̃ ,̃ and .
Equation (28) demonstrates that the Markov property holds (cf. Note 1). As a consequence of Theorem 2, for unknown system dynamics, this yields the following problem in the LQ PRADP case. Problem 2. For = 0, 1, … , find the linear extended state feedback control (23) minimizing (21) and apply * = − ,0 to the unknown system (20).
Before we derive the control law , we analyze the structure of̂ * , , associated with Problem 2 in the following Lemma.
⊺ and is chosen such that = ⊺ .
As a consequence of Lemma 2, * can be exactly parametrized by means of̂ * according to (11) if̂ = corresponds to the non-redundant elements of = ⊺ (doubling elements of̂ associated with off-diagonal elements of ) and = ⊗ , where ⊗ denotes the Kronecker product. Based on Lemma 2, the optimal control law is given as follows.
Thus, if (or equivalently ) is known, both * and * can be calculated.

RESULTS
In order to validate our proposed PRADP tracking method, we show simulation results in the following, where the reference trajectory is parametrized by means of cubic splines 1 . Furthermore, we compare the results with an ADP tracking method from literature which assumes that the reference can be described by a time-invariant exo-system ref ( ). Finally, we compare our learned controller that does not know the system dynamics with the ground truth controller which is calculated based on full system knowledge.

Cubic Polynomial Reference Parametrization
We choose ( , ) to be a cubic polynomial w.r.t. , i.e. ( ) = ( ) 3 ( ) 2 1 ⊺ , where denotes the sampling time. The associated transformation in order to obtain the shifted version ( ) of according to Definition 1 thus results from the following: In order to fully describe ( , ), the values of remain to be determined. Therefore, given sampling points of the reference trajectory every = 25 time steps, let , = , = 0, 1, … , result from cubic spline interpolation. In between the sampling points, let + = ( ) , = 1, 2, … , − 1 (cf. Definition 1 and (38)). This way, the controller is provided with at each time step when facing Problem 2.
Note 5. The given procedure to generate parameters decouples the sampling time of the controller from the availability of sampling points given for the reference trajectory (in our example only every = 25 time steps).

Example System
Consider a mass-spring-damper system with sys = 0.
Here, 1 corresponds to the position, 2 to the velocity of the mass sys and the control corresponds to a force. We desire to track the position (i.e. 1 ), thus we set = 100 0 0 0 and = 1 in order to strongly penalize the deviation of the first state from the parametrized reference (cf. (21)) and = 0.9. In this example setting, Assumptions 1-2 hold.

Simulations
In order to investigate the benefits of our proposed PRADP tracking controller, we compare our method with an ADP tracking controller from literature, 2,3 which assumes that the reference trajectory is generated by a time-invariant exo-system ref ( ). Both our method (with ̂ = 1 × 10 −5 in Algorithm 1) and the comparison method from literature are trained on data of 500 time steps, where Gaussian noise with zero mean and standard deviation of 1 is applied to the system input for excitation. Note that none of the methods requires the system dynamics (20). Let 0 = 0 1 ⊺ . The reference trajectory during training is for the comparison method and the associated spline for our method.
The learned controllers both of our method and the comparison algorithm are tested on a reference trajectory for 1 that equals the sine described by ,1 according to (41) for the first 250 time steps. Then, the reference trajectory deviates from this sine as is depicted in Fig. 1 in gray. Here, the blue crosses mark the sampling points for spline interpolation, the black dashed line depicts 1 resulting from our proposed method and the red dash-dotted line shows 1 for the comparison method. Furthermore, to gain insight into the tracking quality by means of the resulting cost, , , ( , 0) is depicted in Fig. 2 for both methods. Note the logarithmic ordinate which is chosen in order to render the black line representing the cost associated with our method visible.
Comparing the learned controller PRADP with the ground truth solution yields ‖ ‖ PRADP − ‖ ‖ = 1.08 × 10 −13 . Thus, the learned controller is almost identical to the ground truth solution which demonstrates that the optimal tracking controller has successfully been learned using PRADP without knowledge of the system dynamics.

Discussion
As can be seen from Fig. 1, our proposed method successfully tracks the parametrized reference trajectory. In contrast, the method proposed by e.g. Luo et al. 2 and Kiumarsi et al. 3 causes major deviation from the desired trajectory as soon as the reference does not follow the same exo-system which it was trained on (i.e. as soon as (41) does not hold anymore after 250 time steps).
In addition, the cost in Fig. 2 reveals that both methods yield small and similar costs as long as the reference trajectory follows ref . However, as soon as the reference trajectory deviates from the time-invariant exo-system description ref at > 250, the cost of the comparison method drastically exceeds the cost associated with our proposed method. With max exo-system , , ( , 0) ≈ 270 and max PRADP , , ( , 0) ≈ 2.8, our method clearly outperforms the comparison method. PRADP does not require the assumption that the reference trajectory follows time-invariant exo-system dynamics but is nevertheless able to follow this kind of reference (see ≤ 250 in the simulations) as well as all other references that can be approximated by means of the time-varying parameter . Thus, PRADP can be interpreted as a more generalized tracking approach compared to existing ADP tracking methods.

CONCLUSION
In this paper, we proposed a new ADP-based tracking controller termed Parametrized Reference Adaptive Dynamic Programming (PRADP). This method implicitly incorporates the approximated reference trajectory information into the Q-function that is learned. This allows the controller to track time-varying parametrized references once the controller has been trained and does not require further adaptation or re-training compared to previous methods. Simulation results showed that our learned controller is more flexible compared to state-of-the-art ADP tracking controllers which assume that the reference to track follows a time-invariant exo-system. Motivated by a straightforward choice of basis functions, we concentrated on the LQ tracking case in our simulations where the optimal controller has successfully been learned. However, as the mechanism of PRADP allows more general tracking problem formulations (see Section 3), general function approximators can be used in order to approximate and allow for nonlinear ADP tracking controllers in the future.