Learning‐based parametrized model predictive control for trajectory tracking

This article is concerned with the tracking of nonequilibrium motions with model predictive control (MPC). It proposes to parametrize input and state trajectories of a dynamic system with basis functions to alleviate the computational burden in MPC. As a result of the parametrization, an optimization problem with fewer variables is obtained, and the memory requirements for storing the reference trajectories are reduced. The article also discusses the generation of feasible reference trajectories that account for the system's dynamics, as well as input and state constraints. In order to cope with repeatable disturbances, which may stem from unmodeled dynamics for example, an iterative learning procedure is included. The approach relies on a Kalman filter that identifies the repeatable disturbances based on previous trials. These are then included in the system's model available to the model predictive controller, which compensates them in subsequent trials. The proposed approach is evaluated on a quadcopter, whose task is to balance a pole, while flying a predefined trajectory.


INTRODUCTION
Model predictive control (MPC) is an effective and popular control strategy for systems that have input and state constraints. There are numerous examples where MPC has been successfully applied in practice; see, for example, the works of Richalet et al, 1 Borrelli et al, 2 Geyer et al, 3 and Papafotiou et al. 4 The underlying idea is to repeatedly solve an optimal control problem that takes input and state constraints into account, and generates optimal input and state trajectories that guide the system from its current state to a desired state (often the origin). In order to be robust against modeling errors and disturbances, only a small portion of the input trajectory is applied to the system, before resolving the optimal control problem, subject to the actual position as an initial condition. 5 As a consequence, the underlying optimal control problem often needs to be solved very quickly in order to achieve a sufficiently small sampling time. Even if the optimization is warm-started, the resulting computational load is often substantial and represents a bottleneck of MPC. To reduce the computational load, the optimal control problem is often simplified, for example by using a coarser system model with fewer dimensions, and/or by reducing the prediction horizon. However, both the simplified system model and the smaller prediction horizon may degrade the resulting closed-loop performance and may even lead to instability. The parametrization of the input and state trajectories with basis functions is an approach that avoids the discretization and the truncation of the time horizon. For the regulator problem, it has been shown in the work of Muehlebach and D'Andrea 6 that the resulting MPC controller asymptotically stabilizes the origin and is recursively feasible (without the need for a terminal constraint and a terminal cost). This article discusses the extension of such a parametrized MPC strategy from the regulator problem to the tracking problem. In fact, this step is non-trivial, particularly with regard to practical implementation. In addition, a strategy to identify and account for repeatable disturbances is presented, in case the same trajectory is executed multiple times. The implementation of the resulting learning-based MPC control strategy on a flying inverted pendulum system (see Figure 1) is discussed.
Related work: In the work of Dunbar et al 7 a testbed for a ducted-fan actuated flying vehicle is controlled with MPC. The authors discuss the regulator problem and it is shown that the region of attraction of the MPC controller is larger than that of the related linear quadratic regulator.
Various works focus on reducing the computational complexity of MPC, see for example move-blocking strategies 8 or tailored MPC solvers. 9,10 The works of Wang, 11 Rossiter and Wang, 12 and Khan and Rossiter 13 propose parametrizing the input trajectories with Laguerre or Kauz basis functions. The potential benefits of the parametrization in terms of performance and extension of the feasibility regions are discussed.
The article by Faulwasser and Findeisen 14 proposes an extension of MPC from the regulation problem to trajectory tracking. A nonlinear and continuous-time MPC formulation is presented, including time-varying terminal set constraints that guarantee recursive feasibility and closed-loop stability. Stephens et al 15 discuss the design and implementation of an MPC controller to track trajectories of an industrial machine tool servo drive. The authors suggest extending the reference trajectory, for example via first-order hold, and are able to reduce the input horizon (they distinguish between the input and the prediction horizons), such that they can apply explicit MPC. In the work of Neunert et al, 16 a general nonlinear control strategy is presented, which uses time-varying feedforward and feedback terms. The resulting cascaded control structure includes a low-level controller that can be executed at very high rates, while MPC is used at a higher-level for updating the parameters of the low-level controller. The research of Kamel et al 17 compares a linear and a non-linear MPC controller for trajectory tracking with a hexacopter. The authors conclude that the nonlinear MPC controller outperforms the linear one for the experiments conducted. In the work of Mueller and D'Andrea, 18 a diminishing horizon MPC controller is introduced to track interception trajectories with a quadcopter. The trajectories generated at each sampling step are obtained by optimizing a quadratic program that is tailored to the quadcopter's dynamics and is based on conservative bounds on the quadcopter's jerk. Compared to the approach proposed in this article, this results in a comparable computational effort. However, the algorithm does not provide convergence guarantees and does not adapt to the environment.
In order to cope with modeling errors a learning-based predictive controller is presented in the work by Bouffard et al. 19 The dynamics are assumed to be affine and time-invariant and an extended Kalman filter is used to learn, respectively update, the dynamics. As a result, the authors demonstrate offset-free tracking and discuss the improvements of a step response compared to the nominal MPC controller. In the work of Wang et al, 20 iterative learning control is included F I G U R E 2 F At each trial, the generated trajectory is followed by means of a trajectory tracking controller. The state and input rollouts are used to estimate the repeatable disturbances after the trial completion, and are then taken into account in the next generation of the trajectories to improve the tracking performance. Note that in this diagram x des (t) and v(t) denote the state and disturbance trajectories in continuous time for the entire trajectory duration. Similarly x[k] and u[k] denote the state and input rollouts in discrete time for the entire trajectory duration in an MPC framework to control multi-phase batch processes. Two different strategies are proposed to design an updating law using MPC, which can intrinsically deal with constraints. A reference-free approach is presented in the work of Rosolia and Borrelli,21 where a nonlinear MPC controller improves the performance over repetitive tasks when the reference trajectory is not known. The design of terminal constraints at each iteration is proposed to guarantee performance improvement and stability requirements.
However, the approaches cited mostly rely on a discretized formulation of the dynamics and/or a piece-wise constant control input. The approach followed in this article is different: It aims to reduce the number of degrees of freedom and alleviate the computational burden of MPC by parametrizing input and state trajectories with basis functions. In addition to requiring less memory (see Section 3) and reducing computation, inherent (without the addition of terminal constraints) closed-loop stability and recursive feasibility results can be demonstrated when regulating linear time-invariant systems. 6,22 This parametrization is originally presented in the work of Muehlebach and D'Andrea, 6 where simulation results show the regulation of a linear time-invariant system to equilibrium, without addressing the problem of constraint satisfaction. In the work of Muehlebach et al, 22 a strategy to guarantee constraint satisfaction is implemented and the resulting algorithm is shown to run at 100 Hz on an embedded platform, regulating the system to equilibrium. In the work of Sferrazza et al, 23 a heuristic strategy has been proposed to apply this idea to the trajectory tracking problem. In this article, non-equilibrium motions are successfully tracked with a general framework that also accounts for the generation of feasible trajectories. This approach is demonstrated in experiments, where a quadrotor balances an inverted pendulum. In order to cope with the system's nonlinearities and unmodeled dynamics, an iterative learning approach that identifies the repeatable disturbances and augments the model accordingly is applied. The computation related to the learning can be done offline, thus not altering the complexity of the MPC algorithm during the tracking of a trajectory. The trajectory generation problem is formulated here as a single optimization over the entire trajectory duration. This differs from the work of Sferrazza et al, 23 where a greedy approach was used to simplify the optimization. The main differences between the two approaches are emphasized throughout this article.
Outline: Section 2 introduces the different building blocks of the proposed control strategy and shows how they are connected: In Section 2.1, the MPC controller that tracks the generated reference trajectories is presented. Section 2.2 describes how the reference trajectories are generated, and the identification of repeatable disturbances is explained in Section 2.3. The benefits in terms of memory usage are elaborated in Section 3. Section 4 presents experimental results, where a quadrotor balances an inverted pendulum while flying a three-dimensional figure-eight trajectory. The article concludes in Section 5 with remarks.

METHOD
The following section proposes a parametrized MPC approach to generate and track non-equilibrium motions of a dynamic system. It is assumed that the system can be roughly approximated by linear time-invariant dynamics. A scheme of the proposed framework is shown in Figure 2 and consists of the following building blocks: The trajectory generation procedure takes a desired trajectory, x des (t), and, based on an estimate of the repeatable disturbances v(t), computes feasible reference trajectories for the state and input. These trajectories are parametrized by linear combinations of basis functions and include steady-state offsets (as indicated by the variables x,ref , u,ref , v , x ss , u ss , and v ss ). The trajectory tracking procedure implements an (online) MPC controller for tracking the reference trajectories. The actual state and input trajectories are recorded, and after execution of the trajectory, the disturbance estimation procedure updates the estimate v(t) of the repeatable disturbances. In the first iteration, the disturbance estimate v(t) is typically initialized with zero, unless some prior knowledge is available.
The following subsections describe each individual block in detail. In the remainder of this article, vectors are expressed as tuples, with dimension and stacking clear from the context.

Trajectory tracking
In the following subsection, the trajectory tracking procedure is described. As indicated in Figure 2, its aim is to track feasible reference trajectories, given a disturbance estimate. All trajectories are parametrized with basis functions, that is, for all t ∈ [0,∞), wherex andũ denote the planned state and input trajectories, andx ref ,ũ ref , andṽ are the reference and disturbance trajectories that are obtained from the trajectory generation routine and are fixed during trajectory tracking. The parameter vectors of the planned state and input trajectories, which are subject to the (online) optimization, are denoted by x ∈ R ns and u ∈ R ms , the parameters of the reference state, the reference input, and the disturbance trajectory are denoted by x,ref ∈ R ns , u,ref ∈ R ms , and v ∈ R n v s , and the steady-state offsets on the state, the input, and the disturbances are denoted by x ss ∈ R n , u ss ∈ R m , and v ss ∈ R n v . The integers n, m, and n v describe the dimension of the state, the input, and the disturbances, and ⊗ refers to the Kronecker product. In contrast to the work of Sferrazza et al, 23 steady-state offsets are also included in the inputs and the disturbances. In addition, these offsets are decision variables in the optimization problem introduced in Section 2.2. This makes it possible to capture large steady-state disturbances and gives the trajectory generation algorithm the flexibility to trade off the placement of these offsets during the optimization. The basis functions (t) ∶= ( 1 (t), 2 (t), … , s (t)) ∈ R s , where s refers to the number of basis functions, are assumed to satisfy the following standing assumptions: H1 The basis functions 1 , 2 , … , s are linearly independent. H2 The basis functions satisfy the following first order differential equatioṅ(t) = M (t) for all t ∈ [0,∞), where the matrix M ∈ R s×s is Hurwitz.
Assumption H1 is required to guarantee that the resulting optimization problems have unique solutions. The motivation of Assumption H2 is twofold: First, if both H1 and H2 are fulfilled, linear time-invariant dynamicṡx(t) = Ax(t) + Bũ(t) + Gṽ(t) are fulfilled if and only if the corresponding parameter vectors satisfy where A, B, and G are the system's matrices. A proof of this statement can be found in the work of Muehlebach and D'Andrea. 6 Second, if Assumption H2 is fulfilled, it implies that the basis functions are able to capture arbitrary time shifts. More precisely, given the trajectoryx(t) parametrized by the basis functions, the shifted version of this trajectory, x(t + Δt) can be parametrized with the same basis functions, for all t ∈ [0,∞), where Δt ∈ [0, ∞) is any time shift. This is particularly important for fast online execution, where the optimization at the next time step can be warm-started with the solution obtained at the current time. The time-shift property also implies closed-loop stability and recursive feasibility of the resulting MPC algorithm in the regulation case (that is for x,ref = 0, u,ref = 0). 6 Examples for basis functions that fulfill Assumptions H1 and H2 are exponentially decaying polynomials or exponentially decaying harmonics. In the remainder, exponentially decaying polynomials of the form will be used, which are obtained by choosing M as The steady-state offsets x ss , u ss , v ss , obtained from the trajectory generation procedure describe an equilibrium, that is, The aim of the trajectory tracking procedure is to repeatedly plan state and input trajectories that start from the current state of the system and follow the reference trajectoriesx ref ,ũ ref as closely as possible. The reference trajectoriesx ref and u ref will ultimately converge to x ss and u ss (since lim t→∞ (t) = 0), hence the name steady-state offsets. This leads to the following optimization problem where Q ≽ 0 and R ≻ 0 are matrices of appropriate dimension. The problem is repeatedly solved subject to the current state x 0 as an initial condition (as common in MPC). Note that for simplicity, only a linear constraint on the state and input trajectories is included in the optimization. Slightly more general linear constraints including derivatives or integrals of the state and input trajectories can be dealt with in an analogous way. Feedback control is achieved by repeatedly solving (8) and applying the first portion of the inputũ(t), t ∈ [0,T s ) to the system, where T s denotes the sampling time. The reference trajectoriesx ref andũ ref , as well as the disturbancesṽ, are adapted according to the time that has elapsed. Thus, at time t = T s , the reference and disturbance trajectories in According to (6), this amounts to a simple multiplication of the parameter vectors x,ref , u,ref , and v with the matrix where the dimension of the identity matrix I is chosen accordingly.
For orthonormal basis functions, † the above optimization problem reduces to 6 min x , u The linear equality constraint (12) imposes the dynamics and (13) the initial condition. The linear inequality constraint (14) describes the input and state constraints. These are obtained by sampling the constraint (9). The values of the matrices F x , F u , D x , and D u may change slightly for different applications, and are given in Appendix C for the system considered in Section 4. The optimization therefore simplifies to solving the quadratic program (11), subject to the optimization variables x ∈ R ns and u ∈ R ms . Due to the parametrization with the basis functions, the quadratic program (11) typically has few optimization variables and can be solved efficiently. As highlighted before, by exploiting the time-shift property of the basis functions (see (6)), the problem can be warm-started with the solution from the previous time step.

Trajectory generation
The goal of the trajectory generation is to generate feasible input, state, and disturbance trajectories, such that the state trajectory follows the desired trajectory (given to the algorithm) as closely as possible. We assume that the desired trajectory might not necessarily prescribe the motion for all states. For example, the desired trajectory could only contain a position reference that should be followed, while finding the appropriate reference trajectories for the velocities is left to the algorithm. The feasible trajectories satisfy the system's dynamics, as well as the input and state constraints. The feasible trajectory is generated by solving an optimization problem, where the desired trajectory is given in the time domain. The resulting trajectories are parametrized by x,ref , u,ref , and v , and the steady-state offsets x ss ,u ss , and v ss , as indicated in Figure 2.
Due to the finite number of basis functions, only trajectories with a finite length can be approximated well. In order to overcome this limitation, the trajectory's duration is split into N smaller intervals of length T, and different linear combinations of basis functions are used to approximate the trajectory in each of these intervals, as shown in Figure 3. Therefore, the trajectory tracking problem (11) is modified such that, depending on the time that has elapsed, the reference trajectory in the corresponding interval is tracked, that is, x,ref , u,ref , v , x ss , and u ss are replaced by the corresponding x,ref,j , u,ref,j , v,j , x ss,j , and u ss,j , for j = 0,1, … ,N − 1. A practical advantage of this approach is the possibility of aborting the trajectory at any time during execution, leading the system to a safe equilibrium position (represented by the current interval's steady-state offsets). In that case, provided that the constraints are satisfied for all times (a sampling strategy is applied in this article), closed-loop stability and recursive feasibility are guaranteed. 6,22 The approach is particularly suited to smooth desired trajectories, where the approximations at two consecutive intervals generally overlap for a certain time (see Figure 3). However, this might not be the case for trajectories with sharp changes at the time of switching between two intervals.
In order to find a feasible approximation that is closest to the desired trajectory in the L 2 -sense, while penalizing the input and fitting the estimated disturbances, the set of reference parameters and steady-state offsets is chosen to minimize where r controls the smoothness of the input trajectories, Q ≽ 0, R ≻ 0 and V ≽ 0 are suitable symmetric tuning matrices, and, for t ∈ [0,T]. The desired state trajectory at the intervals represents the disturbances divided over the different intervals. Defining the optimization variables through the following auxiliary quantities, and eliminating the terms that do not depend on these variables, simplifies (15) to withQ,R,Ṽ, f x , and f v defined in Appendix A.
In addition, the dynamics need to be fulfilled for the generated trajectories, and the steady-state offsets are required to be equilibria. This is encoded, for j = 0, … ,N − 1, as Ax ss,j + Bu ss,j + Gv ss,j = 0.
In order to ensure that the state, the input, and the disturbances are continuous over subsequent intervals, the following constraints are added, for j = 1, … , N − 1, The constraints (25), (26), (27) prevent discontinuities in the actuation when the trajectory tracking controller switches reference parameters among consecutive intervals. In case the disturbances stem from the system's nonlinearities, (27) encodes the natural fact that the disturbances are continuous. Without loss of generality, the trajectories are enforced to start at zero (other cases can be dealt with in a similar way, up to a translation) with the following constraints, Finally, the state and input constraints are imposed, yielding for j = 0, … ,N − 1, Defining, the complete optimization problem that is solved to generate the reference set of parameters and steady state offsets is rearranged as, with H, f defined in Appendix A, and where A eq , A in , z min , and z max are defined through appropriate concatenations of the constraints (23) to (31). As mentioned in the previous subsection, in contrast to the work of Sferrazza et al, 23 the steady-state offsets are included among the decision variables.
In the authors' experience, using an active-set solver, 24 the optimization problem (33) can be solved more efficiently when the equality constraints are eliminated. This is done by means of QR factorization of A eq , as, where Q A , Q A,1 , and Q A,2 are orthogonal matrices, R A and R A,1 are upper triangular matrices, P A is a permutation matrix such that the magnitude of the elements on the main diagonal of R A is decreasing, and 0 is the zero matrix of appropriate dimensions. From the orthogonality of Q A , it follows that by replacing, whereẑ is an arbitrary vector of appropriate dimension, (34) is satisfied, Therefore, according to (36), the optimization problem in (33) can be reformulated as, The problem in (38) has generally considerably less optimization variables and constraints than the one in (33), and it exhibited smaller solution times in the setup used by the authors.

Kalman filter
The repeatable disturbances of the system are estimated with a Kalman filter, and are provided at each trial to the optimization problem described in Section 2.2. The filter aims at fusing prior and measurement information.
where i is the current trial, and the process noise n[k] i is assumed to be zero-mean Gaussian, with variance Q KF . Therefore, the Kalman filter's prior update is,v where A d , B d , and G d are the discrete-time system's matrices, the actual state, and input trajectories, respectively, x[k] i and u[k] i , are recorded during the ith trial, and used to compute these deviations as, In order to increase the signal-to-noise ratio, the disturbance vector is held constant for N KF discrete time steps. In this way, at each trial, N KF measurements subsequently contribute to update a single disturbance estimate value. Assuming that z[k] i is corrupted by zero-mean Gaussian noise with variance R KF , the measurement update is performed as described in Algorithm 7. The current disturbance estimate is transformed to a continuous time signal v(t) through a zero-order hold, and is used as an input to the trajectory generation problem, as described in Section 2.2. In contrast to the work of Sferrazza et al, 23 the parametrized approximation describing v(t) is obtained by optimizing (15), providing a trade-off between the placement of the various offsets and the fit of the reference trajectories.

Algorithm 1. Pseudo-code for measurement update
Initialize: k = 0 while k is smaller than the trajectory length do

MEMORY CONSIDERATIONS
The parametrization of the system's states, inputs, and disturbances is particularly advantageous when applied to trajectory tracking. As a matter of fact, the reference trajectories need to be stored in the memory and loaded during the task execution, as is the case with a standard approach, see, for example, the work of Kühne et al. 25 When these trajectories are represented in discrete time, a reference needs to be available for all the quantities of interest at each sampling step.
Therefore, the space requirements amount to, where c is the space needed to store a floating point number, t f is the trajectory duration, and T s is the sampling time.
Conversely, by using the parametrization introduced in Section 2.1, a set of parameters and steady-state offsets represents the reference trajectory at each interval. Therefore, the space requirements are, where T is the length of the intervals the trajectory is divided into. Space requirements are reduced through the parametrization when, Introducing a space-efficiency coefficient, the condition in (46) can be expressed as, In the application presented in Section 4, a space-efficiency coefficient of about 8.33 is achieved, meaning that almost an order of magnitude less memory is required compared to a standard approach. Note that this also greatly affects the dimension of the trajectory generation problem. As pointed out in the book by Bentley, 26 space efficiency often has positive effects on run time. In fact, a smaller program is likely to fit into faster levels of the memory hierarchy, 27 which generally have limited size. These effects are even more relevant when a collection of several trajectories need to be available on embedded systems with relatively small random access memory. In this case, the accessing times stemming from the need to load the trajectory data from disk storage can be reduced by a compact representation of the reference trajectories, as described in this article.

EXPERIMENTAL RESULTS
The proposed approach is evaluated on a quadcopter, with the task of balancing a pole while flying a predefined trajectory. This section presents the setup and the experimental results.

Hardware
The experiments are conducted on a quadcopter which uses the frame, motors, and motor controllers of an AscTec Hummingbird. The quadcopter is equipped with custom electronics and a circular plate that serves as a base for the pole. The physical parameters of the experimental setup are summarized in Table 1. An onboard controller, running at 1 kHz on a PX4FMU flight controller, tracks the desired angular rates and the desired total thrust, which are commanded by the parametrized MPC controller. The latter runs on a standard desktop computer that communicates with the quadcopter's flight controller through wireless communication and achieves a sampling time of 0.02 s. The experiments were conducted in the Flying Machine Arena, 28 where an external motion tracking system provides measurements of the system's state.

Model
The quadcopter's state is represented by its position, that is, r x , r y , and r z , the corresponding linear velocities v x , v y , and v z , and the xyz-Euler angles, that is, , , and . The pole is represented by its roll and pitch angles, respectively, p and p , and the corresponding angular velocities p, and p, . The pole is assumed to be very thin and its rotation about its axis of symmetry is not modeled. The quadcopter's angular rates , , and , and the difference between the total thrust f tot and the hovering thrust f tot,0 are regarded as the control inputs for the MPC controller. These are then sent to the onboard controller, where they are tracked with a very high bandwidth. 28 Summarizing, the complete state and input vectors are, The system's matrices are derived from a simple first principles model (see Appendix B for a detailed derivation), which considers the quadrocopter and the pole to be rigid bodies. Friction between the pendulum and the quadrocopter, drag of the pendulum and the quadrocopter, as well as any other aerodynamic forces, are neglected (the propellers are assumed to provide a certain thrust aligned with the symmetry axis of the quadrocopter). The nonlinear model is linearized about hover, which yields where k 1 , k 2 , and k 3 are constants (see Appendix B), m q is the mass of the quadcopter, and m p is the mass of the pole. The matrix G, which models how the disturbances affect the system's dynamics, is chosen as the identity matrix of dimension n v . Actuation constraints are imposed on the thrust f l of a single rotor, that is, for l = 0, 1, 2, 3, State constraints on the relative angles between the pole and the quadcopter ensure sufficient contact between the two bodies, and are expressed as, These constraints can be reduced to the form in (14), as shown in Appendix C.
The desired trajectory x des (t) is only provided for r x , r y , and r z , and is equal to zero for all the other quantities, therefore leaving the solver the flexibility to choose feasible trajectories for these states. The matrices Q, R, Q, R, and V are chosen to be positive definite matrices, which implies that the trajectory tracking problem in (8) and the trajectory generation problem in (33) are strictly convex quadratic programs. This ensures convergence of both programs to a feasible solution (if a solution exists) when solved with an active set method. 29 A "prerun" and a "postrun" trajectory's portions are generally added at the start and end of the trajectory, see Figure 3, by padding the desired trajectory with the initial and final state, respectively, for an appropriate time. Both the prerun and the postrun ensure a smoother transition with the hovering phase, which precedes and follows each execution. In addition, the postrun portion serves as a penalty for steering the system to the final state (note that the trajectory generation does not impose a hard constraint on the final state).
The Kalman filter's state is initialized with zero mean and the following variance, The remaining scalar parameters are summarized in Table 2.
Both the trajectory generation and the online trajectory tracking are solved using qpOASES, 24 which implements an active set strategy. The trajectory generation program is solved on average in 250 ms, while the trajectory tracking controller runs at 50 Hz.
A 12 s trajectory representing a three-dimensional eight is encoded in x des (t), and the parametrization obtained through the trajectory generation procedure provides a feasible approximation, as shown in Figure 4. As a result, the space-efficiency coefficient introduced in (47) is equal to, This trajectory is repeatedly tracked by the parametrized MPC controller introduced in Section 2.1, and the disturbances are updated after each trial, as explained in Section 2.3. The resulting tracking performance is shown in Figures 5 and 6. ‡ The framework presented in this article builds upon the scenario that trajectories are repetitively tracked, and that the system's disturbances can be captured by the basis functions used for the parametrization. In fact, the iterative learning algorithm especially needs to compensate for tracking errors when the quadcopter tilts and changes direction, that is, when the model linearized about hover is not very accurate. The fact that the system improves its tracking performance in these regions over subsequent trials shows that this framework is able to account for unmodeled nonlinearities, provided that the above assumptions are met. Compared with the heuristic approach of Sferrazza et al, 23 the use of steady-state offsets enables the proposed strategy to cope with trajectories and disturbances with larger magnitude (even at the end of an interval), as shown in the example in Appendix D. Together with the fact that a different system is used for the experiments, this facilitates the tracking of considerably more aggressive trajectories. The disturbance estimate shows a higher variance toward the end of the trajectory as a consequence of the particular choice of modeling the system's disturbance as time-varying trajectories. In fact, an incorrect disturbance estimate in an earlier part of the trajectory propagates in time affecting the later parts (it might even drive the system to a state where the current disturbance estimate is not a valid approximation). However, after multiple trials, the disturbance estimates converge, see, for example, Figure 7, which alleviates this effect.
The quadratic cost for the different trials, defined through the matrices Q and R, is shown in Figure 8. Due to the fact that the disturbances are not modeled as a function of the state, they are subject to change when the system explores different regions in the state space, that is, during the first trials, when the algorithm has not yet converged. Therefore, the procedure might first exhibit an increase in the cost, which corresponds to this initial exploration phase, before reaching an overall lower steady-state cost. However, the aforementioned choice of modeling the disturbances as time-varying trajectories has the benefit of not increasing the complexity of the online trajectory tracking MPC controller.
In the experiment presented, the algorithm only requires five trials to reach the steady-state cost, and it exhibits little fluctuation in the remaining executions. Although in this example the system always executed the same task under the same conditions, the fast learning rate and the fact that the learning algorithm keeps running after each trial (ie, no early stopping is required) enables the possibility of responsively adapting the disturbance estimate to new situations, for example, if the quadcopter is subject to a different load.

APPENDIX A. MATHEMATICAL DERIVATION OF THE PARAMETRIZED MPC MATRICES
In the following, the cost function (15) is rearranged by splitting the integral and expanding the different terms. The terms regarding the system's state are expanded as, Neglecting the first term in the sum, which does not depend on the optimization variables, and defining z x,ref,j ∶= ( x,ref,j , x ss,j ), (A1) can be rewritten as, with,Q where The terms regarding the system's disturbances are expanded in the same way, defining z v,j ∶= ( v,j , v ss,j ), as, dt. (A7) Note that due to Assumption H2 in Section 2.1,ũ which provides a means to rearrange the input terms as, Defining z u,ref,j ∶= ( u,ref,j , u ss,j ), (A10) can be rewritten as, Summing (A2), (A6) and (A11) over all the intervals, the cost function (15) reduces to, with z, z x,ref , z u,ref , and z v defined as in (32), (19), (20), and (21), and,

APPENDIX B. FIRST PRINCIPLES MODEL
For the purpose of modeling the system's dynamics, the following frames are introduced: {I} Inertial frame, {B} Quadcopter-fixed frame, with its z-axis aligned with the thrust direction (and symmetry axis of the quadcopter), {C} Pole-fixed frame, with its z-axis aligned with the symmetry axis of the pole.
The equations of motion are derived by using the principle of virtual power. This yields describing the position of the quadcopter and describing the attitude of the pendulum, m p and m q are the masses of the quadcopter and the pole, respectively, v p and v q are the linear velocities of the quadcopter and the pole, R IB ∈ SO (3) is the rotation matrix that transforms vectors from the frame B to the frame I, R IC ∈ SO (3) is the rotation matrix that transforms vectors from the frame C to the frame I, f tot is the total thrust generated by the four rotors, g is the gravity vector, Θ p is the inertia of the pole, r PQ is the distance vector that goes from the pole's center of gravity (positioned at half-length of the pole) to the quadcopter's center of gravity and is expressed in the frame C, and IC is the angular velocity vector of the C frame with respect to the inertial frame. Throughout this section, the preceding superscript denotes the frame in which a particular vector or tensor is expressed. Moreover, the "tilde" symbol on top of a vector denotes its corresponding skew symmetric matrix, that is, ã is defined as a × b = ãb for a ∈ R 3 and for all b ∈ R 3 . Note that the quadrocopter's angular rates, B IB , are considered to be control inputs for the MPC controller (with the vector's components , , which are tracked with very high bandwidth by the onboard controller). The quadrocopter's angular rates govern the quadrocopter's attitude bẏ and enter the combined quadrocopter and pole dynamics through R IB in (B1). Moreover, the fact that the quadrotor and the pole are connected yields the following constraint (expressed on acceleration level) Thus, the combination of (B1), (B2), (B3), and (B4) describes the entire quadrocopter-pole dynamics. The algorithms presented in this article rely on a linear time-invariant system model. To that extent, the nonlinear equations of motion are linearized about hover. The hover equilibrium is given by zero linear velocity and zero angular velocity of the quadcopter, the pole being at rest in the upright position, and the following constant thrust with g 0 : = 9.81 m/s 2 . Linearizing (B4) about hover yields Linearizing (B1) and including the relation (B6) results in where e x , e y and e z are the standard unit vectors in R 3 , and are the quadcopter's roll and pitch angles. Linearizing (B2) and including (B6) yields where p and p are the pole's roll and pitch angles, and d p is the pole's half-length, that is, |r PQ | = d p .
The quantities in Equations (B7)-(B8) relate to the ones in (49)-(50) through the following equalities, where the angular velocity p, of the pole is not included in the system's state.
For the system on which the experiments are conducted, the pole's inertia matrix has the following form, with the following relation holds,û with, where Θ q is quadcopter's inertia matrix, d q is the length of a quadcopter's arm, and k f is a physical constant. For the system on which the experiments are conducted, the inertia matrix has the following shape, Θ q = diag(I x , I y , I z ).
The physical parameters introduced in this section are summarized in Table C1.
With the introduced relations, (C1) becomes, Considering the approximation in (3), and the Assumption H2 in Section 2.1, (C11) can be rearranged as, The above semi-infinite constraint is required to hold for all times t ∈ [0,∞), and is implemented by sampling at the time instances t i , for i = 0, … ,s − 1. In the experiments described in this article, t 0 = 0 s, t 1 = 0.0929 s, t 2 = 0.3215 s, t 3 = 0.7164 s, t 4 = 1.3692 s. Thus, by defining the constraint (C12) can be approximated as u min ≤F u u +D u u ss ≤ u max .
Applying the same sampling strategy to the state constraints (52)-(53), yields where the matricesF x andD x , and the vectors x min and x max are defined accordingly. Consequently, the matrices F x , F u , D x , and D u , and the vectors b min and b max in (14) are obtained by appropriate stacking, . (C14)

APPENDIX D. EXAMPLE SIMULATION FOR COMPARISON TO PREVIOUS WORK
In Figure D1, simulation results are provided that compare the approach presented here to previous work. 23 The simulations are conducted on an unconstrained double integrator system, that is,ẍ(t) = u(t) + v, where v = 1 m/s 2 is a constant disturbance. A ramp of 10 m is prescribed as the desired trajectory. The same parameters chosen for the experiment presented in this article are used for this example. The plots show that using the approach of Sferrazza et al, 23 the steady-state disturbance cannot be fully captured by the basis functions for the entire length of an interval. As a consequence, the approximation shows oscillations and discontinuities at the interval boundaries. This inaccurate approximation tends to reduce the smoothness of the trajectory, compared to the desired one (see the right plots in Figure D1). In contrast, the approach presented here yields an approximation that overlaps with the estimated steady-state disturbance and results in significantly smoother trajectories, as shown in the experiments presented in this article. Note that this effect becomes considerably more relevant for the tracking of aggressive trajectories on higher dimensional systems.

F I G U R E D1
The plots on the left compare the disturbance estimation using, A, the approach presented in the work of Sferrazza et al 23 with the one introduced here. The disturbance captured by the Kalman filter after nine trials (in red) is approximated at each interval with exponentially decaying basis functions (in black). The plots on the right show the tracking performance of the state x at the tenth trial for the respective approaches. The generated trajectory that accounts for the current approximation of the disturbance estimate is shown in black. The actual x trajectory is shown in blue, while the desired ramp trajectory is shown in red. Note that the three trajectories overlap in (B) [Colour figure can be viewed at wileyonlinelibrary.com]