Online learning‐based model predictive control with Gaussian process models and stability guarantees

Model predictive control allows to provide high performance and safety guarantees in the form of constraint satisfaction. These properties, however, can be satisfied only if the underlying model, used for prediction, of the controlled process is sufficiently accurate. One way to address this challenge is by data‐driven and machine learning approaches, such as Gaussian processes, that allow to refine the model online during operation. We present a combination of an output feedback model predictive control scheme and a Gaussian process‐based prediction model that is capable of efficient online learning. To this end, the concept of evolving Gaussian processes is combined with recursive posterior prediction updates. The presented approach guarantees recursive constraint satisfaction and input‐to‐state stability with respect to the model–plant mismatch. Simulation studies underline that the Gaussian process prediction model can be successfully and efficiently learned online. The resulting computational load is significantly reduced via the combination of the recursive update procedure and by limiting the number of training data points while maintaining good performance.


INTRODUCTION
Model predictive control (MPC) 1 is naturally capable of dealing with multi-input multi-output systems and constraints on the input, state, and output already in the design process. This has led to manifold scientific interest, as well as practical applications. 2,3 In terms of performance, MPC can be superior to other control approaches because the prediction of the process under consideration allows to compute control actions based on future outcomes and facilitates to take preview information about references and disturbances into account. Hence, the prediction model plays a crucial role in MPC. Unfortunately, there is always a certain process-model error or model uncertainty present in practice and the system might change over time, which limits the prediction quality of the model. One way to deal with this situation is to resort to robust MPC schemes, such as, for instance, min-max MPC, 4 tube-based MPC, 5 multiscenario approaches, 6,7 or stochastic approaches 8 that take the uncertainty explicitly into account.
Prediction models are often based on first principles approaches, which can be very time consuming or even impossible in practice. Furthermore, if the underlying process or environmental conditions change, a once good model can degrade and thus needs to be adapted. An alternative to first principle approaches is to derive prediction models directly from measured data. The resulting models, so-called black or grey box models, 9 can in principle be learned or refined during operation by including newly available data. Thereby, they can account for changing process dynamics or a changing process environment. Combining data-driven with first principles models is another possibility. [10][11][12] Although data-driven modeling is not a new field of research, it gained significant attention over the last years due to increasing computational power, the possibility to widely collect data, and the rise of machine learning algorithms, such as neural networks, deep learning, support vector machines, or Gaussian processes (GPs). 13,14 Especially the use of GPs within MPC has attracted significant interest in recent years. 10,11,[15][16][17][18] However, combining GPs with MPC leads to multiple challenges, such as the cubical increase of the computational load with the number of training data points. This also increases the overall necessary computations to solve the resulting optimal control problem. Furthermore, the utilization of GPs in an optimal control problem can render the resulting optimization very nonlinear, even for a small number of data points, which increases the probability of obtaining suboptimal or infeasible solutions. Despite these challenges, GPs are employed together with MPC as they provide several advantages. For instance, they do not only allow to compute a prediction of the system evolution but also a prediction variance (an effective measure of the uncertainty of the learned model), they are less susceptible to overfitting, and they have, under certain circumstances, universal approximation capabilities for a large class of functions, 19 thereby allowing to model the underlying dynamics of a wide variety of systems.
In order to reduce the computational load of GPs one can distinguish two main approaches. The first approach basically fixes the maximum number of training data points, while the second approach employs so-called sparsity. 20,21 The first approach often entails the drawback that the GP might not be able to model the system with sufficient accuracy throughout the full operation space. To compensate for this, one can resort to online learning (or adaptation) of the GP during operation, which also allows to account for time-varying systems or changing environmental conditions. On the downside, some of the computation time that is saved by reducing the number of training data points is in turn spent by the learning process, which includes updates of the training data set and covariance matrix, recalculation of the covariance matrix inverse, and hyperparameter optimization in each time step. While these often computationally extensive calculations can be performed offline, only very few publications exist that combine MPC with online learning of GPs. The required computations often take too long to control most processes. Thus, GPs are mostly trained/learned offline. 17,22,23 Exceptions are, for instance, the works by Ortman et al, 24 where the system had a large time constant in the order of hours or Klenske et al., 16 which provided a hyperparameter optimization tailored to the specific application.
Another important aspect when combining GPs and model predictive control is safety, constraint satisfaction, and stability, for which different approaches have been proposed. One can, for example, avoid to enforce stability by design and include instead the GP posterior variance in the cost function of the optimal control problem. This avoids steering the plant into regions where the model validity is questionable. [25][26][27] Also, one can perform a posteriori stability verification. For instance, Berkenkamp et al. 28 proposed to learn the region of attraction of a given closed-loop system, whereas Vinogradska et al. 29 calculated invariant sets for the validation of stability in a closed-loop with GP models. Another possibility is to use invariant safe sets and employ a two-layer control framework, where a safe controller is combined with a control policy that optimizes performance. [30][31][32][33] For instance, in the works by Aswani et al. 34 and Bethge et al. 35 two different prediction models were used in parallel, where the first is a nominal model, used to guarantee robust stability using tubes, and the other can be a general learning-based model (e.g., a GP) used to optimize performance. In the work of Soloperto et al. 36 tube-based MPC was considered together with GPs, which were also used to derive robust stability. To this end, uncertainty sets that are based on the GP variance were used to construct tightened state and input constraint sets. Since the uncertainty sets hold probabilistically, the same goes for the stability result. The two-layer framework was extended to three layers in Bastani. 37 The aforementioned approaches are based on the assumption of full state information and the use of invariant terminal regions.
In Maiworm et al. 18 we considered an output nominal MPC scheme (which neither does require full state information nor terminal region in the optimal control problem) with an offline trained GP prediction model and combined it with input-to-state stability (ISS), a framework that covers inherent robust stability of nominal MPC and stability of robust MPC schemes in the presence of constraints. 38 If a system under a predictive controller is shown to be ISS, then this property is preserved even in the case of suboptimal solutions of the involved optimal control problem. We outlined conditions under which the GP-MPC scheme is inherently robustly stable (i.e., bounded disturbances lead to bounded effects on the output) and guarantees recursive constraint satisfaction. To this end, the uncertainty or disturbance has to be bounded deterministically. At the expense of a potentially smaller domain of attraction, the advantage of guaranteeing inherent robust stability lies in its simplicity. The already involved ingredients in MPC merely have to satisfy certain properties (e.g., uniform continuity). The aforementioned methods in the literature on the other hand are conceptually more complex and/or more computationally expensive than the nominal MPC case because different control layers with backup controllers are required, [30][31][32][33]37 different prediction models are employed that have to be evaluated in parallel, 35 or tubes have to be computed. 36 Furthermore, since the employed MPC formulation provides guarantees without a terminal region, then if also no state constraints have to be fulfilled, the resulting optimal control problem is easier to solve.
In this work, we extend our previous results to the case of a limited training dataset of the GP and aim toward online learning for a wide class of applications. To reduce the computational load we do not consider online hyperparameter optimization. Instead, we focus on a recursive approach to adapt the training dataset and compute the inverse covariance matrix tailored to MPC. The main contributions of this work are: • Online learning of the GP model, by means of adaptation of the training dataset, at reduced computational cost. This facilitates the possibility of deployment for faster processes. For this purpose, we employ a recursive formulation to update the GP prediction model online.
• Guaranteed ISS with constraint satisfaction for the presented online learning approach. The result is not confined to GPs but holds for general prediction models that are learned online and satisfy the presented conditions.
• The extension of the method such that it yields good performance with only limited prior process knowledge (e.g., lack of training data in important regions of the operation space). To this end, we incorporate the concept of evolving GPs to facilitate online learning by means of adaptation of the training dataset. 14, 39 We derive criteria that use the GP prediction error and the variance to determine which points to add to the training dataset.
• The use of analytic linearized GP models for the determination of the MPC terminal components.
The paper is structured as follows: The considered problem setup is formulated in Section 2. The concept of GPs, together with the recursive formulation for online learning, is outlined in Section 3 and used for the formulation of the optimal control problem in Section 4. The same section also contains the stability results. Section 5 presents simulation results with focus on online learning of the GP before Section 6 concludes the paper.
Notation: Vectors, matrices, and sequences (of vectors or scalars) are set using bold variables. For matrices we use upper case (Y), for vectors slanted lower case (y), and for sequences upright lower case (y). Sets are denoted by calligraphic upper case variables (). The distance of a point z ∈ R p to a set  ⊂ R p is defined as d(z, ) = inf y∈ ||y − z|| ∞ , where ||⋅|| ∞ is the infinity norm (i.e., d(z, ) = 0 if z ∈ ). If not stated otherwise, ||⋅|| denotes the Euclidean vector norm. A function is  ∞ in s for any value of t and lim t→∞ (s, t) = 0, ∀s ≥ 0.

PROBLEM FORMULATION
We consider nonlinear discrete-time systems represented by a nonlinear autoregressive model with exogenous input (NARX) * Here k denotes the discrete time index, u k ∈ R the input, y k ∈ R the output, and x k ∈ R n x is the NARX "state vector" that consists of the current and past outputs and inputs, and where m y , m u determine the NARX model order n x = m y + m u + 1. The output is corrupted by Gaussian noise ∼  (0, 2 n ) with zero mean, noise variance 2 n , and bounded support | | ≤ < ∞. † Inputs and outputs are restricted to lie in the constraint compact sets  ⊆ R and  ⊆ R, where  are hard constraints and  can be hard or soft constraints that we denote by  h and  s respectively. The NARX state and the output are connected via . The considered control objective is set-point stabilization and optimal set-point change, that is, we want to steer the system from an initial point (x 0 , u 0 ) to a target reference point (x ref , u ref ), while satisfying the constraints and stabilizing the system at the target. To this end, we employ model predictive control, which requires a model of the process (1a) that is capable of predicting future output values with sufficient accuracy. The hat notation( ⋅) denotes an estimated quantity. We outline an approach to learn the system model approximationf (x k , u k ) from measured input-output data using a GP, which is capable of online learning during operation based on newly available data. This results in a GP-based NARX prediction model.
We consider a NARX model with one output that is modeled by a GP. The presented approach can be extended to more outputs, where for each output an individual GP is used, compared to Ostafew et al. 11,22 or Klenske et al. 16 The theoretical results obtained in Section 4 are also valid for the multi-output case.

GAUSSIAN PROCESSES
We first review the basics of GP regression and then present a recursive formulation that is based on the concept of evolving GPs. This facilitates the generation of a NARX prediction model capable of adapting to changing conditions. To reduce the online computational cost, we do not consider online hyperparameter optimization. Instead, we focus on updating the training dataset efficiently and how to perform the required computations online. To this end, we combine this concept with a recursive update of the involved Cholesky decomposition.

Basics
A GP is a collection of random variables, any finite number of which have a joint Gaussian distribution. 13 It generalizes the Gaussian probability distribution to distributions over functions and can therefore be used to model/approximate functions that can be used to capture dynamic systems. 14 They can be utilized for models purely derived from data or combined in a hybrid way with other, for instance, deterministic models. [10][11][12]28,30,31,36,41,42 For regression, GPs are employed to derive or approximate maps of the form z = f ( ) + with input , output z, and where f (⋅) is the underlying but unknown latent function. The output is assumed to be corrupted by Gaussian noise ‡ ∼  (0, 2 n ) with zero mean and noise variance 2 n . The objective is to infer the function f (⋅) using measured input-output data ( , z) with a GP g(w) with input w ∈ R n w , called regressor. In the present case (1a), we have z = y k + 1 and f ( ) = f (x k , u k ). The regressor of the GP will be w k = (x k , u k ) ∈ R n w with regressor order n w = n x + 1. For the sake of brevity we omit the dependence on the discrete time step k in the remainder of this section whenever possible.
The first required element is a GP prior distribution g(w) ∼ (m(w), k(w, w ′ )) that is specified via the mean function m(w) = E[g(w)] and the covariance function § k(w, w ′ ) = cov[g(w), g(w ′ )] = E[(g(w) − m(w))(g(w ′ ) − m(w ′ ))] with w, w ′ ∈ R n w and E[⋅] denoting the expected value. The mean and covariance function together with a set of so-called hyperparameters , detailed later, fully specify the GP. † In real processes the measurement noise is always bounded, for instance, due to the limitations of the involved data acquisition systems. ‡ The concept of GPs assumes Gaussian noise in the measurements, that is, noise with unbounded support. The considered real system (1), however, is corrupted by Gaussian noise with bounded support. The resulting approximation error can be absorbed in the prediction error (8) defined further below. On the other hand, Gaussian noise with unbounded support can be regained by GP warping. 43 The smaller the bounded support, the larger the difference between the distributions and the larger the correcting effect of warping. § The covariance function is also denoted as kernel.
F I G U R E 1 Gaussian process (GP) inference: The top figure depicts a GP prior distribution with the dashed black line representing the mean function m(w) and the green lines representing random function realizations drawn from the prior distribution. The grey shaded area is the 95% (twice the SD) confidence interval computed via k(w, w ′ ). When data points  are added (bottom figure, red crosses), the GP posterior with m + (w|) and 2 + (w|) is inferred from this data The GP prior is trained/learned using a set of n measured input-output data points, where the input dataset is w = [w 1 … w n ] T ∈ R n×n w and the output data set z = [z 1 … z n ] T ∈ R n×1 . The combined data  = {w, z} is denoted as training dataset and is used to infer the posterior distribution This is also a GP with posterior mean m + (w|) and posterior variance 2 + (w|) given by with Note that realizations of the posterior can yield infinitely many function outcomes but as it is conditioned on the training data points, it rejects all possible functions that do not go through or nearby (if 2 n ≠ 0) these points ( Figure 1).
The posterior mean function (4a) is the desired estimator of the unknown output latent function f (x k , u k ) in (1a), which we highlight by definingŷ The key elements for a GP to yield a sensible model are the prior mean and covariance function. Both depend generally on a set of hyperparameters , that is, m(w| ) and k(w, w ′ | ). Very often just a constant zero prior mean m(w| ) = c = 0 is used. 15,44,45 However, other choices include, for instance, the use of a deterministic base model x k+1 = f (x k , u k ) as the prior mean function. 10,46 Regarding the covariance function, it is often assumed or known that the system dynamics can be modeled by a member of the space of smooth functions C ∞ . A covariance function that provides this property is the squared exponential covariance function with automatic relevance determination where w i , w j ∈ R n w , = { 2 f , Λ}, and Λ = diag(l −2 1 , … , l −2 n w ). The measurement noise 2 n is added via the Kronecker delta ij in (6). The minimal required number of regressors n w can be determined through optimization of the length scale parameters l in Λ. 47,48 Other choices include, for instance, the combination of (6) with a linear kernel. 47,49 A common approach to determine the hyperparameters , given a training dataset  = {w, z}, is to maximize the log marginal likelihood 13 log(p(z|w, )) = − 1 2 An advantage of GPs is that (4b) naturally provides a quantification of the model uncertainty in the form of its variance. On the other hand, the involved computations in (4a) and (4b) scale with (n 3 ) due to K −1 , where n is the number of training data points. This severely limits the application of GP models for fast processes, where small sampling times are required; especially in the case of relatively large training datasets with several hundred or thousands of data points. If online or close to online hyperparameter optimization is needed, this drawback becomes even more pronounced.

Evolving GPs
In order to efficiently refine the GP model online we seek to update the training dataset  k , possibly at each time step k, during operation. To this end, we resort to the concept of so-called evolving GPs, 14,39 which can be used, for instance, if the training data is only available for certain regions of the operating space and one wants to expand operation beyond these regions online. The concept basically leads to GPs whose training data set  k is updated online using some type of information criterion. Different criteria can be used to select new data points to be added and already existing points to be removed if necessary. The general idea is to include an incoming data point to the training dataset only if it contributes enough new valuable information, which can be defined in different ways and depends on the respective application. Possible options are the use of the information gain, entropy difference, or the expected likelihood. 50,51 We employ the GP as a prediction model in MPC and are therefore particularly interested in how accurate the current model is able to predict the output value at the next time step and how confident this prediction is. To this end, given a new data point (w k , y k+1 ), we first define the prediction error via and define the following rule that determines a new training dataset candidate  ′ k+1 . Rule 1 (New training dataset candidate). At the current time step k with regressor w k and training dataset  k computê y k+1 = m + (w k | k ) and 2 + = 2 + (w k | k ). Once the next output y k + 1 is available, the new data point (w k , y k+1 ) is considered as a candidate for inclusion into the training dataset  k whereē and 2 are prespecified thresholds and  ′ k+1 is the new training dataset candidate for k + 1. Thus, if the prediction error e p is larger then the thresholdē, the data point is considered to be included in the training data set  k because the current posterior model is not able to predict the output with the specified accuracy. If it is smaller but the resulting posterior variance 2 + (w| k ) is larger than the threshold 2 , the data point is also a candidate because the current posterior model is not sufficiently confident in its prediction. This allows to include data points that are relevant to attain a certain prediction quality and effectively allows to limit the necessary number of data points in  k . This becomes especially important for long operation times and many encountered data points with new information during operation.
Remark 2. Since Rule 1 would also consider outliers for inclusion, we propose to combine it with an additional update rule presented in Theorem 1 (Section 4.3). The application of both update rules is contained in Algorithm 1.

Initialization
Training data set .
Optimize hyperparameters (7) with initial data set . Initialize GP posterior mean function m + (w) with covariance matrix K −1 , Cholesky factor R, and (Section 3.3). Compute GP posterior mean gradient m + (w) (Section B). Compute linear GP model at x ref (Section 5.5).

Recursion for each time step k do
Solve optimal control problem (12) for initial condition x k and obtain optimal input sequenceû * k|k .
Remove oldest data point and downdate K ′ and R ′ via (A2).
As the available computational power is always limited and depending on the concrete system, this can require the limitation of the maximum number of points in  k by a constant M ∈ N. ¶ If this limit is reached, data points have to be removed to maintain the size of  k . Again, different criteria can be employed to determine which data point shall be deleted. For instance, the point in the training dataset with the lowest benefit for the model quality (e.g., the data point that is most accurately predicted under the current posterior) can be deleted. This, however, can be computationally expensive because the prediction has to be evaluated for every of the M training data points at each time instant k. For online implementation, we employ a more simple approach that deletes the oldest point contained in  k . Remark 3. The concept of evolving GPs, in particular the outlined data handling approach, leads to a training dataset  k that captures the system dynamics in an (evolving) subregion of the whole operating region. Thus, information about already visited regions can be lost when moving towards other regions and have to be regained when visited again. This could be counteracted, for instance, by exploiting multiple GPs for different regions or by GP blending. 35 Remark 4. In principle, the smaller the thresholdsē and 2 , the better the prediction. However, then also the overhead for the computational evaluation for adding and removing data points becomes larger. In addition, the smaller the thresholds, the smaller the region in which the training dataset captures the system behavior, given the case that only a finite number of training data points is allowed. Hence, the selection of the thresholdsē and 2 is an application specific trade-off and might be chosen heuristically by the user. Some general guidelines are, (i) a lower bound for 2 is the measurement noise variance, and (ii)ē could be chosen proportional to 1 that is, to the mean value of all the absolute values of the prediction errors, based on the current training data set  k . In the same way 2 could be chosen.

Avoiding numerical illconditioning for MPC by Cholesky decomposition
The squared exponential covariance function (6) and other smooth covariance functions lead to a poor conditioned covariance matrix K. 53,54 This results in numerical problems when computing the inverse K −1 with computational cost (n 3 ), as required for (4a), (4b), or (7). These problems become even worse if (4a) and (4b) are nested within an optimization procedure like model predictive control. One way to alleviate this problem is by adding an additional noise or jitter term 53 to the diagonal of the covariance matrix. An effective approach however is to avoid the numerical instabilities that arise in the explicit computation of the matrix inverse by performing the required computations using the Cholesky decomposition, which is numerically more stable.
Given a system of linear equations Ax = b with a symmetric positive matrix A, we denote the solution by is an upper triangular matrix that is called the Cholesky factor. It can be used to obtain the solution via x = R ⧵ (R T ⧵ b). In order to use the Cholesky factor to solve (4a) and (4b), we define which can then be computed with the Cholesky decomposition K = R T R via The computational cost of computing R is  ) and the cost of computing and is (n 2 ). 13 If the training data set  k does not change, the Cholesky decomposition K = R T R and the computation of have to be performed only once at the beginning, whereas has to be recomputed for every new test point w. If  k changes, that is, with each inclusion or removal of a data point, the covariance matrix K has to be updated for an appropriate evaluation of the GP posterior. If a data point is included, a row and column have to be added to K. If a data point is removed, the respective row and column associated with this point have to be removed. These changes require in principle a full recalculation of the Cholesky factor R, which is the most expensive computation. To reduce this computational load we employ the approach of Osborne 54 to recalculate the Cholesky factor recursively, taking advantage of the available factor of the previous step. The precise procedure is outlined in the Appendix in Section A1.
Remark 5. The recursive update of the Cholesky factor can only be applied if the hyperparameters do not change because otherwise, every single element of K changes and a recursive approach is not applicable anymore.
Remark 6. Note that in many works 55-57 not the Cholesky decomposition but the covariance matrix inverse K −1 is recursively computed, which is based on the partitioned block inverse using the Woodbury matrix identity. Presumably for the numerical issues outlined above, this approach has never been used in combination with MPC. It has, however, in the signal processing literature, where it is strongly connected to the concept of kernel recursive least-squares. 56,57 Due to the recursive nature, both in the data inclusion approach and the Cholesky decomposition, we denote the resulting GP as recursive GP (rGP). The most important steps of the resulting rGP-MPC formulation are presented in Algorithm 1.

GP-BASED OUTPUT FEEDBACK MODEL PREDICTIVE CONTROL
In this section, we present the output feedback model predictive control formulation, based on the rGP NARX model for prediction. We highlight the necessary components and show under which conditions stability can be guaranteed even if the GP model changes online.

Prediction model
In Section 4.3 we establish input-to-state stability for the considered system, which is defined using the evolution of the state and not the output. For this reason, we first reformulate the GP output prediction in terms of the NARX statex k . We start by setting k : = k + 1 inx k and arrive at Since the predicted outputŷ k+1 is computed by (5) we obtain the NARX prediction model which we also denote as the nominal model. Correspondingly, for the NARX model of the real process (1a) we have and due to (8) this can be reformulated as ] T , that is, the real NARX model can be represented as the superposition of the nominal/prediction model and the prediction error.

MPC optimization problem
Using the prediction model (10), we consider at each time step k the optimization problem The input sequence to be optimized is denoted byû k|k = {û k|k , … , û k+N−1|k }, N is the prediction horizon, x k is the initial condition of the measured NARX state (2), and  ⊆ R n x is the resulting constrained set of the NARX state that is a combination of multiple instances of  h depending on the specific composition of x k . || Since  h is compact, the resulting  is also compact. As cost function in (12) we consider where V f (⋅) is the terminal cost function that is weighted by a design parameter ≥ 1. The employed positive stage cost is given by where s (⋅) penalizes input and state deviations from the reference and b (⋅) is a barrier function that can account for soft output constraints  s . It is defined by is a -function and d(⋅) the distance function as defined in Section 1.
The optimal solution of (12) is denoted byû * k|k , the resulting optimal state sequence byx * k|k . The first element ofû * k|k , that is, û * k|k , is applied to the process such that we obtain , also denoted as value function, because they depend on the changing prediction model associated with  k . Note furthermore that (12) does not include any explicit terminal region constraint for stability. This makes its solution less computationally expensive, especially if only soft output/state constraints are considered.

Stability
Establishing stability in MPC is often based on the use of a terminal cost function V f (⋅) and a terminal region  f . 58 Here we employ an approach where the optimal control problem (12) does not require an explicit terminal region  f . Instead, we use V f (⋅) weighted by a factor , as proposed by Limon et al, 59 to establish ISS.

Definition 1 (ISS). Consider the closed-loop system
holds for all initial states x 0 , errors e k , and for all k.
ISS combines nominal stability as well as uniformly bounded influence of uncertainty in a single condition. It implies asymptotic stability of the undisturbed (nominal) system (with e k ≡ 0) and a bounded effect of the uncertainty on the state evolution. Furthermore, if the error signal e k fades, the uncertain system asymptotically converges to the reference point. We therefore consider stability first for the nominal case, i.e., when the prediction/nominal model (10) and the true system (11) are exactly the same. After that, we establish robust stability in the sense of ISS.

Nominal stability
In the following, let the current deviation from the reference point and the deviation at the next time step bex = x − x ref andx Assumption 1 ensures that the system is locally asymptotically stable on the positive invariant set  f , while satisfying state and input constraints. It can be satisfied if we have at least a locally valid description of the process at the target point. This can, for instance, be a linearized version of the GP prediction model at the reference (e.g.,  =  ref ), which in turn can then be used to derive a suitable terminal cost and controller. Possible options are then, for instance, the use of a linear-quadratic regulator and/or applying Lyapunov methods (see also Section 5.5). Although it is sufficient to determine the terminal components from the nominal model, one could also consider the design of a robust terminal controller and cost. For instance, using a GP model for the target region one could consider a specific probability bound given by the posterior variance and then based on this design a robust terminal controller.
Theorem 1 (Nominal stability). Let MPC (x k | k ) be the predictive controller derived from the optimal control problem (12) and let Assumption 1 be satisfied. Furthermore, let  k be the training dataset at time k,  k+1 the one that will be used at time k + 1, and  ′ k+1 the updated training dataset candidate resulting from Rule 1. If  k is updated using the additional rule Proof. Letx * k|k = {x * k|k ,x * k+1|k , … ,x * k+N|k } be the predicted state sequence that results from applying the optimal input sequenceû * k|k . Then we can write the optimal cost for initial condition x k =x * k|k also as V * the respective sequences that start at k + 1 computed at time k, where the last state is given by the terminal control law, that is, . By Assumption 1 we have that the stage and terminal cost are positive definite. Hence, the cost function V N (x k ,û k|k ) is positive definite. Furthermore we also obtain by Assumption 1, which is a well-known result in standard MPC (for the derivation see, for instance, Rawlings and Mayne 1 or Rakovic et al. 60 ). Given the update rule in Theorem 1 we have Combining the previous two equations we obtain Thus, the value function is decreasing even if the prediction model changes. Hence the value function is a Lyapunov function.
Regarding the feasible region, we first review a result of Limon et al. 59 and show afterwards an extension to the present case. In particular, theorem 3 of Limon et al. 59 shows for the nominal and time-invariant case of (12) (i.e., constant prediction model and no model-plant mismatch) with value function V * N (x k ) that ∀ ≥ 1 there exists a feasible region  N ( ) such that ∀x 0 ∈  N ( ) the nominal closed-loop system x k+1 =F(x k , MPC (x k )) is recursively feasible and asymptotically stable. The feasible region is characterized by where is defined in Assumption 1 and d is a positive constant such that (x k , u k ) > d, ∀x k ∉  f and ∀u k ∈  . The size of the set  N ( ) increases with . ** In this work, the value function V * N (x k | k ) changes at certain time instances k whenever the data set  k changes. For this reason we extend the definition of the feasible region to  k which then also changes with k. Due to (14), the optimal cost is decreasing for a particular state sequence and therefore  k N ( ) is increasing along the state sequence. Thus, if the initial state x 0 ∈  0 N ( ), then the subsequent states x k ∈  k N ( ) and the optimal control problem is recursively feasible. Hence, the target x ref is asymptotically stable for the nominal closed-loop system x k+1 =F(x k , MPC (x k | k )). ▪ At x k (with the current output measurement y k ) the optimal control problem is solved with data set  k and the resulting input u k = MPC (x k | k ) = û * k|k is applied to the system. If the next data point (w k , y k+1 ) is a candidate for updating the GP, the previous optimal cost is recomputed using the updated GP. If the cost does not increase, the GP update becomes effective. Thus, the update rule in Theorem 1 is executed additionally after the data selection process of Rule 1. This is also reflected in Algorithm 1.
Remark 7 (Conflicting objectives). Theorem 1 establishes nominal stability despite a changing training dataset  k . In order to determine the new dataset candidate  ′ k+1 we use Rule 1, whose objective is to refine the current prediction model. Note that also other rules, which utilize different selection criteria for model refinement (e.g., statistical methods, see Section 3.2), can be employed. Now, one could assume that the additional update rule in Theorem 1 is not necessary because with every new data point the prediction model should become more accurate. This is, however, not necessarily the case if, for instance, the output is corrupted by noise or if outliers are present. In both cases, the apparent process behavior differs from the true behavior and it cannot be guaranteed that the prediction model becomes more accurate with every added data point, nor that the value function continues decreasing monotonically. Thus, the objective of the update rule in Theorem 1 is to make sure that safety, in the sense of stability and constraint satisfaction, is guaranteed. This is also illustrated in the simulations section in Figure 10. However, in the same simulations we also see that data points, selected by Rule 1 and which carry valuable information, are discarded by the update rule of Theorem 1 because the decreasing value function condition, and with that stability, could not be guaranteed. In other words, the two objectives of model refinement (expressed by Rule 1) and safety (in the sense of stability, expressed by the update rule of Theorem 1) are conflicting objectives, especially in the case of corrupted measurements. In this work we prioritize safety, thereby sacrificing a bit of the potential of model refinement.
On the basis of the nominal stability result for the online rGP-MPC scheme, we now establish robust stability.

Robust stability
Based on Theorem 1 we show that the real process controlled by the proposed predictive controller is ISS w.r.t. the prediction error e p .
Theorem 2 (ISS). Let MPC (x k | k ) be the predictive controller derived from optimal control problem (12) satisfying Assumption 1 and Theorem 1. If • the nominal modelF(x k , u k | k ) is uniformly continuous in x k for all x k ∈  0 N ( ), all u k ∈  , and all  k during the prediction horizon † † , and ** Note that theorem 3 of Limon et al. 59 is stated the other way round, that is, for each region  N ( ) and for all x k ∈  N ( ), there exists a ≥ 1 such that the nominal closed-loop system is asymptotically stable at x ref .
† † Note that this condition does not prohibit the change of the nominal model from the current time instant k to the next k + 1.
• the stage cost function (x k , u k ) and the terminal cost function V f (x k ) are uniformly continuous in x k for all x k ∈  0 N ( ) and all u k ∈  , then the target x ref of the closed-loop system x k+1 = F(x k , MPC (x k | k ), e p ) is ISS w.r.t. the prediction error e p in a robust feasible set Ω 0 r ( ) ⊂  0 N ( ) for a sufficiently small with |e p | < < ∞. The smaller , the larger the set Ω 0 r ( ).
Proof. We first establish the set Ω 0 r ( ) and prove recursive feasibility. Afterwards we prove the ISS property. Regarding the nature of Ω 0 r ( ) we first review a result of Limon et al. 38 and then extend it to our case. Proposition 1 (C2) in 38 shows for the time-invariant case of (12) (i.e., for a nonchanging prediction model) that the closed-loop x k+1 = F(x k , MPC (x k | k ), e p ) is robustly feasible for all x k in a robust feasible set Ω r ( ). In particular, it is proven that if |e p | < with a sufficiently small , there exists a r such that Ω r ( ) ∶= {x k ∈ R n x ∶ V * N (x k ) ≤ r} ⊂  N ( ) is a compact and positive invariant set (where  N ( ) is the feasible set of the OCP with e p ≡ 0) and for all x k ∈ Ω r ( ) the resulting predicted state sequence remains in Ω r ( ). Therefore the state constraints  do not become active. Hence, for all x 0 ∈ Ω r ( ) the MPC scheme is recursively feasible and the constraints are robustly satisfied. Furthermore, larger values of lead to a larger region Ω r ( ).
In the definition of Ω r ( ) in Reference 38 the value function V * N (x k ) is time-invariant, whereas in this work V * N (x k | k ) depends on the changing dataset  k . For this reason we extend the definition of the robust feasible region to Ω k Thereby, N ⋅ d + ⋅ establishes an upper bound for r. Like the feasible set  k N ( ) (see the proof to Theorem 1), also Ω k r ( ) increases with and in particular with k along a particular state sequence x = {x 0 , x 1 , …}. Therefore, if the initial state x 0 ∈ Ω 0 r ( ), then the subsequent states x k ∈ Ω k r ( ) and the state constraints do not become active. The existence of Ω 0 r ( ) is established by proposition 1 (C2) in Reference 38 (as outlined above) and therefore, if x 0 ∈ Ω 0 r ( ) then (12) is recursively feasible and the constraints are robustly satisfied. Now we show that the closed-loop system the prediction error e p . To this end we start by showing that the cost function V N (x k ,û k|k ) is uniformly continuous in x k . Since the nominal modelF(x k , u k | k ) is uniformly continuous in x k during the prediction horizon, there exists a -function N ( ), all u k ∈  , and for a given data set  k . In accordance with lemma 2 in Reference 38, the predicted state evolution then satisfies ||x k+i|k −ẑ k+i|k || ≤ i x (||x k − z k ||) for i ∈  0∶N−1 . Furthermore, since the stage and terminal cost are uniformly continuous in x k , there exists a couple of -functions (⋅), N ( ) and all u ∈  . Combining these properties we obtain where • denotes the concatenation of functions (e.g. 1 • 2 (x) = 1 ( 2 (x))) and V (⋅) is a -function. Therefore the cost function is uniformly continuous in x k for all x k ∈  0 N ( ) and allû k|k . As shown, for every x k ∈ Ω 0 r ( ) the state constraints do not become active. Thus, the optimal solutionû * k|k of (12) is feasible for every x 0 ∈ Ω 0 r ( ) and we obtain Therefore, the value function V * N (x k | k ) is also uniformly continuous in x k for all x k ∈ Ω 0 r ( ) and a given dataset  k .
At last we show that the value function is a ISS-Lyapunov function. Since V * N (x k | k ) is a Lyapunov function for the nominal system (Theorem 1) there exists  ∞ -functions 1 (⋅), 2 (⋅), 3 (⋅), such that 1 . Moreover, from (11) we have that F(x k , u k , e p ) is affine in e p and is therefore uniformly continuous in e p . Then, there exists a -function e (⋅) such that ||F( for all x k ∈  0 N ( ), all u k ∈  , and all |e p | ≤ . From these facts, it can be inferred that is a ISS-Lyapunov function and the closed- Remark 8 (Differences in soft and hard output constraints). In the case of soft constraints  s , the proposed controller ensures robust stability and constraint satisfaction for all initial states that lie in the feasible region  0 N ( ) of the optimal control problem. In the case of hard constraints  h , the proposed controller ensures robust stability and constraint satisfaction for all initial states in a robust feasible set Ω 0 r ( ) ⊂  0 N ( ) where the constraints are not active. Thus, from a practical point of view, if in the soft constraints case the initial state x k leads to a feasible solution, we then have x k ∈  0 N ( ) and the above guarantees hold. If in the hard constraints case the initial state x k leads to a feasible solution, then we also have x k ∈  0 N ( ). However, in that case, one cannot be sure if also x k ∈ Ω 0 r ( ) is satisfied. If x k ∉ Ω 0 r ( ), then feasibility might be lost at one point. Thus, for safety critical applications the set Ω 0 r ( ) would required to be known in order to check x k ∈ Ω 0 r ( ), which is challenging because Ω 0 r ( ) (as well as  0 N ( )) can in general not be computed but has to be estimated via simulations. 1 However, this issue could be circumvented if the hard constraints were tightened, 38 thereby enlarging Ω 0 r ( ).
Remark 9. Notice that the ISS property is based on the uniform continuity of the optimal cost function and this does not depend on the size of the error signal. Hence, even if e p is larger than for a short period of time in which we assume that the feasibility of the optimal control problem is not lost, that is, x k remains in  k N ( ) and ends in Ω k r ( ), then the closed-loop ISS property and constraint satisfaction will still hold.
Remark 10 (Generalization). Theorems 1 and 2 are independent of the control input dimension and also hold for general errors e independent of the concrete structure of the state x k , that is, whether x k is a vector comprised of NARX states or of physical states. Thus, the theorems also include the multi-input multi-output case. In addition, as long as the presented assumptions are satisfied, in particular the update rule in Theorem 1, the stability results also hold for the case of online hyperparameter optimization and even further, for general prediction mod-elsF(x k , u k | k ) that are updated online, that is, the stability guarantees are not confined to the use of GP prediction models.
A necessary condition in Theorem 2 is that the nominal modelF(x k , u k | k ) is uniformly continuous in x k for all x k ∈  0 N ( ), all u k ∈  , and all  k during the prediction horizon. In the case of GPs, this can be guaranteed by the following proposition.

Proposition 1 (GP Uniform Continuity 18 ). The nominal model (10) is uniformly continuous in
Since the prior mean m(w k ) is added to m + (w k |), the prior mean has to be uniformly continuous in x k ‡ ‡ . Then, one way to ensure that m + (w k |) is uniformly continuous in x k , is to employ continuously differentiable kernels (e.g., the squared exponential covariance function, the Matérn class covariance function with appropriate hyperparameters, or the rational quadratic covariance function). In that case the process is mean square differentiable, 13,61 that is, the posterior mean function is differentiable and therefore also uniformly continuous § § .
Remark 11. Although not required for Theorem 2, note that uniform continuity of the process F(x k , u k , e p ) = F(x k , u k | k ) + de p in x k is ensured ifF(x k , u k | k ) is uniformly continuous in x k , which can be established via Proposition 1. ‡ ‡ The prior mean is usually specified by the user and often set to zero. Thus uniform continuity of m(w) is not an issue. § § Continuous differentiability is a stronger assumption than uniform continuity.

Resulting prediction errors
We finish this section with a discussion on the prediction error e p = y k −ŷ k . According to Theorem 2, the smaller the error bound |e p | ≤ , the larger the feasible set Ω 0 r ( ). Since the noise (affecting y k ) is in practice bounded by a finite , the error bound is finite ifŷ k is finite (given of course that the original process y k is finite), which translates to the necessity that the GP posterior mean (4a) is bounded.
From a theoretical point of view, such a bound exists under certain conditions. Note that the posterior mean (with zero prior mean m(w) = 0) can also be expressed via m + (w * |) = ∑ n i=1 i k(w i , w * ), with w i ∈ w, as a linear combination of n kernel functions 13 that determines a reproducing kernel Hilbert space (RKHS). As shown in Steinwart and Christmann, 19 a bound in the RKHS exists if universal kernels are employed. One such kernel is, for instance, the squared exponential covariance function ¶ ¶ (6) for which the existence of a bound had already been shown by Park and Sandberg. 62 De Nicolao and Pillonetto 63 have presented a very similar result when modeling the impulse response via a spline kernel. The result has also been used in Pillonetto and Chiuso. 64 Furthermore, Engel 65 and Srinivas et al 66 provide ways to explicitly compute the bound, though only with high probability.
In practice, however, m + (w * |) will generally be bounded assuming that the employed GP prior is well chosen and sufficiently informative training data  is used. Thus, the actual bound depends on the designer's choices regarding the particular employed GP model and the involved tuning parameters. Among these, in particular the thresholds for the prediction error and posterior variance for the presented rGP approach.

SIMULATIONS
In this section, we provide simulation results for the presented rGP-MPC scheme and consider a continuous stirred-tank reactor (CSTR) as simulation case study. We present the model equations, the training dataset generation, and the terminal components for the MPC based on the linearized GP posterior mean function. The closed-loop simulations involve investigations regarding the tuning parameters of the rGP-MPC, the influence of different initial training data sets, as well as comparisons with other MPC controllers.

Continuous stirred-tank reactor
As exemplary case study, we consider the CSTR, where a substrate A is converted into product B. 67 The following set of differential equations describes the reactor dynamics: C A (t).
The coolant temperature reference T r (K) is the input and the concentration C A (mol/l) the output, that is, u = T r and y = C A . The tank and coolant temperatures are T and T c , respectively. The model parameters are given in Table B1.

Training datasets
A raw data set  raw (depicted in Figure 2) is generated using the plant (15). The data points (z i , w i ) consist of values of (y k + 1 , y k , … , y k − my , u k , … , u k−m u ), where z = y k + 1 is going be the GP output and w = (y k , … , y k−m y , u k , … , u k−m u ) ¶ ¶ The squared exponential covariance function is sometimes also denoted as Gaussian radial basis function. Especially in the field of neural networks or support vector machines. The sets  0 and  ref are generated by selecting first all points z = y k + 1 (and their corresponding w) that are located within a local neighborhood of the respective set-points and second, by reducing the number of points via exclusion of those that add only little information. For a given data point (z i , w i ), all following (z j , w j ), j > i, are removed, for which ||w i − w j || < w with a chosen threshold w. As a result, the sets are less dense but still contain enough informative data points. The thresholds for  0 and  ref are chosen such that both sets contain approximately 40 data points.
Remark 12. All input and output values are given in the original units of the system (15). However, it is beneficial for the modeling process with the GP to normalize the input-output data to the interval [0, 1].

GP prediction model
For the GP prior we employ a constant mean function with constant c. Since the underlying process equations are smooth and to obtain the universal approximation property (see Section 4.3.2) we employ the covariance function (6) with regressor w = [y k , y k−1 , y k−2 , u k ]. According to (2), the NARX state is then x k = [y k , y k−1 , y k−2 ] T . The hyperparameters are = {c, l 1 , l 2 , l 3 , l 4 , 2 f } and are computed offline via maximization of (7) for each of the three datasets  0 ,  ref , and  comb . We obtain three sets of hyperparameters, respectively (Table B2) and with that three different GP prediction models that use the same prior but different training data sets and hyperparameters. The cross-validation results of these different GP models are shown in Figure 3, where we select test points throughout the regions of the respective training datasets. Test points are chosen such that they are not part of  0 ,  ref , or  comb . As can be seen, appropriate GP predictions are achieved with prediction error e p <ē = 0.02 mol/l and posterior standard deviation + < 2 = 5 ⋅ 10 −3 mol/l for all three GPs.

Optimal control problem
The continuous-time model (15) is discretized with Euler's method and a sampling time of T s = 0.5 min. The input constraints are  = {335K ≤ T r ≤ 372K}, the output constraints  h = {0.35} mol/L ≤ C A ≤ 0.65} mol/l. We add measurement noise ∼  (0, 2 n ) to the output data with 2 n = 0.003 2 , which we furthermore bound *** by ±4 n . The employed *** According to the considered system class we add Gaussian noise with bounded support. Since GPs are based on Gaussian noise with unbounded support, there is a small difference, which could be accounted for by GP warping. 43 However, due to the large bound of four SDs, the difference is so small that the following simulation results are equal to the ones with unbounded noise.
F I G U R E 3 Cross-validation results: Top, the prediction error e p (8) is depicted. Bottom, the posterior SD quadratic stage cost is given by with Q = diag(100, 0, 0), and R = 5. The prediction horizon is set to N = 5. The resulting optimal control problem is solved in MATLAB using fmincon.

Terminal controller and cost function
The As the next output is computed using the GP, that is, y k+1 = m + (w k ), the parameters a 11 , a 12 , a 13 , b 1 can be determined using the posterior mean gradient derived in the appendix, Section B1. In particular we have [a 11 , a 12 , a 13 . The resulting linear model becomes We define the feedback vector as k = Ps with s ∈ R 3 , P = P T > 0, and G = P −1 . We furthermore define the state constraint set  =  h ×  h ×  h and reformulate  and  as polyhedral sets of the form  = {x ∈ R n x ∶ q T i x ≤ r i , i = 1, … , n  } and  = {u ∈ R n u ∶ v T l u ≤ t l , l = 1, … , n  }, where n  and n  are the respective numbers of inequalities. Then, we compute s and P offline by solving the semidefinite optimization problem 7 max G,s log(det(G)) and obtain The optimization problem (17) 7 results from using the Schur complement in combination with the discrete-time Lyapunov equation and the support function concept of closed convex sets. The resulting s and P are such that the closed-loop linearized system is asymptotically stable in parameterized with , could be used to characterize this neighborhood. Then we would need to take the nonlinear remainder term into account to calculate a particular value for , which would require the solution of a global optimization problem. Such a problem could be solved by using scenarios or a Monte Carlo approach. However, since the optimal control problem does not need the terminal region constraint, is not required.

Simulation results
First, we simulate the set-point change from (u 0 , y 0 ) to (u ref , y ref ) and compare the closed-loop results of the rGP-MPC, a batch GP approach (bGP-MPC) that uses a fixed training dataset, and an output feedback MPC scheme (oMPC) that uses the model Equation (15) and acts as a performance bound. We evaluate the performance for the three cases, where  0 ,  ref , and  comb are used as initial training data sets. The bGP and rGP are initialized with the same initial training data and hyperparameters but the rGP updates its training dataset during operation. We setē = 2 = 0 such that every data point is considered as a candidate for inclusion † † † with no upper limit on the number of data points. Hence, no points are removed. Due to the stochastic nature of the noise component, we simulate each case N sim = 50 times. The results are depicted in Figures 4-6. To quantify the performance we employ the measure which averages the stage costs of the resulting state and input sequences over all time steps k ∈ {0, 1, … , N step }, as well as the individual simulations j ∈ {1, 2, … , N sim }. The resulting V values are presented in Table B3. As expected, the oMPC scheme that uses the true model performs best and always the same (see Table B1) because it does not depend on any training data points. The rGP outperforms the bGP in the  0 and  ref cases due to the additional information gained during operation. The performance difference is especially large for  ref , where the bGP, throughout the whole operation, has only data points at the reference at its disposal but not at the initial condition. The rGP performs significantly better due to the added data points at the beginning of operation. In the  comb case, the rGP and bGP performance is almost the same for the employed training data points.
Remark 14. The previous simulation results suggest that one should in general prefer the  ref case over the other cases, which is convenient for the used MPC scheme because knowledge at the reference is required anyway to determine the terminal cost and controller. Furthermore, this also suggests a practical rule for offline hyperparameter determination, namely that the hyperparameters should be optimized for a dataset that contains the target reference.
In the second set of simulations, we investigate the influence of different thresholds used in Rule 1, that is, different values for the maximum prediction errorē and the maximum prediction variance 2 . To this end, we start with Figure 7 that combines the rGP results of the previous figures for the three training data cases, together with the now plotted evolution of the prediction error e p and the prediction variance 2 + . In particular the prediction variance illustrates nicely the difference between the three cases. In the case of  0 , the variance is small at the beginning and increases around t = 8 min when the system leaves the neighborhood of the initial condition and moves towards the reference. The same holds, but the other way round, for the case with  ref , where the initial (t < 3 min) large error and variance is caused by their computation before the first data points are added to the training set. The prediction error bound is 0.033, 0.021, and 0.024 for the cases  0 ,  ref , and  comb , respectively. Figures 8 and 9  be achieved by adding only a fraction of them. Hence, this shows not only that online learning can be achieved but also that it allows working with significantly smaller training data sets, which in turn result in lower computational costs. After evaluating the influence of the parameters of Rule 1, we illustrate the influence of the update rule in Theorem 1, which guarantees a decrease of the value function. To this end, we continue with the  ref case and additionally insert outliers into the output measurements in the course of the simulations. The effect of the update rule is shown in Figure 10. With it, the results are almost the same as before, except for the distortions due to the outliers, which however are compensated shortly after. All simulation outcomes are very similar in that case. Without the update rule, the resulting mean output sequence is different but not necessarily worse (smaller rise time, similar settling time, no overshoot) than the mean output sequence with the update rule. Some of the individual simulation outcomes perform even better, which is an indication that data points with valuable information are indeed discarded by the update rule as was also pointed out in Remark 7. On the other hand, the variability among the individual simulations is much larger. Several of the simulated output evolutions converge slower to the target and some do not converge at all until the end of the simulation. This is a direct result of the corresponding input sequences computed by the optimizer. In between 5 min and 11 min, the deviation of the mean input sequence from the optimal input sequence of the performance bound (oMPC, see  is larger than in the case with the update rule. Furthermore, the individual input sequence outcomes vary considerably, even hitting the lower constraints. Due to the inclusion of every encountered data point candidate, the prediction model changes in some cases in an unfavorable way during the respective simulations, which leads to the depicted results. Note that qualitatively the same results (including not converging output sequences) are obtained, even without outliers. For instance, between the reference change at 5 min and the first outlier at 7 min, we observe that the input sequences already deviate considerably from the case with an active update rule, that is, the outlier is not the cause but usual noisy data F I G U R E 11 Influence of a limited number of training data points on the rGP-MPC with initial training data  ref F I G U R E 12 Comparison of computation times of the full recalculation of the Cholesky factor and the recursive update. The computational reduction that goes along with the recursive update increases with the amount of training data points points. This illustrates the importance of the update rule in Theorem 1, not only for theoretical guarantees but also in terms of practical application.
Next, we consider the case that the number of training data points is limited by M. For the case of  ref we set M = 40, which is the number of initially available training points, that is, the training data set cannot increase but old data points are exchanged with newer more informative ones. To this end, whenever a new point is added, the oldest data point is removed. In Figure 11 we compare the bGP (the initial training dataset is not updated at all), the rGP with M = ∞, e = 2 = 0 (every encountered data point is considered to be added), and the rGP with M = 40,ē = 0.01, 2 = 2 ⋅ 10 −5 (data points are only exchanged). The bGP result is the same as in Figure 5 and represents the worst case because the training data set is not updated at all. The M = ∞ case on the other hand represents the performance bound for this specific case because it includes the maximum of the incoming data points and does not remove any. As can be seen, the reaction of the limited case is a bit slower than the performance bound case but the resulting settling times are almost identical. Thus, with a training data set of only 40 points, where the points are exchanged during operation, almost the same performance can be achieved for the considered example as if every encountered point was included in the training dataset .
Besides the computational cost reduction due to the possibility to work with smaller training data sets, we also illustrate the computational reduction due to the recursive update of the Cholesky factor. In Figure 12 we continue with the  ref case, where we add every incoming point to the training data set and compare the computation times of the full and the recursive update of the Cholesky factor. The results show that the larger the training dataset becomes, the larger the absolute and relative computational reduction. At t = 24 min the full recomputation of the Cholesky factor increases significantly. Investigations point to the reason lying in the generation of the covariance matrix and the inner workings of Matlab's chol function to compute the Cholesky decomposition.
At last we present simulations of the robust feasible set Ω 0 , also denoted region of attraction (ROA), and how it changes for different maximum prediction errors . We continue with the  ref case withē = 2 = 0 such that every data point is considered as a candidate for inclusion. Furthermore, M is set to a large value such that no points are removed from  k . Each initial condition x 0 = [y 0 y 0 y 0 ] T is simulated 30 times. Different values are obtained by varying the measurement noise from 2 n = 0.003 2 to 2 n = 0.012 2 , where is then the largest error of all simulation runs and time steps. The result in Figure 13 yields a clear tendency. The larger , the smaller Ω 0 . F I G U R E 13 Change of the region of attraction Ω 0 for different . Red stars denote infeasible initial conditions, green stars feasible initial conditions. An initial condition is marked as infeasible if at least one simulation resulted in a constraint violation.

CONCLUSION
In this work, we outlined the use of a GP-based nonlinear autoregressive model with exogenous input for prediction in an output feedback model predictive control scheme. The approach allows for online learning, by means of updating the training data set, to account for limited a priori process knowledge and the possibility for adaptation during operation. To this end, the concept of evolving GPs was adapted together with a recursive formulation to update the Cholesky decomposition to minimize computational cost. It was shown that the resulting model predictive control scheme is input-to-state stable with respect to the prediction error, despite the time-varying nature of the GP prediction model. Notably, the theoretic guarantees are not limited to Gaussian processes. They are rather valid for all online learning methods that satisfy the presented conditions. The approach was verified in simulations, which have shown that it is in general possible to start with limited a priori process knowledge and refining the model during operation. One important finding is that it is particularly beneficial to start with a model that captures at least the behavior at the target reference, which is fortunately an intrinsic necessity for all MPC schemes that use a terminal region, cost, and controller to guarantee recursive feasibility and stability. In the case of fixed hyperparameters during online operation, a further consequence is that the hyperparameters should be optimized offline for a dataset that captures the target reference. Furthermore, the presented formulation yields good closed-loop performance with few training data points, thereby efficiently reducing the computational load. This presents itself as a possible option for very fast processes, where hyperparameter optimization is not an option but some kind of online learning is desirable. Additionally, due to the output feedback scheme, this approach can be employed for processes, whose state cannot be measured or is difficult to be estimated.
Future work aims at implementing the presented approach in laboratory experiments, together with a combination of a deterministic base model and the GP prediction model. From a theoretical point of view, time-varying reference tracking instead of set-point changes would be interesting to investigate. For instance, what conditions does the initial training data set has to satisfy to achieve acceptable tracking results and how to automatically compute safe thresholds for the data inclusion approach. Another interesting question to investigate is how the approach performs for time-varying processes. A hypothesis would be to combine the squared exponential covariance function with a non-stationary one to account for time variance in the process model.  TA B L E B3 Model predictive control performance computed by (18) where w 1 , … , w n is the corresponding regressor of each of the n measured training data points in .