Probabilistic performance validation of deep learning-based robust NMPC controllers

Solving nonlinear model predictive control problems in real time is still an important challenge despite of recent advances in computing hardware, optimization algorithms and tailored implementations. This challenge is even greater when uncertainty is present due to disturbances, unknown parameters or measurement and estimation errors. To enable the application of advanced control schemes to fast systems and on low-cost embedded hardware, we propose to approximate a robust nonlinear model controller using deep learning and to verify its quality using probabilistic validation techniques. We propose a probabilistic validation technique based on finite families, combined with the idea of generalized maximum and constraint backoff to enable statistically valid conclusions related to general performance indicators. The potential of the proposed approach is demonstrated with simulation results of an uncertain nonlinear system.


INTRODUCTION
Model predictive control is a popular advanced control technique that can deal with nonlinear systems and constraints while considering general control goals that go beyond conventional set-point tracking tasks. Two of the main obstacles that one faces when implementing and designing a nonlinear model predictive controller are the accuracy of the model and the computational complexity needed to solve a non-convex optimization problem online, which often renders its implementation too slow for fast systems or impossible to be deployed on resource-constrained embedded platforms.
Handling uncertainty in the context of model predictive control is the main goal of robust MPC. Traditional min-max approaches [1] do not explicitly consider the fact that new information will be available in the future, which leads to overconservative solutions. Closed-loop robust MPC avoids the problem of conservativeness by optimizing over control policies instead of optimizing over control inputs [2], leading however to intractable formulations in the general case. Most of the recent robust MPC methods focus on achieving a good trade-off between complexity and performance. Tube-based approaches [3] decompose the robust MPC problem into a nominal MPC and an ancillary controller. The ancillary controller makes sure that the real uncertain system stays close to the trajectory planned by the nominal MPC. By tightening the constraints of the nominal MPC, robust constraint satisfaction can be achieved. In the simplest version, the complexity of tube-based MPC is the same as that of standard MPC. However, if an increased performance is desired, the complexity grows as presented in [4] or [5]. Scenario tree-based [6,7,8] or multi-stage MPC [9] represents the evolution of the uncertainty using a tree of discrete uncertainty realizations. An improved performance can be often seen in practice [10] because the feedback structure is not restricted to be affine, as usually done in tube-based MPC and in other robust approaches [11]. While it is also possible to achieve stability and robust constraint satisfaction guarantees for a multi-stage MPC formulation [7,12,13], its computational complexity grows exponentially with the dimension of the uncertainty space. The presence of uncertainty significantly increases the computational complexity of any NMPC implementation if a non-conservative performance is desired.
The last decade has witnessed an important progress on hardware, algorithms and tailored implementations that enable the efficient solution and implementation of NMPC controllers based, for example, on code generation tools [14,15] that provide efficient implementations of linear and nonlinear MPC on embedded hardware, including low-cost microcontrollers [16] and high-performance FPGAs [17].
A different possibility to achieve embedded nonlinear model predictive control is the use of approximate explicit nonlinear model predictive control [18,19] based on approximating the multi-parametric nonlinear program using similar ideas as for explicit MPC of linear systems [20]. We propose in this work to use deep neural networks to approximate a robust multistage NMPC control law. The idea of using a neural network as function approximator for an NMPC feedback law was already proposed by [21] back in 1995, but only very recently [22,23] deep neural networks (neural networks with several hidden layers) have been proposed as function approximators. The use of deep neural networks is motivated by recent theoretical results that suggest the exponentially better approximation capabilities of deep neural networks in comparison to shallow networks [24].
Assessing the closed-loop performance of approximate controllers, or any other controller subject to further random disturbances or estimation errors, is particularly challenging in the case of complex nonlinear systems. The theory of randomized algorithms [25], [26] provides different schemes capable of addressing this issue. For example, statistical learning techniques can be used to design stochastic model predictive controllers with probabilistic guarantees [27], [28], [29]. Also, under a convexity assumption, convex scenario approaches [30] can be used in the context of chance constrained MPC [31], [32], [33]. The main limitation of the aforementioned approaches based on statistical learning results [25], [34] and scenario based ones [30] is that the number of random scenarios that have to be generated (sample complexity) grows with the dimension of the problem.
Probabilistic validation [35], [36], allows one to determine if a given controller satisfies, with a prespecified probability of violation and confidence, the control constraints. The sample complexity in this case does not depend on the dimension of the problem, but only of the required guaranteed probability of violation and confidence. Examples of probabilistic verification approaches in the context of control of nonlinear uncertain systems can be found, for example, in [26], [37] and [38]. These techniques have also been used for the probabilistic certification of the behaviour of off-line approximations of nonlinear control laws [39], [40].
The main contribution of this paper, which extends the results from [41], is the formulation of general closed-loop performance indicators that are not restricted to binary functions as in [39] and can be computed simulating the closed-loop system with any given controller. We also provide sample complexity bounds that do not grow with the size of the problem for the case of a finite family of design parameters and general performance indicators. Our approach allows to discard a finite number of worst-case closed-loop simulations, improving significantly the applicability of the probabilistic validation scheme compared to existing works. The potential of the presented approach is illustrated for a highly nonlinear towing kite system including a real-time capable embedded implementation of an approximate, but probabilistically safe, robust nonlinear model predictive controller on a low-cost microcontroller.
The paper is organized as follows. The closed-loop performance indicators are introduced in Section 2 which are used in a novel probabilistic validation methodology for arbitrary controllers in Section 3. The mathematical framework for the output feedback robust NMPC problem considered in this work is presented in Section 4 and the use of deep learning to obtain approximate robust NMPC controllers is summarized in Section 5. The case study is detailed in Section 6, the results in Section 7 and the paper is concluded in Section 8.

System description and constraints
We are interested in optimally controlling the following class of nonlinear discrete time systems: where ( ) ∈  is the state vector, ( ) ∈  is the control input, and ( ) ∈  is the disturbance vector. In general, not all states can be measured, and a state estimatê ( ) should be computed based on the past measurements ( ) ∈  . It is assumed that the disturbances ( ) take values, with high probability, from a known set .
The closed-loop trajectory should satisfy general nonlinear input and state constraints defined by where is the number of constraints.

Closed-loop behavior
The goal of a controller ∶ ℝ → ℝ is that the closed-loop trajectory of the uncertain nonlinear system defined by obtains a desired performance level, e.g. does not violate the predefined constraints, despite the presence of uncertainty, for any initial state (0) in the set  0 of feasible initial conditions, for any admissible initial estimation error (0) −̂ (0) and for any sequence of uncertainty realizations Determining if a given controller provides admissible closed-loop trajectories, under the presence of nonlinearity and uncertainty, is in general an intractable problem [42]. Instead, we focus on the use of finite-time closed-loop performance indicators that can be obtained by simulating the closed-loop system. The underlying assumption is that models which can be run a large number of times are available so that statistical guarantees can be obtained. A closed-loop performance indicator is defined as follows.
In this paper we address a more general setting in which we do not circumscribe the performance indicator to the class of binary functions. For example, we consider the closed-loop finite-time performance indicator given by the largest value of any component of along the closed-loop simulation: ( ( ), (̂ ( )), ( )).
Another possibility is to consider the average constraint violation as a performance indicator. That is, Moreover, in many applications it is relevant to consider indicators related to the closed-loop cost, such as an average cost: or any other combination. In the following section we address how to obtain probabilistic guarantees on the random variable ( ; sim , ).

PROBABILISTIC VALIDATION
The derived closed-loop performance indicators can be used in the framework of probabilistic validation [26,36] to obtain probabilistic guarantees regarding the satisfaction of a given set of control specifications. In this section we present a novel result that allows us to address the probabilistic validation of arbitrary control schemes where the performance is influenced by hyper-parameters, e.g. backoff parameters or the control sampling time.

Probabilistic performance indicator levels
We consider a finite family of controllers (̂ ), = 1, … , , corresponding to different combinations of hyper-parameter values, e.g. for constraint backoffs or control sampling times. The objective of this section is to provide a probabilistic validation scheme that allows us to choose, from the possible controllers, the one with the best probabilistic certification for any given closed-loop finite-time performance indicator ( ; sim , ). For simplicity in the notation, we denote the closed-loop finite-time performance indicator obtained with the controller with sim simulation steps as ( ). Remark 1. The stochastic variable that defines the closed-loop trajectories follows the probability distribution  from which it is possible to obtain independent identically distributed (i.i.d.) samples.
Remark 1 only requires knowledge of the probability distributions of the uncertainty. Definition 2 (Probabilistic performance indicator level). We say that ∈ IR is a probabilistic performance indicator level with violation probability ∈ (0, 1) for a sample drawn from  for the measurable function ∶  → IR if the probability of violation Pr  {⋅} satisfies To obtain probabilistic performance indicator levels for the considered controllers , = 1, … , , we generate i.i.d.
It is clear that the largest value of the components of could serve as an empirical performance level for the controller provided that is large enough [35]. Another possibility is to discard the − 1 largest components of and consider the largest of the remaining components as a (less conservative) empirical performance indicator level ( equal to one corresponds to not discarding any component) [44]. In the following section we show how to choose such that the obtained empirical performance indicator levels are, with high confidence 1 − , probabilistic performance indicator levels with probability of violation .

Sample complexity
We first present a generalization of the notion of the maximum of a collection of scalars. This generalization is borrowed from the field of order statistics [45], [46], and will allow us to reduce the conservativeness that follows from the use of the standard notion of max function. See also Section 3 of [44].
Furthermore, ( , 2) denotes the second largest value in , ( , 3) the third largest one, etc. We notice that the notation ( , ) does not need to make explicit and the number of components of .
The next theorem provides a way to compute probabilistic performance levels for a family of controllers. The theorem constitutes a generalization of a similar result, presented in [44] for the particular case = 1. See also the seminal paper [35] for the particularization of the result to the case = 1, = 1. Then, with probability no smaller than 1 − , we have a probability of violation In addition, (10) is satisfied if: Proof. Given the controller and ∈ IR, we denote ( ) the probability of the event ( ) > . That is, We denote probability of asymptotic failure the probability of generating i.i.d scenarios and obtaining an empirical probabilistic performance level that does not meet the probabilistic specification on probability of violation. We now make use of Property 3 in [44], which states that, with probability no smaller than This means that the probability of asymptotic failure Pr  {⋅} for samples ( ) , … , ( ) drawn from  satisfies Consider now the probability that, after drawing i.i.d. samples ( ) , = 1, … , , one or more of the empirical performance indicator levels = ( , ), = 1, … , , are not probabilistic performance indicator levels with violation probability . We have That is, is smaller or equal than provided that (10) holds. This proves the first claim of the property. The second one follows directly from Corollary 1 of [36], which provides an explicit number of samples that guarantees that a binomial expression ( , , − 1) is smaller than a given constant. The major advantage of Theorem 1 is that a family of controllers can be evaluated for the same samples. This is beneficial when the family of controllers can be evaluated in parallel or when drawing samples is expensive, e.g. in experimental setups. The number of required samples for the same probabilistic statement is significantly smaller as when all controllers would be evaluated in a sequential approach as in [47,48,49].
Remark 2. Given a family of controllers , = 1, … , , one does not need to compute all the empirical performance indicator levels ( , ), = 1, … , . It is sufficient to find one that meets the desired performance indicator levels. For example, if the performance indicator ( ) is defined as the average constraint violation along the trajectory (see (8)), then the controller provides an admissible closed-loop trajectory for if and only if ( ) = 0. In this case, the empirical performance indicator ( , ) corresponding to i.i.d. scenarios is equal to 0 if no more than − 1 trajectories are non-admissible when applying the controller to the scenarios. If is chosen according to (10) then Theorem 1 implies that with probability no smaller than 1 − , all the controllers , = 1, … , , providing ( , ) = 0 are such that It is also important to remark that the cardinality of the family of proposed controllers has little effect on the sample complexity because it appears into a logarithm. See also Subsection 4.2 in [36] for other randomized approaches based on a design space of finite cardinality.
Remark 3. Theorem 1 can also be applied in the case when the performance indicators only take binary values. This has been presented in a similar form in [50] and was used for control design problems. See, for example, [37], [38].

ROBUST OUTPUT-FEEDBACK NONLINEAR MODEL PREDICTIVE CONTROL
In this work our goal is to design an NMPC scheme that is able to control the uncertain nonlinear system (1) in an outputfeedback setting where not all the states can be measured as described by the output equation (2). While the novel probabilistic validation scheme described in Section 3 can be applied using any controller, we believe that because of the complexity of the general robust output-feedback problem, it is a promising idea to use an approximate robust NMPC scheme that is validated a posteriori using probabilistic validation.
There exist many different robust model predictive control schemes, but there are four important characteristics that differentiate one approach from the other: the choice of cost function, the propagation of the uncertainty, robust constraint satisfaction and the characterization of feedback information.
The cost function can be chosen following a min-max approach, where the worst-case realization of the uncertainty ( ) at each step in the prediction is chosen [1]. Tube-based methods usually choose the cost incurred by the closed-loop system driven by the nominal realization of the uncertainty [51]. Scenario-tree based methods use a weighted sum of a set of discrete scenarios [9] and stochastic MPC schemes [7] make use of, e.g., the expectation operator. In this work, we consider a general cost function ((0), ; p , ) that depends on an initial state set (0), the uncertainty set , the prediction horizon p and the control policy .
The propagation of the uncertainty is one of the key elements of any robust NMPC scheme. A general framework, which is used in this work, is the definition of reachable sets at each sampling time in the prediction based on a current initial condition, the system model, the applied input and the uncertainty set . The reachable set at sampling time + 1 can be thus denoted as: The are several methods to compute such reachable sets. In the linear case, the consideration of the vertices of the uncertain set and their propagation along the prediction horizon is enough to compute an exact reachable set. In the nonlinear case, linearization techniques [52] or ODE bounding techniques [53] can be used to obtain guaranteed over-approximations, which can be then used in robust optimal control schemes. To maintain the notation independent of the method used to obtain an (over-) approximation of the reachable sets at each sampling time, the bounding operator denoted as ⋄ (⋅) is used, which is defined as: Another possibility for the propagation of uncertainty is to resort to probabilistic reachable sets as done in [54].
Robust constraint satisfaction is often one of the main motivations for the use of a robust NMPC approach. It means that the requirements of the closed-loop system in form of input and state constraints should be satisfied for all possible outcomes of the uncertainty and it is usually enforced by embedding the reachable sets (13) into the constraints of an optimization problem.
The characterization of feedback that is employed is another key property of any robust MPC scheme. It is well known that considering a sequence of optimal control inputs in the prediction under uncertainty can result in very conservative performance of the closed-loop because it is ignored that new information about the future will be available in the form of measurements and thus future actions can be adapted accordingly. To avoid this conservatism, closed-loop approaches can be used, in which one optimizes over a sequence of control policies and can be formulated as the receding horizon solution of the following optimization problem: where the constraints (14c) denote that ( , ( ), ) ≤ 0 should be satisfied for all ∈ ( ) and for all ∈ . Solving the ideal robust NMPC problem ℙ ideal defined in (14), one obtains a receding horizon policy ideal (̂ (0)) which is a function of the initial state estimatê (0) that has been obtained with a certain estimation error bounded by  est .
Obtaining an exact solution of ℙ ideal is usually intractable mainly because of the bounding operator ⋄ (⋅) and the general feedback law (⋅). There are different alternative solutions to obtain approximations of this problem. A common simplifying assumption is to restrict the search to affine policies on the state or on the disturbances [11]. A different alternative is the use of a scenario tree [6], [7], [9] in a so-called multi-stage NMPC approach. A multi-stage NMPC scheme is based on the representation of the uncertainty via a scenario tree (see Figure 1), which branches at each sampling time. This means that the uncertainty set is approximated by a discrete number of uncertainty realizations: where is the number of possible realizations of the uncertainty that are considered in the tree. The considered realizations mean that each node branches times which results in nodes at stage . Using a scenario tree formulation, an approximation of the reachable set can be obtained as the convex hull of the set of all the nodes at a given stage, i.e.: where Conv(⋅) denotes the convex hull of a set and ( ) denotes the node of the tree at stage as depicted in Figure 1. In the linear case with polytopic uncertainty, including the extreme values of the uncertainty in guarantees an exact representation of the actual reachable set. In the nonlinear case considered in this paper it is only an approximation and therefore we focus on the point-wise approximation. Following the same notation, the bounding operator used to propagate the point-wise uncertainty description can be denoted as: The cost function is usually chosen as a weighted sum of the stage cost for each node in the scenario tree: Introducing (18) and (17) in the ideal formulation of robust NMPC ℙ ideal we obtain the optimization problem that should be solved at each sampling time: where the constraints (19c) denote that ( , ( ), ) ≤ 0 should be satisfied for all ∈( ) and for all ∈. The optimal solution of (19) is denoted as the multi-stage NMPC feedback policy ms .
To avoid the exponential growth of the tree with the prediction horizon, a usual additional simplifying assumption is to consider that the tree branches only up to a given stage (usually called robust horizon). While this simplification introduces further errors in the approximation of the reachable sets at each stage, it achieves good results in practice [10]. The current estimation error as well as the presence of future estimation errors should be also included in the problem formulation to achieve stability and recursive feasibility guarantees. This can be done in a multi-stage framework as shown in [55], but additional uncertainties should be included in the scenario tree. To mitigate the exponential growth of the scenario tree with the number of considered uncertainties, we do not consider the estimation error directly in the formulation of the tree. Following ideas from tube-based MPC, these additional errors will be taken into account by means of constraint tightening as explained in Section 5.

DEEP LEARNING-BASED APPROXIMATE ROBUST NMPC
Despite recent advances in algorithms and hardware, solving the simplified output-feedback robust NMPC problem defined in (19) in real time can be challenging. To avoid the need for the real-time solution of non-convex optimization problems, this work considers the data-based approximation of the implicit feedback law defined by (19) following the same ideas as explicit model predictive control. Approximating an NMPC controller with a neural network was already proposed by [21] back in 1995, where the use of shallow networks (with only one hidden layer) was proposed. This suggestion is based on the universal approximation theory that shows that a neural network with only one layer can approximate any function with any desired accuracy under mild conditions [56].

Deep neural networks
The function approximators chosen for this work are Deep neural networks (DNNs). This is motivated by recent theoretical results that support the increased representation power of neural networks with several hidden layers as opposed to classical shallow networks [24]. For the approximation of MPC laws via deep neural networks good results were obtained in [23,40,39,22] among other recent works. In the case of linear time-invariant systems, it was shown in [22] that a deep neural network with a given size can exactly represent the explicit MPC law. The robust NMPC problem (19) is a parametric optimization problem that depends on the current (estimated) state and on the uncertainty values used to define the scenario tree. To perform a deep learning-based approximation, a finite amount of s samples ( ) of the state space are chosen and then s different optimization problems are solved to obtain the corresponding optimal inputs ms ( ( ) ).
A standard deep feed-forward neural network with fully connected layers is defined as a sequence of layers which determines a function  ∶ ℝ → ℝ of the form where the input of the network is ∈ ℝ and the output of the network is ∈ ℝ . The dimensions of the network are defined by the number of hidden layers and the number of neurons per hidden layer, also denoted as the width of the hidden layer, when equal width for all hidden layers is assumed. In contrast to shallow neural networks with = 1 hidden layers, deep neural networks have ≥ 2 hidden layers. The complexity of a neural network can be defined either by the number of weights or the number of neurons that form a given network. The number of weights defines the necessary memory that is needed to store a neural network while the number of neurons determines the maximum possible amount of nonlinear functions present in the approximation. Each hidden layer connects a preceding affine function: where −1 ∈ ℝ is the output of the previous layer and 0 = , with a nonlinear activation function . Common choices for the activation function are rectifier linear units (ReLU) and the sigmoid function or the hyperbolic tangent (tanh): which will be used throughout this work. The parameters of all layers are summarized in = { 1 , … , +1 } with where are the weights and are the biases describing the corresponding affine functions. The best data-based approximation of the exact multi-stage NMPC (19) with a neural network for a given training data set  = {( (1) , ms ( (1) )), … , ( ( s ) , ms ( ( s ) ))} with s elements and fixed dimensions and is achieved for: * = arg min 1 The resulting deep learning-based controller is denoted as dnn ( ) =  ( ; * ).

Constraint tightening
We propose to use a robust NMPC scheme to take explicitly into account the most important uncertainties that affect the system. Still, it is virtually impossible to account for all possible uncertainties and to obtain exact state estimates which in comparison to the ideal, robust NMPC feedback law ideal results in two sources of error: where est is the estimation and measurement error and ms is the error caused by the approximation of the reachable set by a set of discrete scenarios. Because solving the multi-stage NMPC problem (19) online is challenging, our goal is to determine a candidate neural network controller by generating input-output data pairs via the solution of the multi-stage NMPC problem (19) and approximating its solution via a deep neural network solving (26). This means that the closed-loop will be controlled using the feedback law dnn that approximates the behavior of ms which introduces an error approx on top of those described in (27): || ideal ( ( )) − dnn (̂ ( ))|| = || ideal ( ( )) − ms (̂ ( )) + ms (̂ ( )) − dnn (̂ ( ))|| ≤ || ideal ( ( )) − ms (̂ ( ))|| + || ms (̂ ( )) − dnn (̂ ( ))|| ≤ est + ms + approx .
Finding upper-bounds for each one of the errors to apply traditional robust NMPC schemes is not possible for the general nonlinear case.
To counteract the possible errors est , ms , and approx , and following ideas from tube-based MPC, an additional backoff is used to tighten the original constraints of the robust NMPC problem that is solved to generate input-output data for training: Solving (29) online would lead to the feedback controller ms (̂ , ). We are however interested in the proposed approximate robust NMPC dnn (̂ , ) that is obtained by training a deep neural network via (26) based on input-output data generated by solving (29) for many different initial conditions. Introducing a backoff does not guarantee in general that the closed-loop satisfies the constraints. For this reason, closed-loop constraint satisfaction is also not ensured a priori with a terminal set. The probabilistic design scheme presented in the previous sections is employed to select the backoff parameter . The proposed methodology provides probabilistic guarantees on the performance indicators of the closed-loop uncertain system.

CASE STUDY
We investigate the optimal control of a kite which is used to tow a boat. The stable and safe operation of the kite is challenging due to the highly nonlinear system dynamics, uncertain parameters, strong influence from disturbances like wind speed and noisy measurements. To develop optimal control schemes of a kite system, typically models with moderate complexity such as [57,31] are considered because of the required short sampling times. Although for our proposed strategy, also a high-fidelity models could be considered since the majority of the computational load is shifted offline, we consider a popular three-state model as presented in [58] to facilitate the comparison of the results with previous works. We derive an approximate deep learning-based controller from a robust NMPC formulation, which enables a very fast and easy evaluation of the controller even on computationally limited hardware. The idea of learning a controller for a kite has already been exploited in [59], where polynomial basis functions were used to approximate the behaviour of a human pilot based on measurements.

Kite model
In the context of NMPC, we focus on the model presented in [58] which consists of three states, one control input and two uncertain parameters. The state evolution is given by the ordinary differential equations of the three angles kite , kite and kite of the spherical coordinate system describing the position of the kite: where a = 0 cos kite , (30d) The angle between wind and tether (zenith angle) is described by kite , the angle between the vertical and the plane is denoted by kite and kite represents the orientation of the kite. The three states can be manipulated via the steering deflectioñ . The area of the kite is denoted as , and T is the length of the tether. The effect of the wind is denoted as a , which is strongly influenced by the wind speed 0 , the first uncertain parameter. The glide ratio is dependent on the base glide ratio 0 , the second uncertain parameter, and the magnitude of the steering deflectioñ [60]. The parameters of the kite model are shown in the upper part of Table 1.

Wind model
The wind speed 0 is considered as a single uncertainty in (29), but the realizations of the values are computed based on a simulation model. The underlying wind model was presented in [61] and is described by: when the wind shear is neglected. The term m gives the current average wind speed, tb is introduced as a white noise generator to model the short term turbulence and (0) = normal(0, 0.25) is the initial state of the turbulence, where normal = normal( normal , normal ) denotes that the variable normal follows a normal distribution with mean normal and standard deviation normal . In a similar manner, unif = unif( unif , unif ) means that the variable unif follows a uniform distribution between unif and unif . An overview of the parameters for the wind model is given in the lower part of Table 1. For further details on modeling assumptions and the choice of parameters the reader is referred to [61].

Extended Kalman Filter
We assume that we can measure the two angles kite and kite and the wind speed 0 . An Extended Kalman Filter (EKF) is used to obtain an estimate of the augmented state aug = [ kite , kite , kite , 0 , 0 ] in each control instance from the measurements: with the zero-mean gaussian noises kite = normal(0, 0.01), kite = normal(0, 0.01) and 0 = normal(0, 0.05). The augmented state is initialized for all simulations as aug (0)

Objective, constraints and control settings
The goal of the control is to maximize the thrust of the tether defined by: while maintaining a smooth control performance and satisfying the constraints. The desired behaviour is enforced in the stage cost: where F = 1 −4 and = 0.5 are weights and̃ prev is the previous control input and sampling time of the controller c = 0.15 s with a prediction horizon of P = 40 steps.
Throughout the operation of the kite it has to be ensured that the height of the kite: never falls below ℎ min = 100 m. The height constraint is a critical constraint of the control task since the best performance is obtained when the kite is operated close to ℎ min . Because of the error ms caused by the approximation of the reachable sets in the multi-stage NMPC formulation, the errors due to a deep learning-based approximation approx as well as the errors related to estimation and measurement errors est , constraint satisfaction can not be guaranteed. To cover the effect of the errors, the backoff parameter > 0 m is introduced and the height constraint: is formulated as a soft constraint to avoid numerical problems.
To build a multi-stage NMPC controller, we consider the combinations of the extreme values of the base glide ratio 0 ∈ [4, 6] and the wind speed 0 ∈ [6 m s −1 , 10 m s −1 ] and a one-step robust horizon resulting in a total of four scenarios. The interval for the wind speed is obtained by summarizing the possible effects of the uncertain wind model parameters m , (0) and tb into the single uncertain variable 0 .

Simulation
For the simulation of the system, it is assumed that the uncertain parameters 0 and m are constant over a given closedloop simulation and that tb changes every c = 0.15 s. The values of the uncertain parameters are drawn from the probability distribution described in Table 1.

RESULTS
The proposed method for the probabilistic verification of controllers is analyzed for the towing kite case study. The baseline controller for our investigations, which is also used for the training data generation for the corresponding approximate neural network controller dnn, , is the exact multi-stage NMPC controller ms (̂ , ) (29) that derives its initial state estimatê from the EKF based on the current measurement (32). This means that the baseline controller is affected by the estimation error est and the error ms caused by the discrete representation of the uncertainties in the scenario tree and hence no formal guarantees on constraint satisfaction can be given. To avoid numerical problems for the solver in case of violations, the critical height constraint is implemented as a soft constraint.

Learning an approximate output-feedback robust NMPC controller
The training process of a neural network is determined by the quality of the data and the chosen hyperparameters like activation function of the hidden layers and network size (21), (22). In the following, we discuss how the training data can be generated in a way that reduces the number of samples that are needed to achieve a satisfactory approximation in comparison to a random sampling. For the training of the neural networks we used the toolbox Keras [62] with the backend TensorFlow [63] and Adam [64] as the optimization algorithm. The weights were initialized based on the glorot uniform distribution [65] and the biases were set to zero. All considered networks use hyperbolic tangent (tanh) as activation function in the hidden layers and a linear output layer. As the focus of this work is the verification of a given approximate controller and not the training process or the choice of the optimal network architecture, we refrained from applying methods such as Bayesian Optimization to obtain an optimal structure of the underlying network [66]. We consider two training data sets  feas and  opt , and two validation data sets  feas and  opt . Each data data set contains samples ( ( ) , ms ( ( ) )) corresponding to the numerical solution of the multi-stage problem (29) at state ( ) . The subscript opt indicates Mean squared error obtained when training a deep neural network using the space of optimal trajectories  opt or the full feasible space  feas as training data.
that the data was derived from optimal closed-loop trajectories, e.g.  opt = {( ( ) , ms ( ( ) )), … , ( ( sim ⋅ traj ) , ms ( ( sim ⋅ traj ) ))} is composed of traj state-feedback closed-loop simulations of length sim using the exact multi-stage NMPC (29) under the dynamics presented in (30), where the uncertain parameters of the model and the initial conditions are drawn according to the distributions given in Table 1 and first row of Table 2 respectively. The subscript feas means that the data was obtained at randomly sampled states, e.g.  feas = {( ( ) , ms ( ( ) )), … , ( ( s ) , ms ( ( s ) ))} is obtained by sampling ( ) uniformly from the feasible state space and solving (29). Since the training data is generated based on simulations, the application of outputfeedback via EKF is not necessary and not used for the data generation. Each trajectory consists of sim = 400 simulation steps which results in a total simulation time of sim = sim ⋅ c = 60 s. For  opt , traj = 200 closed loop runs were simulated leading to traj ⋅ sim = 80000 data pairs and for the validation traj = 50 simulations were rolled out, resulting in traj ⋅ sim = 20000 samples in  opt . For the data sets  feas and  feas , s = 80000 and s = 20000 random samples were drawn respectively.
For the following investigations, we trained five deep networks with = 6 layers and = 30 neurons per layer on each training set and evaluated all five obtained networks on each validation set. By averaging the results over five networks the impact of the stochastic learning is reduced. Training a deep neural network with the data pairs  opt leads to a significantly smaller average mean squared error (MSE) when compared to the training using the training data  feas , as Figure 2 shows, because the sampled space of optimal trajectories is smaller in comparison to the feasible space. To investigate the impact of the training data set on the actual performance, the networks are tested on the validation sets  feas and  opt . The networks trained on  feas perform better when evaluated on whole feasible space with an average MSE of 0.0048 in comparison to the networks trained on  opt with an average MSE of 0.2105. But when the networks are evaluated on the space of optimal closed-loop trajectories via  opt , the networks trained on  opt have a significantly smaller average MSE of 0.0087 than networks trained on  feas with an average MSE of 0.1642. The fact that controllers trained on  opt do not cover the whole feasible space is not critical since the learning-based controller will only operate in the neighborhood of optimal trajectories where a close approximation of the exact multi-stage NMPC is achieved. Additionally, the controllers will be probabilistically validated, and this validation is completely independent of the data used for training. Our experience shows that extracting training data from closed loop trajectories can significantly reduce the necessary number of training samples s and the dimensions and of the neural network to obtain a desired approximation error of the deep learning-based controller in the critical domain.
For all the results presented in the remainder of the paper, we use deep neural networks with = 6 and = 30 which were trained on the space of optimal trajectories  opt due to the observed superior approximation quality in the crucial regions of the state space.

Verification of a deep learning-based embedded output-feedback robust NMPC
Because of the approximation errors, measurement and estimation errors as well as the errors derived from the multi-stage formulation, we refrain from a worst-case deterministic analysis and resort to the probabilistic verification scheme based on Overview of the the parameter sampling via uniform distribution, normal distribution, beta (2,5) and pareto(5) distribution and results of evaluating the approximate controller dnn with = 4 m for 1388 randomly drawn scenarios . The measurement noise meas = [ kite , kite , kite ] , the initial state of the turbulence (0) = normal(0, 0.25), the white noise modelling the short term turbulences tb = normal(0, 0.25) and the initialization of the estimation vector aug (0) is for all scenario spaces identical. closed-loop trajectories presented in Section 3. We consider four possible values for the backoff hyper-parameter , i.e. ∈ {0 m, 2 m, 4 m, 6 m}. This leads to a family of = 4 deep learning-based approximate controllers dnn, . Each one of these controllers was obtained training on data sets  opt, containing 80000 data pairs each. The resulting controllers were analyzed for i.i.d. scenarios ( ) , = 1, … , corresponding to closed-loop simulations under the dynamics presented in (30), where the uncertain parameters of the model and the initial conditions are drawn according to the distributions given in Table 1 and first row of Table 2 respectively. Since the height constraint (36) is the most critical constraint, we define the performance indicator: where ( , ) is the state trajectory at sampling time caused by scenario using controller dnn, . The performance indicator (37) extracts the largest violation of the minimum height ℎ min , if a violation occurs, or the closest value to ℎ min throughout one scenario. Each scenario has a duration of 60 s which means sim = 400. To consider a controller probabilistically safe, we with confidence 1 − for a randomly sampled scenario according to Pr  . Following the notation of Theorem 1, the performance indicators corresponding to backoff parameters {0 m, 2 m, 4 m, 6 m} are collected into vectors { , , , } respectively. We consider a value of the discarding parameter = 4. That is, a controller is probabilistically validated if no more than 3 simulations violate the height constraint. For these specifications ( = 0.02, = 1×10 −6 , and = 4), = 1388 samples are required (see (11)). The family of controllers was evaluated for 1388 i.i.d. scenarios ( ) and the results are summarized in Table 3. If no backoff is considered ( = 0 m) the exact multi-stage NMPC operates often at the constraint bound which leads to small violations of the height constraint as ms and est are ignored. The corresponding approximate controller dnn,0 is additionally affected by approx (28) which leads to violations of the height constraint in more than half of the scenarios when applied. Exemplary trajectories for the exact multi-stage NMPC and the approximate controller for one scenario are visualized in Figure 3a.
By considering = 2 m, the amount of violations can be significantly reduced to 8 scenarios, which shows the importance of the backoff parameter. However, the performance of dnn,2 is not considered probabilistically safe because after discarding the allowed number of worst-case simulation runs, we get (  Figure 3b. Due to the backoff, the kite is keeping a safety distance to the constraint bound and the impact of ms and est does not directly lead to constraint violations. Also the trajectory of the approximate controller does not violate the trajectories despite being affected by the additional approximation error approx . The preferred deep learning-based controller is dnn,4 due to the higher average tether thrust F provided. By introducing a performance indicator level for the average thrust per simulation run: it is possible to obtain probabilistic statements about the performance in the same fashion as for violation of the height constraint. Using the parameters = 1 × 10 −6 , = 0.02, = 4 and = 4 we obtain, for the controller dnn,4 , that with confidence 1 − , the probability that the average thrust for a simulation run of 60 s duration is lower than 111.346 kN is not larger than = 0.02. A smaller number of samples is required if the discarding parameter is set equal to 1. However, this leads to more conservative results because violations of the height constraint occur throughout the closed loop simulations used for verification. This is even worse when the performance index is a binary function determining if the trajectories are admissible or not. In this case, the obtained results are often not informative because in a binary setting with = 1, a single violated trajectory out of determines that the controller does not meet the probabilistic constraints. Larger values for , along with the consideration of non-binary violation performance indexes, provide more informative results. One more advantage of the proposed probabilistic method is that a family of controllers can be evaluated in parallel in the closed loop for the same set of sampled scenarios. This can reduce the verification effort significantly, if drawing samples from  is costly or the closed-loop experiments have a long duration.

Robustness of the probabilistic validation scheme
All obtained probabilistic guarantees are only valid if the assumptions about the probability density functions (PDFs) of  from which the scenarios are drawn are correct. For the verification, the closed-loop simulations were generated using the dynamics presented in (30) and the different dnn, controllers. The uncertain parameters of the model and the initial conditions were drawn according to the distributions given in Table 1 and first row of Table 2 respectively.
To test the robustness of the probabilistic statements w.r.t. to wrong assumptions about the PDFs, the performance of the approximate controllers dnn, is evaluated using not the distribution of the first row of Table 2, but the second (normal distribution), the third (beta distribution) and the fourth one (pareto distribution). The first parameter in the description of the beta distribution is the scaling and the second parameter is the offset, e.g. 0 = 2.0 ⋅ beta(2, 5) + 28.0. The long-tailed pareto distribution is also described with two parameters, where the first one is the tail index and the second one is the scaling, e.g. 0 = pareto(5.0) + 28.0. The possible extreme values of samples from the space of beta distributions  beta are identical with those when sampling from the space of uniform distributions , see Figure 4. In case of sampling from  normal and  pareto , which have infinite support, the occurrence of values in which exceed the bounds of the scenarios considered in the robust MPC formulation and the verification scenarios is likely, which highlights the importance of including the discarding parameter . The four different considered PDFs including the bounds applied in the NMPC formulation are shown in Figure 4 for the example base glade ratio 0 .
The results corresponding to drawing 1388 scenarios from each of the distributions  normal ,  beta and  pareto , and evaluating the approximate controller dnn,4 are given in Table 2. For the case of  normal and  beta one simulation run violates the height constraint, while three simulation runs violate the constraints for  pareto . This means that the probabilistic requirements for the safety certificate ( = 0.02, = 1 × 10 −6 , = 4, = 4) hold for all alternative choices of distributions. This shows that neither the training of the network nor the verification approach fails catastrophically when the statistical assumptions are not exactly fulfilled.

Embedded implementation
One of the major advantages of learning the complex optimal control law via deep neural networks is the reduction of computational load and the fast evaluation. The computation of the control input is reduced from solving an optimization problem to one matrix-vector multiplication per layer and the evaluation of the tanh-function. This enables the implementation of a probabilistically validated, approximate robust nonlinear model predictive control on limited hardware such as microcontrollers or field programmable gate arrays (FPGAs). We deployed the approximate controller on a low-cost microcontroller (ARM Cortex-M3 32-bit) running with a frequency of 89 MHz with 96 kB RAM. The memory footprint of both the EKF and the neural network that describes the approximate robust NMPC is only 67.0 kB of the 512 kB flash memory. The average time needed to evaluate the neural network was 32.1 ms (max. evaluation time: 33.0 ms) and the average evaluation time for one EKF step was 28.3 ms (max. evaluation time: 30.0 ms), which shows that the proposed controller is real-time capable with a worst-case evaluation time of 63.0 ms. We analyzed the impact of the evaluation time on the safety by simulating the kite for the same 1388 scenarios considered in Table 3 drawn from the uniform distribution, but by applying the computed control inputs with a delay of delay = 65.0 ms, emulating a hardware-in-the-loop setting. We rounded the time delay up to 65.0 ms to account for possible time measurement errors. To deal with the additional error delay caused by delay , we chose = 6 m. Out of the 1388 simulated scenarios, 1374 were free of violations. This means the controller violates the height constraint in 0.86% of the cases, which is less than the probabilistically guarantee = 0.02 chosen in Section 7.2, despite of the additional errors induced through the delay. If the performance needs to be further improved for the hardware-in-the-loop setting, training data for the controller could be generated where the deterministic delay is incorporated in the NMPC formulation. This is an additional advantage of the proposed approach, because the evaluation time of a given neural network is deterministic and can be known in advance. Additional measures to counteract the impact of delayed application of the control inputs such as advanced-step NMPC [67] or the real-time iteration scheme [68] could be also incorporated in the scheme.

CONCLUSIONS AND FUTURE WORK
The computation complexity related to output-feedback robust NMPC controllers is prohibitive in most cases. Instead of relying on strong assumptions on error bounds and invariant sets that cannot be verified in practice, we propose a probabilistic performance validation scheme that can be used to obtain probabilistic guarantees about the closed-loop performance of approximate robust NMPC controllers based on a tree of discrete scenarios. To enable the implementation of such controllers in real-time even on limited embedded hardware, we used deep learning to approximate the proposed robust NMPC controller.
To deal with errors related to estimation, computation of approximate reachable sets based on scenarios as well as approximation of the resulting optimization problem with a neural network, we tighten the original constraints of the problem using a backoff parameter. The novel probabilistic validation framework lead to less restrictive results when compared to previous approaches because of the incorporation of a discarding parameter and the consideration of non-binary performance indicators. Moreover, the required sample complexity does not depend on the dimension of the problem. The promising results for the embedded output-feedback robust NMPC of a towing kite show the potential of the proposed approach. Future work includes the definition of robust margins based on probabilistic validation techniques as well as the learning of controllers that are parameterized, for example, with a backoff parameter.