Robust risk aggregation with neural networks

Abstract We consider settings in which the distribution of a multivariate random variable is partly ambiguous. We assume the ambiguity lies on the level of the dependence structure, and that the marginal distributions are known. Furthermore, a current best guess for the distribution, called reference measure, is available. We work with the set of distributions that are both close to the given reference measure in a transportation distance (e.g., the Wasserstein distance), and additionally have the correct marginal structure. The goal is to find upper and lower bounds for integrals of interest with respect to distributions in this set. The described problem appears naturally in the context of risk aggregation. When aggregating different risks, the marginal distributions of these risks are known and the task is to quantify their joint effect on a given system. This is typically done by applying a meaningful risk measure to the sum of the individual risks. For this purpose, the stochastic interdependencies between the risks need to be specified. In practice, the models of this dependence structure are however subject to relatively high model ambiguity. The contribution of this paper is twofold: First, we derive a dual representation of the considered problem and prove that strong duality holds. Second, we propose a generally applicable and computationally feasible method, which relies on neural networks, in order to numerically solve the derived dual problem. The latter method is tested on a number of toy examples, before it is finally applied to perform robust risk aggregation in a real‐world instance.


Introduction 1.Motivation
Risk aggregation is the process of combining multiple types of risk within a firm.The aim is to obtain meaningful measures for the overall risk the firm is exposed to.The stochastic interdependencies between the different risk types are crucial in this respect.There is a variety of different approaches to model these interdependencies.One generally observes that these models for the dependence structure between the risk types are significantly less accurate than the models for the individual types of risk.
We take the following approach to address this issue: We assume that the distributions of the marginal risks are known and fixed.This assumption is justified in many cases of practical interest.Moreover, risk aggregation is per definition not concerned with the computation of the marginal risks' distributions.Additionally, we take a probabilistic model for the dependence structure linking the marginals risks as given.Note that there are at least two different approaches in the literature to specify this reference dependence structure: The construction of copulas and factor models.The particular form of this reference model is not relevant for our approach as long as it allows us to generate random samples.Independently of the employed method, the choice of a reference dependence structure is typically subject to high uncertainty.Therefore, our contribution is to model the ambiguity with respect to the specified reference model, while fixing the marginal distributions.We address the following question in this paper: How can we account for model ambiguity with respect to a specific dependence structure when aggregating different risks?
We propose an intuitive approach to this problem: We compute the aggregated risk with respect to the worst case dependence structure in a neighborhood around the specified reference dependence structure.For the construction of this neighborhood we use transportation distances.These distance measures between probability distributions are flexible enough to capture different kinds of model ambiguity.At the same time, they allow us to generally derive numerical methods, which solve the corresponding problem of robust risk aggregation in reasonable time.To highlight some of the further merits of our approach, we are able to determine the worst case dependence structure for a problem at hand.Hence, our method for robust risk measurement is arguably a useful tool also for risk management as it provides insights about which scenarios stress a given system the most.Moreover, it should be emphasized that our approach is restricted neither to a particular risk measure nor a particular aggregation function. 1n summary, the approach presented provides a flexible way to include model ambiguity in situations where a reference dependence structure is given and the marginals are fixed.It is generally applicable and computationally feasible.In the subsequent subsection we outline our approach in some more details before discussing the related literature.

Overview
We aim to evaluate for some f : R d → R in the presence of ambiguity with respect to the probability measure μ ∈ P(R d ), where P(R d ) denotes the set of all Borel probability measures on R d .In particular, we assume that the marginals μ1 , . . ., μd of μ are known and the ambiguity lies solely on the level of the dependence structure.Moreover, we assume a reference dependence structure, namely the one implied by the reference measure μ, is given and that the degree of ambiguity with respect to the reference measure μ can be modeled by the transportation distance d c , which is defined in (2) below.Hence, we consider the following problem φ(f ) := max µ∈Π(μ1,...,μ d ) dc(μ,µ)≤ρ where the set Π(μ 1 , . . ., μd ) consists of all µ ∈ P(R d ) satisfying µ i = μi for all i = 1, . . ., d, where µ i ∈ P(R) and μi ∈ P(R) denote the i-th marginal distributions of µ and μ.Moreover, we fix a continuous function c : R d × R d → [0, ∞) such that c(x, x) = 0 for all x ∈ X.The cost of transportation between μ and µ in P(R d ) with respect to the cost function c is defined as where Π(μ, µ) denotes the set of all couplings of the marginals μ and µ.For the cost function c(x, y) = ||x − y|| p with p ≥ 1, the mapping d 1/p c corresponds to the Wasserstein distance of order p.
The following duality result is the starting point for the numerical methods to solve problem (1), which are developed below.Moreover, it allows us to derive analytical solutions to problem (1) in some cases.Hence, the following theorem can be seen as the central result in the present paper.Let U b (R d ) denote the set of all upper semicontinuous and bounded functions f : R d → R, and C b (R) the set of all continuous and bounded functions h : R → R.
Remark 1. (i) In case ρ = ∞, the above result collapses to the duality of multi-marginal optimal transport.On the other hand, if ρ = 0, both the primal problem (3) and the dual problem (4) reduce to f dμ.Finally, if one drops the constraint µ ∈ Π(μ 1 , . . ., μd ) in the primal formulation (3), the functions (ii) In Section 2 the Theorem is generalized in the following aspects: Firstly, rather than where the sets X i can be of arbitrary dimensions.In practice this means the problem setting can include an information structure where multivariate marginals are known and fixed.Secondly, the functions f need not be bounded, as we can derive the above results for a more general class of functions f .Lastly, the constraint d c (μ, µ) ≤ ρ is one particular form of a more general way to penalize with respect to d c (μ, µ).
(iii) Gao and Kleywegt [22] derive the above duality under different assumptions, which we discuss in the subsequent Section 1.3.These authors point out that the dual problem (4) can be reformulated as a linear program under the following assumptions: First, the function f can be written as the maximum of affine functions.Second, the reference distribution μ is given by an empirical distribution on n points x 1 , . . ., x n in R d .Third, the set X, mentioned in (ii), only contains n d points, to be precise Fourth, the cost function c has to be of a particular form, which is satisfied for instance by c(x, y) = d i=1 |x i − y i |.For further details, we refer to Corollary 3 in Section 2.1.
The assumptions listed in Remark 1(iii) under which problem (4) can be solved by means of linear programming are so restrictive that they exclude many cases of practical interest.Even in cases that linear programming is applicable, the resulting size of the linear program quickly becomes intractable in higher dimensions.Hence, this paper presents a method to numerically solve problem (4) which uses neural networks.

Implementation
Starting with the dual formulation (4) of the problem φ(f ), the goal is to solve the problem numerically using neural networks by recasting it as a stochastic optimization problem.The basic idea is penalization, which has a long history in optimal transport and related problems, see for example [16] and references therein.
The first hurdle is the inf-sup structure, which is circumvented by dualizing the point-wise supremum in the integral.Under mild assumptions, this leads to As the point-wise inequality constraint prevents a direct implementation with neural networks, the constraint is penalized.This is done by introducing a measure θ ∈ P(R 2d ), which we refer to as the sampling measure.Assuming all occurring functions are continuous, if θ ∈ P(R 2d ) gives positive mass to every non-empty open set in R 2d , it holds The final step is to approximate the function x → ∞ max{0, x} by a sequence of convex, nondecreasing and differentiable functions (β γ ) γ>0 .The resulting optimization problems are In Proposition 1 we analyze the problem φ θ,γ (f ).In particular, Proposition 1 (c) identifies conditions for the convergence φ θ,γ (f ) → φ(f ) for γ → ∞.
The problem φ θ,γ (f ) now fits into the stochastic optimization framework.In this paper, we solve it numerically using neural networks, i.e. we replace the occurring sets of functions C b (R) and C b (R d ) by sets of neural network functions.
The remainder of the paper is structured as follows.In the subsequent Section 1.3, we provide a brief overview of the relevant literature.Our main results can be found in Section 2, which consists of two parts: First, we state and prove the generalized version of Theorem 1 above.We then derive some implications thereof.In the second part of Section 2, we focus on the function φ θ,γ (f ) introduced in equation ( 5) above.In this respect, our contribution is threefold: We dualize φ θ,γ (f ), derive a connection between primal and dual optimizers and study the convergence of φ θ,γ (f ) to φ(f ).
Section 3 is devoted to three toy examples, which are based on two uniformly distributed random variables with ambiguous dependence structure.The aim of this exemplification is to shed some light on the developed concepts and show how they can be used to solve given problems.In the final Section 4, the acquired techniques are applied to a real world example.We thereby demonstrate how to implement robust risk aggregation with neural networks in practice.

Related literature
There are three different strings of literature, which are relevant in the present context: Firstly, literature on risk aggregation; secondly, literature on model ambiguity and particularly on ambiguity sets constructed using the Wasserstein distance; thirdly, recent application of neural networks in finance and related optimization problems.

Risk aggregation
In Section 4, we motivate from an applied point of view why there is interest in risk bounds for the sum of losses of which the marginal distributions are known.The theoretical interest in this topic started with the following questions: How can one compute bounds for the distribution function of a sum of two random variables when the marginal distributions are fixed?This problem was solved in 1982 by Makarov [28] and Rüschendorf [40].Starting with the work of Embrechts and Puccetti [17] more than 20 years later, the higher dimensional version of this problem was studied extensively due to its relevance for risk management.We refer to Embrechts et al. [18] and Puccetti and Wang [37] for an overview of the developments concerning risk aggregation under dependence uncertainty, as this problem was coined.Let us mention that Puccetti and Rüschendorf [36] introduced the so-called rearrangement algorithm, which is a fast procedure to numerically compute the bounds of interest.Applying this algorithm to real-world examples demonstrates a conceptual drawback of the assumption that no information concerning the dependence of the marginal risk is available: The implied lower and upper bound for the aggregated risk are impractically far apart.
Hence, some authors recently tried to overcome this drawback and to come up with more realistic bounds by including partial information about the dependence structure.For instance, Puccetti and Rüschendorf [35] discuss how positive, negative or independence information influence the above risk bounds; Bernard et al. [7] derive risk bounds with constraints on the variance of the aggregated risk; Bernard et al. [8] consider partially specified factor models for the dependence structure.The interested reader is referred to Rüschendorf [41] for a recent review of these and related approaches.Finally, we want to point out the intriguing contribution by Lux and Papapantoleon [27].These authors provide a framework which allows them to derive VaR-bounds if (a) extreme value information is available, (b) the copula linking the marginals is known on a subset of its domain and (c) the latter copula lies in the neighborhood of a reference copula as measured by a statistical distance.
Since our paper aims to contribute to this string of literature, let us point out that the latter mentioned type of partial information about the dependence structure used in [27] is similar in spirit to our approach.We emphasize that Lux and Papapantoleon use statistical distances which are different to the transportation distance d c defined in the previous subsection.

Model Ambiguity
There is an obvious connection of problem (1), which is studied in this paper, with the following minimax stochastic optimization problem where X ⊂ R m , f : R m × Ξ → R, ξ is a random vector whose distribution Q is supported on Ξ ⊂ R d and Q is a nonempty set of probability distributions, referred to as ambiguity set.Problems of this form recently became known as distributionally robust stochastic optimization problems.As pointed out by Shapiro [43], there are two natural and somewhat different approaches to constructing the ambiguity set Q. On the one hand, ambiguity sets have been defined by moment constraints, see Deluge and Ye [15] and reference therein.An alternative approach is to assume a reference probability distribution Q is given and define the ambiguity set by all distributions which are in the neighborhood of Q as measured by a statistical distance.To the best of our knowledge two distinct choices of this statistical distance have been established in the literature: The φ-divergence and the Wasserstein distance.Concerning ambiguity sets constructed using the φ-divergence we refer to Bayraksan and Love [4] and references therein.In the following, we focus on approaches which rely on the Wasserstein distance to account for model ambiguity.Pflug and Wozabal [33] were the first to study these particular ambiguity sets.Esfahani and Kuhn [29] showed that distributionally robust stochastic optimization problems over Wasserstein balls centered around a discrete reference distribution possess a tractable reformulation: under mild assumptions these problems belong to the same complexity class as their non-robust counterparts.The duality result driving this insight was also proven by Blanchet and Murthy [11], Gao and Kleywegt [21] and Bartl et al. [3] based on different techniques and assumptions.These contribution indicate that distributionally robust stochastic optimization using the Wasserstein distance developed into an active field of research in recent years.For instance, Zhao and Guan [46] and Hanasusanto and Kuhn [25] adapted similar ideas in the context of two-stage stochastic programming and Chen et al. [14] and Yang [45] study distributionally robust Markov decision processes using the Wasserstein distance.Obloj and Wiesel [31] analyze a robust estimation method for superhedging prices relying on a Wasserstein ball around the empirical measure.Most relevant in the context of our paper are the following two references: Gao and Kleywegt [23] put two Wasserstein-type-constraints on the probability distribution Q in (6): Q has to be close in Wasserstein distance to a reference distribution Q, while the dependence structure implied by Q has to be close, again in Wasserstein distance, to a specified reference dependence structure.In their follow-up paper, Gao and Kleywegt [22] consider problem (1) in the context of stochastic optimization, i.e. in the framework (6).As indicated in Remark 1, they derive Theorem 1 for upper semicontinuous functions f satisfying the growth condition sup x∈X f (x) c(x,y0) < ∞ for some y 0 ∈ X.Furthermore, these authors consider a rather particular cost function c(•, •).Note that Corollary 3, i.e. the linear programming reformulation of the derived dual problem (4), is also derived by Gao and Kleywegt [22].

Neural networks in finance and optimization
Applications of neural networks have vastly increased in recent years.Most of the popularity arose from successes of neural networks related to data representation tasks, e.g.related to pattern recognition, image classification, or task-specific artificial intelligence.In contrast to such a utilization, neural networks have also been applied strictly as a tool to solve certain optimization problems.This is the way we use neural networks in this paper, and they have found similar uses in various areas related to finance.Among others, they were applied to solving high dimensional partial differential equations and stochastic differential equations (see e.g.Beck et al. [5], Berner et al. [9] and Weinan et al. [44]) as well as backward stochastic differential equation (see Henry-Labordere [26]), in optimal stopping (Becker et al. [6]), optimal hedging with respect to a risk measure (Buehler et al. [12]), and superhedging (Eckstein and Kupper [16]).
For more classical learning tasks where neural networks are applied, ideas from optimal transport and distributional robustness are also used.While the settings are often quite different in nature to the one in this paper, the optimization problems which are eventually implemented are nevertheless similar.Most related to the current paper are settings in which optimal transport type of constraints are solved via a penalization or regularization method.Examples include generative models for images (see e.g.Gulrajani et al. [24] and Roth et al. [39]), optimal transport and calculation of barycenters for images (see e.g.Seguy et al. [42]), or distributional robustness methods applied to learning tasks (see e.g.Blanchet et al. [10], Gao et al. [20]).

Duality
a Polish space, and denote by P(X) the set of all Borel probability measures on X. Throughout, we fix a reference probability measure μ ∈ P(X).For i = 1, ..., d, we denote by Denote by C κ (X) and U κ (X) the spaces of all continuous, respectively upper semicontinuous functions f : X → R such that f /κ is bounded.
In the following we fix a continuous function c : X × X → [0, ∞) such that c(x, x) = 0 for all x ∈ X.The cost of transportation between μ and µ in P(X) with respect to the cost function c is defined as where Π(μ 1 , . . ., μd ) denotes the set of all µ ∈ P(X) such that µ i = μi for all i = 1, . . ., d.
The elements in Π(μ 1 , . . ., μd ) are referred to as couplings of the marginals μ1 , . . ., μd .Although the computation of the convex conjugate in the following result relies on Bartl et al. [3], we do not need their growth condition on the cost function c.The main reason we do not require this condition is that continuity from above of the functional (8) -which corresponds to tightness of the considered set of measures -is already obtained by the imposed marginal constraints.
Proof. 1) Define the optimal transport functional ψ By assumption, Xi κ i dμ i < +∞ for all i = 1, . . ., d.Hence, it follows from Urysohn's lemma that there exist K i ⊂ X i compact and g i ∈ C κi (X i ) such that g i ≥ 0, g i = M κ i on K c i and Xi g i dμ i < ε/3d.By Dini's lemma there exists This shows that ψ 1 is continuous from above on C κ (X).Moreover, its convex conjugate is given by +∞ else (9) for all µ ∈ P κ (X), where P κ (X) denotes the set of all µ ∈ P(X) such that κ ∈ L 1 (µ).Notice that Π(μ 1 , . . ., μd ) ⊂ P κ (X).
3) For f ∈ U κ (X) define the convolution For the associated convex conjugates it follows from ( 9) and ( 10) that . This shows that ψ : U κ (X) → R. By definition, ψ is convex and increasing.Moreover, ψ is continuous from above on C κ (X), since for every sequence where we use that ψ 1 is continuous from above on C κ (X) by the first step.Since also ψ * Cκ = ψ * Uκ on P κ (X) by the third step, it follows from [2, Theorem 2.2.] and [2, Proposition 2.3.] that ψ has the dual representation for each radius ρ ≥ 0.
From a computational point of view the penalty function ϕ(x) = x is of particular interest since the optimization in Theorem 2 over the Lagrange multiplier λ disappears.
In particular, if c i (•, •) is such that constraint (14) can be written as a linear constraint, then the above problem can be solved by means of linear programming.
Proof.Due to the assumptions that X i = {x 1 i , . . ., x n i } and μ = 1 n n j=1 δ x j , the term Xi h i dμ i in (11) can be written as 1 n n j=1 h i (j), where h i (j) := h i (x j i ) ∈ R for j = 1, . . ., n.Hence, putting all the assumptions in Corollary 3 together, the dual problem (11) can be reformulated as inf λ≥0, hi(j) Introducing the auxiliary variables g(j) ∈ R and u i (j, m) ∈ R, where i, j = 1, . . ., n and m = 1, . . ., M , in order to remove the two max functions, yields the assertion.

Penalization
The aim of this section is to modify the functional (8), so that it allows for a numerical solution by neural networks.To focus on the main ideas, we assume that κ is bounded, i.e. we restrict to continuous bounded functions, as well as ϕ = ∞1 1 (ρ,∞) as in the overview in Section 1.2.Hence, in line with Corollary 1 we consider the functional for all f ∈ C b (X) and a fixed radius ρ > 0. For simplicity, we assume that the function f λc (x) = sup y∈X {f (y) − λc(x, y)} is continuous for all λ ≥ 0 and f ∈ C b (X). 3 In that case, the functional ).The functional φ 1 can be regularized by penalizing the inequality constraint.To do so, we consider the functional for a sampling measure θ ∈ P(X 2 ), and a penalty function β γ (x) := 1 γ β(γx), γ > 0, where β : R → [0, ∞) is convex, nondecreasing, differentiable, and satisfies β(x) x → ∞ for x → ∞.Let β * γ (y) := sup x∈R {xy − β γ (x)} for y ∈ R + , and notice that β * γ (y) = 1 γ β * (y).

Proof. Observe that for every
where the right hand side is equal to φ θ,γ (f ).This follows from the dominated convergence theorem applied on the sequence fn (x, y) = min{n, g(x) + d i=1 h i (y i ) + λc(x, y)}.As for the calculation of the convex conjugate, we first show that φ * θ,γ (π) = ∞ whenever π 1 = μ or π 2 ∈ Π(μ 1 , ..., μd ).Indeed, since it follows that φ θ,γ is bounded above by a multi-marginal transport problem.As the respective convex conjugate is +∞, it follows that φ * θ,γ (π) = ∞ for all π ∈ P(X 2 ) such that Here, the second equality follows by substituting f (x, y) = f (x, y) − d i=1 h i (y i ) − g(x) and using the structure of the marginals of π.The third equality follows by setting f n = f + min{n, λc} and using the dominated convergence theorem.Finally, the fourth equality follows by a standard selection argument, see e.g. the proof of [2, Lemma 3.5].
(c) By restricting the infimum in (18) to those λ ≥ 0, where the last equality follows from (17).As for the second inequality, since µ ε ∈ P(X) is an ε-optimizer of ( 16), and The proof is complete.

Examples
The aim of this section is to illustrate how the above introduced concepts can be used to numerically solve given problems with neural networks.To do so, we consider different examples with increasing difficulty.Throughout this section, we consider two different transportation distances, for which we fix the following notation: d c refers to the Wasserstein distance of order 1, which obtained when using the following cost function c in the definition (7): On the other hand d 1/2 c denotes the second order Wasserstein distance with respect to the Euclidean metric, i.e. c(x, y)

Expected maximum of two comonotone standard Uniforms
We start our exemplification with a toy example which is not connected to risk measurement.Consider the following problem where μ1 = μ2 = U([0, 1]) are (univariate) standard uniformly distributed probability measures and μ is the comonotone copula.In other words, μ is a bivariate probability measure with standard uniformly distributed marginals which are perfectly dependent.In the notation of the previous section, we have that (21), we aim to compute the expected value of the maximum of two standard Uniforms under ambiguity with respect to the reference dependence structure, which is given by the comonotone coupling.Problem (21) possesses the following analytic solution The derivation of this solution can be found in Appendix A.1 and is based on the duality result in Corollary 1.Hence, problem (21) is well suited to benchmark the following two numerically solution methods: 1. We discretize the reference copula μ (and thereby the marginal distributions μ1 and μ2 ) and solve the resulting data-driven dual problem by means of linear programming (see Corollary 3).There are two distinct ways to discretize μ: a) We use Monte Carlo sampling.In the notation of Corollary 3, this means we sample n points x 1 1 , . . ., x n 1 in [0, 1] from the standard Uniform distribution.Then, we set x j 2 = x j 1 for j = 1, . . ., n. b) We set the points x j 1 = x j 2 = 2j−1 2n for j = 1, . . ., n.As the comonotonic copula lives only on the main diagonal of the unit square, this deterministic discretization of μ in some sense minimizes the discretization error.The simple geometrical argument used to find this discretization can be applied only due to the special structure of the reference distribution at hand.Let us emphasize that method 1.a) can be applied to any reference distribution μ.On the other hand, method 1.b) can only be used in this particular example as μ is given by the comonotonic copula.
2. We solve the penalized version of the dual of problem (21), see equation (18).Let us comment on the choice of the penalization function β γ , the architecture of the employed neural networks and the sampling measure θ ∈ P([0, 1] 4 ).We set β γ (x) = γ 2 max(0, x) 2 with γ = 1280.The general form β γ (x) = γ 2 max(0, x) 2 has shown to be a solid choice for all considered problems, and the parameter γ should usually be set high enough to make the theoretical penalization error small (see Proposition 1 (c)), while not being so high as to cause numerical stability issues.Regarding network structure: To approximate the space C b (R d ) we use a 4-layer feed-forward ReLu network with hidden-dimension 64 • d.The hidden-dimension is chosen sufficiently high so that either slightly decreasing or increasing it does not change the numerical outcomes.Concerning the sampling measure θ, for this example we consider two choices: , where μ is the given reference distribution, and U([0, 1] 2 ) denotes the uniform distribution on the unit square , where δ x is understood as the stochastic kernel R 2 → P(R 2 ), x → δ x .Hence, when sampling a point w ∈ [0, 1] 4 from θ, we simply set where u 1 , v 1 , v 2 ∈ [0, 1] are samples drawn from a standard uniform distribution.
The choice of reference measure should be interpreted in view of Proposition 1 (c).
To allow for a small penalization error, we want to set the measure θ as similar as possible to a potential optimal coupling π * .Method 2.a) can be seen as the most oblivious approach in this respect: We know π * 1 = μ and π * 2 ∈ Π(μ 1 , ..., μd ) and now choose θ to link these constraints in the simplest way (i.e. as a product measure) as θ = μ ⊗ (μ 1 ⊗ ... ⊗ μd ).Method 2.b) however anticipates that an optimal coupling, and hence θ, should put some mass on the "diagonal" μ ⊗ δ x .This diagonal is the optimal coupling for ρ = 0 and hence it is reasonable to anticipate mass in this region for other small values of ρ as well.
Figure 1 compares the two above mentioned methods to solve problem (21) for different values of ρ.In the left panel of Figure 1, we observe that method 1.a) yields an unsatisfactory result even though n = 250 is chosen as large as possible for the resulting LP to be solvable by a commercial computer.This issue arises due to the poor quality of the discretization resulting from Monte Carlo simulation.If one chooses the discretization as done in method 1.b), we recover the analytic solution of problem (21) as can be seen in the right panel of Figure 1.Moreover, Figure 1 indicates that method 2, i.e. the approach presented in this paper, yields quite good and stable results.The left panel, however, shows that for small ρ method 2.a) does not rediscover the true solution.The reason for this is that when drawing random samples from the chosen sampling measure θ = μ ⊗ U([0, 1] 2 ), it is unlikely that in the third and fourth coordinate we sample from the relevant region, namely the main diagonal of the unit square.As discussed, method 2.b) is designed to overcome precisely this weakness and the right panel of Figure 1 illustrates that it does.We finalize this example by considering the second order Wasserstein distance d 1/2 c , defined above, rather than the first order Wasserstein distance d c .Thus, we compare problem (21) to There are two important difference between problem (21) and problem (22): Firstly, problem (22) cannot be solved by means of linear programming.Secondly, the derivation of an analytic solution for φ(f 1 ) is not as straight forward as in the case of φ(f 1 ).Nevertheless, we can approximate φ(f 1 ) using neural networks, which demonstrates the flexibility of this approach.Figure 2 compares φ(f 1 ) and φ(f 1 ) for different ρ.It should be mentioned that problem (22) is well suited for the solution via penalization and neural networks: other than in problem (21) the choice of the sampling measure θ does not seem to impact the solution as much.

Average Value at Risk of two independent standard Uniforms
We increase the level of complexity slightly compared to the previous example, as we now turn to robust risk aggregation.We aim to compute AVaR α (U +V ), where U and V are independent standard Uniforms under ambiguity with respect to the independence assumption.The analytic solution φ(f 1 ) of problem (21), which uses the 1 -metric to define the transportation cost, is compared to the numerical solution φ(f 1 ) of problem (22), which uses the 2 -metric to define the transportation cost.
Note that the Average Value at Risk is defined by see Rockafellar and Uryasev [38].Using the first order Wasserstein distance to construct an ambiguity set around the reference dependence structure, we are led to the following problem = sup µ∈Π(μ1,μ2), dc(μ,µ)≤ρ = inf where μ1 = μ2 = U([0, 1]) are (univariate) standard uniformly distributed probability measures and μ is the independence copula.In other words, μ = U([0, 1] 2 ) is a bivariate probability measure with independent, standard uniformly distributed marginals.Moreover, we have that f τ 2 (x) = τ + 1 1−α max(x 1 + x 2 − τ, 0) and φ(•) is defined as in equation ( 1).Notice that in the above formulation of the problem we can go from ( 24) to ( 25) since the problem is convex in τ and concave in µ and Wasserstein balls are weakly compact.Thus, we can apply Sion's Minimax Theorem to interchange sup and inf in (24).
In Appendix A.2, we derive an analytical upper and lower bound for Φ 2 in (23).These bounds are tight enough for the present purpose, which is to evaluate the performance of the two discussed numerical methods.
Figure 3 supports the latter claim: The analytic bounds for Φ 2 are rather tight when plotted as a function of ρ.The bounds are compared to the same two numerical methods as discussed in the previous example.With respect to the solution based on Monte Carlo simulation and linear programming, we now average over 100 simulations for each fixed ρ.Thus, the results in Figure 3 do not fluctuate as much as those we have seen in the left panel of Figure 1.Nevertheless, Figure 3 shows that the solution obtained via MC and LP does not stay within the analytic bounds (other than the solution via penalization and neural networks).Arguably this is due to the lack of symmetry when discretizing the reference distribution µ using Monte Carlo.Regarding runtime, both numerical methods take around the same time to calculate the values needed for Figure 3.We now want to illustrate a further merit of the neural networks approach, namely that we can sample from the numerical optimizer µ * of problem (23).By doing so, we obtain information about the structure of the worst case distribution.The samples are obtained by acceptance-rejection sampling from the density given by Proposition 1 (b), where we replace true optimizers by numerical ones.Figure 4 plots samples of this worst case distribution µ * for different values of ρ.To understand the intriguing nature of the results presented in Figure 4, we have to describe problem (23) in some more detail.It should be clear that the comonotone coupling of the Uniforms U and V is maximizing AVaR α (U + V ) among all possible coupling of U and V .However, one can find many different maximizing couplings.Notably, the optimizer shown for ρ = 0.2 corresponds to the one which has the lowest relative entropy with respect to the independent coupling among the maximizers of AVaR α (U + V ).On the other hand, the middle panel for ρ = 0.16 motivated us to derive a coupling whichamong maximizers of AVaR α (U +V ) -we conjecture to have the lowest Wasserstein distance to the independent coupling.This is used to derive the lower bound for problem (23) in Appendix A.2.Some features of the others couplings, e.g. for ρ = 0.08 and ρ = 0.12 came as a surprise to us: For example, the curved lines as boundary for the support are unusual in an 1 -Wasserstein problem.23) as obtained by the neural networks approach are shown in form of a heatplot for six different levels of ambiguity, i.e. ρ = 0, 0.04, 0.08, 0.12, 0.16, 0.2.

Value at Risk of two independent standard Uniforms
The set-up remains the same as in the previous example, but we now consider the Value at Risk (VaR) rather than the AVaR.As we shall see, this minor adjustment yields a substantially harder problem.Note that VaR α (Y ) := inf {τ ∈ R : P (Y ≤ τ ) > α}, where α is typically close to one, e.g., α = 0.95.The VaR α (Y ) can be interpreted as the smallest capital allocation τ to the random loss Y such that the probability that the loss Y does not exceed τ is a least α.Following Bartl et al. [3], we can robustify the computation of the Value at Risk by determining the capital τ with respect to the "worst case probability distribution" in a given ambiguity set.As in the previous example, this ambiguity set is assumed to be a Wasserstein ball with radius ρ.Hence, we are led to the following problem: where I {•} denotes the indicator function.Thus, Φ 3 (ρ, α) is the worst case VaR with ambiguity level ρ and confidence level α.In the following we shall also be interested in the best case VaR Φ 3 (ρ, α), which is defined by substituting the second inf in (26) Figure 5: The best case VaR Φ 3 (ρ, α), defined in equation ( 26), and the worst case VaR Φ 3 (ρ, α), defined below equation (26), are plotted as a function of the confidence level α.We consider three different levels of ambiguity: ρ = 0, 0.05, ∞.Note that Φ 3 (ρ = 0, α) = Φ 3 (ρ = 0, α) = VaR Π α (U + V ), where Π denotes reference distribution in this example, i.e. the independence copula linking the two uniforms U and V .Moreover, Φ 3 (ρ = ∞, α) resp.Φ 3 (ρ = ∞, α) coincide with inf C∈C VaR C α (U +V ) resp.sup C∈C VaR C α (U +V ), which we use as a short notation for sup ( V U )∼µ∈Π(μ1,μ2) VaR α (U + V ).
In general the problem, as formulated in (26), differs from the problem sup ( V U )∼µ∈Π(μ1,μ2), In following we will focus on problem (26), rather than on problem (27), since it can be numerically solved with neural networks based the results presented in this paper.Notice however that in case ρ is chosen so large that the constraint d c (μ, µ) ≤ ρ is not imposing a restriction on the measures µ ∈ Π(μ 1 , μ2 ), then the solutions of the two problems ( 26) and ( 27) coincide.
In order to solve problem (26) with the tools introduced above we have to proceed as follows: We solve the "inner problem" for different τ , until we find a τ such that |φ(f τ 3 )| < ξ for some fixed accuracy ξ > 0. Then the solution of problem ( 26) is approximately given by τ .
It should be clear that Corollary 3 does not apply to problem (28).In contrast to the previous two examples, we can therefore not solve this problem by means of linear programming.This insight explains why our approach with neural networks is extremely useful in practice: it does not require the function f τ 3 to be of any restrictive form.The outcome of this last toy example is displayed differently from the previous two examples.We fix three values for the radius ρ of the considered Wasserstein ball and plot the worst case VaR Φ 3 (ρ, α) and the best case VaR Φ 3 (ρ, α) as a function of α.For the first value ρ = 0, Figure 5 shows that, as expected, Φ 3 (ρ = 0, α) = Φ 3 (ρ = 0, α) = VaR Π α (U + V ), where U and V are independent standard uniforms -hence the letter Π in the superscript of VaR.For ρ = 0.05, Figure 5 illustrates how a small level of ambiguity with respect to the reference dependence structure affects the best (resp.worst) case VaR Φ 3 (ρ = 0.05, α) (resp.Φ 3 (ρ = 0, α)).Finally, we observe that for large enough ρ, problem (26) and (27) actually coincide as we recover the known VaR-bounds, see e.g.Frank et al. [19].
To conclude this section, let us highlight the main insights, which we obtained from the three toy examples.First, we studied an approach, which relies on Corollary 3 and thereby on linear programming.A major drawback of this approach is that the resulting LP can only be solved efficiently when the discrete reference distribution μ lives on relatively few points, i.e. n ≤ 250.In case the reference distribution μ is constructed based on a limited number of observations, this might not be a limitation.If the reference distribution μ is continuous, one has to approximate μ by a discrete distribution supported on a small number of points.We have seen that when relying on Monte Carlo simulations to perform this discretization, one has to average the solutions of many independent simulations.Still, the resulting average solution is shown to be relatively far away from the true solution in some cases.
As an alternative to linear programming, we present a second method based on penalization and neural networks.The above examples indicate that this second method generally performs better in terms of computational effort as well as solution quality.It should be emphasized that the solution method using neural networks has a theoretical drawback.Since the optimization problem is not convex, there is no guarantee for global convergence.Nevertheless, we have shown that this approach can give correct and insightful results.Compared to the linear programming approach, it has two additional merits: First, we can determine the structure of the worst case distribution.Second, we can solve a much larger class of problems than with the first method.

DNB case study: Aggregation of six given risks
Aas and Puccetti [1] provide a very illustrative case study of the risk aggregation at the DNB, Norway's largest bank.We want to make use of this example to showcase the applicability of the novel framework presented in this paper.
The DNB is exposed to six different types of risks: credit, market, asset, operational, business and insurance risk.Let the random variables L 1 , . . ., L 6 represent the marginal risk exposures for these six risks.Per definition, risk aggregation is not concerned with the computation of the distribution of the marginal risks.Hence, we take the corresponding marginal distribution functions F 1 , . . ., F 6 as given.In this particular case, F 1 , F 2 and F 3 are empirical cdfs originating from given samples, while L 4 , L 5 and L 6 are assumed to be log-normally distributed with given parameters, see Table 1.
For the purpose of risk management, the DNB needs to determine the capital to be reserved.According to the Basel Committee on Banking Supervision [32] losses. 4The AVaR of the sum of these six losses at a specific confidence level α is defined as where L + 6 := 6 i=1 L i .To evaluate expression (29), the joint distribution of L 1 , . . ., L 6 is needed.As the marginal distributions of L 1 , . . ., L 6 are known, the DNB relies on the concept of copulas to model the dependence structure between these risks.From the above description, it is clear that joint observations of the L 1 , . . ., L 6 are not available.Hence, standard techniques to determine the copula, e.g., by fitting a copula family and the corresponding parameters to a multivariate data set, do not apply.A panel of experts at the DNB therefore chooses a specific reference copula C 0 , in this case a student-t copula with six degrees of freedom and a particular correlation matrix.Such an approach is common in practice and referred to as expert opinion.
From an academic point of view, this method for risk aggregation is not very satisfying due to the fact that the experts' choice of a reference dependence structure between the different risk types might be very inaccurate.Hence, we say that there is model ambiguity with respect to the dependence structure.It should be emphasized that a misspecification of this reference copula chosen by expert opinion can have a significant impact on the aggregated risk and therefore on the required capital.Table 2 supports this statement by comparing the AVaR implied by the reference copula C 0 to the AVaR implied by other dependence structures: Without any information regarding the dependence structure between the six risk, the lower (resp.upper) bound for the AVaR with confidence level α = 0.95 is 24165.52(resp.36410.12) million Norwegian kroner.Similar bounds are studied in Aas and Puccetti [1].As we pointed out in the literature review in Section 1. AVaR α (L i ).The two remaining entries are computed by averaging over 50 simulation runs where 10 millions sample points are drawn in each run.Note that Π denotes the independence copula.Thus, AVaR Π α (L + 6 ) corresponds to the AVaR of the sum of the six losses given that they are independent.criticized in the literature since they are too far apart for practical purposes.We therefore apply the results derived in this paper to compute bounds for the AVaR which depend on the level ρ of distrust concerning the reference copula C 0 .Alternatively, the parameter ρ can be understood as the level of ambiguity with respect to the reference distribution μ.
We define the probability measure μ of the reference distribution by the following joint cumulative distribution function F (x) = C 0 (F 1 (x 1 ), F 2 (x 2 ), . . ., F 6 (x 6 ), for all x ∈ R6 .Hence, the cdfs of the marginals μi are given by F i (•) for i = 1, 2, . . ., 6.The problem of interest can be formulated as follows: Φ C0 4 (α, ρ) := inf L + 6 ∼µ∈Π(μ1,...,μ6), dc(μ,µ)≤ρ The cost function c defining the transportation distance d c in problem (30) and ( 31) is set to where σi denotes the standard deviation of μi and is given in Table 1.The rational behind this definition of c is that we want to model the ambiguity such that it concerns solely the dependence structure of the reference distribution.Definition ( 32) is a simple way to achieve this. 5igure 6 shows the numerical solutions of problems ( 30) and (31), which are computed relying on penalization and neural networks, as a function of ρ and for α = 0.95.As a comparison, the same problem is also solved with respect to the independence coupling Π rather than the reference copula C 0 described in Table 1.The shaded regions outline the possible levels of risk for a given level of ambiguity ρ and the two reference structures.Figure 6: We consider two distinct reference dependence structures, the student-t copula C 0 defined in Table 1 and the independence copula Π.The corresponding robust solutions Φ C0 4 (α, ρ) and Φ C0 4 (α, ρ), defined in (30), and (31) resp.Φ Π 4 (α, ρ) and Φ Π 4 (α, ρ), defined analogously, are plotted as a function of the level of ambiguity ρ.We compare these results, which were computed relying on the concept presented in this paper, to the known values of AVaR α (L + 6 ) given in Table 1.Note that we fix α = 0.95.
On one hand, the evolution of the risk levels in ρ, combined with the given optimizers of problems ( 30) and ( 31) can be used as an informative tool to better understand the risk the DNB is exposed to.On the other hand, if a certain level of ambiguity is justified in practice, the bank can assign their capital based on the corresponding worst-case value.If for example ρ = 0.1 is decided on, the bank would have to assign 32490 capital compared to 30499 as dictated by the reference structure C 0 .
Analytically, one striking feature of the numerical solution with respect to C 0 is worth pointing out: The absolute upper bound is attained already for ρ ≈ 0.8, while the distance from the reference measure to the comonotone joint distribution can be calculated to be around 1.7.This underlines the fact that even though the comonotone distribution is a maximizer of the worst case AVaR, there are several more, and they may be significantly more plausible structurally than the comonotone one.
In conclusion, this paper introduces a flexible framework to aggregate different risks while accounting for ambiguity with respect to the chosen dependence structure between these risks.Moreover, the proposed numerical method allows us to perform this task without making restrictive assumptions about either the particular form of the aggregation functional, or the considered distributions, or the specific way to account for the model ambiguity.

A. Appendix
A.1.Proof for Section 3.1 We want to derive the analytic solution of problem (21).To do so, the concept of copulas turns out to be rather useful.We refer to Nelsen [30] for an introduction to this topic.Let C denote the set of all copulas and let the comonotonic copula be denoted by M (u 1 , u 2 ) = min(u 1 , u 2 ), for all u 1 , u 2 ∈ [0, 1].Using this notation, we can rewrite problem (21) and show the following: It follows that we can assume ρ ≤ 0.5 for the remainder of the proof.Let us define the copula R α as follows: Plugging in the value λ = 0.5 and setting h 1 (u) = h 2 (u) = u/2, yields φ 1 (f ) ≤ ρ 2 + 1 2 +0.
A.2. Proof for Section 3.2 We now derive the analytic bounds for Φ 2 , i.e. the solution of problem (23), which are plotted in Figure 3.
Let us start by proving the following upper bound where Φ 2 is defined in (23).

Figure 1 :
Figure 1: In the left panel, the analytic solution φ(f 1 ) of problem (21) is plotted as a function of ρ and compared to corresponding numerical solutions obtained by method 1.a) and method 2.a), which are described in Section 3.1.The right panel shows the same for the improved methods 1.b) and 2.b).

Figure 2 :
Figure2: The analytic solution φ(f 1 ) of problem(21), which uses the 1 -metric to define the transportation cost, is compared to the numerical solution φ(f 1 ) of problem(22), which uses the 2 -metric to define the transportation cost.

Figure 3 :
Figure 3: The analytic upper and lower bounds of problem (23) are compared to two distinct numerical solutions.The first numerical solution is obtained by Monte Carlo simulation with n = 100 sample points as well as linear programming and averaged over 100 simulations for each fixed ρ.The second numerical solution is obtained by penalization and neural networks.The confidence level of the AVaR considered in problem (23) is set to α = 0.7.

Table 1 :
, this capital requirement should be computed by the Average Value at Risk (AVaR) of the sum of these six Overview of the information concerning the reference distribution in the DNB case study.The correlation matrix Σ 0 is given in Appendix A.3.