Learning the random variables in Monte Carlo simulations with stochastic gradient descent: Machine learning for parametric PDEs and financial derivative pricing

In financial engineering, prices of financial products are computed approximately many times each trading day with (slightly) different parameters in each calculation. In many financial models such prices can be approximated by means of Monte Carlo (MC) simulations. To obtain a good approximation the MC sample size usually needs to be considerably large resulting in a long computing time to obtain a single approximation. A natural deep learning approach to reduce the computation time when new prices need to be calculated as quickly as possible would be to train an artificial neural network (ANN) to learn the function which maps parameters of the model and of the financial product to the price of the financial product. However, empirically it turns out that this approach leads to approximations with unacceptably high errors, in particular when the error is measured in the L∞$L^\infty$ ‐norm, and it seems that ANNs are not capable to closely approximate prices of financial products in dependence on the model and product parameters in real life applications. This is not entirely surprising given the high‐dimensional nature of the problem and the fact that it has recently been proved for a large class of algorithms, including the deep learning approach outlined above, that such methods are in general not capable to overcome the curse of dimensionality for such approximation problems in the L∞$L^\infty$ ‐norm. In this article we introduce a new numerical approximation strategy for parametric approximation problems including the parametric financial pricing problems described above and we illustrate by means of several numerical experiments that the introduced approximation strategy achieves a very high accuracy for a variety of high‐dimensional parametric approximation problems, even in the L∞$L^\infty$ ‐norm. A central aspect of the approximation strategy proposed in this article is to combine MC algorithms with machine learning techniques to, roughly speaking, learn the random variables (LRV) in MC simulations. In other words, we employ stochastic gradient descent (SGD) optimization methods not to train parameters of standard ANNs but instead to learn random variables appearing in MC approximations. In that sense, the proposed LRV strategy has strong links to Quasi‐Monte Carlo (QMC) methods as well as to the field of algorithm learning. Our numerical simulations strongly indicate that the LRV strategy might indeed be capable to overcome the curse of dimensionality in the L∞$L^\infty$ ‐norm in several cases where the standard deep learning approach has been proven not to be able to do so. This is not a contradiction to the established lower bounds mentioned above because this new LRV strategy is outside of the class of algorithms for which lower bounds have been established in the scientific literature. The proposed LRV strategy is of general nature and not only restricted to the parametric financial pricing problems described above, but applicable to a large class of approximation problems. In this article we numerically test the LRV strategy in the case of the pricing of European call options in the Black‐Scholes model with one underlying asset, in the case of the pricing of European worst‐of basket put options in the Black‐Scholes model with three underlying assets, in the case of the pricing of European average put options in the Black‐Scholes model with three underlying assets and knock‐in barriers, as well as in the case of stochastic Lorentz equations. For these examples the LRV strategy produces highly convincing numerical results when compared with standard MC simulations, QMC simulations using Sobol sequences, SGD‐trained shallow ANNs, and SGD‐trained deep ANNs.


Abstract
In financial engineering, prices of financial products are computed approximately many times each trading day with (slightly) different parameters in each calculation. In many financial models such prices can be approximated by means of Monte Carlo (MC) simulations. To obtain a good approximation the MC sample size usually needs to be considerably large resulting in a long computing time to obtain a single approximation. A natural deep learning approach to reduce the computation time when new prices need to be calculated as quickly as possible would be to train an artificial neural network (ANN) to learn the function which maps parameters of the model and of the financial product to the price of the financial product. However, empirically it turns out that this approach leads to approximations with unacceptably high errors, in particular when the error is measured in the L ∞ -norm, and it seems that ANNs are not capable to closely approximate prices of financial products in dependence on the model and product parameters in real life applications. This is not entirely surprising given the high-dimensional nature of the problem and the fact that it has recently been proved for a large class of algorithms, including the deep learning approach outlined above, that such methods are in general not capable to overcome the curse of dimensionality for such approximation problems in the L ∞ -norm. In this paper we introduce a new numerical approximation strategy for parametric approximation problems including the parametric financial pricing problems described above and we illustrate by means of several numerical experiments that the introduced approximation strategy achieves a very high accuracy for a variety of high-dimensional parametric approximation problems, even in the L ∞ -norm. A central aspect of the approximation strategy proposed in this article is to combine MC algorithms with machine learning techniques to, roughly speaking, learn the random variables (LRV) in MC simulations. In other words, we employ stochastic gradient descent (SGD) optimization methods not to train parameters of standard ANNs but instead to learn random variables appearing in MC approximations. In that sense, the proposed LRV strategy has strong links to Quasi-Monte Carlo (QMC) methods as well as to the field of algorithm learning. Our numerical simulations strongly indicate that the LRV strategy might indeed be capable to overcome the curse of dimensionality in the L ∞ -norm in several cases where the standard deep learning approach has been proven not to be able to do so. This is not a contradiction to the established lower bounds mentioned above because this new LRV strategy is outside of the class of algorithms for which lower bounds have been established in the scientific literature. The proposed LRV strategy is of general nature and not only restricted to the parametric financial pricing problems described above, but applicable to a large class of approximation problems. In this article we numerically test the LRV strategy in the case of the pricing of European call options in the Black-Scholes model with one underlying asset, in the case of the pricing of European worst-of basket put options in the Black-Scholes model with three underlying assets, in the case of the pricing of European average put options in the Black-Scholes model with three underlying assets and knock-in barriers, as well as in the case of stochastic Lorentz equations. For these examples the LRV strategy produces highly convincing numerical results when compared with standard MC simulations, QMC simulations using Sobol sequences, SGD-trained shallow ANNs, and SGD-trained deep ANNs.     . . 19 4.4 Multilevel Monte Carlo approximations for parametric stochastic differential equations . 19 4.5 Replacing the random variables in multilevel Monte Carlo approximations . . . . . . . 20 4.6 Random loss functions for fixed random variables in multilevel Monte Carlo approximations 21 4.7 Learning the random variables with stochastic gradient descent . .

Introduction
Many computational problems from engineering and science can be cast as certain parametric approximation problems (cf., e.g., [12,14,27,28,40,63,72,77,95,115] and references mentioned therein). In particular, parametric PDEs are of fundamental importance in various applications, where one is not only interested in an approximation of the solution of the approximation problem at one fixed (spacetime) point but where one is interested to evaluate the approximative solution again and again as, for instance, in financial engineering where prices of financial products are computed approximately many times each trading day with (slightly) different parameters in each calculation. Moreover, the problems appearing in such financial applications are often high-dimensional, as the dimension usually corresponds to the number of assets/financial contracts in the considered trading portfolio.
It is a widespread issue that the majority of algorithms for such parametric approximation problems suffer from the curse of dimensionality (cf., e.g., Bellman [11] and Novak & Woźniakowski [92,Chapter 1]) in the sense that the computational effort of the approximation methods grows exponentially in the dimension of the approximation problem or in the required approximation accuracy, making them useless for high-dimensional problems. In the information based complexity literature there are fundamental lower bounds which generally reveal the impossibility to approximate the solutions of certain classes of high-dimensional approximation problems without the curse of dimensionality among general classes of approximation algorithms; see, e.g., Grohs & Voigtlaender [54], Heinrich [62], Heinrich & Sindambiwe [63], and Novak & Woźniakowski [91]. Developing methods which produce good approximations for highdimensional problems is thus an exceedingly hard task and, among the general classes of algorithms considered in the above named references, essentially impossible.
In this paper we present a new method to tackle high-dimensional parametric approximation problems. Roughly speaking, our strategy is based on the idea to combine Monte Carlo (MC) algorithms (such as, e.g., standard MC methods, multilevel Monte Carlo (MLMC) methods, or multilevel Picard (MLP) methods) with stochastic gradient descent (SGD) optimization methods by viewing the employed realizations of random variables in the MC approximation as training parameters for the SGD optimization method. In other words, in this approach we intend to employ SGD optimization methods not to train standard artificial neural networks (ANNs) but to learn random variables appearing in MC approximations.
To make this idea more concrete, we now sketch this learning the random variables (LRV) approximation strategy in the context of a basic example of a parametric approximation problem. Let p ∈ N, suppose that we intend to approximate a target function u : [0, 1] p → R, let ϕ : [0, 1] p × R → R be measurable and bounded, let (Ω, F, P) be a probability space, let W : Ω → R be a random variable, and suppose that u : [0, 1] p → R admits the probabilistic representation that for all p ∈ [0, 1] p we have that (parametric integration problem; cf., e.g., Cohen & DeVore [28] and Heinrich & Sindambiwe [63]) . We note that we only chose the random variable W to be 1-dimensional for simplicity and refer to Section 2 for the case when W is a possibly high-dimensional random variable. Our first step to derive the proposed approximation algorithm is to recall standard MC approximations for the parametric expectation in (1). Let M ∈ N, let W m,m : Ω → R, m, m ∈ N 0 , be i.i.d. random variables which satisfy for all A ∈ B(R) that P(W 0,0 ∈ A) = P(W ∈ A), and observe for all p ∈ [0, 1] p that In the next step we introduce a parametric function on Next we employ the SGD optimization method to train the right hand side of (3) in search of "better random realizations" to approximate the target function u than those provided by the random variables W 0,m : Ω → R, m ∈ {1, 2, . . . , M}, in the standard MC approximation in (3). The random variables W 0,m : Ω → R, m ∈ {1, 2, . . . , M}, on the right hand side of (3) then only supply the initial guess in the SGD training procedure. More specifically, let P m,m : Note that the recursion in (5) describes nothing else but the standard SGD optimization method with the learning rate schedule N ∋ m → γ m ∈ R. For every sufficiently large m ∈ N we then propose to employ the random function as an approximation for the target function [0, 1] p ∋ p → u(p) ∈ R. Note that the set [0, 1] p ∋ p → (p, θ) ∈ R, θ ∈ R M , of potential approximating functions for u : [0, 1] p → R does not consist of standard fully-connected feedforward ANNs but is a very problem specific class of approximating functions determined by the MC method in (2). In light of this and of the fact that the expression in (3) resembles the definition of single layer fully-connected feedforward ANNs we refer to this class of approximating functions as MC neural networks. Moreover, observe that through (3) the MC method also naturally specifies a favorable initializing law for the SGD training procedure in (5).
In the special case of the LRV strategy illustrated in (2)-(6) above, the standard MC approximation method serves as a proposal algorithm for the SGD training procedure. More generally, the LRV strategy can, in principle, be used on any high-dimensional approximation problem as soon as there is a reasonable stochastic proposal algorithm for the considered approximation problem available. The LRV strategy thereby naturally specifies the compositional architecture of the involved approximating functions and also naturally specifies the initializing law in the SGD training procedure for each specific approximation problem. In particular, the LRV strategy can be used with the standard MC method (see Section 2), the MC-Euler-Maruyama method (see Section 3), or the MLMC method (see Section 4) as the proposal algorithms to approximate solutions of stochastic differential equations (SDEs) and Kolmogorov PDEs, respectively, and the LRV strategy can be used with the MLP method (see Section 5) as the proposal algorithm to approximate solutions of semilinear PDEs.  We now illustrate the effectiveness of the LRV strategy on a simple but famous numerical example, where the LRV strategy leads to impressively precise approximations. In Table 1 we present results for the approximative computation of prices of European call options in the Black-Scholes model by means of the deep learning method induced by Becker et al. [4] with 140000 Adam (see Kingma & Ba [73]) training steps (rows 2-5 in Table 1), by means of the standard MC method with 32768 MC samples (row 6 in Table 1), by means of the antithetic MC method with 32768 MC samples (row 7 in Table 1), by means of the Quasi-Monte Carlo (QMC) method using Sobol sequences with 32768 QMC samples (row 8 in Table 1), and by means of the antithetic QMC method using Sobol sequences with 32768 QMC samples (row 9 in Table 1), by means of the LRV strategy with the standard MC method as the proposal algorithm with 140000 Adam training steps (row 10 in Table 1), and by means of the LRV strategy with the antithetic MC method as the proposal algorithm with 140000 Adam training steps (row 11 in Table 1). In Table 1 the L 2 -error and the L ∞ -error have been computed approximately on the region using 8 192 000 evaluations of the considered approximation method. In (7) we have that ξ stands for the initial price, that T stands for the time of maturity, that r stands for the drift rate, that σ stands for the volatility, and that K stands for the strike price; see, e.g., [10,Lemma 4.4] (with c = 0 in the notation of [10,Lemma 4.4]). The reference solution values to compute the errors in Table 1 have been computed with the famous Black-Scholes formula (see, e.g., [10,Lemma 4.4] or (116) in Subsection 7.2). The evaluation time corresponds to the time required to compute 8 192 000 evaluations. We note that the training and evaluation times 1 of the LRV strategy are significantly longer when compared to the deep learning method induced by Becker et al. [4] even when the considered MC neural networks involve less arithmetic operations than the considered ANNs. This is likely attributed to the efficient implementation and parallelization of standard feedforward ANNs in Tensorflow and our own implementation of MC neural networks, which may be slightly less computationally efficient. The numbers in Table 1 are taken from Tables 2, 4, 5, and 6 in Subsection 7.2. We refer to Subsection 7.2 for more details on the results in Table 1.
Note that Table 1 indicates that the algorithm obtained by the LRV strategy not only produces very accurate prices in the L 2 -norm, but even has a very high accuracy in the uniform L ∞ -norm. Concretely, for this 5-dimensional approximation problem, this strongly suggests that for any choice of parameters in the region considered in (7), the LRV strategy offers an approximation with an error smaller than 2 1000 . Based on this, on the other numerical results presented in this paper, as well as on preliminary analytic investigations (cf., e.g., Gonon et al. [50,Lemma 2.16]) we conjecture that the LRV strategy can overcome the curse of dimensionality for certain classes of parametric PDE problems in the L ∞norm. We would like to emphasize that large classes of algorithms for such approximation problems have been shown not to be able to overcome the curse of dimensionality, see, e.g., Heinrich [ [54]. However our conjecture does not contradict the general lower bounds established in the above mentioned references, due to the fact that the LRV strategy does not belong to the class of algorithms considered in the above mentioned references: roughly speaking, in the LRV strategy there are two stages of computational procedures, the main computational procedure in which the "best random variables" are learned through SGD (corresponding to the 6th column in Table 1) and the evaluation procedure where the computed approximation of the target function is evaluated (corresponding to the 7th column in Table 1). In the LRV strategy we consider the situation where it is allowed to perform function evaluations both during the main computational procedure and the evaluation procedure while the lower bounds in [54,62,63] consider the situation where function evaluations are only allowed during the main computational procedure but not during the evaluation procedure. The LRV strategy being outside of the classes of algorithms considered in the above named references is thus not constrained by the established lower bounds and hence holds the potential to overcome the curse of dimensionality, even in the L ∞ -norm. We note that in practically relevant situations, just as in the considered derivative pricing problem, it is often possible to perform function evaluations also in the evaluation procedure, thus making the LRV strategy an applicable method for practically relevant approximation problems.
We now compare the proposed LRV strategy to existing algorithms and computational methods 1 The numerical experiments have been performed in TensorFlow 2.12 running on a system equipped with an NVIDIA GeForce RTX 4090 GPU with 24 GB Graphics RAM.
in the scientific literature. As the LRV strategy employs SGD-type methods to "learn" parametric functions it is related to deep learning methods, where instead of random variables, optimal weights of ANNs are "learned". From this point of view, the LRV strategy can be seen as a machine learning approach where instead of employing generic ANNs, very problem specific parametric functions are used, which contain a lot of human insight about the problem at hand. There is a plethora of deep learning methods for the approximation of PDEs, which have been developed recently, and seem to be very effective for the approximation of high-dimensional PDEs: cf., e.g., [3, 4, 8, 10, 12, 20, 21, 25, 32, 34, 38, 42, 44, 51, 55-59, 64, 65, 69-71, 76, 78, 82-86, 93, 94, 101, 103, 104, 109, 112, 119, 120]. We refer to the survey articles [6,17,35,45] for a more detailed overview. For methods which are specifically designed for parametric PDEs we refer to, e.g., [12,14,72,95,115]. In addition, we want to highlight a connection between the LRV strategy and the deep learning method for Kolmogorov PDEs developed in Becker et al. [4] and further specialized to parametric Kolmogorov PDEs in Berner et al. [12]. Very roughly speaking, the work [12] is concerned with approximating a function u : Γ → R given for all are parametric functions, and where for every γ ∈ Γ the random variable S γ : Ω → R d is the solution of an SDE parametrized by the parameter γ. In [12] they propose to approximate u by looking for a function f : Γ → R within a certain class of ANNs which aims to where Λ : Ω → Γ is a random variable and for every γ ∈ Γ we have that Y γ : Ω → R d is an approximation of S γ (such as, e.g., an Euler-Maruyama approximation). If the class of ANNs in which an optimal function f is looked for is replaced by parametric functions induced by a proposal algorithm this becomes a special case of the LRV framework presented in this paper. Furthermore, the LRV strategy can be associated with a broad subcategory of machine learning called algorithm learning (cf., e.g., [18,[22][23][24]53,87,114,116,118]). Roughly speaking, algorithm learning refers to the idea to employ existing algorithms with known empirical or theoretical qualities as a basis to construct or extend ANNs or more general parametric function families. In many cases the employed algorithm relies on certain hyper-parameters (such as, e.g., the learning rates in case of SGD-type methods: cf., e.g., Chen et al. [24]) which typically are added to the set of trainable parameters of the ANN. It is in this point that the LRV strategy differs from existing algorithm learning methods since the LRV strategy considers the random variables and not the hyper-parameters of the proposal algorithm as learnable parameters. It thereby has the advantage that the initialization of all trainable parameters is implicitly given through the proposal algorithm. However, the LRV strategy could very well be combined with ideas of algorithm learning. We leave this task open for future research.
Another tranche of literature connected to the LRV strategy are QMC methods (cf., e.g., [19,33,89]). Roughly speaking, the idea of QMC methods is to replace uniformely distributed random variables of the MC method by a more suitable sequence of deterministic points to obtain a higher rate of convergence. The idea to improve the choice of random variables in MC methods is a common feature of QMC methods with the LRV strategy. A key difference between the two methods is that QMC methods construct new integration points which have good properties for a wide class of integrands while the LRV strategy aims to "learn" new integration points which are specific to each considered integrand.
Next we illustrate a link between the LRV strategy and quantization methods (cf., e.g., [1,31,41,90, 96-100, 102, 106]). Quantization methods are concerned with approximating a continuously distributed random variable by a random variable with a finite image. Quantization was first introduced for signal processing and information theory (cf., e.g., [52,99]) but can also be used for the numerical approximation of expectations involving the original random variable (cf., e.g., [96,99]). For the latter, the original random variable in the expectation is replaced by its quantization, resulting in an expected value which can easily be computed by a finite sum of weighted function evaluations. A good quantization is typically found with optimization methods such as, e.g., the Newton method in low dimensions or SGD-type optimization methods in high-dimensions (cf., e.g., [97][98][99]102]). To make this more concrete we now roughly illustrate this in the context of the setting described above. For all θ = (θ 1 , . .
A good quantization is found by minimizing an error between the quantization and the original random variable such as, e.g., the squared L 2 -error given for all θ = (θ 1 , . .
with an SGD-type optimization method. Once appropriate θ = (θ 1 , . . have been found, they can be employed to approximate the expectation in (1) for every p ∈ P by Note that this approximation has a similar form as the proposed LRV approximation in (6) and in both cases the points at which ϕ is evaluated are found through an SGD-type optimization method.
The main difference is that the optimization problem in (8), which is used to determine the evaluation points in (9), only depends on the random variable W whereas the optimization problem (see (4)) to determine the evaluation points in (6) in the LRV strategy depends on the random variable W and the function ϕ.
The concept of optimizing random MC samples, a central aspect of the LRV strategy, also appears in the Bayesian statistics literature in the context of inducing points in Bayesian learning (cf., e.g., [13,30,105,113]). Such inducing points are employed to find sparse representation for large data sets. One way to find good inducing points is to first sample them randomly and then optimize them for example with gradient based methods (cf., e.g., Snelson & Ghahramani [113]), which bears some resemblance to the LRV strategy.
The reminder of this article is structured as follows. In Sections 2, 3, 4, and 5 we present the LRV strategy for increasingly complex approximation problems and proposal algorithms. As proposal algorithms we consider the standard MC method in Section 2, the MC-Euler method in Section 3, the MLMC method in Section 4, and the MLP method in Section 5. In Section 6 we present the most general case of a generic proposal algorithm, which includes all previous sections as special cases. The results of numerical experiments for the LRV approximation strategy are presented in Section 7. Specifically, we consider 1-dimensional Black-Scholes equations for European call options (resulting in a 5-dimensional parametric approximation problem) in Subsection 7.2, we consider 3-dimensional Black-Scholes equations for worst-of basket put options (resulting in a 15-dimensional parametric approximation problem) in Subsection 7.3, we consider 3-dimensional Black-Scholes equations for average basket put options with knock-in barriers (resulting in a 16-dimensional parametric approximation problem) in Subsection 7.4, and we consider stochastic Lorentz equations (resulting in a 10-dimensional parametric approximation problem) in Subsection 7.5.

Learning the random variables (LRV) strategy in the case of Monte Carlo approximations
In this section we employ the LRV strategy for the approximation of parametric expectations (see (10) in

Parametric expectations involving vector valued random variables
Let p, d ∈ N, let P ⊆ R p be measurable, let u : P → R be a function, let (Ω, F, P) be a probability space, let W : Ω → R d be a random variable, let ϕ : P × R d → R be measurable, assume for all p ∈ P that R d ∋ w → ϕ(p, w) ∈ R is continuously differentiable, and assume for all p ∈ P that E |ϕ(p, W)| < ∞ and The goal of this section is to derive an algorithm to approximately compute the function u : P → R given through the parametric expectation in (10).

Approximative pricing of European options in the Black-Scholes model
, P) be a filtered probability space which satisfies the usual conditions, let W : [0, T ] × Ω → R be a standard (F t ) t∈[0,T ] -Brownian motion with continuous sample paths, and let X : [0, T ] × Ω → R be an (F t ) t∈[0,T ] -adapted stochastic process with continuous sample paths which satisfies that for all t ∈ [0, T ] it holds P-a.s. that Proof of Lemma 2. 1. Observe that, e.g., [10, item (i) in Proposition 4.3] implies that for all t ∈ [0, T ] it holds P-a.s. that This establishes item (i). Moreover, note that, e.g., [10,Lemma 4.4] and (13) show for all K ∈ R that The proof of Lemma 2.1 is thus complete.

Monte Carlo approximations
Let M ∈ N and let W m,m : Observe that (10) suggests that for all p ∈ P it holds that

Replacing the random variables in Monte Carlo approximations
Note that (16) and (17) suggest that for all p ∈ P it holds that

Random loss functions for fixed random variables in Monte Carlo approximations
and for every m ∈ N let G m :

Learning the random variables with stochastic gradient descent
Let (γ m ) m∈N ⊆ (0, ∞) and let Θ : For every sufficiently large m ∈ N we propose to employ the random function P × Ω ∋ (p, ω) → (p, Θ m (ω)) ∈ R as an approximation for the target function P ∋ p → u(p) ∈ R in (10).

LRV strategy in the case of Monte Carlo-Euler-Maruyama approximations
In the previous section we derived the LRV strategy for the approximation of parametric expectations involving finite-dimensional random variables. In this section we consider parametric expectations involving standard Brownian motions as random variables (see (24) in Subsection 3.1). To derive an LRV algorithm for these approximation problems we first discretize the Brownian motions in Subsection 3.2 to obtain parametric expectations involving only finite-dimensional random variables. In Subsections 3.4, 3.5, 3.6, and 3.7 the LRV strategy is then subsequently applied to these discretized parametric expectations as in Section 2 above (cf. Subsections 2.3, 2.4, 2.5, and 2.6). In Subsection 3.3 we illustrate a special instance of the parametric expectations in (24) and of associated discretizations in the situation where the parametric expectations in (24) involve solutions of SDEs. Finally, in Subsection 3.8 we summarize the algorithm derived in this section in one single framework.

Parametric expectations involving vector valued stochastic processes
Let p, d ∈ N, T ∈ (0, ∞), let P ⊆ R p be measurable, let u : P → R be a function, let (Ω, F, P) be a probability space, let W : [0, T ] × Ω → R d be a standard Brownian motion with continuous sample paths, let ϕ : The goal of this section is to derive an algorithm to approximately compute the function u : P → R given through the parametric expectation in (24).

Euler-Maruyama approximations for parametric stochastic differential equations
In the case where where g : and where it holds for all p ∈ P, θ ∈ R N d that Φ(p, θ) = g(t, X ξ,θ t ) observe that (i) it holds for all p = (t, ξ) ∈ P that is the expectation of the test function R d ∋ x → g(t, x) ∈ R evaluated at time t of the solution process (X ξ,W s ) s∈[0,T ] of the additive noise driven SDE in (26) and

Monte Carlo approximations
Let M ∈ N and let W m,m : Ω → R N d , m, m ∈ N 0 , be i.i.d. standard normal random vectors. Observe that (24) and (25) suggest that for all p ∈ P it holds that

Replacing the random variables in Monte Carlo approximations
Note that (30) and (31) suggest that for all p ∈ P it holds that

Random loss functions for fixed random variables in Monte Carlo approximations
and for every m ∈ N let G m :

Learning the random variables with stochastic gradient descent
For every sufficiently large m ∈ N we propose to employ the random function P × Ω ∋ (p, ω) → (p, Θ m (ω)) ∈ R as an approximation for the target function P ∋ p → u(p) ∈ R in (24).

LRV strategy in the case of multilevel Monte Carlo approximations
In this section we consider the problem of approximating a parametric expectation involving a general measure space-valued random variable for which a sequence of finite-dimensional approximations is available (see Subsections 4.1 and 4.2). We illustrate such a situation in Subsection 4.4 in which the original random variable is a Brownian motion driving an SDE and the finite-dimensional approximations consist of Euler discretizations of the SDE with decreasing step sizes. The employed proposal algorithm for the LRV strategy in this section is a form of MLMC method (cf. Heinrich [60] and Giles [46] and cf., e.g., [29,47,61,63]) as described in Subsection 4. 3. Based on this proposal algorithm, the LRV strategy is applied to the considered approximation problem in Subsections 4.5, 4.6, and 4.7 and, thereafter, summarized in one single framework in Subsection 4.8.

Parametric expectations involving measure space valued random variables
Let p ∈ N, let P ⊆ R p be measurable, let (V, V) be a measurable space, let ϕ : P×V → R be continuous, let u : P → R be a function, let (Ω, F, P) be a probability space, let W : Ω → V be a random variable, and assume for all p ∈ P that E |ϕ(p, W)| < ∞ and The goal of this section is to derive an algorithm to approximately compute the function u : P → R given through the parametric expectation in (37).

Finite-dimensional approximations
We think of W l , l ∈ N 0 , and Φ l , l ∈ N 0 , as suitable approximations of W and ϕ in the sense that for all p ∈ P and all sufficiently large l ∈ N 0 it holds that ϕ(p, W) ≈ Φ l (p, W l ). (38)

Multilevel Monte Carlo approximations
: Ω → R d l , l, m, m ∈ N 0 , be independent random variables, and assume for all l, m, (37) and (38) suggest that for all p ∈ P it holds that

Multilevel Monte Carlo approximations for parametric stochastic differential equations
where g : where where for all l ∈ N 0 , p = (t, ξ) ∈ P, θ ∈ R 2 l d it holds that Φ l (p, θ) = g(t, X ξ,θ,2 l t ), and where for all l ∈ {0, 1, . . . , L} it holds that M l = 2 L−l observe that is the expectation of the test function R d ∋ x → g(t, x) ∈ R evaluated at time t of the solution process (X ξ,W s ) s∈[0,T ] of the additive noise driven SDE in (41), is a MLMC approximation of u(t, ξ) as proposed in Giles [46].

Replacing the random variables in multilevel Monte Carlo approximations
Note that (39) and (47) suggest that for all p ∈ P it holds that

Random loss functions for fixed random variables in multilevel Monte Carlo approximations
and for every m ∈ N let G m :

Learning the random variables with stochastic gradient descent
Let (γ m ) m∈N ⊆ (0, ∞) and let Θ : For every sufficiently large m ∈ N we propose to employ the random function P × Ω ∋ (p, ω) → (p, Θ m (ω)) ∈ R as an approximation for the target function P ∋ p → u(p) ∈ R in (37).

LRV strategy in the case of multilevel Picard approximations
In this section we consider the case where the target function to be approximated with the LRV strategy is given as the solution of a suitable stochastic fixed point equation (SFPE); see (55) in Subsection 5. 1 [36,37]). In analogy to the previous sections, in Subsections 5.4,5.5,and 5.6 an algorithm for the considered approximation problem is derived based on the LRV strategy with this MLP algorithm as proposal algorithm. Finally, the problem and the algorithm are summarized in one single framework in Subsection 5.7.

Parametric solutions of stochastic fixed point equations
and u : P → R be measurable, let (Ω, F, P) be a probability space, let W : Ω → R d be a random variable, and assume for all p ∈ P that E |ϕ(p, W, u(P(p, W)))| < ∞ and The goal of this section is to derive an algorithm to approximately compute the function u : P → R given through the SFPE in (55).

5.2
Approximations for parametric semilinear partial differential equations 5.2.1 Heat partial differential equations with Lipschitz nonlinearities is at most polynomially growing, where f, g ∈ C(R d , R) are at most polynomially growing, where for all and (ii) it holds that u is a viscosity solution of

Multilevel Picard approximations
Let M ∈ N, I = ∪ ∞ n=1 Z n , let W i : Ω → R d , i ∈ I, be i.i.d. random variables which satisfy for all A ∈ B(R d ) that P(W 0 ∈ A) = P(W ∈ A), and let V i n : P × Ω → R, n ∈ Z, i ∈ I, satisfy 2 for all n ∈ Z, i ∈ I, p ∈ P that Note that under suitable integrability conditions (61) implies for all n ∈ N, p ∈ P, i ∈ I that V i 0 (p) = 0 and Induction thus shows that for every n ∈ N the identically distributed random functions V i n : P × Ω → R, i ∈ I, correspond in expectation to the n-th fixed point iterate of the fixed point equation in (55). For every n ∈ N the recursive definition of (V i n ) i∈I in (61) thus represents an approximated fixed point iteration step in which the expectation of the fixed point iteration is approximated by a MLMC sum over previously computed approximated fixed point iterates (V i k ) (i,k)∈I×{0,1,...,n−1} . Under suitable assumptions (see, e.g., Hutzenthaler et al. [68] for precise assumptions and a more detailed derivation of MLP algorithms in the case of semilinear PDEs) we expect for sufficiently large n ∈ N that

Replacing the random variables in multilevel Picard approximations
Let (C n ) n∈Z ⊆ N 0 satisfy for all n ∈ Z that for every n, m ∈ N, l ∈ {1, 2, . . . , n} let c n,l,m ∈ N satisfy let n : P × R Cnd , n ∈ N 0 , satisfy for all n ∈ N, p ∈ P, θ 1 , θ 2 , . . . , θ Cn ∈ R d that 0 (p) = 0 and Observe that induction shows that for all n ∈ N 0 , p ∈ P, i ∈ I the number C n ∈ N corresponds, roughly speaking, to the number of realizations of d-dimensional random variables required to compute one random realization of V i n (p). Moreover, note that (61), (64), (65), and (66) assure that for all p ∈ P, B ∈ B(R) it holds that Combining this and (63) suggests that for all p ∈ P it holds that

Random loss functions for fixed random variables in multilevel Picard approximations
Let M, N ∈ N, let P m,m : Ω → P, m, m ∈ N, be i.i.d. random variables, assume that (P m,m ) (m,m)∈N 2 , (W i ) i∈I , and (W k ) k∈{1,2,...,C N } are independent, for every m ∈ N let F m : and for every m ∈ N let G m :

Learning the random variables with stochastic gradient descent
Let (γ m ) m∈N ⊆ (0, ∞) and let Θ : For every sufficiently large m ∈ N we propose to employ the random function P × Ω ∋ (p, ω) → N (p, Θ m (ω)) ∈ R as an approximation for the target function P ∋ p → u(p) ∈ R in (55).

LRV strategy in the case of a general proposal algorithm
In this section we derive and formulate the LRV strategy in its most general form, which contains all the algorithms derived in the previous sections as special cases. Roughly speaking, we want to approximate a target function (cf. u : P → R k in Subsection 6.1) for which we already have a generic stochastic approximation algorithm (cf. Ψ : P × R d → R k and W : Ω → R d in (76) in Subsection 6.1).
We refer to this algorithm as proposal algorithm. Moreover, we assume that we are able to generate random reference solutions which approximate the target function at every point in expectation (cf. (77) in Subsection 6.1).
In the next few sentences we briefly sketch in words the LRV strategy in this general case. The first step of the LRV strategy is to consider the random variables in the stochastic approximation algorithm as parameters for a parametric family of functions (corresponding to P ∋ p → Ψ(p, θ) ∈ R k , θ ∈ R d , in Subsection 6.1); see Subsection 6.2. The goal of the LRV strategy is then to "learn" parameters whose corresponding function yields a good approximation of the target function u : P → R k . Taking this into account, the second step of the LRV strategy is to minimize a loss function (cf. (78) in Subsection 6.2) measuring the distance between the approximating function and the approximate reference solutions with an SGD-type optimization method; see Subsection 6. 3. As initial guess for the SGD-type learning procedure we suggest to randomly choose the parameters according to the distribution of the random variables appearing in the proposal algorithm, since we know that this already results in a passable approximation. This feature of the LRV strategy is an important advantage when compared to standard deep learning methods in the sense that the LRV strategy has already in the beginning of the training procedure a relatively small loss function. The entire approach is presented in one single framework in Subsection 6.4. One of the differences between this section and the previous sections is that in the previous sections we only used, for simplicity, the plain vanilla SGD method, however in this section we allow for various more sophisticated SGD-type optimization methods (cf. (80) in Subsection 6.3). Some of these more sophisticated SGD-type methods are presented in Subsection 6.5 as special cases of the framework in Subsection 6.4.

Stochastic approximations for general target functions related to parametric expectations
Let p, d, k, d ∈ N, let P ⊆ R p be measurable, let Ψ : P × R d → R k , Ξ : P × R d → R k , and u : P → R k be measurable, assume for all p ∈ P that R d ∋ w → Ψ(p, w) ∈ R k is continuously differentiable, let (Ω, F, P) be a probability space, let W : Ω → R d and W : Ω → R d be independent random variables, and assume for all p ∈ P that E[∥Ξ(p, W)∥] < ∞. We think of (Ψ, W) as a stochastic approximation algorithm for u : P → R k in the sense that for all p ∈ P it holds that and for every p ∈ P we think of E[Ξ(p, W)] as a suitable approximation of u(p). The goal of this section is to derive an algorithm to approximately compute the function u : P → R.

Random loss functions for fixed random variables in stochastic approximations
and for every m ∈ N let G m :

Learning the random variables with stochastic gradient descent-type methods
Let ψ m : R md → R d , m ∈ N, be functions and let Θ : For every sufficiently large m ∈ N we propose to employ the random function P × Ω ∋ (p, ω) → Ψ(p, Θ m (ω)) ∈ R as an approximation for the target function P ∋ p → u(p) ∈ R.

Numerical examples
In this section we apply the LRV strategy to different parametric approximation problems from the literature. Specifically, we consider the classical parametric Black-Scholes model for the pricing of European call options in Subsection 7.2, we consider a parametric Black-Scholes model for the pricing of worst-of basket put options on three underlying assets in Subsection 7.3, we consider a parametric Black-Scholes model for the pricing of average basket put options on three underlying assets with knockin barriers in Subsection 7.4, and we consider a parametric stochastic Lorentz equation in Subsection 7. 5. In the literature there are already a number of simulation results for SGD-based deep learning methods regarding the high-dimensional pricing of financial derivative contracts. In particular we refer to, e.g., [4,12,15,39,45,81] for parametric pricing results for European options, we refer to, e.g., [2, 8-10, 26, 43, 79, 80, 108, 117] for the pricing of American options, and we refer, e.g., to [107] for further references. In Subsection 7.1 we briefly recall the antithetic MC method (cf., e.g., Glasserman [48]) and some well-known properties of it. This is a variance reduction technique for MC methods which we will employ in some of the proposal algorithms for the LRV strategy in case of some of the above mentioned approximation problems. In each of the considered numerical examples we also compare the LRV strategy with existing approximation techniques from the literature such as the deep learning method induced by Becker et al. [4], MC methods, and QMC methods.
All the simulations in this section were run in Python using TensorFlow 2.12 on remote machines from https://vast.ai equipped with a single NVIDIA GeForce RTX 4090 GPU with 24 GB Graphics RAM. The Python source codes which were employed to produce all the results in this section can be downloaded as part of the sources of the arXiv version of this article at https://arxiv. org/e-print/2202. 02717. Specifically the codes in the folder 1 BS1 were employed to produce all the results in Subsection 7.2, the codes in the folder 2 BS eur put basket were employed to produce all the results in Subsection 7.3, the codes in the folder 3 BS barrier put basket avg were employed to produce all the results in Subsection 7.4, and the codes in the folder 4 Lorentz were employed to produce all the results in Subsection 7.5.

Antithetic Monte Carlo approximations
In this section we recall a special case of antithetic variates for MC methods (cf., e.g., Glasserman [48,Section 4.2]) when the distribution of the sampled random variables is symmetric around the origin. The following result, Lemma 7.1 below, shows that the resulting antithetic MC method achieves a higher or equal L 2 -accuracy than the standard MC method when the same number of MC samples are used for both methods. The subsequent result, Corollary 7.2 below, then provides a sufficient condition for the antithetic MC method to achieve a strictly higher L 2 -accuracy than the standard MC method when the same number of MC samples are used for both methods.
Combining this with item (ii) in Lemma 7.1 demonstrates that This establishes item (ii). The proof of Corollary 7.2 is thus complete.

Parametric Black-Scholes partial differential equations for European call options
In this section we apply the LRV strategy to the problem of approximating the fair price of an European call option in the classical Black-Scholes model (cf. Black & Scholes [16] and Merton [88]). A brief summary of the numerical results of this subsection can be found in Table 1 2 ) dy, and let u : P → R satisfy for all p = (ξ, T, r, σ, K) ∈ P that Note that (112) corresponds to the famous Black-Scholes formula for European call options. In the economic interpretation of the Black-Scholes model, for every p = (ξ, T, r, σ, K) ∈ P the number u(p) ∈ R thus corresponds to the fair price of a European call option with initial price ξ, time of maturity T , drift rate r, volatility σ, and strike price K.
We now specify the mathematical objects in the LRV strategy appearing in Framework 6.1 to approximately calculate the target function u : P → R in (112). Specifically, in addition to the assumptions above, let M, ℳ ∈ N, a, e ∈ {0, 1}, assume d = M, d = ℳ, k = 1, for every ξ, r, σ, w ∈ R, T ∈ [0, ∞) let X r,σ,ξ,w T ∈ R satisfy X r,σ,ξ,w let ϕ a : P × R d → R, a ∈ {0, 1}, satisfy for all p = (ξ, T, r, σ, K) ∈ P, w ∈ R that ϕ 0 (p, w) = exp(−rT ) max{X r,σ,ξ,w T − K, 0} and (114) assume for all p = (ξ, T, r, σ, K) ∈ P, w = (w 1 , assume for all x, y ∈ R that C(x, y) = |x − y| 2 , let a = 9 10 , b = 999 1000 , ε ∈ (0, ∞), let (γ m ) m∈N ⊆ (0, ∞) satisfy for all j ∈ {1, 2, . . . , 7}, m ∈ N ∩ (20000(j − 1), 20000j] that γ m = 10 −j , assume for all m ∈ N, assume that W is a standard normal random vector, assume that P 1,1 is U P -distributed, and assume that W 1,1 is a standard normal random variable. Let us add some comments regarding the setup introduced above. Observe that (112), (114), (115), and Lemma 2.1 assure that for all p = (ξ, T, r, σ, K) ∈ P, it holds that Moreover, note that in the case a = 0 the proposal algorithm on the left hand side of (116) corresponds to the standard MC method with M samples and that in the case a = 1 the proposal algorithm on the left hand side of (116) corresponds to the antithetic MC method with M samples (cf. Subsection 7.1). In addition, observe that in the case e = 0 the reference solutions on the right hand side of (116) correspond to antithetic MC approximations with ℳ samples and that in the case e = 1 the reference solutions on the right hand side of (116) are given by the exact solution. Furthermore, observe that (117) describes the Adam optimizer in the setup of Framework 6.1 (cf. Kingma & Ba [73] and Subsection 6.5.7).

(Figures 1 and 2 and 4 th column in
(5 th column in Tables 2 and 3), the time to compute Θ 140000 (6 th column in Tables 2 and 3), and the time to compute 8 192 000 evaluations of the function P ∋ p → Ψ(p, Θ 140000 ) ∈ R (7 th column in Tables 2  and 3). We approximated the integrals in (119) and (120)  To compare the LRV strategy with existing approximation techniques from the literature, we also employ several other methods to approximate the function u : P → R in (112). Specifically, in Table 4 we present numerical simulations for the deep learning method induced by Becker et al. [4] (with training values given by the exact solution, Adam 140000 training steps, batch size 8192, learning rate schedule N ∋ j → 1 (0,20000] (m)10 −2 +1 (20000,50000] (m)10 −3 +1 (50000,80000] (m)10 −4 +1 (80000,100000] (m)10 −5 + 1 (100000,120000] (m)10 −6 + 1 (120000,140000] (m)10 −7 , and GELU activation function), in Table 5 we present numerical simulations for the standard and the antithetic MC method, and in Table 6 we present numerical simulations for the standard and the antithetic QMC method with Sobol sequences. In Tables Figure 4 we visualize realizations of empirical distributions of random variables in the MC method and in QMC method based on Sobol sequences. Very roughly speaking, it seems that with increasingly precise reference solutions and thereby smaller L 2 -errors, the LRV method produces learned random variables whose histograms approximate the density of the normal distribution more closely, in particular more closely than the realizations of the standard MC samples with which the LRV method is initialized. This suggests that the LRV strategy is learning random variables which in some sense try to approximate the normal distribution. On the other hand, we note that the QMC samples seem to have the most regular histograms, but still have much worse L 2 -errors when compared to the learned random variables. One explanation for this could be that even though the histograms of the QMC method seem to approximate the normal density very accurately, the empirical moments of the QMC samples are a worse approximation of the moments of the normal distribution than the empirical moments of the learned random variables, and so the QMC samples effectively do not approximate the normal distribution as well as the learned random variables.    Figure 1. The convergence rate 1/2 is inspired by the convergence rate 1/2 of the MC method which is strongly related to SGD.
let l : R d → R satisfy for all x = (x 1 , x 2 , x 3 ) ∈ R d that l(x) = min{x 1 , x 2 , x 3 }, let ϕ : P × R d → R satisfy for all p = (ξ, T, r, δ, σ, ρ, K) ∈ P, w ∈ R d that ϕ(p, w) = exp(−rT ) max K − l(X ξ,r,δ,σ,ρ,w T ), 0 , let W : Ω → R d be a standard normal random vector, and let u : P → R satisfy for all p ∈ P that In the economic interpretation of the Black-Scholes model for every p = (ξ, T, r, δ, σ, ρ, K) ∈ P the number u(p) ∈ R corresponds to the fair price of a worst-of basekt put option on three underlying assets with initial prices ξ, time of maturity T , risk free rate r, dividend yields of the respective underlying assets δ, volatilities of the respective underlying assets σ, covariance matrix of the Brownian motions Q(ρ), and strike price K.
1 , . . . , g assume that W is a standard normal random vector, assume that P 1,1 is U P -distributed, and assume that W 1,1 is a standard normal random vector. Let us add some comments regarding the setup introduced above. Note that the proposal algorithm in (127) corresponds to the MC method. Furthermore, observe that (129) describes the Adam optimizer in the setup of Framework 6.1 (cf. Kingma & Ba [73] and Subsection 6.5.7).
In Table 7 we approximately present for M ∈ {2 5 , 2 6 , . . . , 2 13 } one random realization of the L 1 (λ P ; R)-approximation error P |u(p) − Ψ(p, Θ 30000 )| dp (130) (3 rd column in Table 7), one random realization of the L 2 (λ P ; R)-approximation error P |u(p) − Ψ(p, Θ 30000 )| 2 dp (4 th column in Table 7), one random realization of the L ∞ (λ P ; R)-approximation error (5 th column in Table 7), and the time to compute Θ 30000 (6 th column in Table 7). For every M ∈ {2 5 , 2 6 , . . . , 2 13 } we approximated the integrals in (130) and (131)  Besides the LRV strategy we also employed the standard MC method to approximate the function u : P → R in (127). In Table 8 we present the corresponding numerical simulation results. In Table 8 we have approximated the L 1 (λ P ; R)-approximation error of the MC method with the MC method based on 128000 samples, we have approximated the L 2 (λ P ; R)-approximation errors of the MC method with the MC method based on 128000 samples, and we have approximated the L ∞ (λ P ; R)-approximation errors of the MC method based on 128000 random samples (cf., e.g., Beck et al. [

Parametric Black-Scholes partial differential equations for multi-asset average put options with knock-in barriers
In this section we apply the LRV strategy to a more complicated and practically relevant Black-Scholes option pricing problem. Specifically, we approximate the fair price of a European average basket put option on three underlyings with a knock-in barrier in the Black-Scholes model (cf. Black & Scholes [16] and Merton [88]). We start by introducing the Black-Scholes model for this pricing problem in the context of Framework 6.1. Assume Framework 6.1, let d = 3, e 1 = (1, 0, 0), e 2 = (0, 1, 0), e 3 = (0, 0, 1), let Q : let L : R → R d×d satisfy for all ρ = (ρ 1 , ρ 2 , ρ 3 ) ∈ R that let W : Ω → C([0, ∞), R d ) be a standard Brownian motion with continuous sample paths, for every let ϕ : P × C([0, ∞), R d ) → R satisfy for all p = (ξ, T, r, δ, σ, ρ, K, B) ∈ P, w ∈ C([0, ∞), R d ) that ϕ(p, w) = 1 B T,B (X ξ,r,δ,σ,ρ (w)) exp(−rT ) max K − m(X ξ,r,δ,σ,ρ T (w)), 0 , and let u : P → R satisfy for all p ∈ P that In the economic interpretation of the Black-Scholes model for every p = (ξ, T, r, δ, σ, ρ, K, B) ∈ P the number u(p) ∈ R corresponds to the fair price of a average basekt put option on three underlying assets with initial prices ξ, time of maturity T , risk free rate r, dividend yields of the respective underlying assets δ, volatilities of the respective underlying assets σ, covariance matrix of the Brownian motions Q(ρ), strike price K, and knock-in barrier B.
In Table 9 we approximately present for M ∈ {2 5 , 2 6 , . . . , 2 13 } one random realization of the L 1 (λ P ; R)-approximation error P |u(p) − Ψ(p, Θ 40000 )| dp (150) (3 rd column in Table 9), one random realization of the L 2 (λ P ; R)-approximation error P |u(p) − Ψ(p, Θ 40000 )| 2 dp (4 th column in Table 9), one random realization of the L ∞ (λ P ; R)-approximation error (5 th column in Table 9), and the time to compute Θ 40000 (6 th column in Table 9). For every M ∈ {2 5 , 2 6 , . . . , 2 13 } we approximated the integrals in (150) and (151)  Besides the LRV strategy we also employed the standard MC method to approximate the function u : P → R in (139). In Table 10 we present the corresponding numerical simulation results. In Table 10 we have approximated the L 1 (λ P ; R)-approximation error of the MC method with the MC method based on 12800 samples, we have approximated the L 2 (λ P ; R)-approximation errors of the MC method with the MC method based on 12800 samples, and we have approximated the L ∞ (λ P ; R)-approximation errors of the MC method based on 12800 random samples (cf., e.g., Beck et al. [4,Lemma 3.5] and Beck et al. [7,Section 3.3]). In our approximations of the above mentioned approximation errors we have approximately computed for all required sample points p ∈ P the value u(p) of the unknown exact solution by means of an MC approximation with 52428800 MC samples.

Data availability statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.