A CMA‐ES Algorithm Allowing for Random Parameters in Model Calibration

In geoscience and other fields, researchers use models as a simplified representation of reality. The models include processes that often rely on uncertain parameters that reduce model performance in reflecting real‐world processes. The problem is commonly addressed by adapting parameter values to reach a good match between model simulations and corresponding observations. Different optimization tools have been successfully applied to address this task of model calibration. However, seeking one best value for every single model parameter might not always be optimal. For example, if model equations integrate over multiple real‐world processes which cannot be fully resolved, it might be preferable to consider associated model parameters as random parameters. In this paper, a random parameter is drawn from a wide probability distribution for every singe model simulation. We developed an optimization approach that allows us to declare certain parameters random while optimizing those that are assumed to take fixed values. We designed a corresponding variant of the well known Covariance Matrix Adaption Evolution Strategy (CMA‐ES). The new algorithm was applied to a global biogeochemical circulation model to quantify the impact of zooplankton mortality on the underlying biogeochemistry. Compared to the deterministic CMA‐ES, our new method converges to a solution that better suits the credible range of the corresponding random parameter with less computational effort.

2 of 23 known a priori, but are often only roughly constrained by laboratory or in-situ experiments, and integrate over many groups of organisms. Inverse model studies, that adjust parameters to fit the model to more easily observable quantities, such as nutrients or oxygen, can help to define credible intervals (e.g., Kane et al., 2011;Schartau et al., 2017). In practice, BGC parameters of global ocean models are typically adjusted manually in a given circulation, that is, the BGC parameter values are adjusted until simulated tracers yield a good match with observed data. Systematic calibrations of biogeochemical model parameters need to apply an automated optimization procedure in order to minimize an objective model-data misfit measure, but require a large number of model evaluations. However, the high computational demand of global BGC models impedes this approach, which is therefore the exception rather than the rule; moreover, parameters adjusted in a given circulation may not perform well in a different one, thereby requiring a re-calibration whenever the circulation changes (Kriest et al., 2020).
The computational demand may be reduced by either applying computationally cheaper surrogate circulations (Khatiwala et al., 2005;Prieß et al., 2013), or by using efficient optimization algorithms. Several parameter optimization studies invoke gradient information of the misfit function to iteratively approach a locally optimal parameter vector from some initial estimate. The gradient information is mostly obtained by the-computationally efficient-adjoint method. Lawson et al. (1995) introduced it in the context of biogeochemical models. It is used to determine a direction of descent, for example, by a quasi-Newton method as in Friedrichs (2002); Spitz et al. (1998); Tjiputra et al. (2007); Xiao and Friedrichs (2014), and a step size to change the parameter vector. In the face of complex BGC ocean models (including discontinuous functions), the deployment of associated adjoint models is often difficult. Consequently, gradient-based methods are often unstable. Further, the convergence speed of gradient-based methods can suffer from poorly conditioned Hessian matrices, and finding good pre-conditioners can be difficult or time-consuming, nullifying a possible gain in convergence speed. Stochastic derivative-free search algorithms (e.g., used by Hurtt & Armstrong, 1996;Kidston et al., 2011;Kaufman et al., 2018) are known to be more robust. They allow for a thorough yet efficient scan of the parameter space and can also avoid to get stuck in a (first) local optimum (cf. Schartau & Oschlies, 2003;Vallino, 2000). This is of particular importance given the complex topography of the model-data misfit measure typical for complex BGC models (see, e.g., Faugeras et al., 2003;Hurtt & Armstrong, 1996).
The covariance matrix adaption evolution strategy (CMA-ES, Hansen, 2006) is a stochastic search algorithm that is also applicable to poorly conditioned problems. It is popular for its competitiveness concerning effectivity and efficiency (cf., Hansen et al., 2010). Because of its advantages concerning robustness, efficiency and the balance between exploration and exploitation of the search space, Kriest et al. (2017) applied the CMA-ES algorithm in one of the first optimizations of a global ocean BGC model of intermediate complexity in combination with a sufficient model spin-up time (3,000 years) to approach annual tracer equilibria. In this study, we modified the CMA-ES algorithm in order to enhance the concept of parameter adjustment, in the view of uncertainties that arise from insufficiently resolved real-world processes.
When applying parameter optimization, the decision about appropriate model complexity is often accompanied by the question which BGC parameter values should be tuned (i.e., are changed during optimization in order to obtain a better model fit to observations), and which parameters should remain fixed at some reasonable, constant value during optimization. In order to save computational effort, it seems natural to exclude parameters from optimization that are of minor importance for the research question on hand and/or are unimportant with regard to the applied model-data misfit measure. Indeed, Kriest et al. (2017) observed the most insensitive parameter subject to their optimization experiment to converge the slowest. The same behavior was recently observed by Oliver et al. (2021) who applied an even faster (but less explorative) search algorithm to the same coupled BGC model setup, using twin experiment data. However, fixing some parameters from the start might impact the optimization of the other parameters. Actually, it may not always be the most realistic but pragmatic assumption that model parameters are constant at all. Optimal constant parameters can result in unexpected model responses, when the modeling purpose is changed. Our model processes are simplifications of reality. Therefore, a single model process might integrate over several relevant real-world processes, which may change in space and time. An example for unresolved processes is the impact of higher trophic levels on biogeochemical cycling. In general global coupled BGC models resolve only the first two to three trophic levels (of plankton), with zooplankton mortality as upper closure term. Here, zooplankton mortality parameterizes not solely the natural zooplankton mortality but in addition also the feeding pressure of higher trophic levels (HTL), such as fish, on zooplankton. Given that the spatial and temporal distribution of fish populations is variable, a constant mortality parameter of zooplankton (as currently applied in many BGC models) may not be well justified.
SAUERLAND ET AL.

10.1029/2022MS003390
3 of 23 Therefore, we target an optimization approach that allows to declare some parameters to be random parameters while optimizing those that are assumed to take fixed values. In this paper, a random parameter is drawn from a probability distribution for every singe model simulation (but stays constant within a model simulation). Solutions that are obtained this way might be preferable, since.
1. they are optimal with regard to a large range of (uncertain) parameters, and thereby may render the model applicable in a wider range of contexts (e.g., when coupled to fish models imposing spatially varying feeding pressures on zooplankton), so as to minimize the deterioration of the performance of the biogeochemical model. 2. setting a parameter to the one or the other credible value may bias the optimization of the other parameters toward local optima, with consequences for the biogeochemical cycling (as, e.g., in Kriest et al., 2017).
However, exploring the model-data misfit across a large range of randomly varying parameters would drastically increase the computational demand of the misfit evaluation. Indeed, global ocean BGC models are a paragon for the problem, as single model simulations already have a high computational demand. We provide an efficient way to deal with the given problem by introducing the new R-CMA-ES (a modification of the CMA-ES where R stands for random) in this study. We will examine the effects of the proposed optimization procedure by applying it to the same global BGC model setup as used in Kriest et al. (2017) and Oliver et al. (2021).
The paper is organized as follows. In Section 2 we briefly describe the CMA-ES algorithm as well as our modifications in order to efficiently solve problems with random parameters. While Section 2 is conceptual, the technical descriptions of the algorithms and their pseudo codes can be found in Appendix A and Appendix B, respectively. Further, in Appendix C we test our CMA-ES variant on a set of mathematical benchmark instances and selected random parameters in order to provide evidence that the algorithm serves the intended purpose.
In Section 3 we introduce the global ocean biogeochemical model which we choose as our sample application along with an RMSE type model-data misfit measure, which has been used by Kriest et al. (2017) to calibrate the model. Based on an analysis of the calibration experiments of that former study, we declare one parameter a random parameter and derive the associated integrated model-data misfit measure (the expected model-data misfit measure with regard to the random parameter), which we want to optimize, here. In Section 4 the results of our BGC model calibrations with one random parameter are compared to the calibrations carried out by Kriest et al. (2017). We discuss the convergence behavior of both optimization approaches as well as scientific findings by the new calibration experiment and close with some conclusions.

Methods
For our approaches to optimization throughout this paper, we distinguish three types of model parameters: • fixed parameters: parameters stay fixed at a single, scalar value throughout the optimization. These are not further considered in this paper. • non-random parameters: parameters whose values are to be optimized and change through optimization. The parameter values are drawn from a probability distribution with changing standard deviation (see below, Section 2.1), and are hereafter denoted by p. • random parameters: parameters whose values are drawn at random from a (wide) probability distribution.
In contrast to non-random parameters, the standard deviation of the probability distribution is kept constant throughout the optimization. However, equivalent to non-random parameters, random parameters remain constant during each simulation. Their value can only change for each new simulation.
Random parameters are hereafter denoted as q and characterized in detail in Section 2.2.
A suitable division of the parameters into random parameters and non-random parameters requires some preliminary considerations. For instance, all parameters that are relevant for the research question on hand can be analyzed with regard to their covariances and/or their impact on the model-data misfit function, using multiple model runs (e.g., the simulations of a parameter sensitivity analysis or a deterministic model calibration experiment). Based on this analysis, an appropriate parameter (set) can then be declared random.
SAUERLAND ET AL.

The CMA-ES Algorithm
Here, we adapt the Covariance Matrix Adaption Evolution Strategy (CMA-ES) to allow for an efficient model calibration with random parameters. More precisely, we build upon the (μ/μ w , λ)-CMA-ES (see Hansen, 2006;Hansen, 2016). From now on, we will simply write CMA-ES to refer to the (μ/μ w , λ)-CMA-ES. We start with a brief recapitulation of the CMA-ES. A detailed description and the pseudo code (reduced by some subtleties) of the original algorithm can be found in Appendix A.
The CMA-ES algorithm allows the optimization of n model parameters without the requirement to calculate model derivatives and the associated derivatives of a cost function with respect to the parameters. It is popular for its robustness and efficiency in solving optimization problems that are characterized by a difficult topography of a misfit function f(p), ∶ ℝ → ℝ (cf., e.g., Hansen et al., 2010).
The algorithm maintains a multi-variate normal distribution over the n-dimensional parameter space of ∈ ℝ . Similar to the definition of a uni-variate normal distribution  ( ) by its mean m and its standard deviation σ, a multi-variate normal distribution  ( , ) is uniquely defined by a mean vector m and a positive definite covariance matrix C. Figure 1 illustrates the situation for the uni-variate and the bi-variate case. The algorithm's normal distribution is initialized such that it covers a sufficient part of the parameter space. The algorithm then iteratively samples a population, that is, a set of λ candidate solutions from the normal distribution. From the λ samples the = ⌊ 2 ⌋ best-ranked samples with respect to f are selected. The subset of selected samples is used to calculate an empirical normal distribution (i.e., an empirical mean vector and an empirical matrix of covariances), using rank-dependent weights w which (usually) sum up to 1. The suitable choice of λ, μ, w, and many other operational constants of the algorithm are mainly determined empirically (cf. Hansen, 2016) and details are given in Table A1 in Appendix A. The normal distribution that has been estimated from the sampled parameter vectors is in turn used to update the algorithm's normal distribution to a region with better misfit values. This procedure of sampling and updating the normal distribution is repeated until convergence (or for as many iterations as necessary/desired), that is, until f becomes sufficiently small and all parameter variances vanish. Figure 2 illustrates the convergence behavior of . The associated probability densities are indicated by the gray curve and mesh-grid, respectively. We also highlight the areas of one standard deviation which is an interval on the x-axis (limited by the blue vertical lines) in the uni-variate case, and an ellipse in the plane (blue ellipse) in the bi-variate case. In both cases, we realized some random samples (black dots) from the respective normal distribution.

Figure 2.
Convergence examples of the CMA-ES algorithm for the uni-variate Griewank function (upper 6 panels) and the bi-variate Griewank function (lower 3 panels). In both cases we draw λ = 10 samples (indicated as dots) per iteration from the normal distribution. The better half of μ = 5 samples (black dots) is used to update the distribution. In the uni-variate case, the function is represented as black curve and the normal distribution is indicated as gray curve. In the bi-variate case, the function values are represented by the gray-scale color scheme with increasing values from dark to light shades. Here, like in Figure 1, blue ellipses denote the standard deviation of the normal distribution in the two-dimensional parameter space.
SAUERLAND ET AL.
10.1029/2022MS003390 5 of 23 CMA-ES for a uni-variate misfit function (upper six panels) and a corresponding bi-variate misfit function (lower three panels), respectively. We see that the variance of the distribution can increase as long as the good samples show a wide spread (e.g., iteration 3 in the one-dimensional case) but decrease if the good samples are close together like in the vicinity of the optimum (iterations 22 and 28 in the one-dimensional case).
The general applicability of CMA-ES to large-scale, real-world problems has been shown by, for example, Kriest et al. (2017Kriest et al. ( , 2020. These studies also benefited from additional features of CMA-ES, that help to accelerate shifts of the distribution's mean into directions of descent and also support reliable updates of the distribution with a small population size λ. These features are given in more detail in Appendix A.

The R-CMA-ES Algorithm
The CMA-ES provides a unique set of n optimal model parameters p*, that minimize a given misfit function. However, as mentioned above, and shown earlier (cf. Kriest et al., 2017), a given misfit function can be rather insensitive to some parameters, thereby slowing down the convergence, and resulting in poorly defined optimal parameter values. Moreover, the parameters may represent a number of unresolved processes, which are only vaguely defined; therefore, these parameters are associated with a large inherent uncertainty. We now aim to account for this uncertainty by introducing a random element into optimization, by extending and modifying CMA-ES as follows: Let n be the number of non-random parameters p, ∈ ℝ , whose values are to be optimized by a given misfit function, in analogy to CMA-ES described in Section 2.1. We consider these parameters to represent processes that are well-understood and well-defined by the model, and that can be constrained well by the given misfit function. Let further r be the number of random parameters q, ∈ ℝ , that represent unresolved processes, and whose values can not be constrained by the given misfit function.
We may further assume without loss of generality an overall parameter vector (p; q), whose first n components consist of the non-random parameters and the following r of the random components: ( 1, . . . , , 1, . . . , ) ∈ ℝ + . Our intention behind R-CMA-ES is to efficiently optimize p given randomly varying q. In other words, we seek for a parameter vector p* that is optimal over the entire range of q, where the mean value over the range of q can be generalized by applying a probability density function pdf(q). Such an optimal solution would then be minimal with respect to ∶ ℝ → ℝ , defined by which is a weighted integral over a given misfit function ∶ ℝ + → ℝ . For example, if we deal with only one random parameter q (r = 1) and allow q to take uniformly distributed values within a credible interval [a, b], then Equation 1 becomes When the evaluation of f is already computationally demanding, its computation over a wide range of q is prohibitive. Therefore, rather than applying the normal CMA-ES directly to F, we seek to design a modified CMA-ES that operates on the cheaper (point-based) misfit measure f(p; q), in order to converge to a parameter vector p* that minimizes F. The next two subsections describe our corresponding algorithmic modifications including a final sentence of justification at the very end of each subsection.

Distribution Handling for Random Parameters
The basic idea is to modify the update procedure of the maintained multi-variate normal distribution of parameters such that it gets only updated for the non-random parameters but retains its initial distribution property with regard to the random parameters. More precisely, we desire that the standard deviation interval of the marginal distribution w.r.t. the random parameter always remains the entire credible interval of that parameter. Figure 3 6 of 23 sketches the intended behavior for a two-dimensional test function (Himmelblau's function): the top row shows the convergence of the CMA-ES algorithm; the bottom row illustrates the corresponding behavior of R-CMA-ES if the first parameter is random and the second parameter is optimized. In the example, the CMA-ES algorithm converges to a point (parameter vector p*) that minimizes Himmelblau's function, that is, the mean of the normal distribution approaches the optimal point while all variances approach zero. In the modified algorithm, the normal distribution is supposed to "converge to a line," leaving the variance of the first (random) parameter unchanged while forcing the variance of the second parameter to approach zero.
The intended adaption with regard to the normal distribution over random and non-random parameters is justified by the following fact: if the normal distribution converged in the sense that the variances of all non-random parameters approached zero, then the expected f((p; q))-value of any sample (p; q) is exactly the integrated value, F(p). However, in order to ensure that R-CMA-ES does actually converge to a parameter vector p that yields a small integrated misfit value F(p), we also have to adapt the sampling and selection procedure of the algorithm.

Modification of the Sampling and Selection Procedure
In an iteration of the CMA-ES, the good μ samples that are used to update the probability distribution are selected with regard to an f-value ranking. The same kind of selection is, however, not preferable if we want to update the normal distribution toward a solution with a small value of the integrated misfit function F defined by Equation 1. As an example, which is depicted graphically in Figure 4, we consider the two-dimensional linear function f(p, q) = −p − q on the feasible domain A = [0,1] 2 and declare q to be random. The misfit function f(p, q) takes its minimum in the upper right corner (1,1) T of A and the integrated misfit function F, defined by is minimized for q = 1, that is, on the right boundary of A. With the selection criterion that is based on the ranking according to the f-values, the μ selected samples can often occur quite balanced on both sides of the current distribution mean, implying too small distribution updates toward the right boundary of A, where the actual minimum (line) of A would be found. At the same time, the selected samples might not cover the full credible interval of the random parameter, as f pushes the selection toward its minimum in the upper right corner. Using this kind of selection (sketched in Panel (a) of Figure 4) for the new algorithm, it will likely not converge to the  Figure 4. We therefore prescribe the values of the random parameter to stay independently normally distributed with regard to the subset of μ selected samples (like it is the case with regard to all λ samples).
One possibility to attain the desired independency of selected parameter vectors is mirrored sampling and pairwise selection. Brockhoff et al. (2010) introduced the idea of mirrored sampling as a derandomization technique which can improve the convergence speed of evolution strategies . Auger et al. (2011) combined mirrored sampling with pairwise selection in order to avoid premature convergence caused by canceling effects on cumulative step-size adaptions. Recently, Wang et al. (2019) proposed mirrored orthogonal sampling which further improved the convergence behavior of CMA-ES. We adapt the idea of mirrored sampling to our situation and generate pairs of samples that are mirrored at the axis (hyper plane) of the random parameter(s). From each pair of mirrored samples we select the one with smaller f-value. An example for this kind of mirrored sampling is sketched in panel (c) of Figure 4. There, the combined sampling and selection procedure forces the distribution to approach the right boundary of the search space as desired (Figure 4,panel (d)).
To summarize, applying the adapted mirrored sampling to the CMA-ES -in conjunction with the afore described distribution handling for random parameters -supports the convergence to a parameter vector p* that actually minimizes F instead of f.

Boundary Handling
Biogeochemical model parameters are usually confined to a reasonable range, for example, negative growth or mortality rates are biologically meaningless. Therefore, when optimizing these parameters, usually credible boundaries are defined.
Yet, during the selection of discrete parameter values from the assumed distribution, the algorithm might select parameters slightly outside this range. Similar to Kriest et al. (2017), we exclude these values by adding a penalty term to the misfit. For the current study we apply a simpler boundary handling, which can be applied, without the need of further adaptions, to the case with random parameters. The procedure is as follows: We assume that every parameter vector ∈ ℝ must be contained in the set where each interval [a i , b i ] is the credible interval of parameter p i . We consider these constraints by adding to the objective function a penalty term, which depends on the distance of p to its own best approximation ϕ(p) in A. The best approximation ϕ(p) ∈ A of p is the vector in A which has (amongst all vectors in A) the smallest distance to p. It can be calculated component-wise by setting (ϕ(p)) i = min(b i , max(a i , p i )) for all i = 1, …, n. Thus, for p ∈ A we have ϕ(p) = p and for p ∉ A the best approximation ϕ(p) lies on the boundary of A. The re-definition of f with regard to the penalty term is given by where p is a non-random parameter to be optimized, and q is a random parameter. The gray-scale color scheme represents the function values f(p, q) with increasing values from dark to light shades. We draw λ = 20 samples per iteration (shown as dots) and select μ = 10 samples (black dots) for the distribution update. Blue ellipses denote the standard deviation of the normal distribution. We also show the principal axis of the ellipses that corresponds to the random parameter. Panel (a) shows iteration three of an optimization using independent samples that are ranked and selected with regard to their f-values. Panel where c is a (large) constant. We apply this kind of boundary handling with both, the classical CMA-ES and R-CMA-ES.
Since we deal with boundary constraints of the form p i ∈ [a i , b i ], we can restrict CMA-ES to operate on the unit cube [0,1] n . In this case, we obtain a parameter vector p ∈ A (with A as defined above) from a sample x ∈ [0,1] n by scaling and shifting every component: We tested R-CMA-ES on a set of mathematical benchmark functions. The results of our test-bed confirmed the intended behavior of the algorithm. Our tests are summarized in Appendix C. They encouraged us to use R-CMA-ES in a real-world application, and compare its effects on algorithm performance, optimal model parameters and potential effects on biogeochemical turnover. Kriest et al. (2017) applied the CMA-ES to optimize six parameters of the BGC Model of Oceanic Pelagic Stoichiometry (MOPS; Kriest & Oschlies, 2015). The approach was facilitated by the Transport Matrix Method (TMM; Khatiwala, 2007Khatiwala, , 2018Khatiwala et al., 2005) as a numerically efficient tool to represent the global ocean circulation. The TMM represents advection and mixing in terms of transport matrices which are precomputed from an online ocean circulation model simulation. Here, as in Kriest et al. (2017) and also in Oliver et al. (2021), we apply monthly mean transport matrices from a 2.8° global configuration of the MIT general circulation model (MITgcm), having 15 depth levels (Marshall et al., 1997). MOPS coupled to the TMM simulates globally the concentrations and biogeochemical turnover of seven tracer components, namely phyto-and zooplankton, dissolved and particulate organic matter, phosphate, nitrate and oxygen.

The Global Ocean Biogeochemical Model Setup
To showcase the impact and performance of R-CMA-ES, we consider the same model setup as in Kriest et al. (2017), and optimize biogeochemical model parameters against a misfit function that includes global observations of nutrients and oxygen. Each biogeochemical model setup is simulated for 3,000 years, after which the model tracers approach a steady annual cycle.

The Misfit Function f rmse
The misfit function considered in Kriest et al. (2017) is a weighted root mean squared error between simulated annual mean tracer concentrations of oxygen, phosphate, and nitrate and their observed equivalents after being mapped to the model grid with N = 52, 749 ocean grid boxes. For each grid box and each tracer the corresponding misfit term is weighted with regard to both, the volume V i of the grid box divided by the total ocean volume V T , and the global average observed tracer concentration (j = 1 for phosphate, j = 2 for nitrate, and j = 3 for oxygen). Denoting the model output for a given parameter vector as m i,j and the corresponding observation as o i,j , i = 1, …, N, j = 1, 2, 3, the misfit function is 9 of 23

Zooplankton Mortality as Random Parameter
We here diverge from the setup by Kriest et al. (2017) and apply R-CMA-ES to consider zooplankton mortality κ Zoo (see Kriest & Oschlies, 2015, equations 9-11) as a random parameter. In particular, we aim to optimize the remaining five parameters (the light and nutrient affinity of phytoplankton, maximum zooplankton grazing rate, the oxygen demand of remineralization, and the exponent describing the particle flux curve) such that the model shows a good fit to the misfit function f rmse of Equation 2 over a wide range of zooplankton mortalities. The choice of this parameter can be justified from a modeling point of view. As noted above, zooplankton mortality does not only represent the natural mortality but also the predation by higher trophic levels which is likely to vary, because, for example, of different fish populations preying upon zooplankton. Further, the model assumes a single zooplankton functionality, while in nature many different zooplankton organisms contribute with probably different mortalities. Moreover, the parameter optimization experiment OBS-NARR by Kriest et al. (2017) identified κ Zoo to be most insensitive among the six parameters that are optimized. Table 1 depicts the optimal parameter values of experiment OBS-NARR when only simulations which deviate from the best fit f rmse (p*) by less than 5% (according to (f rmse (p)/f rmse (p*) − 1)*100) are included. Parameter κ Zoo has the widest range of variability ([0.772,5.315]), exceeding both boundaries of the credible interval [1.6,4.8] used by Kriest et al. (2017). Relative to the credible interval, the range of variability is 142%. Further, together with the half-saturation constant for PO 4 uptake, κ Zoo was the only parameter that showed a strong and long-lasting trend during the optimization experiment OBS-NARR, while the other four parameters approached their optimum value much earlier (cf. Figure 5). Therefore, we decided to optimize the other five parameters considered by Kriest et al. (2017) such that the expected model-data misfit with regard to κ Zoo ∈ [1.6, 4.8] is minimized. More precisely, we aim to obtain five optimal parameters In this contribution we assume normally distributed parameter values and define κ Zoo to be  (3.2, 1.6) distributed, that is, normally distributed with mean 3.2 (the middle of the credible interval) and standard deviation 1.6 (half interval length).
Thus, we compare two model calibration experiments. The first one is the reference experiment OBS-NARR by Kriest et al. (2017). Our new calibration experiment is referred to as OBS-RAND. It deviates from OBS-NARR by defining κ Zoo to be random and  (3.2, 1.6) distributed, aiming to minimize the integrated misfit function F rmse instead of the parameter point-based misfit function f rmse . The model configuration and the computing facility are the same for both experiments.

Results and Discussion
We note that both model calibration approaches have a common set of five non-random parameters to optimize. In order to compare their performance, following optimization we carried out 100 model simulations with the five optimal parameters over the range of zooplankton mortality rates. We then selected the minimum value of Equation 2 as representative for f rmse , and the average over all 100 simulations as representative for F rmse . Note that for OBS-NARR, the former was the target of optimization with CMA-ES, whereas for OBS-RAND F rmse (the integrated misfit) was the implied target. Table 2 shows the results of this analysis, together with the optimal parameter sets.The optimized model-data misfit values of experiment OBS-RAND are quite close to the optimal model-data misfit value of experiment OBS-NARR. Indeed, we can see that the optimizations with R-CMA-ES (experiment OBS-RAND) improved the model-data misfit averaged over the entire range of zooplankton mortalities from 0.4566 to 0.4537 (see Table 2), while the best misfit of 0.4499 for a single (six-)parameter vector has been found by CMA-ES in the reference experiment, OBS-NARR by Kriest et al. (2017).
Compared to experiment OBS-NARR by Kriest et al. (2017), our global ocean biogeochemical model calibrations with random quadratic zooplankton mortality parameter p Zoo hardly affect parameters related to longterm and/or large scale model processes which are affected by circulation and latitude, such as I c , − 2 ∶ , and Note. For CMA-ES, the f rmse -value is the misfit value of the algorithm's solution. For R-CMA-ES, we state the empirically best f rmse -value which is obtained a posteriori by fixing the non-random parameters of the solution vector but choosing 100 different κ Zoo -values such that each value covers an area of probability 1 100 . The same 100 κ Zoo -values were also used to approximate the F rmse -values of both algorithms' solution vectors.

Table 2 Optimized Parameters and Model-Data Misfit Values of Experiment OBS-NARR and Our New Experiment With Random Zooplankton Mortality Parameter, OBS-RAND
SAUERLAND ET AL.

10.1029/2022MS003390
11 of 23 the sinking speed parameter b*. On the other hand, the half saturation constant for nutrient uptake K Phy is not pushed to the upper bound of its credible interval any more, but is closer to its center. Also, the maximum grazing rate μ Zoo of zooplankton drops by 10%. A detailed view of the optimization trajectory is provided in Figure 5, which shows the convergence of the six free model parameters obtained by the reference experiment OBS-NARR while Figure 6 shows the corresponding result of experiment OBS-RAND. Figure 7 illustrates the exploration of the parameter space by R-CMA-ES, as well as its convergence toward five optimal parameters of p*.
Further we find that R-CMA-ES converged faster compared to CMA-ES, as indicated by the lower number of iterations required for convergence (cf. Table 2). Both of these responses might be a result of the modifications in the optimization procedure, as R-CMA-ES optimizes the expected misfit with regard to the most misfit-insensitive parameter. Figure 8 shows the model-data misfit measure, f rmse , for 100 parameter sets with varying κ Zoo given optimal vector of the non-random parameters p. In OBS-NARR the minimum misfit is obtained at the corresponding κ Zoo -value of 4.57 (blue dot). On the contrary, by minimizing the expected misfit F rmse (p) using R-CMA-ES, we obtain a lower minimum misfit over a wider range of κ Zoo . When partitioning f rmse with respect to its three components, namely the misfit to observed oxygen, phosphate and nitrate (Figure 9).we see that the improvement is mainly caused by a uniformly better RMSE of oxygen, while the RMSE of nitrate and, to a smaller extend, the RMSE of phosphate increased. Because the oxygen misfit comprises about 45% and both nutrients together only about 55% of the total misfit (cf. the right panel of Figure 9), the increased RMSEs of both nutrients could not compensate the better agreement with oxygen. Verifying the bias of the three tracers we find that the oxygen bias improves by 4 to 8 mmol m −3 for all κ Zoo -values in the credible interval [1.6,4.8] in experiment OBS-RAND ( Figure 10, red lines in lower right panel) compared to the values of the reference experiment OBS-NARR (black lines). Averaging over κ Zoo in [1.6,4.8] the oxygen bias improved from +10.1 mmol m −3 to +5.6 mmol m −3 , that is, we have a smaller overestimation of the observed global oxygen content (172.9 mmol m −3 ), which is also recognizable in the spatially resolved, depth integrated oxygen bias of Figure 11. Concerning the nutrient bias, there is almost no change for nitrate and phosphate between experiments OBS-RAND and OBS-NARR (also cf. Figure 11), that is, their higher RMSEs appear to be caused by higher pattern errors, only.

Increasing Zooplankton Mortality Decreases Global Oxygen Inventory
To understand the reasons for the improved oxygen misfit across the range of zooplankton mortalities, we have to investigate the role of those two parameters (μ Zoo and K Phy ) that changed significantly between the two optimization approaches. We therefore set up a set of sensitivity simulations applying the optimal values for μ Zoo (hereafter dubbed MU) and K Phy (KP) derived by the two optimization procedures in different combinations (see Table 3). As for OBS-NARR (hereafter called NA) and OBS-RAND (RA) for each sensitivity set-up, we ran a set of 100 simulations for different κ Zoo values. By doing so, we hope to disentangle the effects of μ Zoo , K Phy and κ Zoo on oxygen and biogeochemical cycling. Figure 10 shows the change of the most important tracer concentrations and fluxes with regard to the four scenarios. We find considerable effects of the mortality rate on plankton concentrations. Increasing zooplankton mortality leads to a strong decline in zooplankton concentrations, less grazing on phytoplankton and thus an increase in phytoplankton concentration. Because phytoplankton mortality is a linear function of its concentration, it increases with increasing zooplankton mortality. Egestion decreases in line with zooplankton concentration (and grazing; not shown here). The decline in zooplankton concentrations is, however, not strong enough to counteract the increase in its mortality rates, leading to an increase in the mortality flux. Zooplankton mortality contributes mostly to the production of sinking detritus, followed by phytoplankton mortality and egestion. The antagonistic responses of detritus production through egestion and the respective mortality fluxes cause a net increase in export production by about 8% for scenario RA. Larger export of organic matter to deep waters, where it is respired and thus consumes oxygen, in turn causes a decline in global average oxygen by about 8% (16.6 Pmol in terms of global inventory or 14.2 mmol m −3 global average oxygen) for scenario RA after 3,000 years of simulation. The effects of changes in mortality rate are similar for all four model setups that apply different combinations of μ Zoo and K Phy .
Changes in these two parameters affect the oxygen inventory only by about 1%-2%, but we note that here we investigate a smaller range of variation in the parameters. A smaller K Phy (higher nutrient sensitivity) as in MU causes a more efficient nutrient uptake and larger growth of phytoplankton; yet, this is not reflected in its concentration or mortality, but propagated into zooplankton grazing (not shown), egestion and mortality. The resulting increase in export production in turn causes a decline in oxygen; thus, for MU the decrease in oxygen is caused through zooplankton egestion and mortality. A smaller zooplankton grazing rate in KP causes a decline in its concentration, egestion and mortality. However, it also relieves the grazing pressure on phytoplankton, thereby enhancing its concentration and mortality. The resulting increase in export production, which decreases the average oxygen concentration, is thus caused by phytoplankton mortality.
Setup RA combines both parameter changes (small K Phy and μ Zoo ) which eventually add up to the highest export production and lowest global average oxygen concentration. Thus, the side effects of introducing a random zooplankton mortality in OBS-RAND lead, to some extent, to a modification in the model's biogeochemical cycling, which eventually results in a global oxygen inventory that is about 6 mmol m −3 (about 3%) lower than in OBS-NARR, which relied on a single, fixed value of zooplankton mortality. Nevertheless, the effects of zooplankton mortality on global oxygen content are about two to three times as large.

Conclusions
We introduced R-CMA-ES as a new variant of the Covariance Matrix Adaption Evolution Strategy. R-CMA-ES allows us to declare one or more parameters to be random, that is, the algorithm seeks to adjust only the other (non-random) parameters in order to minimize a new misfit function, which is the expectation of the former misfit function, integrating over all values of the random parameters. Such a calibration can be more reasonable than searching for single optimal parameter values only. An example are situations where (e.g., due to model simplifications) it is not clear which natural processes are actually covered by a certain parameter and to what extent. Tests with mathematical benchmark functions confirm an efficient convergence behavior of R-CMA-ES as compared to its deterministic counterpart.
We applied R-CMA-ES to a global ocean biogeochemical model setup, which inspired us to develop the algorithm. The model has been considered in a former optimization study by Kriest et al. (2017), who optimized six BGC parameters and observed that two of the parameters-the quadratic loss rate of zooplankton and the phytoplankton half-saturation constant for PO 4 -showed a long lasting drift during optimization. The quadratic loss of zooplankton parameter has high uncertainty because it mimics many different processes (e.g., cannibalism within the highly aggregated zooplankton compartment; predation by fish and higher trophic levels; density-dependent population control through viral infection), as well as many different zooplankton species. Therefore, this parameter is an ideal candidate to declare random during optimization. Allowing the quadratic loss of zooplankton to vary randomly over a credible interval, R-CMA-ES converged faster than CMA-ES did in the reference experiment of the former study. Moreover, the optimization now also reflects the potential spatio-temporal variability of this parameter, for example, due to higher trophic levels such as fish, which might be of relevance when a BGC model is coupled to a model of higher trophic levels (Getzlaff & Oschlies, 2017;Hill-Cruz et al., 2022). Furthermore, while optimization OBS-NARR seems to cause a bias of the half-saturation constant of phytoplankton for phosphate toward its upper limit, the same parameter is more in the center of its credible interval when we declare the quadratic zooplankton loss random. Another significant change is observed for the zooplankton maximum grazing rate.
Our model results suggest that after 3,000 years of simulation with climatological forcing, the uncertainty in zooplankton mortality causes a variation of oxygen inventory by 8% (16.6 Pmol or 14.2 mmol/m 3 average). Smaller changes in maximum zooplankton grazing rate or nutrient affinity of phytoplankton have a smaller effect on biogeochemical fluxes and the long-term global oxygen inventory, but suffice to improve the applied modeldata misfit measure over a wide range of zooplankton mortalities. The change in oxygen induced by mortality is of the same order of magnitude as changes induced by anthropogenic climate change (Oschlies, 2021), but it occurs after a considerably longer time span. Also, the oxygen variation across the spectrum of mortality rates is as large as the deviation of many global Earth system models from observations (e.g., Bopp et al., 2013). Our optimizations have shown that it is difficult, if not impossible, to constrain zooplankton mortality with a misfit function that targets at the RMSE of dissolved inorganic tracers , which is a common practice when tuning global biogeochemical ocean models. A more flexible tuning strategy as presented here could potentially help to account for this large uncertainty, and may also help to provide a sound and reliable upper closure term and interface for biogeochemical models coupled to higher trophic level (HTL) models.
In this study we used an ocean biogeochemical model to showcase the potential advantages of the R-CMA-ES to carry out model calibrations in the view of some uncertain model parameters. Of course, there are many more fields, that face parameter uncertainty, and where a model calibration with random parameters could be useful, such as the personalization cardiac   (Elshall et al., 2015;Lykkegaard et al., 2021) or morphodynamic models of a curved channel (Shoarinezhad et al., 2020). All these fields face the issue of parameter uncertainty. In general, a suitable partition of the model parameters into random parameters and non-random parameters needs some problem-dependent pre-considerations. For example, all parameters of interest can be analyzed w.r.t. their covariances and their impact on the model-data misfit function, using multiple model runs (e.g., the simulations of a parameter sensitivity analysis or a deterministic model calibration experiment).
Depending on the research question and the observed misfit-sensitivities and covariances of the model parameters, a suitable parameter (set) can be declared random during an optimization, which calibrates the model in the face of (random) parameter uncertainty.

Appendix A: Details of the Classical CMA-ES
As illustrated in Section 2.1 CMA-ES iteratively samples a population of λ candidate solutions from a multi-variate normal distribution  ( , 2 ) , defined by a mean vector m, a positive definite covariance matrix C and an overall scaling factor s. A new normal distribution is empirically re-estimated from the better half of = ⌊ 2 ⌋ samples, and the new probability distribution is used for a smooth update of the former distribution, which in turn is sampled in the next iteration.

A1. Sampling the Normal Distribution
Sampling a parameter vector ∈ ℝ from a multi-variate normal distribution  ( , 2 ) is practically realized by choosing n independent samples from the uni-variate standard normal distribution  (0, 1) (e.g., using the Box-Muller transform) to be the components of a vector ∈ ℝ and defining of C, that is, the columns B (i) of B are orthogonal eigenvectors of C, and D 2 is a diagonal matrix of corresponding eigenvalues. Geometrically (cf. Hansen, 2016), B and D can be identified with the so called standard deviation ellipsoid, which is a surface of equal probability density of the normal distribution (see, e.g., the ellipses in Figures 3, 4 and 7). The orientations of the n principal axes of the standard deviation ellipsoid are given by the eigenvectors B (i) of C, and the lengths of its principal axes are given by the roots of the corresponding real and positive eigenvalues D i,i . Like in the uni-variate case, the probability that a sample lies within a given area of the search space is obtained by integrating the density function of the probability distribution (pdf) over that area. For a multi-variate normal distribution  ( , ) , the corresponding probability density function is given as

A2. Updating the Distribution: Basic Principle
Given any set S = {p (1) , …, p (λ) } of λ samples, empirical (re)estimates m emp and C emp of the distribution parameters can be calculated such that the expectation of m emp is m and the expectation of C emp is C. Clearly, the estimates become more reliable the larger λ is. We may assume that the population S is increasingly ordered (ranked) with respect to the considered objective function ∶ ℝ → ℝ , that is Now, by involving only the better half of = ⌊ 2 ⌋ samples, their distribution estimate  ( , ) with corresponding parameters m μ and C μ will be modified to reproduce that μ samples with higher probability than the other λ − μ samples. CMA-ES uses values w 1 ≥ w 2 ≥⋯ ≥ w μ with ∑ =1 = 1 to give solutions a rank dependent weight in the updating process of both, m μ and C μ (a more general version allows to involve all solutions, applying negative weights for the poor ranks). The new mean is, thus, calculated as = ∑ =1 ( ) . A subtlety is the choice of the reference mean value used for estimating C μ . Instead of the new empirical mean m μ , the mean m of the former distribution is chosen and yields It has the effect that the new distribution is elongated into directions of descend (cf., e.g., iteration 2 in the right example of Figure 2).

A3. Updating the Distribution: Working With Small Populations
As mentioned above, reliable distribution estimates require a sufficiently large number of samples. But, for a competitive computational performance CMA-ES should get along with a rather small number of samples. Therefore the information of former populations is involved by updating the covariance matrix C to be a (convex) combination of both the current C and its estimate C μ , that is Using this formula with c μ as in Table A1, it can be shown that 37% of the current matrix C's information dates back at least ⌊ 1 ⌋ generations, that is, the choice of the smoothing factor c μ decides about the backward time horizon of the update procedure.
Another feature that facilitates small population sizes λ is to calculate and update a vector p c that represents iteration averaged changes of the distribution mean and to use p c for a so called rank-one estimate = of the covariance matrix. The idea behind this approach is that, using C μ , distribution elongations into directions of descend do not distinguish the sign of the directions. The use of the vector p c (called evolution path) mitigates this effect: consecutive changes of the distribution mean into opposite directions would cancel out each other. Similar

Selection and recombination
Step size control Covariance matrix adaption  to the smoothing with factor c μ in the update of C in Equation A2, the update of p c is done with a smoothing factor c c . With a further smoothing factor c 1 for the rank-one estimate C 1 , the combined covariance matrix update reads ← (1 − − 1) + + 1 1.
While C μ efficiently involves information from the current population into the update process, C 1 exploits correlations between generations. The former is important in large populations, the latter is particularly important in small populations.

A4. Step Size Control
Finally, there is an additional explicit adaption of the overall scale (the step size) of the distribution by adapting a scaling factor σ, actually using  ( , 2 ) instead of  ( , ) . Similar to the evolution path p c for the rank-one covariance matrix estimates above, the adaption of the scale σ involves an evolution path p σ that mirrors cumulative changes of the mean. The difference between the update formulas of both evolution paths p σ and p c is that for p σ all step sizes are re-scaled with respect to the isotropic normal distribution  ( , ) , where I is the identity matrix and 0 is the zero vector. The expected step size between the mean vectors of two consecutive iterations is therefore the expected length χ of a sample of  ( , ) , which is  Set p σ = p c = 0, C = B = D = I and σ = σ 0 5: while stopping criterion is not met do 6: Sample probability distribution: 7: for k = 1, …, λ do 8: Sample z (k) from  ( , ) 9: Set y (k) = BDz (k) 10: Set p (k) = m + sy (k) 11: end for 12: Sort samples by (penalized) objective function values 13: Update probability distribution: 14: Update mean: 15:

17:
Update evolution paths: 18: Update covariances and scaling: Derive B and D according to (A1) 25: end while SAUERLAND ET AL.

10.1029/2022MS003390
19 of 23 evolution path p σ longer than χ indicates consecutive distribution drifts into correlated directions which justifies a larger overall scale of the distribution.

A5. Operational Constants and the Pseude Code
The classical CMA-ES is (reduced by some subtleties) outlined as Algorithm 1 (cf. Hansen, 2016). This pseudo code may serve to verify our adaptions to the new situation, which we will elucidate in detail in Appendix B and summarize as Algorithm 2. Here, we denote the identity matrix with I, the all-ones vector (1,…,1) T with 1, and the zero vector with 0. Note, that in Algorithm 1 the sampled candidate parameter vectors p (k) are always assumed to be increasingly ordered with respect to the (penalized) objective function, that is, the best μ out of λ parameter vectors are used for updating both, m and C.

Appendix B: Details of R-CMA-ES
As already explained in Section 2, we will modify both, the distribution and the sampling and selection procedure of the CMA-ES algorithm in order to efficiently optimize the integrated objective function F defined by Equation 1 instead of its point-based counterpart f. Set μ, w, μ eff , χ, c σ , d σ , c c , c μ , c 1 , s 0 according to Table A1 (but λ = 2μ) and σ = σ 0 3: Set = 1 2 ⋅ ∖∖ (dimension n + r) 4: Set p σ = p c = 0, C = B = D = I ∖∖ (all with dimension n) 5: while stopping criterion is not met do 6: Sample probability distribution: 7: for k = 1, …, μ do 8: Sample z (k) from  ( , ) 9: Set (

B1. Modification of the Distribution
We will use the following notations, most of which were already introduced in Section 2.2: we denote by n be the number of non-random parameters which we want to optimize and by r the number of random parameters. We can restrict to the case that the first n components of the parameter vectors are the non-random parameters and the last r components are the random parameters. For ∈ ℝ we write p n for the subvector ( 1, . . . , ) ∈ ℝ and p r for the subvector ( +1, . . . , + ) ∈ ℝ . Vice versa, if a vector ∈ ℝ and a vector ∈ ℝ are given, then we denote by (p; q) the vector ( 1, . . . , , 1, . . . , ) ∈ ℝ + .
In order to incorporate the variability of a parameter p i in the procedure, we must fix both, the mean and the variance of the random parameter, that is, ∼  (0.5, 0.5) in our case, as we operate on [0,1] n+r . This can be done by using some suitable modification C′ of the scaled covariance matrix s 2 C (and additionally by keeping m i = 0.5).
A natural modification is to overwrite the ith column of s 2 C with 0.25 · e i and the ith row of s 2 C with 0.25 ⋅ , where e i is the ith unit vector. It implies that e i becomes an eigenvector of the modified covariance matrix C′ with eigenvalue 0.25 and, thus, that one of the principal axes of the standard deviation ellipsoid is parallel to the ith coordinate axis and has length 0.5.
Essentially the same result can be achieved by restricting the maintained multi-variate normal distribution to the set of non-random parameters, that is, ∈ ℝ × , calculating from each sample ∼  ( , ) the corresponding vector of non-random coordinates y n = BDz n (where B and D describe the eigendecomposition of C according to Equation A1), and the corresponding vector of random coordinates = 1 2 . Finally, the new scaled and shifted sample (cf. lines 12-13 in Algorithm 2) can be defined by

B2. Modification of the Sampling and Selection Procedure
Let ∶ ℝ + → ℝ be the objective function in the deterministic optimization case and ⊆ ℝ be the subspace of all random components. By pdf we denote the probability density function of the random components' (r-dimensional) normal distribution  (0.5 ⋅ , 0.5 ⋅ ) . We wish to optimize the integrated objective function ∶ ℝ → ℝ defined by Equation 1: ( ) = ∫ ∈ pdf( ) ⋅ (( ; )) d .
As reasoned in Section 2.2 we want the values of a random component i to stay independently  (0.5, 0.5) distributed with regard to the μ selected samples (like it is the case with regard to all samples). For this purpose, we adapt the idea of mirrored sampling (Auger et al., 2011) to our situation and generate pairs of samples that are mirrored at the axis (hyper plane) of the random parameter(s), meaning that we sample the vectors z (k) only for k ∈ [μ] and set ( + ) ∶= ( − ( ) ; ( ) ) (cf. lines 8-9 of Algorithm 2). An example for this kind of mirrored sampling is sketched in panel (c) of Figure 4. From each pair of samples we select exactly one sample for the update of the normal distribution, namely the better one w.r.t. f.

B3. Pseude Code
We summarize R-CMA-ES as Algorithm 2. For the update formulas in lines 18-19 as well as for the calculation of C μ , we assume that (by sorting) the first μ samples have been selected according to the procedure described above.

Appendix C: R-CMA-ES Test-Bed
We consider an ensemble of mathematical benchmark functions on which we compare CMA-ES to R-CMA-ES. Similar to our real-world application, we restrict R-CMA-ES to deal with a single random component and choose n = 6 as problem dimension. Additionally, we use n = 10. Our test cases invoke 5 benchmark functions; one linear function, one bowl-shaped ("sphere") function, one valley-shaped ("Rosenbrock") function, and two functions ("Griewank" and "Rastrigin") with many local minima. The details are listed in Table C1.
For problem dimension 6 we set the number of samples per iteration λ = 10 and pose a maximum iteration number of 200; for problem dimension 10 we use λ = 12 and pose an iteration limit of 300. As a second stopping criterion a standard deviation of at most 10 −4 must be satisfied for all parameters. Table C2 represents the results that we obtained by applying both, a Matlab implementation of CMA-ES (Hansen, 2012, cmaes.m, Version 3.61. beta) and our R-CMA-ES algorithm (also implemented in Matlab). Note. We used 10 optimization trials (with different random numbers) for both algorithms and each instance. In each case we show the mean result over all trials. The f-values and F-values have been calculated like in Table 2 but using 10,000 (instead of 100) values for the random parameter. Eight out of our ten test instances have the property that the optimal solution to the deterministic optimization task does also belong to the set of optimal solutions to the optimization task with a random parameter. For example, the all-ones vector 1 is the global optimum of the benchmark function f of the test instances Linear1 and Linear2, but also a solution to the problem min ∈[0,1] ( ).
Nevertheless, benchmark functions that exhibit the mentioned property serve well in proving a good convergence behavior of R-CMA-ES in comparison to CMA-ES, if it has similar (or even smaller) f-values and F-values. This is indeed the case for the eight respective test instances. The situation is different for both Rosenbrock instances.
Here, the optimal solutions with regard to f and F differ. Indeed, CMA-ES often finds solutions that have a better f-value than the solutions of R-CMA-ES; but vice versa, all solutions found by R-CMA-ES have significantly better F-values.
For the well-shaped test instances Linear1, Linear2, Sphere1, and Sphere2, CMA-ES converged after less function evaluations than R-CMA-ES did; the factor was about 3 for the linear instances and about 1.5 for the sphere instances. However, R-CMA-ES required less function evaluations for both valley-shaped instances Rosenbrock1 and Rosenbrock2 and for the jagged instances Rastrigin1 and Rastrigin2.

Data Availability Statement
The implementation of our optimization algorithm R-CMA-ES and the benchmark function test-bed in MATLAB as well as the C++ implementation for model calibration on HPC platforms can be found on GitHub. The permanent version of the code which we used for the experiments of this article is archived in a public zenodo repository (Sauerland, 2023). For compilation, usage, and further notes, we refer to the README contained in that repository. The BGC ocean model code and the observational data we used for this study is the same as used by Kriest et al. (2017) and also available on (Sauerland, 2023) (in the folder "MOPS"). The basic TMM and MOPS code as well as required input data for forcing, geometry, and initialization of the model are available to download from (Khatiwala, 2018).