Calibration and Uncertainty Quantification of Convective Parameters in an Idealized GCM

Parameters in climate models are usually calibrated manually, exploiting only small subsets of the available data. This precludes both optimal calibration and quantification of uncertainties. Traditional Bayesian calibration methods that allow uncertainty quantification are too expensive for climate models; they are also not robust in the presence of internal climate variability. For example, Markov chain Monte Carlo (MCMC) methods typically require $O(10^5)$ model runs and are sensitive to internal variability noise, rendering them infeasible for climate models. Here we demonstrate an approach to model calibration and uncertainty quantification that requires only $O(10^2)$ model runs and can accommodate internal climate variability. The approach consists of three stages: (i) a calibration stage uses variants of ensemble Kalman inversion to calibrate a model by minimizing mismatches between model and data statistics; (ii) an emulation stage emulates the parameter-to-data map with Gaussian processes (GP), using the model runs in the calibration stage for training; (iii) a sampling stage approximates the Bayesian posterior distributions by sampling the GP emulator with MCMC. We demonstrate the feasibility and computational efficiency of this calibrate-emulate-sample (CES) approach in a perfect-model setting. Using an idealized general circulation model, we estimate parameters in a simple convection scheme from synthetic data generated with the model. The CES approach generates probability distributions of the parameters that are good approximations of the Bayesian posteriors, at a fraction of the computational cost usually required to obtain them. Sampling from this approximate posterior allows the generation of climate predictions with quantified parametric uncertainties.


Introduction
Calibrating climate models with available data and quantifying their uncertainties are essential to make climate predictions accurate and actionable.A primary source of uncertainties in climate models comes from representation of small-scale processes such as moist convection.Parameters in convection schemes and other parameterizations are usually calibrated by hand, using only a small fraction of data that are available.As a result, the calibration process may miss information about the small-scale processes in question.This paper presents a proof-of-concept, in an idealized setting, of how parameters in climate models can be calibrated using a substantial fraction of the available data, and how uncertainties in the parameters can be quantified.We employ a new algorithm, called calibrate-emulate-sample (CES), which makes such calibration and uncertainty quantification feasible for computationally expensive climate models.CES reduces the hundreds of thousands of model runs usually required to quantify uncertainties in computer models to hundreds, thereby achieving about a factor 1000 speedup.It leads to more robust calibration and uncertainty quantification in the presence of noise arising from chaotic variability of the climate system.We show how uncertainties in climate model parameters can be translated into quantified uncertainties of climate predictions through ensemble integrations.

Introduction
The principal uncertainties in climate predictions arise from the representation of unresolvable yet important small-scale processes, such as those controlling cloud cover [13,14,10,83,9,87,88,12,81].These processes are represented by parameterization schemes, which relate unresolved quantities such as cloud statistics to variables resolved on the climate models' computational grid, such as temperature and humidity.The parameterization schemes depend on parameters that are a priori unknown, and so fixing the parameters is associated with uncertainty.The process of fixing these parameters to values that are most consistent with data is known as calibration, which generally requires solving an optimization problem.Traditionally, however, parameters are calibrated ("tuned") by hand, in a process that exploits only a small subset of the available observational data and relies on the knowledge and intuition of climate modelers about plausible ranges of parameters and their effect on the simulated climate of a model [64,53,36,38,29,39,74,94].More recently, some broader-scale automated approaches that more systematically quantify the plausible range of parameters have begun to be explored [19,40].
Opportunities to improve climate models lie in exploiting a larger fraction of the available observational data together with high-resolution simulations, and learning from both, systematically and not manually [75].To fully account for parametric uncertainty, we adopt a Bayesian view of the model-data relationship, which amounts to solving a Bayesian inverse problem.Given model, data, and prior information on the parameters, Bayesian inversion yields a posterior distribution of parameters.The mean or mode of the posterior distribution define the best parameter estimate, and the entire distribution provides uncertainty quantification through its spread about the mean or mode.We use the mismatch between climate statistics simulated with the model and those obtained from observations or high-resolution simulations as the data likelihood in the Bayesian inverse problem to calibrate parameterizations in a climate model and to quantify their parametric uncertainties.Our focus is on learning from time-averaged climate statistics for three reasons: (1) time-averaged statistics are what is relevant for climate predictions; (2) time-averaged statistics vary more smoothly in space than atmospheric states, leading to a smoother optimization problem than that of atmospheric state estimation in numerical weather prediction (NWP); (3) time-averaging over long time-intervals reduces the effect of the unknown initial state of the system, removing the need to determine it.Focusing on time-averaged climate statistics, rather than on instantaneous states or trajectories as in NWP, makes it possible to exploit climate observations and high-resolution simulations even when their native resolutions are very different from those of climate models.
While learning from climate statistics accumulated in time presents opportunities, it also comes with challenges.Accumulating statistics in time is computationally much more expensive than the forecasts over hours or days used in NWP.Therefore, we need algorithms for learning from data that minimize the number of climate model runs required.Traditional methods for Bayesian calibration and uncertainty quantification such as Markov chain Monte Carlo (MCMC) typically require many iterations-often more than 10 5 -to reach statistical convergence e.g., [34].Conducting so many computationally expensive climate model runs is not feasible, rendering MCMC impractical for climate model calibration [1].Additionally, while MCMC can be used to obtain the distribution of model parameters given data, it is not robust with respect to noise in the evaluation of the map from model parameters to data.Such noise, arising from natural variability in the chaotic climate system, can lead to trapping of the Markov chains in spurious, noise-induced local maxima of the likelihood function [17].This presents additional challenges to using MCMC methods for climate model calibration.
Here we demonstrate a new approach to climate model uncertainty quantification that overcomes the limitations of traditional Bayesian calibration methods in a relatively simple proof-of-concept.The approach-called calibrateemulate-sample (CES) [17]-consists of three successive stages, which each exploit proven concepts and methods: 1.In a calibration stage, we use variants of ensemble Kalman methods.These approaches were originally developed as a derivative-free method for state estimation [27,85] and are now widely used in NWP [41].The methods were subsequently developed for simultaneous state and parameter estimation, and variants were developed to deal with strongly nonlinear systems [7,37,52,69,8].They were eventually recognized as a general purpose tool for the solution of inverse problems whose objective is parameter estimation [16,25,65,28].Here we use an optimization approach, referred to as ensemble Kalman inversion [44], which builds on the work of [16,25].Ensemble Kalman inversion may be understood as gradient descent projected to the ensemble subspace [71]; relatedly, iterated EnKF smoothers may be interpreted as Gauss-Newton type methods confined to the ensemble subspace [69].However, ensemble Kalman inversion and other ensemble Kalman methods do not provide a basis for systematic uncertainty quantification, except in linear Gaussian problems [1,35,26].
2. In an emulation stage, we train an emulator on the climate model statistics generated during the calibration stage in order to quantify uncertainties.To emulate how the climate model statistics depend on parameters to be calibrated, we use Gaussian processes (GPs), a hierarchical machine learning method that learns smooth functions from a set of noisy training points [50,70].The GP approach also learns the statistics of the uncertainty in the resulting predicted function.The training points here are provided by the climate model runs performed in the calibration stage.
3. In a sampling stage, we approximate the posterior distribution on parameters given model and data, using the GP emulator to replace the parameter-to-climate statistics map.The emulator is used to estimate error statistics and to sample from the approximate posterior distribution with MCMC.Because the GP emulator is computationally cheap to evaluate and is smooth by virtue of the smoothing properties of GPs, this avoids the issues that limit the usability of MCMC for sampling from climate models directly.
The CES approach is described in detail in [17], which provides a justification and contextualization of the approach in the literature on data assimilation and Bayesian calibration.If uncertainty quantification is not required, the calibration step provides an effective, derivative-free parameter estimation method, and the emulation and sampling stages are not required.However, when uncertainty quantification is required, the role of the calibration stage is to provide a good set of training points in the vicinity of the posterior mean or mode, to train the emulator in the next stage of the algorithm.
The purpose of this paper is to demonstrate the feasibility of the approach for estimating parameters in an idealized general circulation model (GCM).This represents a proof-of-concept in a small parameter space and limited data space; how the methods scale up to larger problems will be discussed at the end.
This paper is arranged as follows: Section 3 describes the experimental setup, including the idealized GCM and the generation of synthetic data from it.Section 4 describes the CES approach and the methods used in each stage.Section 5 describes the results of numerical experiments that use CES to calibrate parameters in the idealized GCM and quantify their uncertainties.It also demonstrates how sampling from the posterior distribution of parameters can be used to generate climate predictions with quantified uncertainties.Section 6 discusses and summarizes the results and their applicability to larger problems.
3 Experimental Setup

General Circulation Model
We use the idealized GCM described by [31] and [59], which is based on the spectral dynamical core of the Flexible Modeling System developed at the Geophysical Fluid Dynamics Laboratory.To approximate the solution of the hydrostatic primitive equations, it uses the spectral transform method in the horizontal, with spectral resolution T21 and with 32 latitude points on the transform grid.It uses finite differences with 10 unevenly spaced sigma levels in the vertical.We chose this relatively coarse resolution to keep our numerical experiments computationally efficient, so that comparison of CES with much more expensive methods is feasible.The lower boundary of the GCM is a homogeneous slab ocean (1 m mixed-layer thickness).Radiative transfer is represented by a semi-gray, two-stream radiative transfer scheme, in which the optical depth of longwave and shortwave absorbers is a prescribed function of latitude and pressure [59], irrespective of the concentration of water vapor in the atmosphere (i.e., without an explicit representation of water vapor feedback).Insolation is constant and approximates Earth's annual mean insolation at the top of the atmosphere.
We focus our calibration and uncertainty quantification experiments on parameters in the GCM's convection scheme, which is a quasi-equilibrium moist convection scheme that can be viewed as a simplified version of the Betts-Miller convection scheme [2,3,4].It relaxes temperature T and specific humidity q toward reference profiles on a timescale τ [30]: and Here, f T (z; T, q, p) is a function of altitude z and of the thermodynamic state of an atmospheric column (dependent on temperature T , pressure p, and specific humidity q in the column), which determines where and when the convection scheme is active; f q (T, q, p) is a function that modulates the relaxation of the specific humidity in non-precipitating (shallow) convection [30,59].The reference temperature profile is a moist adiabat, T ma (z), shifted by a state-dependent and constant-with-height offset ∆T , which is chosen to ensure conservation of enthalpy integrated over a column: T ref (z) = T ma (z) + ∆T .The reference specific humidity q ref (z) is the specific humidity corresponding to a fixed relative humidity RH relative to the moist adiabat T ma (z).The two key parameters in this simple convection scheme thus are the timescale τ and the relative humidity parameter RH.We demonstrate how we can learn about them from synthetic data generated with the GCM.

Variable Selection and Generation of Synthetic Data
The idealized GCM with the simple quasi-equilibrium convection scheme has been used in numerous studies of large-scale atmosphere dynamics and mechanisms of climate changes, especially those involving the hydrologic cycle e.g., [59,58,11,61,77,54,56,47,48,51,6,92,89].We know from this body of work that the convection scheme primarily affects the atmospheric thermal stratification in the tropics, with weaker effects in the extratropics [76].We also know that the relative humidity parameter (RH) in the moist convection scheme controls the humidity of the tropical free troposphere but likewise has a weaker effect on the humidity of the extratropical free troposphere [57].Thus, we expect tropical circulation statistics to be especially informative about the parameters in the convection scheme.However, convection plays a central role in extreme precipitation events at all latitudes [61,60], so we expect statistics of precipitation extremes to be informative about convective parameters, and in particular to contain information about the relaxation timescale τ .
As climate statistics from which we want to learn about the convective parameters, we choose 30-day averages of the free-tropospheric relative humidity, of the precipitation rate, and of a measure of the frequency of extreme precipitation.Because the GCM is statistically zonally symmetric, we take zonal averages in addition to the time averages.The relative humidity is evaluated at σ = 0.5 (where σ = p/p s is pressure p normalized by the local surface pressure p s ), as shown in Figure 1.As a measure of the frequency of precipitation extremes, we use the probability that daily precipitation rates exceed a high, latitude-dependent threshold.The threshold is chosen as the latitude-dependent 90th percentile of daily precipitation in a long (18000 days) control simulation of the GCM in a statistically steady state.So for the parameters in the control simulation, the precipitation threshold is expected to be exceeded 10% of the time at each latitude.The convective parameters in the control simulation are fixed at their reference values RH = 0.7 and τ = 2 h [59], and we collect the parameters in the vector θ † = (θ † RH , θ † τ ) = (0.7, 2 h). Figure 2 shows the mean relative humidity, the mean precipitation rate (broken down into its contributions coming from the convection scheme and from condensation at resolved scales), and the 90th percentile precipitation rate, from the control simulation averaged over 600 batches of 30-day windows.We use the single long control simulations of duration 18000 days only for the creation of Figure 2 and for the estimation of noise covariances.

Definition of noise covariance
Estimation of model parameters requires specification of a noise covariance matrix, reflecting errors and uncertainties in the data.The principal source of noise in our perfect-model setting with synthetic data is sampling variability due to finite-time averaging with unknown initial conditions.The initial condition is forgotten at sufficiently long times because of the chaotic nature of atmospheric variability, so a central limit theorem quantifies the finite-time fluctuations around infinite-time averages that are caused by uncertain initial conditions.Therefore, the asymptotic distribution of the fluctuations is a multivariate normal distribution N (0, Σ(θ)) with zero mean and covariance matrix Σ(θ).We estimate the covariance matrix at Σ(θ † ), that is, with the parameters θ † in the control simulation.To estimate Σ(θ † ), we run the GCM for 600 windows of length 30 days (because we use 30-day averages to estimate parameters) and calculate the sample covariance matrix of the 30-day means.With the 3 latitude-dependent fields evaluated at 32 latitude points, Σ(θ † ) is a 96 × 96 symmetric matrix representing noise from internal variability in finite-time averages.
Hereafter, we make the assumption that Σ(θ) ≈ Σ(θ † ) for any θ, and thus we treat Σ as a constant matrix.In practical implementations of this method, a corresponding constant Σ can be estimated from climatology.We illustrate the correlation structure of Σ in Figure 3.
To generate synthetic data, we also include the effect of measurement error [50].We add Gaussian noise to the time-averaged model statistics, with a diagonal covariance structure in data space.We construct the measurement error covariance matrix ∆ to be diagonal with entries δ i > 0, where i indexes over data type (the 3 observed quantities) and latitude (32 locations).Combining this measurement covariance matrix ∆ with the covariance matrix Σ arising from internal variability leads to an inflated noise covariance matrix There are many options to pick δ i .We choose it by reducing the distance of the 95% confidence interval to its nearest physical boundary for each i by a constant factor C, so as to retain physical properties (e.g., precipitation must be nonnegative).Denote the mean µ i , variance Σ ii , and a physical boundary set ∂Ω i for each data i, we choose We take C = 0.2.This value implies a significant noise inflation, with an average ratio of the standard deviations Figure 4 shows the resulting data mean (grey circles), the 95% confidence interval of the inflated covariance (grey ribbon), and four realizations of the data y (1) , . . ., y (4) (yellow to red lines), each defined by taking a different 30-day average of the GCM, and adding a different realization of N (0, ∆).These four realizations will be used throughout when presenting our results.

Misfit functions for time averaged data
Both calibration and uncertainty quantification in CES rely on a misfit function (standardized error) that quantifies mismatch between model output and data.Calibration minimizes a (possibly regularized) misfit function over the parameter space; uncertainty quantification samples from the posterior distribution, using a misfit function as the negative log-likelihood.To define the desired misfit function, we introduce G T (θ; z (0) ) and G ∞ (θ), which denote the mapping from the parameter vector θ to the 96 data points, either averaged over a finite time horizon (T ) or over an infinite time horizon (∞).The former average depends on the unknown initial condition z (0) , whereas the latter does not, because the initial condition is forgotten after a sufficiently long time.We refer to G T (θ; z (0) ) as the forward model and G ∞ (θ) as the infinite time-horizon forward model.
To define the misfit function, we begin from the relationship between parameters θ and data y.Expressed in terms of finite-time averages, this relationship has the form The data realizations y (i) for i = 1, . . ., 4 in Section 3.3 can therefore be seen as evaluations of (4) for initial conditions z (0) i , noise realization η i ∼ N (0, ∆), and at a parameter θ † .This form has the undesirable feature that it involves z (0) , a quantity that is not of intrinsic interest.However, the central limit theorem asserts that This theorem quantifies the forgetting of the initial condition after sufficiently long times, e.g., for the atmosphere, T 15 days [93].From the definition (3) of the total noise covariance matrix, Γ = Σ + ∆, we can combine the observational and internal-variability noise and write We can equally interpret y (i) for i = 1, . . ., 4 from Section 3.3 as evaluations of ( 6) with noise γ i ∼ N (0, Γ) and at parameter θ † .This removes the dependence on initial condition but is expressed in terms of infinite-time averages.
Computing these averages directly is not feasible, but (in the emulation phase) we introduce a procedure that enables us to learn a surrogate model for G ∞ , using carefully chosen finite-time model evaluations G T .
In the Bayesian approach to parameter learning, the aim is to determine the conditional distribution of parameters θ given a realization of data y (written θ | y) by applying Bayes theorem, which states P(θ | y) is proportional to the product of the likelihood P(y | θ) and the prior P(θ).In the case where a surrogate model for G ∞ is available, y | θ is defined as the pushforward of θ through a parameter-to-data map ( 6), and we define the misfit function, the negative logarithm of the likelihood, as where • 2 is the Mahalanobis distance.Before a surrogate model for G ∞ is available, this function is infeasible to evaluate, but as y | θ can be defined using (4) as well, we consider the related misfit function for the finite-time model evaluation at an initial condition w (0) : In general, i , the initial conditions used to generate y.We view evaluation of G T from any initial condition as a random approximation of G ∞ ; hence, the additional internal-variability covariance matrix Σ appears in (8).
The CES algorithm in this context proceeds as follows: use optimization based on (8) to calibrate parameters; on the basis of evaluations of G T made during the calibration, learn a GP surrogate for G ∞ ; then use this surrogate to sample from the posterior distribution of (θ | y) defined using (7).We will henceforth neglect z (0) in our notation, and just write G T (θ) and Φ T (θ, y).Viewing the initial condition as random makes these objects random as well.
We have the following undesirable properties of the finite-time model average G T (θ): (i) it is computationally expensive to evaluate for large T ; (ii) it can be nondifferentiable or difficult to differentiate (e.g., because of non-differentiability of parameterization schemes in climate models); and (iii) evaluations of it are not deterministic (when one drops the explicit dependence on initial conditions).Our methodology, detailed in the upcoming sections, is constructed to overcome these difficulties.
An alternative approach to emulating the map G ∞ (θ) is to emulate the log-likelihood function Φ(θ, y) directly.This is often more computationally efficient as one always models a scalar function.But we found this efficiency comes at the cost of reduced accuracy and interpretability; for example, the choice of observational noise is unclear.In preliminary experiments, these drawbacks proved detrimental to performance, and so we abandoned the approach.

Prior distributions
The priors of the physical parameters are taken to be the logit-normal and lognormal distributions, θ RH ∼ Logitnormal(0, 1) and θ τ ∼ Lognormal(12 h, (12 h) 2 ), for the relative humidity and timescale parameters, respectively.That is, we define the invertible transformation which transforms each parameter to values along the real axis.Through an abuse of notation, we relabel the transformed (or computational) parameters as θ, and the untransformed (or physical) parameters (relative humidity and timescale) are uniquely defined by T −1 (θ).The forward map G T is defined as a composition F T • T −1 where F T is the forward map defined on the physical parameters.
This choice allows us to apply our methods in the transformed space, where our priors are normally distributed and unbounded (namely logit (θ RH ) ∼ N (0, 1) and log(θ τ ) ∼ N (10.17,1)); meanwhile the climate model is applied only to physically defined variables, θ RH ∈ [0, 1] and θ τ ∈ [0, ∞).In this way the prior distributions enforce physical constraints on the parameters.

Calibrate: Ensemble Kalman Inversion
Ensemble Kalman inversion (EKI) [44] is an offline variant of ensemble Kalman filtering designed to learn parameters in a general model, rather than states of a dynamical system.EKI can be viewed as a derivative-free optimization algorithm.Given a set of data y, it iteratively evolves an ensemble of parameter estimates both so that they achieve consensus and evolve toward the optimal parameter value θ * (close to θ † if the quality/volume of the data is good enough) that minimizes an objective function, in our case given by the misfit (8), possibly with inclusion of a regularization term.It has great potential for use with chaotic or stochastic models due to its ensemble-based, derivative-free approach for optimizing parameters.Theoretical work shows that noisy continuous-time versions of EKI exhibit an averaging effect that skips over fluctuations superimposed onto the ergodic averaged forward model [23].Empirical results suggest similar phenomena apply to EKI as implemented here, justifying the application to noisy forward model evaluations.Furthermore, the derivative-free approach scales well to high-dimensional parameter spaces, as evidenced by the use of Kalman filtering in numerical weather prediction, where billions of parameters characterizing atmospheric states are routinely estimated [46].This makes the algorithm appealing for complex climate models.The algorithm is mathematically proven to find the optimizer, within an initial, ensemble-dependent subspace, for linear models [72].It is known to be effective for high-dimensional nonlinear models [44,80,78], such as the nonlinear map from parameters to data represented by the idealized GCM we use in our proof-of-concept here.
The EKI algorithm we use is detailed in [44].The algorithm iteratively updates an ensemble of parameters, θ m , where m = 1, . . .M denotes an ensemble member, and the superscript n denotes the iteration count.The algorithm uses the ensemble to update parameters according to the following equation where C GG is the empirical covariance of the ensemble of quantities of interest from model runs, and C θG is the empirical cross-covariance of the ensemble of parameters and the ensemble of quantities of interest.The covariance matrix of the distribution of differences between realizations of y and G T (•) is Γ + Σ, which is approximated empirically in the update by Γ + C GG .When ensemble methods are used as approximate samplers [16,25], additional independent noise is added to y at each iteration and for every ensemble member; however, because we are solving an optimization problem within this calibration phase, such noise is not added here.
We initialize the algorithm by drawing an ensemble of size M = 100 by sampling the parameter space from assumed prior distributions on the parameters.

Emulate: Gaussian Process Emulators (EKI-GP)
In the emulation stage, we learn a surrogate for G ∞ by training a Gaussian process on samples of G T ; the Gaussian process is a natural choice of tool here due to the Gaussianity of fluctuations in G T about G ∞ described by the central limit theorem (5).During the calibration stage with N iterations and an ensemble of size M , we obtain a collection of input-output pairs

The cloud of points {θ
(n) m } from the calibration stage (a) spans the prior distribution, as initial EKI draws are from the prior, and (b) in later iterations, has a high density around the point θ * to which EKI converges.We use regression to train a GP emulator mapping θ to G T (θ), using the input-output pairs {θ m )}, which are referred to as training points in the context of GP regression.The emulation will be most accurate in regions with more training points, that is, around θ * .This is typically near the true solution θ † , and it is the region where the posterior parameter distribution will have high probability; this is precisely where uncertainty quantification requires accuracy.In effect, EKI serves as an effective algorithm for selecting good training points for GP regression.
Gaussian processes emulate the statistics of the input-output pairs, using a Gaussian assumption.Specifically, we learn an approximation of the form The approximation is learned from the input-output pairs assuming that the outputs are produced from a mean function G GP (θ), and subject to normally distributed noise defined by a covariance function Σ GP (θ), both dependent on the parameters.The choice of notation here is to imply the fact that G GP (θ) serves to approximate the (unattainable) infinite-time average of the model G ∞ (θ), and Σ GP (θ) serves to approximate the covariance matrix Σ. Importantly, Σ GP (θ) is θ-dependent as it also includes the uncertainty in the approximation of the emulator at θ (for example, the emulator uncertainty Σ GP (θ) will be large when θ is far from the inputs {θ m } used in training).
The atmospheric quantities from which we learn about model parameters are correlated (e.g., relative humidity or daily precipitation at neighboring latitudes are correlated), resulting in a nondiagonal covariance matrix Σ.Any GP emulator therefore also requires a nondiagonal covariance Σ GP (θ).We can enforce this, by mapping the correlated statistics from the GCM into a decorrelated space by using a principal component analysis on Σ, and then training the GP with the decorrelated statistics to produce an emulator with diagonal covariance ΣGP (θ).We use the notation (•) to denote variables in the uncorrelated space.To this end, we first decompose Σ as Here, V is an orthonormal matrix of eigenvectors of the covariance matrix Σ, and D is the diagonal matrix of the square root of the eigenvalues, or the ordered standard deviations in the basis spanned by the eigenvectors of Σ.We store the outputs from the pairs as columns of a matrix Y kl = (G T (θ l )) k , and then we change the basis of this matrix into the uncorrelated coordinates When trained on Ỹ , GP returns the mean GGP (θ) and the (diagonal) covariance matrix ΣGP (θ).We use tools from scikit-learn [63] to train the emulator.After the diagonalization, we can train a scalar-valued GP for each of the 96 output dimensions, rather than having to train processes with vector-valued output.We construct a kernel by summing an Automatic Relevance Determination (ARD) radial basis function kernel and a white-noise kernel.The ARD kernel is a standard squared exponential kernel, where each input dimension has an independent lengthscale hyperparameter.This corresponds to regression, rather than interpolation, and the variance of the white noise kernel reflects the noise level assumed in the regression.We train by learning 4 hyperparameters: the radial basis function variance, a lengthscale for each of the two parameters θ (due to ARD), and the white-noise variance.We train using the input-output pairs of the initial ensemble plus N = 5 subsequent iterations of the EKI algorithm.We use M = 100 ensemble members; thus, the training requires (N + 1) × M = 600 short (30-day) runs of our GCM.
We continue using the uncorrelated basis in the sampling stage; where required, we transform the output of the emulator back into a correlated basis,

Sample: MCMC Sampling with a Gaussian Process Emulator
To quantify uncertainties, we use MCMC to sample the posterior distribution of parameters with the GP emulator.The primary reason for using the GP emulator goes back to the seminal paper by [67] and concerns the fact that it can be evaluated far more quickly than the GCM at a point in parameter space; this is important as we require more than 10 5 samples within the likelihood P(y | θ) in a typical MCMC run to sample the posterior distribution of parameters given data.However the emulator is also important for two additional reasons: (i) it naturally includes the approximation uncertainty (within ΣGP ) of using an emulator; (ii) it smooths the likelihood function because we work with an approximation of (7) based on the smooth G ∞ , rather than (8) based on the noisy G T ; as a result, MCMC is less likely to get stuck in local extrema.
Recall that we trained the GP in uncorrelated coordinates.Within MCMC, one can either map back into the original coordinates or continue working in the uncorrelated space.We choose to continue working in the uncorrelated space, and so we map each data realization y into this space: ỹ = D −1 V T y.In the Gaussian likelihood, we can use the GP emulated mean GGP (θ) and covariance matrix ΣGP (θ) as surrogates for the map G ∞ and the internal variability covariance matrix Σ (after passing to the uncorrelated coordinates).That is, we approximate the Bayesian posterior distribution as Here, ΓGP (θ) = ΣGP (θ) + D −1 V T ∆V D −1 is the GP approximation of Γ = Σ + ∆ in the uncorrelated coordinates.We include the (often overlooked) log-determinant term, arising from the normalization constant due to dependence of Γ GP on θ.In the transformed parameter space, our prior P(θ) is also Gaussian and therefore can be factored inside this exponential, adding a quadratic penalty to the negative log-likelihood.The MCMC objective function is then defined to be the negative logarithm of the posterior distribution (9); explicitly, given a Gaussian prior N (m, C) on the parameters, we define it as Evaluation of Φ MCMC requires only the mean and covariance from the GP emulator.Furthermore Φ MCMC is smooth and so is suitable for use within an MCMC algorithm to generate samples from the approximate posterior distribution of the parameters.[17] contains further discussion of MCMC using GPs to emulate the forward model, including situations where data comes from finite time-averages but the emulator is designed to approximate the infinite time-horizon forward model.
We use the random walk metropolis algorithm for MCMC sampling.The priors chosen were the same, physics-informed priors used to initialize EKI.We choose the proposal distribution also as a Gaussian with covariance proportional to the prior covariance.The MCMC run consists of a burn-in of 10,000 samples followed by 190,000 samples.

Benchmark Gaussian process (B-GP)
The performance of any emulator is dependent on the training points.Since we use an adaptive procedure (EKI) to concentrate the training points, which is the novel approach introduced in [17], we also train a benchmark emulator to compare our results with those resulting from more traditional, brute-force approaches to the emulation.
For this purpose, we use a GP emulator trained on a uniform set of points.Even in two dimensions, it is prohibitively costly for this set to span the support of the prior distribution.

Results
To demonstrate the dependence of the parameter uncertainty on the realization of the (inflated) synthetic data, we reproduce the experiments 4 times with each of the four realizations shown in Figure 4. We denote these four sets of data y (1) , . . ., y (4) .

Calibrate: Ensemble Kalman Inversion
We use the first 6 iterations of EKI in the CES training process.The initial ensemble is spread over the whole parameter space but collapses within a few iterations near the true parameter values-to within 10% error in θ RH and 30 minutes error in θ τ (Figure 5).That is, the algorithm evolves toward consensus and toward the true solution.Biases arise from the realization of internal variability and the realization of the observational noise in each y (•) .
To check for EKI convergence, we evaluate an additional 4 EKI iterations (labeled 0 to 9).At each iteration n, we compute residuals of the ensemble mean for each realization of the synthetic data y (1) , . . ., y (4) created at the true parameters θ † , weighting the residuals by the covariance matrix Γ of the synthetic data.Figure 6(a) shows the residual as a function of EKI iteration.The residual decreases quickly over the first few iterations, before plateauing for subsequent iterations.Figure 6(b) shows standard deviations of the ensemble of parameters.The standard deviations decrease monotonically from iteration to iteration, reflecting the evolution toward consensus.The behavior is qualitatively similar for all realizations; quantitative differences reflect different realizations of internal variability in the different data realizations.This behavior reflects the fact that EKI is an optimization method for calibrating parameters: it is not constructed to learn uncertainty but rather to reach consensus around a single parameter value that makes the misfit small [71].

Emulate: Validation
Figure 7 shows the parameter values used for training points for the EKI-GP and B-GP.We use the first 6 EKI iterations (i.e., 600 training points) for training.These are plotted over the associated objective function Φ MCMC .The panels in the left column correspond to the EKI-GP using data vectors y (1) , . . ., y (4) .We see that the resulting objective functions Φ MCMC from training EKI-GP at the marked training points (black dots) lead to unimodal distributions with a minimum near to a significant number of the training points; there are also training points that fall outside of the plotting domain (see Figure 5 for their extent).The right column of Figure 7 shows the benchmark grid for B-GP, which we use as a comparison for the EKI-GP method; the contours of Φ MCMC were calculated using the same realization as their counterpart EKI-GPs.We see that for each realization the EKI-GP and B-GP produce objective functions that are qualitatively similar in terms of the magnitude of the minimum, the location of the minimum, and the approximate shape of the objective function; the quantitative differences are accounted for by differing geometry and density of training points (and hence a difference in approximation uncertainty).In both settings, the objective function Φ MCMC is smooth because the GP smoothly approximates G ∞ .
EKI-GP shows similar results for the objective function as B-GP, at a fraction of the computational effort.B-GP is far less practical as a methodology than is EKI-GP because it does not scale well to high-dimensional parameter spaces; it requires many more training points than EKI-GP.Even in our two-dimensional experiments, we needed to use posterior information to reduce the number of training points for B-GP by a factor of 20.The B-GP comparison is included simply to demonstrate that EKI-GP achieves comparable results to those achieved by means of traditional emulation.
We validate the emulator approximation to the data by making a prediction at the true parameters θ † .We display G GP (θ † ) and the 95% confidence intervals computed using the variance from Σ GP (θ † ) in Figure 8 for EKI-GP, and in Figure 9 for B-GP.The rows of Figure 8 correspond to the EKI-GP results for y (1) , . . ., y (4) .In both figures we also show the statistics of 600 30-day samples from the control simulation at θ † .Both the mean and 95% confidence intervals of all EKI-GP emulators (orange line and ribbon) closely match the statistics from the GCM runs (blue dots and error bars), as does the prediction from the B-GP (dark red line and ribbon).The training data are sufficient to ensure that the predicted 95% confidence interval from the emulators do not produce unphysical values (such as giving negative precipitation rates, or relative humidities outside [0, 1]).

Sample: MCMC Sampling
We use an MCMC algorithm to generate a set of samples from the posterior distribution with the help of the GP emulator.We choose the random walk step size (which multiples the covariance in the proposal) at the start of a run to achieve proposal acceptance rates near to 25%.(This is near optimal in a precise sense for certain high-dimensional posteriors [66]; in practice, it works well beyond this setting.)All sampling is performed in the transformed space    8: Comparison between the GCM statistics at the true parameters θ † and the trained EKI-GP emulator at θ † .The four rows correspond to using EKI with the data vectors y (1) , . . ., y (4) .Blue lines: GCM mean (dots) averaged over 600 30-day runs, with the error bars marking a 95% confidence interval from variances on the diagonal of Γ. Orange: predicted mean (line) and 95% confidence interval (shaded region) produced by the GP emulator.Table 1: Average standard deviations of parameters from EKI and MCMC experiments over y (1) , . . ., y (4) .
where the prior distribution is normal.Figure 10 shows kernel density estimates of the MCMC results; the panels in the left column are for EKI-GP (for y (1) , . . ., y (4) ), and the panels in the right column are for B-GP for the same data realization.We display contours of the posterior that contain 50%, 75%, and 99% of the mass of the posterior density.
All sets of results converge to similar regions of the parameter space about the true parameters, and the spread of uncertainty is quantified similarly in both EKI-GP and B-GP.Table 1 shows the standard deviations of the individual parameters alongside the empirical standard deviation calculated from the ensemble spread in EKI iteration 9.The standard deviations from the MCMC posterior based on EKI-GP and B-GP are similar.We re-emphasize that the EKI is constructed as an ensemble optimization method and has the property that the ensemble evolves towards consensus among the parameters, while also matching the data: the ensemble collapses.As a result, the EKI ensemble spread is an inadequate estimate the uncertainty, as seen in Table 1.As explained in the introduction, ensemble methods are only justifiable to quantify uncertainties in the Gaussian posterior setting [16,25].Our approach is justifiable whenever the GP accurately approximates the forward model [17].The use of EKI for the design of training points for the GP does not require accurate uncertainty quantification within EKI; it only relies on EKI approximately locating minimizers of the model-data misfit.
There is sampling variability because of the different data realizations.This sampling variability can be assessed by asking which probability contours contain the true parameters.For both EKI-GP and B-GP, in three of four realizations, we capture the true values within 50% of the posterior probability mass; the realization y (3) is captured only within the 99% contour of the posterior probability.

Uncertainty Quantification in Prediction Experiments
To illustrate how the posterior distribution of parameters obtained in the sample step of the CES algorithm can be used to produce climate predictions with quantified uncertainties, we consider an idealized global-warming experiment.As in [58,59], we rescale the longwave opacity of the atmosphere everywhere by a uniform factor α. In the control climate we have considered so far, α = 1.We generate a warm climate by setting α = 1.5, which results in a global-mean surface air temperature increase from 287 K in the control climate to 294 K in the warm climate.To see parametric uncertainty rather than internal variability noise in the resulting "climate change predictions," we use long (7,200-day or approximately 20-year) averages in the prediction experiments.The contours are drawn to contain 50%, 75%, and 99% of the distribution generated from the samples.The left column show distributions learned using EKI-GP at y (1) , . . ., y (4) , and the right column using B-GP at the same data realizations.The blue dot represents the true parameters, while the red plus is the average across particles in the 6th EKI iteration.10.)The solid lines are the medians, and the shaded regions represent the 95% confidence intervals (between the 2.5th and 97.5th percentiles).Top: Relative humidity in mid-troposphere.Middle: Precipitation rate.Bottom: Frequency with which 99.9th percentile of latitude-dependent daily precipitation in the control climate is exceeded.
We evaluate predictions of the latitude-dependent relative humidity and mean precipitation rate that we used in the CES algorithm.We also consider the frequency of precipitation extremes, now taken as the frequency with which the 99.9th percentile of daily precipitation in the control simulation is exceeded (rather than the 90th percentile we considered earlier).This last statistic indicates how the frequency of what are 1-in-1000 day precipitation events in the control climate change in the warmer climate.
We investigate the effect of parametric uncertainty on predictions by taking 100 samples of parameters from the posterior, creating a prediction for each sample, and comparing statistics of these runs with runs in which parameters are fixed to the true values θ † .The climate statistics in the control climate are shown in the left column of Figure 11.The runs from posterior samples (orange) and with fixed true parameters (blue) match well.The noise due to internal variability is quantitatively represented by the blue shaded region.Unlike in the earlier figures with short (30-day) averages (e.g., Figure 9), the internal variability noise here is small relative to the parametric uncertainty because of the (long) 7200-day averaging window.The orange shaded region contains both internal variability and parametric uncertainty and is dominated by parametric uncertainty.This remains the case in the warmer climate (right column of Figure 11).
The effects of global warming on atmospheric quantities is seen by comparing the two columns of Figure 11.Relative humidity is fairly robust to the warming climate, and precipitation rates increase globally [59].The most dramatic changes occur for the frequency of extreme precipitation events [61].What is a 1-in-1000 day event in the control climate (e.g., occuring with frequency 0.001) occurs in the extratropics of the warmer climate an order of magnitude more frequently, with the 95% confidence interval spanning 0.01 to 0.03.That is, a 1-in-1000 day event in the control climate occurs every 30 to 100 days in the warmer climate.The parametric uncertainty is particularly large for extreme precipitation events within the tropics-behavior one would not be able to see in global warming experiments with fixed parameters.This is consistent with the known high uncertainty in predictions of tropical rainfall extremes with comprehensive climate models [60].

Conclusion and Discussion
The primary goal of this article was to demonstrate that ensemble Kalman inversion (EKI), machine learning, and MCMC algorithms can be judiciously combined within the calibrate-emulate-sample framework to efficiently estimate uncertainty of model parameters in computationally expensive climate models.We provided a proof-of-concept in a relatively simple idealized GCM.
Our approach is novel because we train a machine learning (GP) emulator using input-output pairs generated from an EKI algorithm.This methodology has several advantageous features: 1.It requires a modest number of runs of the expensive forward model (typically, O(100) runs).2. It generally finds optimal or nearly optimal parameters even in the presence of internal variability noise because EKI is robust with respect to such noise.3. The resulting GP emulation is naturally most accurate around the (a priori unknown) optimal parameters because this is where EKI training points concentrate.4. MCMC shows robust convergence to the posterior distribution, and allows identification of the optimal parameters with the maximum of the posterior probability, because it uses an objective function that is smoothed by GP emulation.
The effectiveness of GP depends on the training points, and a user must choose how many iterations of EKI to use for training (before ensemble collapse).In practice, we find the GP performance is robust as long as we include the initial iteration of training points (drawn from the prior) in our training set.The necessity of using the initial ensemble could be side-stepped by using an ensemble method that does not collapse, such as the recently introduced ensemble Kalman sampler (EKS) [33].
The CES algorithm is efficient, as it addresses two dominant sources of computational expense.First, poor prior knowledge of model parameters requires blind exploration of a possibly high-dimensional parameter space to find optimal parameters and thus the region of high posterior probability.The CES framework handles this with an EKI algorithm, which we show to be successful when using time averaged data from a chaotic nonlinear model.Second, computing parametric uncertainty with a sampling technique (such as MCMC) generally requires many (10 5 -10 6 ) evaluations of an expensive forward model.We instead solve a cheap approximate problem by exploiting GP emulators.We train the emulators on relatively few (O(100)) intelligently chosen evaluations provided by EKI, which ensures that training points are placed where they are most needed-near the minimum of the model-data misfit.The training itself introduces negligible computational cost relative to the running of the forward model, and the computational expense of evaluating the emulator in the sampling step is also negligible.Hence, the CES framework achieves about a factor 1000 speedup over brute-force MCMC algorithms.Significant efforts to accelerate brute-force MCMC without approximation have been undertaken [45,82], and improvements of up to a factor 5 speedup have been made with adaptive and parallelized Markov chains.However, these approaches still are considerably more expensive than the CES algorithm.
The CES algorithm also has a smoothing property, which is beneficial even in situations where a forward model is cheap enough to apply a brute-force MCMC.If the forward model exhibits internal variability, the objective function for the sampling algorithm will contain a data misfit of the form (8), which has a random component because it contains a finite-time average.Without more sophisticated sampling methods, MCMC algorithms get stuck in noise-induced local minima.In the CES algorithm, only EKI uses the functional (8), and EKI is well suited for this purpose.The GP emulator learns the smooth, noiseless model G ∞ (in which internal variability disappears), using evaluations of G T (which are affected by internal variability).Thus, MCMC within the CES algorithm uses the smooth GP approximation of (7).
The MCMC results in this study successfully capture the true parameters and their uncertainties.The results contain natural biases arising from the use of prior distributions, internal variability of the climate, and use of a single noisy sample as synthetic data.Despite the sampling variability and emulator constraints, our MCMC samples were able to capture the true parameters within an estimated 99% confidence interval in our examples, demonstrating the potential of EKI-trained GP emulators for MCMC sampling.Validation of the emulator (Figure 8) further supports the MCMC results, as do our comparisons with MCMC using the benchmark emulator (Table 1).The GP emulator both smooths the objective function and allows us to quantify uncertainty by sampling from the posterior distribution.This contrasts with uncertainty quantification based on the EKI ensemble, which underestimates the true uncertainties in our experiments by an order of magnitude.As used here, EKI should be viewed as an optimization algorithm and not a sampling algorithm.Adding additional spread to match the posterior within EKI may be achieved for Gaussian posteriors [16,25] or by means of EKS [33]; however, these methods are not justifiable beyond the Gaussian setting.The MCMC algorithm within CES, on the other hand, samples from an approximate posterior distribution and is justifiable beyond the Gaussian posterior setting [17].
The key to the success of the CES methodology is founded on ensemble Kalman methods and the cheap derivative-free manner in which they can learn states or, in the setting of this paper, parameters, from data.We emphasize that there exists already significant work showing, empirically, that ensemble Kalman methods can be generalized to perform uncertainty quantification beyond the Gaussian setting [7,37,52,69,8,21,22] by means of iterated EnKF, which may be motivated via the Newton method [69], and also by means of ensemble-based Gauss-Newton [68] in the presence of model error.However, it is unclear how this methodology might be rigorously justified and developed as a general-purpose tool, despite the empirical successes it has shown on a number of challenging problems.In contrast, CES may be systematically refined, based on the theory of Gaussian process emulation [90] and its use within Bayesian inference [84].CES does not depend on the UQ properties of ensemble methods, only on their optimization and estimation properties, which are amenable to systematic analysis [71,49,20].For these reasons, CES has considerable potential as a systematic approach to uncertainty quantification.
Good scaling of the CES algorithm with the dimension of the parameter space will be of critical importance for moving beyond the current, low-dimensional proof-of-concept setting.Each stage of the CES algorithm scales to higher dimensions: For the calibration stage, ensemble methods scale well to high-dimensional state and parameter spaces, typically with O(10 2 ) forward model runs [46,62], if used with localization.In high dimensions, regularization is also typically needed in calibration algorithms, and various regularization schemes can be added [15,42,43,32,79].The sampling stage also scales well to high dimensions.In particular, MCMC methods scale to non-parametric (infinite-dimensional) problems, in which the unknown is a function [18].The current bottleneck for scalability lies with the GP emulator, which does not scale easily to high-dimensional inputs.However, other supervised machine learning techniques have potential to do so.For example, random feature maps and deep neural networks show promise in this regard [55,5]; incorporating these tools in the CES algorithm is a direction of current research.
An alternative form of constraining parameter uncertainty is history matching, or precalibration [86,24,91].The idea complements that of Bayesian uncertainty quantification, where instead of searching for a high probability region of parameter space with respect to data, one rules out regions of parameter space that are deemed inconsistent with the data.[19] and [40] recently constrained the parameter space of a parameterization scheme by approximating a plausibility function over the parameter space using a Gaussian process, and then removing "implausible" regions of parameter space where the plausibility function passes a threshold.This removal process is iterated until the uncertainty of the emulator is small enough, or the space becomes empty.History matching accomplishes a similar adaptivity task as performed in the CES algorithm by EKI.During early stages of history matching, however, one must sample the full parameter space with reasonable resolution, and emulator training is required at every iteration to evaluate the plausibility function.In contrast, in the CES algorithm, EKI draws a modest numbers of samples at every iteration and can work directly with noisy model evaluations, lowering the computational expense.The output of history matching is a (possibly empty) acceptable set of forward model runs; sampling this set leads to an upper bound on the prediction uncertainty.The benefit of the CES algorithm is that it provides samples of the posterior distribution, which lead to full estimates of prediction uncertainty (Figure 11).For this reason, history matching has been proposed as a preprocessing step for Bayesian uncertainty quantification, known as precalibration to improve priors and assess model validity [86,24].The CES algorithm targets the Bayesian posterior distributions directly.
In the more comprehensive climate modeling settings we target, data will be given from earth observations and from local high-resolution simulations [75].In these settings, model error leads to deficiencies when comparing model evaluations to data, leading to structural biases and uncertainty that must be quantified.Structural model errors can be quantified with a flexible hierarchical Gaussian process regression that learns a non-parametric form of the model deficiency, as demonstrated in prototype problems in [78].This approach represents model error in an interpretable fashion, as part of the model itself, rather than in the data space as pioneered in [50].
The CES framework has potential for both the calibration and uncertainty quantification of comprehensive climate models, and other computationally expensive models.It is computationally efficient enough to use data averaged in time (e.g., over seasons), which need to be accumulated over longer model runs.Time-averaged climate statistics, including mean values and higher-order statistics such as extreme value statistics, are what typically matters in climate predictions.CES allows us to focus model calibration and uncertainty quantification on such immediately relevant statistics.Using time averaged statistics also has the advantage that it leads to smoother, albeit still noisy, objective functions when compared with calibration of climate models by minimizing mismatches in instantaneous, short-term forecasts [75].The latter approach can improve short-term forecasts but may not translate into improved climate simulations [73].It also suffers from the difficulty that model resolution and data resolution may be mismatched.Focusing on climate statistics, as we did in our proof-of-concept here, circumvents this problem: time-aggregated climate statistics are varying relatively smoothly in space and, hence, minimizing mismatches in statistics between models and data does not suffer from the resolution-mismatch problem.CES can be used to learn about arbitrary parameters in climate models from time-averaged data.It leads to quantification of parametric uncertainties that then can be converted into parametric uncertainties in predictions by sampling from the posterior distribution of parameters.

Figure 1 :
Figure 1: Zonal average of relative humidity averaged over one month.The black line shows the level at which data was extracted for computing data misfit functions.

Figure 2 :
Figure 2: Long-term mean values of synthetic data.(a) Free-tropospheric relative humidity.(b) Total daily precipitation rate (solid) and its contributions from convection (dashed) and grid-scale condensation (dotted).(c) Probability of daily precipitation exceeding a 90th percentile (which is trivially 10% in this case).

Figure 3 :
Figure 3: Correlation matrix associated with the internal variability estimated from the control simulation, plotted to illustrate the structure of Σ.The matrix blocks labeled (RH, Pr, Ext) are associated with the observed relative humidity, precipitation, and extreme precipitation.

Figure 4 :
Figure 4: Four noisy realizations of the synthetic data, plotted in color over the underlying mean (grey circles) and 95% confidence intervals from Γ(θ † ) (grey bars).(a) Relative humidity.(b) Daily precipitation rate.(c) Probability of daily precipitation exceeding the 90th percentile of the long-term mean data.
Instead, we use knowledge of the location and size of the posterior distribution to place a uniform grid of 40 × 40 = 1600 training points over [−1.25, −0.5] × [8.0, 10.0] in the transformed parameter space.This corresponds to [0.62, 0.77] × [0.83 h, 6.12 h] in the untransformed parameter space and captures the region of high probability of the posterior.The use of posterior knowledge here reduces the number of training points by a factor of 20 when compared to spanning 95% of the prior mass; no such posterior information is used in the CES algorithm, which automatically places training points where they are needed.The benchmark emulator uses the same kernel and training setup as in section 4.4, and we use the trained emulator in MCMC experiments in the same way as described in Section 4.5.To distinguish the two methods, we refer to the EKI-trained GP as EKI-GP and the benchmark (traditionally trained) GP as B-GP.

Figure 5 :
Figure 5: EKI ensemble at iterations 0 to 5 displayed as particles in parameter space.Left column: all members; right column: zoom-in near true parameter values.Each row represents optimization with a different data vector y (i) from Figure 4.The (initial) prior ensemble 0 is highlighted in dark grey, and the final ensemble 5 is highlighted in pink.The intersection of the dashed blue lines represents the true parameter values used to generate observational data from the GCM.

Figure 6 :
Figure 6: Convergence behaviour tests over 9 iterations of EKI for each realization of the data.The vertical dashed line marks the final iteration of Figure 5; we also show behaviour of 4 further iterations.(a) Ensemble-mean residuals relative to synthetic data for each EKI iteration.(b) Standard deviation of ensemble for the relative humidity parameter (circle) and timescale parameter (triangle) for each realization.

Figure 7 :
Figure7: Training points for the GP emulators (EKI-GP and B-GP), plotted over the objective function Φ MCMC for different data realizations y(1) , . . ., y(4) (rows).Left column: particles representing members of the first 6 EKI iterations.Right column: grid (uniform in the transformed parameters) used to train the benchmark Gaussian process.In both cases, some additional training points fall outside of the plotting domain.

Figure
Figure 8: Comparison between the GCM statistics at the true parameters θ † and the trained EKI-GP emulator at θ † .The four rows correspond to using EKI with the data vectors y(1) , . . ., y(4) .Blue lines: GCM mean (dots) averaged over 600 30-day runs, with the error bars marking a 95% confidence interval from variances on the diagonal of Γ. Orange: predicted mean (line) and 95% confidence interval (shaded region) produced by the GP emulator.

Figure 9 :
Figure 9: Comparison between the GCM statistics at the true parameters θ † and the trained B-GP emulator predictions at θ † .Blue: GCM mean (dots) averaged over 600 30-day runs, with the error bars marking a 95% confidence interval from variances on the diagonal of Γ. Dark red: predicted mean (line) and 95% confidence interval (shaded region) produced by the B-GP emulator.

Figure 10 :
Figure10: Density plot of MCMC samples of the posterior distribution.The contours are drawn to contain 50%, 75%, and 99% of the distribution generated from the samples.The left column show distributions learned using EKI-GP at y(1) , . . ., y(4) , and the right column using B-GP at the same data realizations.The blue dot represents the true parameters, while the red plus is the average across particles in the 6th EKI iteration.

Figure 11 :
Figure 11: Comparison of statistics of a 7200-day average in a climate-change simulation.Left column: control climate; right column: warmer climate.Synthetic observational data evaluated at the true fixed parameters are shown in blue, while data evaluated at 100 samples from the posterior distribution (EKI-GP) are shown in orange.(We choose the posterior from the first data realization, top-left panel of Figure10.)The solid lines are the medians, and the shaded regions represent the 95% confidence intervals (between the 2.5th and 97.5th percentiles).Top: Relative humidity in mid-troposphere.Middle: Precipitation rate.Bottom: Frequency with which 99.9th percentile of latitude-dependent daily precipitation in the control climate is exceeded.