Autocalibration of the E3SM Version 2 Atmosphere Model Using a PCA‐Based Surrogate for Spatial Fields

Global Climate Model tuning (calibration) is a tedious and time‐consuming process, with high‐dimensional input and output fields. Experts typically tune by iteratively running climate simulations with hand‐picked values of tuning parameters. Many, in both the statistical and climate literature, have proposed alternative calibration methods, but most are impractical or difficult to implement. We present a practical, robust, and rigorous calibration approach on the atmosphere‐only model of the Department of Energy's Energy Exascale Earth System Model (E3SM) version 2. Our approach can be summarized into two main parts: (a) the training of a surrogate that predicts E3SM output in a fraction of the time compared to running E3SM, and (b) gradient‐based parameter optimization. To train the surrogate, we generate a set of designed ensemble runs that span our input parameter space and use polynomial chaos expansions on a reduced output space to fit the E3SM output. We use this surrogate in an optimization scheme to identify values of the input parameters for which our model best matches gridded spatial fields of climate observations. To validate our choice of parameters, we run E3SMv2 with the optimal parameter values and compare prediction results to expertly‐tuned simulations across 45 different output fields. This flexible, robust, and automated approach is straightforward to implement, and we demonstrate that the resulting model output matches present day climate observations as well or better than the corresponding output from expert tuned parameter values, while considering high‐dimensional output and operating in a fraction of the time.


Introduction
Model tuning, also referred to as model calibration, is the science (and art) of matching the model predictions to observed data by adjusting model parameters after the model configuration has been fixed Hourdin et al. [2017].Improvements in model tuning can contribute to improvements in model predictions of the past, present, and future, and a better understanding of uncertainty.
Earth Systems Models (ESMs) are some of the most computationally-expensive and complex computer models requiring tuning.When tuning ESMs, the goal is to obtain model output that is consistent with our observations of the Earth; in many cases, these observations consist of different time-averaged spatial fields, which may have hundreds or thousands of spatial grid points.In this setting, the input parameters govern aspects of physical processes and typically are on the order of 10's to 100's for an ESM.For tuning, the set of adjusted model parameters is down-selected to 10's of parameters or less to make the problem tractable.Traditionally, "hand-tuning" for Earth System Models is a slow and tedious process in which experts select a set of parameters, manually adjusts them, and evaluates their effects on the simulation.The difficulty, in part, is due to the aforementioned complexity of the models.Furthermore, "hand-tuning" is not formalized by a set of rules, so a different expert will likely derive a different solution.
Alternative approaches to model tuning by hand have been and continue to be proposed, but many are still challenged by scalability.For example, Jackson et al. [2004] and Jackson et al. [2008] proposed the Multiple Very Fast Simulated Annealing Algorithm to efficiently identify the regions of the parameter space that minimize differences between predictions and observations within a Bayesian framework.This, however, still relies on evaluating the computer model at various parameter values sequentially, rather than simultaneously, which leads to a slower tuning process.An increasingly popular approach, due to the surge in machine learning tools, is to build an "emulator" or "surrogate" which maps the input parameter space to the output space using a perturbed parameter ensemble.Apart from the generation of the training data, which is the most time-consuming part of the process, the emulator or surrogate can typically run orders of magnitude faster.Whereas a single E3SM simulation run takes days to run, a surrogate model can typically produce output in a fraction of a second.This is the approach we will take here.
In the statistical literature, a classical calibration solution Kennedy and O'Hagan [2001] uses a Gaussian Process (GP) emulator within a Bayesian framework to account for known sources of uncertainty, e.g., discrepancy between the surrogate and the climate model, or between the climate model and observational data.Their method launched many proposed GP-based calibration approaches, yet it was employed in a simplified setting and requires computationallyexpensive iterative retraining of the surrogate model.Higdon et al. [2013] demonstrate a multi-step ensemble Kalman filtering approach for high-dimensional model output, which also leverages GP emulators and accounts for model discrepancy from the Community Atmosphere Model (CAM).Dunbar et al. [2021], Berdahl et al. [2021], and Beusch et al. [2021] have all utilized a Gaussian process emulator approach for the calibration of an idealized GCM and for the Community Ice Sheet Model (CISM), respectively.The "Calibrate, Emulate, Sample" approach of Cleary et al. [2021] leverages GPs as emulators within an approximate Bayesian learning framework to ease the computational burden of traditional Bayesian frameworks.Recently, proposed autocalibration approaches are also leaning into approaches with more of a machine learning flavor.Fletcher et al. [2022] proposes a convolutional neural network as the emulator for multiple spatial fields, but this has not yet been implemented fully within a calibration framework.For a thorough literature review of climate model emulators that could be used for calibration, see Chowdhary et al. (2021).
Building on this body of work, we propose a flexible and scalable framework for the automated calibration of the atmospheric model of E3SM, which accounts for uncertainty in the parameter space and spatial variability of highdimensional time-averaged spatial fields.Similar to the approaches above, we rely on a surrogate model for calibration.Instead of the popular GP surrogates, we propose to use polynomial chaos expansions (PCE), as in Ricciuto et al. [2018], on a reduced space.Our approach fills a need for more automated training tools and meta-learning approaches for surrogate construction that can adapt to data-rich or data-sparse environments.The general calibration framework proposed allows the user to select their choice of surrogate model (i.e., alternatives to PCE) and loss function and is intended to be easy to implement and understandable by any user.In the case of a very sparse data setting, it is possible and recommended to combine expert decision making with this approach for the greatest impact.
Our proposed framework is applied to the atmosphere model of E3SMv2, resulting in an a optimized set of 5 parameters that have substantial influence on the 45 output fields we studied.We validate these parameters using an E3SMv2 simulation, and find that the optimized set results in a simulated climate that reduces the root-mean-squared-error (RMSE) relative to our observational targets for a majority of our targeted fields, as compared to hand-tuned E3SMv2.Many biases of the hand-tuned E3SMv2 persist in the autocalibrated E3SMv2; these could likely be reduced by tuning a larger set of E3SM parameters or by improving the structure of the E3SM model.The following is a brief outline of the paper.Section 2 describes how we generated the training data for our surrogate, section 3 steps through the surrogate model fitting and inverse optimization over a loss function, and section 4 demonstrates our approach in the context of E3SMv2.Lastly, in section 5 we discuss special considerations and future directions of our approach.

Perturbed Parameter Ensemble
To train our surrogate, we generate an ensemble of 250 10-year simulations using the Energy Exascale Earth System Model version 2 (E3SMv2) [Golaz et al., 2022].The simulations are configured with active atmosphere, land, and river components, while ocean surface temperature and sea ice extent are prescribed to a monthly-varying present-day observational climatology for years 2005 to 2014 Durack and Taylor [2018].This configuration is commonly referred to as "atmosphere only" even though the land and river models are active.The atmosphere model has approximately Five atmospheric input parameters (see table 1) were adjusted using Latin Hypercube Sampling McKay et al. [1979].
The selected parameters come from the shallow convection and stratiform cloud parameterization Golaz et al. [2002], Larson and Golaz [2005], the deep convection parameterization Zhang and McFarlane [1995], Xie et al. [2019], and the cloud microphysics parameterization Gettelman and Morrison [2015].Parameterizations in ESM's represent one part of a simplification of a physical process that occurs in the climate system but is not resolved in the model.These simplifications often involve empirical fits or bulk representations of populations, so uncertainties arise naturally from using a single number to approximate these complexities.The five selected parameters and their sampling bounds were chosen in consultation with E3SM developers who previously "hand-tuned" E3SMv2.Each of the five parameter's influence on model output is demonstrated on an E3SMv1 perturbed parameter ensemble Qian et al. [2018].By restricting the ensemble to five parameters we achieve relatively dense sampling, but we knowingly exclude other influential parameters.Including more parameters could result in a better calibration.
The target fields of interest are time-averaged (over the 10-year period) for every season and spatial location of eleven climate variables, creating a total of 44 target spatial fields.These, along with a scalar target value, RESTOM, are described in table 2. We primarily use simulation and observational data that has been coarsened from its original resolution to a 24x48 grid (7.5 degree resolution, 1,152 grid points) resolution for latitude and longitude fields and 24x37 (888 grid points) resolution for latitude and pressure fields.We found that working with this resolution worked well in our setting and minimized the influence of more extreme observations.The method can easily be extended for finer scale resolutions, and our implementation allows a user to select 180x360 (1 degree resolution, 64,800 grid points) and 180x37 (6,660 grid points) spatial fields.
The E3SMv2 PPE was generated on the Chrysalis supercomputer at Argonne National Laboratory.Parameter sampling and E3SMv2 model setup was managed using the uncertainty quantification software Dakota Adams et al. [2018].
Model throughput is often a bottleneck for PPE's, but computational resources enabled relatively fast ensemble generation.Node counts were varied experimentally before settling on ten nodes per simulation, each node having 64 cores.Simulations were submitted ten-at-a-time in 100-node bundles.These ten-simulation bundles achieved ∼90-100 simulated years per day (SYPD), and were sometimes run two-at-a-time for a combined ∼180-200 SYPD.The ensemble consists of 2,500 simulated years, so such an ensemble could be run, theoretically, on 200 dedicated nodes in about 14 days.We emphasize throughput because ensembles of future E3SM versions will be planned in parallel with model development to assist with hand tuning, and such ensembles must be produced quickly to be relevant.

Method
This section gives a description of the framework we propose for automated calibration, including the surrogate fitting, optimization steps, and implementation tools used.Let  = ( 1 , . . .,   ) be a -dimensional variable describing the atmospheric parameters and let  () ∈ R  denote the −dimensional time-averaged spatial fields, where we unfold and stack the target outputs of the ESM into a single output vector.In our exemplar,  = 5 as shown in table 1 and  =    where   denotes the size of the spatial elements of the -th field, for a total 47,404 elements containing data from four seasons after removing missing data (data with NA values).The goal of calibration is to find values for the vector of parameters  that minimize a loss function  (•, •) between  () and an -dimensional vector of target observations   , i.e. (  (),   ).
Substituting  () with a surrogate model, f (, ), we can write the solution, or "optimal" parameter set, as Here, while  refers to the input E3SM parameters that were perturbed, the parameters  refers to the surrogate-specific parameters (and φ to its estimate), e.g., the polynomial coefficients in the case of a polynomial regression surrogate.
The surrogate f (, φ) aims to mimic the behavior of  () at a fraction of the computational cost.The task of finding a set of estimated tuning parameters θ can then be broken out into two steps: 1) fitting the surrogate model f (, ) to the training ensemble (i.e.learning φ, an estimate of ) and 2) solving equation 1 with a specified loss function to learn θ.
We detail the specifics of building the surrogate in section 3.1, and we discuss the optimization step in section 3.2.A summary and visual outline of our automated calibration workflow is given in figure 1.

Surrogate Model Construction
As mentioned in section 2, we use a perturbed parameter ensemble of size  = 250 to train our surrogate model; we briefly detail the choice of surrogate and surrogate training process and provide links to open-source software implementation here.Let  ∈ R × be the perturbed parameter ensemble of input values theta, sampled via Latin Hypercube Sampling (LHS), and let  ∈ R × be the resulting E3SM model output of the target fields, such that the data matrix pair ( , ) represents the full set of ensemble training data.
Due to the high dimensionality of the target fields, i.e.,  ≫ 1, training  independent surrogate models is impractical, especially when choices of hyperparameters (i.e.values describing the structure of the surrogate model) are to be chosen automatically.Further, treating target fields independently of others would ignore the inherent correlation across these variables.For a practical solution, we reduce the dimension of our output space via principal component analysis (PCA, a.k.a.empirical orthogonal function decomposition) and train a surrogate on each of the orthogonal components.This technique can also be referred to as data-driven Reduced Order Modeling (DDROM), the purpose of which is to reduce the output size of the quantities of interest Chowdhary et al. [2022].Specifically, we compute a set of principal components {  ∈ R  }  =1 and projection coefficients {  ∈ R  }  =1 from  where  ≪ .This results in a rank- approximation of : where  ⊤  represents the transpose of the column vector   .We then independently train separate surrogate models, f  (,   ) to each of the  projection coefficients using Legendre-based polynomial-based regression models, also known as polynomial chaos expansions (PCEs) Xiu and Karniadakis [2002], Xiu and Hesthaven [2005].Now, instead of training  regression models, we fit  PCE-based machine-learning models on  training pairs, ( ,   ) for  = 1, . . ., .
The polynomial chaos expansion can be written as a linear combination of basis terms, where   () represents the -dimensional multivariate Legendre polynomials [Xiu and Hesthaven, 2005, 2.2.2], which are products of its univariate counterpart [Hesthaven et al., 2007, 4.2.1]along the  dimensions.The total number of expansion terms,  + 1, is given by ( + )!/(!!), where  is the maximum polynomial order, which means that the number of terms can grow very quickly Xiu and Karniadakis [2002].Nonetheless, we can efficiently compute these coefficients in a systematic, robust, and automated manner.We use -fold cross validation and a brute-force hyperparameter grid search Bergstra and Bengio [2012] to determine the optimal choice of the polynomial order, , and the best regularized least squares solver, e.g., a lasso approach Friedman et al. [2010], to solve for the polynomial coefficients,    .For each of the k components, the surrogate with the lowest average root mean squared error across the five folds is selected for the optimization step.Our final surrogate model can then be written as resulting in a prediction f (, φ) ∈ R  , where φ is the optimal choice of polynomial coefficients, comprising of the optimal choice of hyperparameters like the regularization and polynomial order.We present the hyperparameter grid used for the PCE in A.
In order to make this approach easily accessible to non-ML experts, we provide a freely available and flexible Python machine learning library called tesuract (https://github.com/kennychowdhary/tesuract)Chowdhary [2021].The library is compatible with the scikit-learn API Pedregosa et al. [2011], which allows us to build on its vast library of ML tools for maximum flexibility.In fact, in addition to polynomial-based ML models, we can just as easily fit these components with random forests, Gaussian processes, and neural networks, all of which are accessed through the scikit-learn backend.Again, our philosophy is to let the data decide which model works best.Our choice of PCEs was based on experiments with our 250 ensemble runs which showed that, compared with neural networks, random forests, and Gaussian processes, PCEs performed the best (the full surrogate comparison results are presented in B).

Optimization
We will now describe the optimization step to learn θ using the surrogate with fitted parameters φ, learned via the hyperparameter cross validation approach described above.To construct the loss function, we specify a Gaussian likelihood for the observational fields centered at the fitted values of the surrogate model.The principal consideration in our optimization scheme is the balance of magnitudes of the different variables and seasons in the loss function  ( f (, φ),   ).
Let  ,ℓ denote the entry of   corresponding to the -th target field and the ℓ  ℎ grid point of that target field, and let f,ℓ (, φ) be the surrogate prediction of  ,ℓ for parameters .The normalized likelihood for this output can then be written as where { ,ℓ }   ℓ=1 are area weights on grid points for the -th output field, and  2  = Var ℓ ( ,ℓ ) is the variance of the observational spatial field for the -th output field.In words, the distributional assumption in (3) says that the standardized observations follow a normal (Gaussian) distribution with mean f,ℓ (, φ)/  and variance  −1 ,ℓ  2  and that different locations and different variables are independent.Here,   is the first step of normalization, approximately adjusting each field to the same scale.The terms { 2  } are then the adjustments to this normalization which aim to more flexibly balance the fields.We emphasize that although {  } is determined in the estimation process, one may also pre-specify {  } to manually weight specific fields more than others.
We construct our Gaussian likelihood-based loss function through the joint log-likelihood over all target fields where ∝ means "proportional to," and   (•) is the weighted mean-squared error (MSE) for variable , We highlight three options to find an optimized solution θ: (1) maximum likelihood estimation (MLE), (2) maximum a priori estimation (MAP, a regularized approach to MLE), and (3) sampling from the posterior distribution of  via Markov Chain Monte Carlo (MCMC).We will discuss all three here but will only show results for the estimates obtained via the MAP and MCMC procedures.The MLE and MAP solutions are numerically found using the limitedmemory optimization algorithm L-BFGS-B Byrd et al. [1995], which is commonly used to solve nonlinear optimization problems.
The maximum-likelihood solution is To obtain the MAP estimate, a prior distribution or regularization term is specified for  2 .Since our likelihood is defined for normalized values of   , it is sensible to assign the same prior distribution to each element of  2 .This is equivalent to assuming there is no additional expert knowledge on whether some fields should be fit more closely to observations.We specify the prior distribution P   as an inverse gamma distribution with parameters  and , so that the logarithm of the joint posterior distribution can be written as and The hyperparameters  = 3 and  = 1/2 are chosen.The maximum a posteriori (MAP) estimate is then defined as The inverse-gamma distribution is a common prior distribution used for the variance parameter of a normal distribution (for example, see section 2.6 of Gelman et al. [1995]).The choices of the hyperparameters ensure that the prior distribution tends to have more mass for values between 0 and 1 compared to above 1, which encourages lower values for these variances.These lower values then encourage the model output and observations to fit closer to each other rather than having a high tolerance for differences between them.The prior distribution for  is a uniform distribution across the parameter space, representing a lack of prior knowledge about its value.However, one may straightforwardly set a more informative prior distribution if desired.More informative prior distributions on  2 can be used to place more weight on specific target fields if desired.These samples can then be used to provide a distribution of uncertainty around the MAP point, allowing one to visualize the relationships between the input parameters  and potentially consider additional areas of the parameter space such that the parameters fit the observations well.

Results
In this section, we demonstrate the effectiveness of the autocalibration approach detailed in the previous sections on the E3SMv2 atmosphere model using the 250 PPE runs.We weigh the scalar value RESTOM equally in the optimization to one of the 32 latitude by longitude output fields.After removing missing data for latitude by pressure level fields (i.e. where the data would be underground), the total output size is  = 47,404 points.We use a target RESTOM value of 0.70 Watts per meter squared, 16 principal components (resulting in 86.6% of variance explained of the original data), and 10-year simulations.The positive value of target RESTOM indicates that under perpetual 2010 forcing with prescribed present-day ocean surface temperature, we would expect more energy entering than exiting the top of the atmosphere due to the positive forcing caused by carbon dioxide and climate feedbacks.A relatively large range of RESTOM (0.1 to 1.5 W/m 2 ) would be considered acceptable for this configuration because prescribed sea surface temperatures are an infinite source and sink of energy.
We first present overall end-to-end results for the optimized parameters from our automated calibration approach.In table 3, we present the different solutions estimated by expert-tuned parameters for the released v2 model, which we refer to as "v2 control" and denoted hereafter as  2 , the MAP estimate θ   , and the minimum and maximum bound considered by the PPE.The MAP estimates for ice_sed_ai and clubb_c1 are at the boundary of their parameter ranges.The other three parameters were estimated away from the boundary, and all parameters are substantially different compared to  2 .
To evaluate performance, the E3SMv2 configuration with the optimized parameters θ   was run for a 10-year simulation to obtain model output  ( θ   ).We then compare results with  ( 2 ).Results from these output fields were compared against observations on a 180x360 grid for latitude/longitude fields and a 180x37 grid for latitude/pressure fields.Table 4 displays the difference in root-mean-squared-error (RMSE) between the map estimate and the simulation run.The autocalibrated parameter set θ   leads to a reduction in root-mean-squared-error (RMSE) for a majority of seasons and output variables compared to  2 , with a 2.7% decrease on average across the spatial fields.For many output fields, the improvements were substantial and in excess of a 10% decrease in RMSE.These results were not uniform, however, and the automated-calibration parameter set did not provide improvements overall for DJF and for LWCF and PRECT.We note that clouds were the focus of a model hand-tuning effort in E3SMv1 Ma et al. [2022], and that the tuned parameter values mostly carried over into E3SMv2.Ma et al. also documented improvements in precipitation climatology, so these fields have less room for improvement by autocalibration.On the other hand, manually set values of  2 that focus on LWCF, SWCF, and PRECT could perform more comparably to the control simulation for these variables, potentially at the expense of sacrificing improvements in other variables.

Surrogate Fit
Here, we further investigate the surrogate model fit.In table 5, we summarize the optimal polynomial order and coefficient fitting procedure, i.e., fit type, for the PCE surrogate selected by the cross-validation process for each principal component.In general, the first few principal components have a higher selected polynomial degree compared to the later principal components, suggesting that there are more complex relationships between the input parameters In figure 2, we plot an example of a surrogate prediction from one output field, with the patterns of precipitation largely matching the E3SM output.Overall, the surrogate explains a substantial portion (R-squared of 0.48) of the variance in the perturbed parameter ensemble.In addition to an overall R-squared value, we formulate a version of R-squared for each grid point.Formally, let  (  ) ℓ be the ℓ-th grid point for the -th simulation run and f (  , φ) ℓ its surrogate prediction.We evaluate where  ℓ = 1   =1  (  ) ℓ and plot  2 (ℓ) for a few variables in figure 3. The surrogate explains a higher proportion of variability in the tropics, suggesting that the surrogate focuses on high variability areas and that the output fields in these areas can be controlled using the five input parameters.For each of the output variables, we see similar surrogate performance across the four seasons in general.The surrogate also predicts RESTOM very well with a high R-squared of 0.996.
We can next investigate the ability of the principal components to represent the key spatial patterns and key modes of variability.In figure 4, we plot the first principal component vector.While further principal components may refine the covariance between different grid points, the first component describes the relationships between grid points along the largest mode of variability.Based on this one-dimensional representation of the data, locations where the vector's absolute value is high generally have more variance compared to locations with values closer to 0. Most fields thus appear to have more variability in the tropics rather than the polar regions.Also, two locations with the same color are likely positively correlated based on this first component.Therefore, we see that variables at the same location for different seasons are usually positively correlated.
In figure 5, we compare the principal component scores for the E3SM simulation output and the surrogate's predictions.For each principal component score, we plot the 250 E3SM simulation runs, comparing to the line representing equal principal component scores for the simulation output and the surrogate predictions.The first principal components (for example, principal components 1 through 6) are predicted very well by the surrogate.These principal components are especially useful, in that the relationship between the input parameters and the principal components is especially strong.For later principal components, the surrogate is not able to predict as well, suggesting that these components are more noisy and are less explainable using the five input parameters.
Another way to evaluate a principal component decomposition is using the proportion of variances the principal components explain.Figure 5 shows the cumulative proportion of variance for 1-16 components.Collectively, the first five principal components explain 81.9% of variance and the first sixteen explain 86.6%.We conclude that using the first sixteen principal components is suitable given its large proportion of the explained variance of the data.Since the later principal components explain far less of the variance in the data, the surrogate struggles with fitting these later components (as in the left panel of figure 5), and our experiments have shown that using more than 16 principal components does not necessarily improve the surrogate fit.

Optimization and Validation
Here, we describe the results of the optimization and validation procedures in greater detail.Alongside the five automatically-calibrated parameters, normalization terms  2 for each of the 45 fields are optimized and are reported in table 6.We note that smaller values mean that the optimization aims to fit these values more closely relative to other fields and weighs them more in the optimization problem.There are only minor differences in these values between seasons, yet there are substantial differences between variables.To some extent, the results from table 6 are in accordance with the RMSE results in table 4. For example, the results in table 6 suggest that fitting the LWCF, SWCF, and PRECT fields well is especially hard as they are weighed less in the optimization.In contrast, some of the variables where the automatically-calibrated parameters provide improvement (TREFHT, Z500) were weighed more in the optimization.This, again, suggests that if one uses  2 to weight the optimization of LWCF, SWCF, and    PRECT more heavily, one would obtain improvements in these variables, possibly at the expense of performance in other variables.
While the RMSE comparisons in table 4 provide field-level summary of performance of the automatically-calibrated parameter set, we give examples of the resulting spatial fields compared to observations for E3SM in figure 6.Here, we compare annual means instead of seasonal means to summarize performance more succinctly.For the most part, the biases have the same spatial pattern in automatically-calibrated and control simulations.Some biases are structural to the model, and other biases may be sensitive to parameters that were not included in the PPE.In temperature (the top panels), the autocalibration reduces the RMSE and visually improves temperature differences in the northern high-latitudes.For precipitation (the most degraded field, bottom panels), the autocalibration degraded RMSE.The spatial correlation in both fields was unchanged from the hand-tuned model.The surrogate also had more trouble predicting precipitation compared to other fields (see figures 3 and 9), which then challenges the optimization effort.
In addition to comparing the automatically-calibrated and control E3SM parameter sets to observations, we next aim to contextualize these results in the context of the perturbed parameter ensemble.One popular diagnostic tool to assess ESM performance is a Taylor diagram Taylor [2001].We give more details on its construction in C. The Taylor diagram visualizes the components of the centered mean-squared error between the model run and observations in one plot with polar coordinates: the standard deviation of the model field normalized by the standard deviation of the observational field is the radius, and correlation between the two fields is the angular component of the polar coordinates.As a result, the distance between the evaluated point and the idealized point at (1, 0), representing a CMSE of 0, is the square-root of the normalized CMSE.

Autocalibrated Control
Intercomparison of e3sm_diags  In figure 7, we plot area-weighted versions of these statistics in a Taylor diagram for a single season (March-April-May, or MAM) for each of the variables on the same plot.On the left of figure 7, the diagram is plotted in totality comparing automatically-calibrated and control parameter sets to the observational fields.Points plotted directly on the circle centered at (0, 0) with radius 1 indicate that the model and the observations have the same standard deviation for that variable.The circles centered at (1, 0) represent contours of equal values of the normalized √ CMSE, so that, in general, points closer to (1, 0) represent models with lower centered mean-squared error.We see that, in general, the autocalibrated parameters provide modest improvements compared to the control set of parameters for most variables.The Taylor Diagram is an important cross check, because autocalibration targets RMSE instead of centered RMSE (CRMSE).Therefore, the optimal autocalibrated simulation could reduce bias at the expense of spatial correlation.However, examination of the Taylor Diagram shows that improvements in RMSE generally do not come at the expense of spatial correlation.
The right-most plot in figure 7 shows a close up view of the same points along with the performance of each of the 250 PPE runs.For each variable, one should not expect the autocalibrated parameters to perform substantially better than the best of the 250 of the PPE runs.Therefore, for some variables like U200, RELHUM, and LWCF, we would not expect there to be much improvement in these variables for this season without using a different PPE.Conversely, it may be possible to improve the results for other variables like PSL by weighing them more in the optimization since some of the 250 PPE runs perform better in terms of CMSE.Since we are interested in the model performance for all fields, we would not expect that the automatically-calibrated parameter set perform better than all 250 PPE runs for all fields or even any field, yet for most variables the automatically-calibrated parameter set is competitive with or better than the top 25% of simulations in terms of CMSE.
The Taylor diagram also informs us on how the PPE was designed.For the temperature variables T and TREFHT, the 250 simulation runs are tightly grouped, suggesting that there isn't much variability in the PPE associated with these variables.If larger variance and flexibility is desired for these variables, a future PPE should be designed accordingly.For other variables like LWCF, SWCF, and PRECT, we see that the PPE induces a larger range of outcomes, reflecting that the automatic calibration approach has the ability to choose parameters that affect these variables substantially.Overall, this plot emphasizes that the PPE enables the autocalibration approach to provide substantial improvements, yet overall biases in the model cannot be entirely overcome without modifying or expanding the PPE.

Bayesian Implementation
In addition to providing a maximum a posteriori (MAP) estimate of the input parameter set, one may visualize the objective function to better understand uncertainty associated with the predicted parameters.In Bayesian statistics, Markov Chain Monte Carlo (MCMC) iteratively samples from the objective function to provide a distribution of the relevant input parameters.A Bayesian approach would allow the identification of a multimodal posterior, pointing to ranked solutions.We run 200 independent chains of MCMC to avoid sensitivities to the algorithm initialization.Of these chains, one half are initialized as random samples from the 250 input parameters based on Latin Hypercube Sampling; the remaining half are chosen uniformly within the set parameter bounds.For each chain, we iteratively obtain 8,000 samples.The first 5,000 are discarded as "burn-in," and every tenth sample from the remaining 3,000 samples is retained in a "thinning" step to decrease the correlation between neighboring samples.This results in 60,000 total samples from the combined 200 chains.
In figure 8, we plot the results for the optimized parameter selection.Across the diagonal of each plot are histograms of the MCMC samples for each parameter.On the off-diagonal of each plot are scatter and density plots of the samples describing the relationship between each pair of input parameters.The posterior distributions suggest bimodality with respect to the clubb parameters and ice_sed_ai.This bimodality indicates there are two combinations of input parameters that give output fields matching well to observations: one with a value of clubb_c1 on the lower boundary, a value of ice_sed_ai on the upper boundary, and a moderate value of clubb_gamma_coef; and one with a slightly higher value of clubb_c1, a slightly lower value of ice_sed_ai, and a slightly higher value of clubb_gamma_coef.In addition, due to the positive relationship between zmconv_tau and zmconv_dmpdz in the plot, one can retain fidelity with observations (to some extent) by increasing or decreasing both parameters simultaneously nearby the MAP point.We see that  2 are substantially different from θ   , and the distribution of the parameters are quite localized compared to the large range considered in the input parameters.These narrow posterior distributions indicate we are likely underestimating the uncertainty represented in these posterior distributions since they do not reflect variability due to surrogate model fit.As implemented here, figure 8 represents the posterior distribution around the MAP estimate θ   .

Discussion
In this work, we introduce a practical, fast, and flexible approach for automatic calibration using an ML-based surrogate model for spatially varying fields and present its results on the E3SMv2 atmosphere model.In this setting, we show the effectiveness of the automatic calibration framework, improving root-mean-squared-error on output fields compared to an expert-tuned control set of parameters by 2.7% on average.This application shows that the surrogate flexibly fits the output fields, and the optimization identifies suitable parameters.Furthermore, through Bayesian estimation, we can provide a distribution of the parameters, incorporating some uncertainty into the estimates and potentially providing ranked solutions.This distribution also shows how the different input variables interact to give parameter sets in accordance with the optimization problem.As alluded to above, multiple solutions can be found quickly using this approach through user-chosen weights ( 2  ) of the target fields.Our general automatic calibration framework has a number of advantageous qualities.First, through empirical orthogonal functions (EOF), we can decompose the full ensemble of model output into a reduced space of orthogonal components, each of which represents a unique mode of variability.The framework then seamlessly takes advantage of a large number of output fields.In addition, we give flexibility in each step of the process, empowering the user to choose the number of principal components, , output fields of interest, choice of surrogate, loss function and field-specific weights in the optimization function.Surrogate model parameters are learned automatically through cross-validated hyperparameter tuning.
Although the combination of PCE and a Gaussian loss function worked well in our case compared to other popular and successful surrogate models, other surrogates and loss functions can be streamlined into our approach.In particular, while the preferred candidate for our ML-based surrogate is polynomial-based, we also show that a well tuned GP, random forest, and/or fully connected neural network, can achieve similar levels of accuracy, albeit slightly worse than a well-tuned PCE.Our philosophy is to let the data decide which model is best in terms of a performance metric (e.g.cross-validated R-squared), and, while PCE may work better for our data, perhaps GPs or neural networks may work better for a different data set.
There are a number of opportunities to improve upon this work.Future work could include estimating θ within a full Bayesian framework.This would include the uncertainty due to the surrogate model fit through assigning a prior distribution to .This is not a straightforward procedure for regularized PCE surrogates although it is an active area of research Lu et al. [2015].Alternatively, to achieve this, one could specify a surrogate that is more adaptable to the Bayesian framework such as Gaussian process approaches.We are currently pursuing the feasibility of using Bayesian Adaptive Smoothing Splines [BASS, Yue et al., 2014] or Bayesian Additive Regression Trees [BART, Chipman et al., 2010] which have shown promise for automated calibration in other applications and are already suited for Bayesian estimation.
We also recognize that the assumption of independence between different spatial locations and between different variables in the optimization step is a computational approximation and not perfect theoretically.When using full spatial fields, this results in an overestimate of the amount of independent data used in the optimization problem.Specifying dependence, however, between a large number of locations for multivariate output fields is challenging both conceptually and computationally, and the independence assumption implies simpler computation and a first-order comparison to the observations.We show that the optimized parameters from this approach work very well.An alternative solution would be to define the likelihood in the reduced space rather than the observed space, which we are currently exploring.
Although we have a large target space, the input space of five parameters in our exemplar was relatively small.We are currently working towards using this approach to tune E3SMv3 and test the robustness of our method with 14 specified atmospheric (input) parameters.This will require a larger Latin Hypercube sample to ensure high-fidelity of our surrogate.Whereas E3SMv2 default parameters had been set before our analysis, effectively providing a base case for comparison, tuning E3SMv3 will require extending and adjusting our approach as we interact with model tuning experts.For example, we may elicit subject matter expertise to adjust values of  2 based on priorities on the different output fields, and through these adjustments, we expect to provide multiple "optimal" parameter sets with different climate sensitivities.Lastly, it is worth noting that although the ultimate target for model tuning, the fully-coupled E3SM model, including an active ocean model, would be a more challenging computational endeavor.+ Pen(  ).
For the polynomial chaos expansion, we consider polynomials of order 1 through 12.The type of interactions between input variables was chosen between "total order" (all possible interactions up to a prescribed polynomial order) and "hyperbolic" (a reduced set of interactions); see Chowdhary [2021] for more information.The fit type was chosen between linear, elastic net, or lasso.The penalty parameters for elastic net and lasso were chosen from 20 values evenly-spaced on the log scale between 10 −8 and 10 4 .

B Comparison of Surrogate Models
We compare the polynomial chaos expansion to alternative choices of surrogate model using cross-validation scores (Table 7).Random forests are trained with varying number of trees, number of selected variables per tree, and tree depth.These hyperparameters are chosen by cross-validation similar to the approach for the polynomial chaos expansion.A Gaussian process (GP) regression model is also considered, with either a radial basis function or a Matérn covariance chosen by cross-validation.Finally, we consider a multilayer perceptron with varied number of layers and neurons in each layer, varying learning rate, and varying solver chosen by cross-validation.For each of the four models, we run the automatically-calibrated process including surrogate fitting, surrogate cross-validation, and optimization of parameters.We compare total running time and cross-validated surrogate performance in table 7. The polynomial chaos expansion has the best performance and also runs substantially faster than the random forest and the multilayer perceptron.We also compare the optimized parameters in table 8. Overall, the surrogates lead to similar values of the optimized parameters.One exception is the random forest model, which has substantially different parameters for all parameters except for ice_sed_ai.While choice of surrogate may depend on the particular application of the automatic calibration approach, we conclude that the polynomial chaos expansion performs as well or better than other state-of-the-art surrogate models.

C Description of Taylor Diagram
We more concretely describe the Taylor diagram of Taylor [2001].In other words, the performance of a model in terms of CMSE(), standard deviation   ( ) , and correlation with the observations   ,  ( ) can be visualized in the same diagram.While this ignores bias   −  () between the fields, it provides a comprehensive look at the second-order error between the two fields.

D Additional Plots
We present more complete plots from the surrogate results.In figure 9, we provide the location-by-location R-squared, and show a plot of surrogate-predicted and model output RESTOM values.The second plot 10 shows the first principal component for all variables, including all principal components for RESTOM.In figure 11, we provide Taylor diagrams for all four seasons, allowing for comparison between them.

E Open Research
The E3SM code is available at https://github.com/E3SM-Project/E3SM.The exact E3SM code used to run the 250 training simulations can be retrieved by checking out the hash 37959275bf3384157264e45a8d9c7c43f2be1d56.
The code pre-dates the official release of E3SMv2 but has the same climate, thus we refer to it as "E3SMv2" in this publication.
Our ensemble creation and surrogate code is available at https://github.com/E3SM-Project/Autotuning-NGD, which is currently closed to non-E3SM developers.We are working on making the code publicly available.Contact authors for code in the meantime.The code for the surrogate construction is based on the scikit-learn framework Pedregosa et al. [2011] and implemented in the tesuract package Chowdhary [2021].
In addition, we use preprocessing code from the clif package at https://github.com/sandialabs/clif. Specifying surrogate and optimization options are done through YAML files.Our Python environment is based on the "e3sm_unified_1.7.1_chrysalis" environment (https://e3sm.org/resources/tools/other-tools/e3sm-unified-environment/), and includes additional packages specified by the "requirements.txt"file on our GitHub repository.The code is also amenable to parallel architectures.All surrogate and optimization results were run on one node on Chrysalis.
The prescribed sea surface temperature and sea ice extent data is a 2005-2014 monthly climatology from Durack and Taylor [2018].

Fig. 1 :
Fig. 1: An overview of our automated calibration workflow broken into four main steps (depicted in center row): generation of PPE, estimation of surrogate model, optimization, and validation.

Fig. 2 :
Fig. 2: E3SM and surrogate-predicted output for time-averaged PRECT during JJA, plotted on a 24x48 grid.(Top Left) E3SM output, (Top Right) surrogate prediction, (Bottom) the difference between the E3SM output and the surrogate prediction.

Fig. 3 :
Fig. 3: Surrogate R-squared ( 2 (ℓ), proportion of explained variance) computed for different variables, seasons, and spatial locations.(Left) For 3 fields defined on longitude/latitude, (Right) for fields defined on latitude/pressure.The respective plots for all longitude/latitude variables are plotted in D.

Fig. 4 :
Fig. 4: Values of the first principal component (PC) vector: (Top) for 3 fields defined on longitude/latitude and the first PC, (Bottom) for fields defined on latitude/pressure for the first PC.The respective plots for all longitude/latitude variables and RESTOM are plotted in D.

Fig. 5 :
Fig. 5: (Left) Principal component (PC) score/coordinate values for each principal component, comparing the simulation output and the surrogate predicted fields.The line y=x is plotted in each plot.(Right) The cumulative proportion of variance in the data explained by the first  principal components, for  = 1, 2, . . ., 16.

Fig. 6 :
Fig.6: Simulated annual mean 2-meter temperature (top) and precipitation (bottom) minus observational targets for the auto-calibrated simulation (left) and the control simulation (right).Auto-calibrated temperature RMSE is improved, precipitation RMSE is degraded, and the large-scale spatial patterns of biases are consistent.

Fig. 7 :
Fig. 7: Taylor diagram of automatically-calibrated and control parameters for the season MAM.(Left) The entire Taylor diagram for the v2 control and automatically-calibrated parameter set.(Right) A zoomed-in version of the plot on the left, with the performance of the 250 simulation runs also plotted lightly in the background.

Fig. 8 :
Fig. 8: MCMC results: (Left) with plot bounds specifying the range of the parameter bounds searched; (Right) zoomed in to the bounds of MCMC samples.The vertical lines on the histograms represent the 16%, 50%, and 84% quantiles of the distribution of each input parameter.The orange squares represent the MAP points, while the green circles represent the E3SMv2 default parameter set.

Fig. 9 :
Fig. 9: Surrogate R-squared (proportion of explained variance) computed for each variables, season, and spatial location: (Left) for fields defined on longitude/latitude, (Top right) for fields defined on latitude/pressure, (Bottom right) a scatterplot of global top of atmosphere energy balance, with a black line at y=x and blue line representing the linear regression line between the simulation output and surrogate prediction using the 250 simulations.

Table 1 :
Atmospheric parameters sampled."pdf" is probability density function, and CAPE is convective available potential energy.
110 km grid spacing and 72 vertical levels with a model top of 10 Pa (approximately 60 km).The prescribed ocean configuration is consistent with the goal of finding atmospheric parameters that bring the climate closest to the presentday atmospheric targets while eliminating the influence of ocean variability and minimizing the time necessary for the simulated climate to adjust to parameter perturbations.

Table 3 :
Comparison of automated-calibration parameters with v2 control parameters and minimum and maximum values from which training data was sampled.

Table 4 :
Comparison of autocalibrated and v2 control parameters: percentage change in root-mean-squared-error (RMSE) between time-averaged E3SMv2 output and observations.Green represents improvements when using the autocalibrated parameters.On average, the automated-calibrated configuration reduces RMSE by approximately 2.7 percent.The autocalibration parameters also led to a RESTOM value of +0.47 under the F2010 configuration.

Table 5 :
Automatically-selected estimators for each principal component.

Table 6 :
MAP estimates ŝ  for each target field  by variable and season.The   for RESTOM had a value of 0.02.

Table 7 :
Comparison of surrogate models using cross-validated scores: polynomial chaos expansion (PCE), random forest (RF), Gaussian process regression (GPR), and multilayer perceptron (MLP).For each surrogate model, we run on a single node in Chrysalis.Briefly describing the process to estimate , we use penalized optimization.Consider the minimization problem ∑︁ =1    − f  (  • ,   ) 2 + Pen(  ),where    is the -th entry of   and f  (  • ,   ) is a polynomial of the inputs  • with order (or "degree")  with coefficients specified by   .The penalty Pen(  ) is a regularization (e.g.lasso or elastic net) term to avoid overfitting of f  to the training data, with its size determined by  > 0. The optimization problem may also be written  ∑︁ =0      ()