Sparse, Empirically Optimized Quadrature for Broadband Spectral Integration

Broadband (spectrally‐integrated) radiation calculations are dominated by the expense of spectral integration, and many applications require fast parameterizations for computing radiative flux. Here we describe a novel approach using a linear weighted sum of monochromatic calculations at a small set of optimally‐chosen frequencies. The empirically‐optimized quadrature method is used to compute atmospheric boundary fluxes, net flux profiles throughout the atmosphere, heating rate profiles, and top‐of‐the‐atmosphere forcing by CO2, in the longwave for clear skies. We evaluate the method against two modern correlated k‐distribution models and find that we can achieve comparable errors with 32 spectral points. We also examine the effect of minimizing different cost functions, and find that in order to accurately represent heating rates and CO2 forcing, these quantities must be included in the cost function.

In dynamical models of the atmosphere, radiation calculations must similarly be fast and accurate.The state-of-theart method for reducing spectral resolution is the correlated k-distribution (CKD) algorithm (Fu & Liou, 1992;Goody et al., 1989;Lacis & Oinas, 1991), which sorts frequencies according to their absorption coefficient k and determines the average at some set of quadrature points.The sorting and averaging creates a much smoother integral that can be evaluated with just tens or hundreds of quadrature points, as compared to the millions of points required by line-by-line models.These parameterizations seek to minimize the number of these quadrature points, called g-points, to limit computational cost (Hogan & Matricardi, 2022).
There is, however, no theory underlying k-distributions, so developers must make a range of somewhat arbitrary choices (Hogan & Matricardi, 2020a).Moreover, the common experience is that the absorption coefficients determined with any algorithm need to be tuned to match reference calculations (Hogan & Matricardi, 2022;Pincus et al., 2019), making the link between the reference and approximate calculations somewhat obscure.
Here we describe an approximation to spectral integration in the longwave consisting of a weighted sum of monochromatic fluxes computed at a small set of discrete wavenumbers.We optimize both the members of this set and the linear weights to find the best estimate of the broadband integrated flux.The idea builds on approaches for computing narrow-band radiances originally developed for remote sensing applications (Buehler et al., 2010;Moncet et al., 2008) to sample the spectrum sparsely rather than averaging absorption coefficients.While our work is in the same spirit as de Mourgues et al. (2023), we take a more general approach and treat a wider range of applications including heating rates, boundary fluxes, and CO 2 forcing.Our data is discretized more finely in the spectral dimension and more sparsely in the vertical, which prompts us to develop a cost function that will allow us to extrapolate the integration to atmospheres of different structures.We seek not only to develop a novel method, but also to understand any inherent properties and tradeoffs of the approach.To this end, we explore the relationship between the quantities targeted in the optimization and individual error profiles, test the effect of the vertical discretization of the training data set, and compare the performance of the empirically-optimized quadrature method to state-of-the-art correlated-k methods.We find that weighted, sampled monochromatic calculations can predict fluxes, heating rates, and CO 2 forcing with error-cost relationships commensurate with modern correlated-k models, while being flexible to the needs of a given application.

Broadband Radiative Integral Calculation
The net broadband radiative flux at height z in the atmosphere, F(z), is given by integrating monochromatic fluxes across all wavenumbers: where ν denotes wavenumber and F ν (z) is monochromatic net flux.We develop the method for the longwave in clear skies, so we disregard scattering.
Rather than calculating the line-by-line integral directly, we seek to compute the broadband net flux from a small subset of representative wavenumbers.This breaks the problem into two parts: • Optimally select the representative wavenumbers, and • Compute the net integrated flux from this subset.
Inspired by approaches originally developed for remote sensing applications (Buehler et al., 2010;Moncet et al., 2008), we select the representative subset via a combinatorial optimization algorithm called simulated annealing (Kirkpatrick et al., 1983), and calculate the net flux by performing a weighted sum of monochromatic fluxes found at the representative wavenumbers.In other words, we rewrite our problem as: where w ν is the weight corresponding to wavenumber ν in the sparse subset S, and F ν (z) is the monochromatic flux found at wavenumber ν at height z.Though this is similar to the formulation of k-distributions (e.g., Equation 3 in Lacis and Oinas (1991)), here we are sampling monochromatic fluxes rather than averaging absorption coefficients.Furthermore, we sample across the entire longwave spectrum without imposing a band structure (cf. the Full-Spectrum Correlated-k, Hogan, 2010;Modest & Zhang, 2002). 10.1029/2023MS003819 3 of 13

Line-By-Line Spectra and Correlated-k Data
In order to spectrally sample F ν∈S and choose w ν , we need accurate, finely-resolved line-by-line spectra at arbitrary frequencies, as well as corresponding integrated fluxes and heating rates throughout the atmosphere.Atmospheric profiles used for training must be independent and provide a representative sample of conditions for a given applications.
We use the set of atmospheric conditions developed for the Correlated K-Distribution Model Intercomparison Project (CKDMIP; see Hogan & Matricardi, 2020a).The CKDMIP profiles are chosen from reanalysis data to represent the mean state as well as minimum and maximum extremes in water vapor concentration, ozone concentration, and temperature for the present day climate.We use software provided by CKDMIP to generate 100 independent present-day, clear-sky reference flux profiles, half for training and half for testing, integrated in four hemispheric angles.Heating rates are computed from fluxes using a simple finite difference.Finally, we perturb the CO 2 concentration of the 100 profiles to generate reference CO 2 forcing values for the last glacial maximum (LGM) (180 ppm), pre-industrial (PI) (280 ppm), present day (∼415 ppm), 2× PI (560 ppm), 4× PI (1,120 ppm), 8× PI (2,240 ppm) (Hogan & Matricardi, 2020a), and 16× PI (4,480 ppm) for extreme paleoclimate scenarios (Bice & Norris, 2002).

Determining Optimal Weights for a Subset of Spectral Points
Given a subset of spectral points S of predetermined size |S|, we must obtain a set of weights such that the weighted sum of fluxes at those points predicts the broadband net flux across the spectrum.The weights w ν are found by minimizing an objective function subject to several constraints.
Our first constraint is that all w ν ≥ 0, since quadrature weights are always positive.Positive weights lend the method stability, as they prevent monochromatic fluxes from canceling (Buehler et al., 2010;Moncet et al., 2008).Second, the weights are normalized so that they sum to the length of the spectral domain, a general property of integration: as we focus on the longwave.This is achieved by expressing the cost function in conjunction with the constraints as a linear program and solving using the convex optimization package cvxpy in Python (Agrawal et al., 2018;Diamond & Boyd, 2016).

Wavenumber Subset Optimization via Simulated Annealing
Next, we must identify an optimal subset at which to compute the predicted flux.We use simulated annealing, a stochastic optimization algorithm that can combinatorially evaluate different subsets in order to find a minimum in a given cost function (Kirkpatrick et al., 1983).The algorithm is guaranteed to converge asymptotically to a global minimum in cost function with probability 1 (Dekkers & Aarts, 1991), and can do so at much lower computational cost than a brute force approach for a large state space.Buehler et al. (2010) provides a useful tutorial on algorithm implementation, and we follow most of their choices, except as follows.In our implementation of the algorithm, as in theirs, optimization is organized in sets of moves toward a minimum that the cost function value is averaged over (Buehler et al., 2010).However, our blocks consist only of 100 moves of which at least one must be accepted; otherwise, the algorithm is considered to have converged to a minimum and is terminated.
Because the optimization process is stochastic, we perform 16 independent optimizations for each subset size and plot the standard deviation away from the mean error in our results.We refer to a given set of wavenumbers and corresponding weights as a model.

Cost Function Formulation
As we are interested in optimizing multiple radiative quantities-fluxes, heating rates, and top-of-the-atmosphere CO 2 forcing-we take inspiration from multi-objective optimization in the formulation of our cost function (Yang, 2014).Given multiple optimization targets F (flux), H (heating rate), and   (forcing), the simplest way to minimize their respective cost functions c simultaneously is to minimize their weighted sum: Each f i implicitly has the inverse units of its corresponding c to make the cost function unitless.Choosing the values of the weights f i is considered a subjective task with many valid solutions (Yang, 2014).In our case, the approach is as follows: first, we fix f 2 = 1 and f 3 = 0 as we vary f 1 , to focus on the tradeoff between heating rates and fluxes.For each value of f 1 , we fit linear weights to 100 random subsets of size 32 at every fifth vertical level for each of the 50 training atmospheres.The maximum relative (percent) error across vertical levels is recorded for each of the 100 iterations.The mean of these 100 iterations is plotted, where any random subset that produces an ill-conditioned problem (noninvertible data matrix) is ignored.The results of this experiment are shown in Figure 1a.The gray line indicates the parameter value f 1 = 0.15 that is chosen, as it occurs where heating rate errors level off.
In Figure 1b, the same procedure is repeated while holding f 1 = 0.15 and f 2 = 1 fixed and varying f 3 .We choose parameter f 3 = 1 in order to minimize increase in heating rate and flux errors.

Boundary Fluxes
We illustrate the method by training the algorithm only on the outgoing atmospheric boundary fluxes: the upwelling flux at the top of the atmosphere and the downwelling flux at the surface.These fluxes can be useful for diagnostic calculations such as radiative forcing or the cloud radiative effect, and can help understand the Earth's energy balance.The simulated annealing and linear program both minimize the cost function Fluxes at the boundaries of the atmosphere can be computed accurately with relatively sparse spectral information (Figure 2).For various representative subset sizes, the best flux and weight configurations are recorded and the root mean squared error (RMSE) away from reference calculations, 10.1029/2023MS003819 5 of 13 where n indexes the atmospheric profiles and m indexes vertical levels (here, just the boundaries), is plotted.This is the same as the cost function up to a constant multiplier, which serves to take the average of the error across atmospheric profiles and vertical levels rather than the sum.Additionally, using data submitted to CKDMIP (Hogan & Matricardi, 2020a) and available online, our results are plotted alongside metrics from modern CKD models, the 27 g-point ecCKD (Hogan & Matricardi, 2022), which is trained on the CKDMIP training data, and the reduced 128 g-point RRTMGP (Pincus et al., 2019).
First, to separate the contributions of simulated annealing and linear programming, weights are computed via a Riemann sum (defined by the Euclidian distance between wavenumbers), performed over the quadrature points during optimization, rather than by the linear program.The pink curve shows that for a simple diagnostic calculation, the Riemann sum can perform at around 1 W/m 2 error with 32 spectral points, similarly to correlated-k models, highlighting the effectiveness of simulated annealing at choosing an appropriate representative subset.The purple and blue curves show the training and testing error, respectively, using both simulated annealing and linear weight optimization.Here, eight spectral points allow fluxes at the boundaries of the atmosphere to be computed to within 1 W/m 2 (1% accuracy), comparable to ecCKD and RRTMGP.With more than 16 spectral points, errors in testing data decrease much less rapidly with spectral detail than do errors in training data, indicating overfitting to the training data set.However, the difference between the Riemann sum error and the RMS testing error of the model that optimizes Equation 2 is at close to 1 W/m 2 throughout.This highlights the importance of optimizing both the linear weights and the representative spectral subset.

Net Flux and Heating Rate Profiles
Boundary fluxes control the total energy exchanged by the Earth and the atmosphere; flux profiles throughout the atmosphere set heating rates and hence the diabatic forcing of circulation.Simultaneous prediction of fluxes and heating rate can be achieved by specifying a cost function that contains both quantities, that is, 10.1029/2023MS003819 6 of 13 which sums the ℓ 2 norm of the differences between estimated and reference heating rate and net flux with weights f i across all included training profiles and vertical levels, formulated similarly to Equation 2 in Hogan and Matricardi (2022).F est is defined as the dot product between the weights and the monochromatic fluxes, as in Equation 1, and the heating rate is computed using this predicted flux; for each vertical level m in a given training profile: where S = 3,600 ⋅ 24 s/day is a scale factor; g = 9.81 m/s 2 is the acceleration due to gravity; c p = 1,004 J/ kg K is the specific heat at constant pressure for dry air; and p m is pressure provided by the CKDMIP data set at vertical level m, given in Pa.The heating rate is obtained in units of K/day.Finally, f 1 = 0.15 and f 2 = 1 are empirically derived hyperparameters that balance the optimization of heating rates and fluxes, as discussed in Figure 1.This cost function is used both to choose the optimal spectral subset as well as to find the associated weights.
We train the model on every fifth vertical level of the CKDMIP data set to predict vertically-resolved fluxes for these applications.The algorithm requires random access to a large (∼170 GB) data set, so this is a practical limitation, but also allows us to treat some levels as out-of-sample, which can help us understand the impact of the vertical discretization of the atmosphere on the accuracy of the algorithm.In other words, we insist that our algorithm learns to integrate across the spectral dimension independently of any implicit vertical structure.
In Figure 3, we show the profiles of absolute RMSE (W/m 2 ) of flux predicted by the minimization of Equation 3for each subset size in order to examine this vertical dependence.We compare to ecCKD on the training plot as it is trained on CKDMIP data, and to RRTMGP on the testing plot, as it is not.Both here and in Figure 4, pressure is plotted linearly in CKDMIP model level because training levels were chosen linearly in model level.The figure demonstrates good generalizability of the method in the vertical as well as to out-of-sample atmospheric profiles.Throughout the vertical profile, errors remain of a constant order of magnitude.Extrusions between training levels become more pronounced in larger subset sizes, indicating overfitting.Errors in the testing set are of similar order of magnitude to those in the training set.With 32 spectral points, we achieve errors within 0.3 W/m 2 (about 0.1% relative error), comparable to the correlated-k models.Calculating heating rates from predicted fluxes with accuracy is difficult, especially at high altitudes, where the small pressure difference will magnify any errors in the flux.Furthermore, since heating rate is a derivative of flux computed by a simple finite difference, the heating rate error will be determined by the difference in flux error at adjoining vertical levels, rather than by the error in the fluxes themselves.Thus, to produce good estimates of heating rate profiles, we include heating rates in the cost function (Equation 3).Without this term, the optimization produces large heating rate errors in the stratosphere and above, even for large subset sizes.
Figure 4 shows the absolute RMSE in heating rates, in units of K/day.Similarly as in Figure 3, levels we trained on are marked with crosses.In panels (a) and (b), heating RMSE is shown only for the levels in the training set.Though errors are relatively large for small subset sizes, the errors for 32 and 64 wavenumbers are commensurate with CKD models throughout the atmosphere, maintaining errors of <0.2 K/day (approaching 1%).In panels (c) and (d), heating rates computed from predicted fluxes at all levels (as in Figure 3) are shown.Here, for small subset sizes (≤16 representative wavenumbers), heating rate errors are larger, especially high in the atmosphere; in the layers immediately above the ground, the errors subset sizes 4 and 8 reach as high as 8 and 5 K/day, respectively, due to the fine pressure discretization in these levels.Even for 16 predictors, where flux errors are commensurate with correlated-k models in Figure 3 at about 1 W/m 2 (1%) error throughout the atmosphere, the heating rate error is large where pressure thickness is small.For 64 predictors, the heating rate errors of our model are smaller than 0.5 K/day for both training and testing throughout most of the atmosphere.However, the dependence of accuracy on vertical discretization is highlighted here.The method extrapolates fluxes to out-of-sample levels well because the relationship between the monochromatic fluxes and the broadband integral remains fixed.This is not true of heating rates when training is performed on certain pressure levels and extrapolation to different pressure levels is attempted.

Forcing by CO 2
The top-of-the-atmosphere radiative forcing is defined as the difference in outgoing longwave radiation (OLR) between two states.This quantity is crucial for studying how Earth's energy balance will evolve with changing atmospheric composition.As with heating rates, it is not sufficient to predict the OLR in the two states; we must also minimize their difference.We adjust our cost function accordingly: where the TOA forcing by , and similarly for   .Hyperparameters f 1 = 0.15 and f 2 = 1 are as in Equation 3, and f 3 is set to 1, as described by Figure 1.Here, we use the difference between 8× PI CO 2 and present-day conditions in the cost function, and extrapolate to forcing between present-day and the scenarios used in CKDMIP: LGM, PI, 2× PI, and 8× PI (Hogan & Matricardi, 2020a).We also extrapolate to 16× PI (4,480 ppm), to test whether our model can be useful in extreme paleoclimate scenarios (Bice & Norris, 2002).
Figure 5 shows the RMSE for different subset sizes across all the CO 2 scenarios.Our method displays highest errors in extrapolations to very low and very high CO 2 scenarios, such as the LGM and the extreme paleoclimate scenario.With 64 spectral points, we can achieve about 0.2 W/m 2 (<10%) error across scenarios, which is similar to the correlated-k models.Training on just the 2,240 ppm scenario is able to maintain errors within 10% for all climate scenarios for 32 and 64 spectral points, including extreme high and low CO 2 concentrations.The choice of climate scenario to train on is somewhat arbitrary; training on any perturbation of CO 2 yields similar results.This is demonstrated in Figure 6, where the RMSE in CO 2 forcing is plotted for three independent optimizations trained on the various climate scenarios, each with 32 spectral points.The 8× PI scenario is the same as Figure 5.There are minima in the RMS at the scenarios trained on, and scenarios that are further out-of-sample yield higher errors.For example, when training on the LGM or PI CO 2 concentrations, the dramatically-different 16× PI scenario is predicted with up to 1 W/m 2 error.However, the quadrature method remains flexible to different applications, as predictions for all scenarios remain within about 10% error for 32 spectral points.

Evaluating Optimization Priorities
One of our aspirations is to create a framework in which optimization goals can be explicitly and transparently specified.For instance, for a weather model, it may be imperative to predict fluxes and heating rates well, but forcing by CO 2 may not be a concern.Adding terms to the cost function increases optimization times.It also results in an error tradeoff, as ensuring that multiple quantities are equally optimized yields larger errors in each individual quantity.

Physical Interpretation of Optimization Choices
Though simulated annealing is a stochastic optimization algorithm agnostic to the physics of the Earth system, we hope to draw some physical understanding from the wavenumber subsets and associated weights that are chosen.
Figure 8 shows the location of subset members on an example OLR spectrum for three independent optimizations.Bars are shown below the spectrum noting major features: the atmospheric water vapor window as well as absorption by water vapor, carbon dioxide, and ozone.On these bars, the dots indicate where subset members lay on the spectrum for three 16-member models trained on heating rates and fluxes (Equation 3).The size of the dots is proportional to the value of the monochromatic flux at that wavenumber multiplied by its linear weight.This quantifies how much a given subset member contributes to the integrated broadband flux.The colors correspond to the three different models.
The optimization chooses wavenumbers that are distributed across the spectrum, spanning absorption by all three major gases as well as the water vapor window.Where the Planck curve is small, such as the tails of the spectrum, few points are chosen, and those that are tend not to be "important."The largest contributions tend to come from the water vapor window, followed by the main CO 2 and H 2 O bands.
Next, the amount of absorption captured by each subset member is quantified through optical depth.For each wavenumber in a given subset S, the pressure level at which optical depth τ = 1 is reached from the top of the atmosphere is recorded.The wavenumbers are then sorted by this pressure level, averaged across training atmospheric profiles and independent optimizations, and then plotted in Figure 9.The colorful lines correspond to an optimization minimizing error in both heating rates and fluxes (Equation 3), while the accompanying gray curves result from optimizing only fluxes.
As the size of the subset increases, the members span more of the atmosphere-the pressure of the highest-altitude (lowest-pressure) optically-thick level decreases.Furthermore, more levels that radiate at higher pressures in the atmosphere are included: the slope of the lines below 100 hPa decreases.Finally, the models trained on only fluxes span less of the atmosphere and contain more wavenumbers that radiate from higher pressures on average.This explains why the model trained only on fluxes cannot predict heating rates at low pressures with accuracy,  2), in dark red; one including every fifth net flux in the vertical, in red; one trained on these fluxes as well as corresponding heating rates (Equation 3), in orange; and one where the cost function encompasses fluxes, heating rates, and the CO 2 forcing between present day and 8× PI CO 2 (Equation 4), in yellow.and further justifies our inclusion of heating rates in the cost function.It also proposes a relationship between the vertical discretization of the training atmospheres and the number of subset members required to accurately predict heating rates.

Conclusions
In this paper, we demonstrate that a small set of spectral points can be used to predict flux profiles, heating rate profiles, and CO 2 forcing with accuracy comparable to CKD methods.The empirically-optimized quadrature Figure 8. Below an example of a spectrum of outgoing longwave radiation, major absorbers are noted.Dots mark the members of three independent, optimal wavenumber subsets.The colors correspond to these independent optimizations.The size of the dots is proportional to the importance of that spectral point to the broadband flux integration.

Figure 9.
For each wavenumber in the final subset S, the pressure level at which that wavenumber first reaches optical depth τ = 1 from the top of the atmosphere is plotted.The colorful lines correspond to models optimizing both fluxes and heating rates (Equation 3), and the accompanying gray lines correspond to models optimizing only fluxes.
method selects an optimal subset of spectral points at which to compute a weighted sum to estimate the broadband net flux integral.We apply the method to predicting fluxes throughout the vertical column and test its generalizability to out-of-sample atmospheric profiles.We can achieve accuracy to within 1 W/m 2 for vertically-resolved net fluxes with 16 spectral points while avoiding overfitting.
We show that in order to reproduce heating rates accurately, especially high in the atmosphere, we must optimize not only on fluxes, but also on the difference between fluxes at consecutive height levels.Including heating rates in the cost function yields errors of less than 0.2 K/day throughout the atmospheric column for subsets of 32 points.Similarly, in order to explore applications in climate science, we include forcing by CO 2 in the cost function.We find that by training only on an 8× PI scenario, we can extrapolate to reproduce top-of-the-atmosphere forcing within 10% (<1 W/m 2 ) with 32 or 64 spectral points across various scenarios.The ability to optimize quantities of interest lends flexibility to different applications to the method.
While our approach performs well on the present-day clear-sky scenarios provided in CKDMIP, there are several limitations to this initial demonstration.First, we have chosen to focus on clear-sky scenarios, as many k-distributions are initially developed using only clear skies (e.g., Pincus et al., 2019).The optical properties of clouds vary only weakly with wavenumber so we anticipate that training on cloudy profiles would not change the accuracy or the flexibility of our quadrature method.However, as with choosing clear-sky profiles, choosing an appropriate set of cloudy training profiles is a challenge, and so we leave it out of the scope of this work.
We have also limited our training and testing data to the CKDMIP sets, which commits us to certain atmospheric condition and vertical discretization choices.Choosing training atmospheres that will represent diverse conditions and generalize well is an open question for many statistical and machine learning applications, and we use only 50.Furthermore, memory constraints (the total size of the spectral training data set is about 170 GB) also prevent us from training on all model levels in the vertical.Though this allows us to highlight the importance of the vertical discretization of the data set, using a finer grid would certainly improve both flux and heating rate predictions.It may also require more spectral points to parameterize the extra information.
Finally, though we know many greenhouse gases exert a forcing on the climate, we focus only on CO 2 .Since we fit one set of weights with a linear regression, there is a limit on how many related quantities we can optimize at once before we formulate an ill-conditioned problem.In order to predict forcing by different gases, or improve our CO 2 forcing estimates across different scenarios, we will have to consider how best to include many different optimization targets.

Figure 1 .
Figure1.The relative root mean squared error of each radiative term of interest is plotted as weighting parameters vary.In panel (a), the weight on the heating rate term f 2 = 1 is held fixed while f 1 is varied.In panel (b), f 1 = 0.15 and f 2 = 1 are held fixed while f 3 is varied.The gray lines show the value of the parameter that is used throughout the study.

Figure 2 .
Figure2.The root mean squared error, in units of W/m 2 , across a range of wavenumber subset sizes for boundary fluxesupward flux at the top of the atmosphere and downward at the bottom.The blue curve represents the training set, and the purple the testing, when trained on the cost function in Equation2; the pink curve represents training error when the integral is performed using a Riemann sum over the chosen spectral points.We omit Riemann testing error as it is nearly identical to the training error.Envelopes denote standard deviation across 16 independent optimizations, and crosses show where training took place.The same metric is plotted for boundary fluxes predicted by correlated-k models ecCKD (black) and RRTMGP (red).10%, 1%, and 0.1% of mean boundary fluxes are plotted in gray.

Figure 3 .
Figure 3.The root mean squared error (RMSE) (W/m 2 ) for predicted flux values for each pressure level, for a model that minimizes both flux error and heating rate error (Equation 3).Subset sizes are shown above the plot.Panel (a) shows results for the training set, and panel (b) plots out-of-sample testing data.Since the algorithm is trained only on every fifth vertical level, denoted by crosses on the training plot, in-between levels in the training set are also out-of-sample.Envelopes denote the standard deviation in RMSE across 16 algorithm runs.We highlight ecCKD on the training plot since it is trained on the CKDMIP data (black line), and show RRTMGP on the testing plot, since it is tested on our training data (red line).1% and 0.1% of the mean flux at each level is plotted in gray for reference.

Figure 4 .
Figure 4.The absolute root mean squared error (RMSE) (in K/day) of the heating rate calculated from predicted fluxes throughout the atmosphere.Subset sizes are noted above the plot.Panel (a) shows the RMSE of heating rates at only the pressure levels in the training set; panel (b) shows the associated out-of-sample atmospheric columns; (c) shows results for the training set using predicted fluxes for all model levels; panel (d) shows absolute RMSE for out-of-sample atmospheric profiles for all levels.Envelopes denote the standard deviation in absolute RMSE across 16 algorithm runs.The black and red lines show heating rates predicted by ecCKD and RRTMGP, respectively.10% and 1% of the mean heating rate profile are plotted in gray dotted lines.

Figure 5 .
Figure 5.The root mean squared error (in units of W/m 2 ) of the forcing by CO 2 is plotted for different subset sizes across various climate scenarios.Panel (a) shows the error from the training set, and testing error on panel (b).Crosses indicate that only forcing from the 2,240 ppm training scenario is included in the cost function.The results from ecCKD and RRTMGP are shown in black and red, respectively.Clouds show the standard error across 16 independent optimizations.10% and 1% of the mean forcing for each scenario are plotted in gray dotted lines.

Figure 7
Figure7highlights the importance of including quantities of interest in the cost function.In panel (a), the mean RMSE of outgoing fluxes is shown across subset sizes.While the model trained only on the outgoing fluxes (dark red) outperforms all others, on average, for almost all subset sizes, the others are not far behind.In panel (b), we examine the mean RMSE of the predicted flux.The tradeoff of adding extra quantities to the cost function is demonstrated-the model trained only on fluxes (red) is consistently more accurate than all others, and adding heating rates (orange) or CO 2 (yellow) to the cost function decreases slightly accuracy for all subset sizes.As shown above, including the flux derivative in the cost function is imperative in accurately reproducing heating rates.In panel (c), the mean RMSE of heating rates is shown for different models.The models trained on heating rates (orange and yellow) achieve 10% error, on average, with 32 spectral points, while the model trained only on fluxes (red) exhibits larger errors.Finally, in panel (d), we explore forcing by CO 2 .Here, it is clear that in order to reproduce this TOA radiative forcing, we must include it in the cost function.Although including CO 2 forcing in the cost function increases errors in the flux profiles and boundary fluxes, these increases are small.

Figure 6 .
Figure 6.The root mean squared error (in units of W/m 2 ) of the forcing by CO 2 is plotted for a 32-wavenumber subset trained on the different climate scenarios: the last glacial maximum, pre-industrial concentrations (PI), and 2-, 4-, 8-, and 16× PI CO 2 .Panel (a) shows the error from the training set, and testing error on panel (b).Crosses indicate which training scenario is included in the cost function.The results from ecCKD and RRTMGP are shown in black and red, respectively.10% and 1% of the mean forcing for each scenario are plotted in gray dotted lines.Clouds show the standard error across just three independent optimizations.

Figure 7 .
Figure 7.The average errors for models trained on various cost functions: one including only boundary fluxes (Equation2), in dark red; one including every fifth net flux in the vertical, in red; one trained on these fluxes as well as corresponding heating rates (Equation3), in orange; and one where the cost function encompasses fluxes, heating rates, and the CO 2 forcing between present day and 8× PI CO 2 (Equation4), in yellow.The panels show: (a) the mean root mean squared error (RMSE) for boundary fluxes; (b) the mean RMSE for fluxes throughout the atmospheric column; (c) the average absolute RMSE of heating rates throughout the atmospheric profile and (d) the average RMSE of forcing by CO 2 across climate scenarios.The crosses indicate where the given quantity was included in the cost function.10%, 1%, and 0.1% of the mean reference values for each quantity are plotted in gray dotted lines.
Figure 7.The average errors for models trained on various cost functions: one including only boundary fluxes (Equation2), in dark red; one including every fifth net flux in the vertical, in red; one trained on these fluxes as well as corresponding heating rates (Equation3), in orange; and one where the cost function encompasses fluxes, heating rates, and the CO 2 forcing between present day and 8× PI CO 2 (Equation4), in yellow.The panels show: (a) the mean root mean squared error (RMSE) for boundary fluxes; (b) the mean RMSE for fluxes throughout the atmospheric column; (c) the average absolute RMSE of heating rates throughout the atmospheric profile and (d) the average RMSE of forcing by CO 2 across climate scenarios.The crosses indicate where the given quantity was included in the cost function.10%, 1%, and 0.1% of the mean reference values for each quantity are plotted in gray dotted lines.