Use of neural networks for stable, accurate and physically consistent parameterization of subgrid atmospheric processes with good performance at reduced precision

A promising approach to improve climate-model simulations is to replace traditional subgrid parameterizations based on simplified physical models by machine learning algorithms that are data-driven. However, neural networks (NNs) often lead to instabilities and climate drift when coupled to an atmospheric model. Here we learn an NN parameterization from a high-resolution atmospheric simulation in an idealized domain by coarse graining the model equations and output. The NN parameterization has a structure that ensures physical constraints are respected, and it leads to stable simulations that replicate the climate of the high-resolution simulation with similar accuracy to a successful random-forest parameterization while needing far less memory. We find that the simulations are stable for a variety of NN architectures and horizontal resolutions, and that an NN with substantially reduced numerical precision could decrease computational costs without affecting the quality of simulations.

Two ML algorithms that have been used for climate-model parameterizations are neural networks (NNs) and random forests (RFs). O' Gorman and Dwyer (2018) trained an RF to emulate a conventional convection scheme of an atmospheric GCM and showed that when this RF parameterization is implemented in the GCM it leads to stable simulations that reproduce its climate, and the use of an RF allowed physical constraints, such as energy conservation and non-negative surface precipitation, to be respected. More recently,  (hereinafter referred to as YOG20) learned an RF parameterization from output of a three-dimensional high-resolution 1 idealized atmospheric model. The parameterization led to stable simulations that replicate the climate of the high-resolution simulations at several coarse resolutions.
Despite the success of RFs in these cases, NNs have some computational advantages over RFs such as the possibility of greater accuracy and needing substantially less memory when implemented. Furthermore, an NN parameterization could potentially be implemented at reduced precision in GPUs, TPUs and even in CPUs (Vanhoucke et al., 2011) leading to very efficient parameterizations. For example, modern hardware developments that use half precision (16-bit) arithmetic on recent generation NVidia A100 GPU devices (Nvidia, 2020) allow up to 64 times the peak compute performance (0.6 PFlop/s at 16-bit precision), compared to 64-bit arithmetic (9.7 TFlop/s at 64-bit precision). Although these are theoretical peak numbers, a first exascale application in Earth science has already been realized for image classification based on GPU technology with 16-bit arithmethic (Kurth et al., 2018). Google TPUv3 devices (Jouppi et al., 2020) have similarly optimized capability (0.12 PFlop/s at 16-bit precision), for reduced precision floating point arithmetic to support deep learning applications. Furthermore, previous studies have found that using deep NNs with half-precision multipliers has little effect on accuracy when used for classification (Courbariaux et al., 2014;Miyashita et al., 2016;Micikevicius et al., 2017;Das et al., 2018). For regression tasks, which are needed for parameterizations, a reduced-precision NN might not be as accurate as a full-precision NN, but it has been argued that the dynamics at small horizontal length scales represented by parameterizations does not need the same level of precision as for large-scale dynamics because of the unpredictable and stochastic nature of the small-scale flow (Palmer, 2014). Indeed, use of reduced precision in parts of weather or climate models has already been proposed as a way to increase the speed of simulations and reduce their energy cost Düben et al., 2017).
However, there have been issues of instability and climate drift when NN parameterizations are used in GCMs. For example,  used NN parameterizations in the superparametrized community atmosphere model (SPCAM) and found they are often unstable when coupled dynamically to a GCM.  were able to acheive a stable simulation with an NN parameterization, but they found that small changes to the configuration led to blow ups in the simulations (Rasp, 2020). Furthermore, when they quadrupled the number of embedded cloud-resolving columns (from 8 to 32) within each coarse-grid cell of SPCAM they found that instabilities returned (Brenowitz et al., 2020). Bretherton (2018, 2019) learned an NN parameterization in a near-global cloud system resolving model (CRM) and were able to deal with instabilities by removing the upper-atmospheric humidity and temperature from the input space and by using a training cost function that takes into account the predictions from several forward time steps. Although these changes in the learning structure led to stable simulations at coarse resolution with the NN parameterization, the climates of these simulations drifted on longer time scales and were not accurate. Brenowitz et al. (2020) found using spectral analysis that instability occurs when GCMs are coupled to NNs that support unstable gravity waves with certain phase speeds. A study by Ott et al. (2020) tested the stability of simulations coupled to more than a hundred different NNs, and found a correlation between offline performance (i.e., the quality of predictions from NNs when they are not coupled to a GCM) and how long simulations run before they blow up. These results suggest that an exhaustive hyperparameter tuning might be necessary in order to achieve stability in GCM simulations that are coupled to NNs.
RFs might be more stable than NNs since their predictions for any given input are averages over samples in the training dataset, and thus they can automatically satisfy physical constraints such as energy conservation (O'Gorman and Dwyer, 2018, YOG20) and make conservative predictions for samples outside of the training data (NNs can also be forced to satisfy analytic constraints (Beucler et al., 2019), but such NNs have not yet been coupled to a GCM). However, the RFs and NNs in the studies mentioned above used different training datasets and preprocessed the highresolution data differently. Therefore, it is difficult to determine if the stability arises from the different processing of the high-resolution model output or due to the the different properties of RFs compared to NNs. The two main differences in the data processing between YOG20 and the NNs studies mentioned above are that (1) the predicted tendencies were calculated accurately for the instantaneous atmospheric state (YOG20) rather than approximating them based on differences over 3-hour periods Bretherton, 2018, 2019) and that (2) the subgrid corrections were calculated independently for each physical process (YOG20) rather than for the combined effect of several processes Bretherton, 2018, 2019).
Here we learn an NN parameterization using the same data processing that was used to learn an RF parameterization in YOG20. We show that implementing this parameterization in the same model at coarse resolution leads to stable simulations with a climatology that resembles the one obtained from a high-resolution simulation, and we compare the performance of an NN parameterization to the performance of an RF parameterization. We also test how the simulated climate is affected when reducing the precision of the inputs and outputs of the NN parameterization.
We first describe the high-resolution simulation from which the training data was calculated and how this data was coarse-grained (section 2). We then describe the structure of the NN parameterization and explain how this structure ensures that NN parameterization is consistent with several physical properties (section 3). We compare simulations using the NN parameterization to the high-resolution simulation and to a simulation with an RF parameterization, and we investigate the dependence of climate on the numerical precision of the NN parameterization (section 4). Finally, we give our conclusions (section 5).

Simulations
All simulations were run on a quasi-global aquaplanet configured as an equatorial beta-plane using the System for Atmospheric Modeling (SAM) version 6.3 . The domain has zonal width of 6, 912km and meridional extent of 17, 280km. The distribution of sea surface temperature (SST) is specified to be the qobs SST distribution (Neale and Hoskins, 2000), which is zonally and hemispherically symmetric. There are 48 vertical levels. We use hypohydrostatic rescaling of the equations of motion with a scaling factor of 4, which increases the horizontal length scale of convection and allows us to use a horizontal grid spacing of 12km for the high-resolution simulation (referred to as hi-res) while explicitly representing both convection and planetary scale circulations in the same quasi-global simulation (Kuang et al., 2005;Pauluis and Garner, 2006;Garner et al., 2007;Boos et al., 2016;Fedorov et al., 2019). The hi-res simulation is the same simulation that was used for training in YOG20. It was recently shown that the tropical distributions of precipitation intensity and precipitation cluster size in hi-res are similar to the distributions calculated from satellite-based observations . Further details of the model configuration are given in YOG20.
In addition to hi-res, we also consider simulations at horizontal grid spacings of 96km and 192km, which will be referred to as x8 and x16, respectively, since they correspond to coarser grid spacings by factors of 8 and 16, respectively. We ran a simulation at 96km horizontal grid spacing without an NN parameterization (x8), several simulations at 96km horizontal grid spacing with an NN parameterization (x8-NN, variants of this simulation are described in the text below), and simulations at 192km horizontal grid spacing with (x16-NN) and without (x16) an NN parameterization. All simulations were run for 600 days. The first 100 days are considered as spin up, and the presented results are taken from the last 500 days of each simulation. Simulations with the NN parameterization start with initial conditions taken from the last time step of the simulations without the NN parameterization at the same resolution.

Coarse-graining and calculation of subgrid terms
The NN parameterization aims to account for unresolved processes that act in the vertical and affect thermodynamic and moisture prognostic variables. There are three thermodynamic and moisture prognostic variables in SAM : liquid-ice static energy h L , total non-precipitating water mixing ratio q T , and precipitating water mixing ratio q p . Since q p is a variable that varies on short time scales not typically resolved in climate models, we do not include q p in the NN parameterization scheme following the "alternative" parameterization approach in YOG20. Consequently, we reformulate the equations of motion, and define a different thermodynamic variable (H L ) that does not include the energy associated with precipitating water (text S1).
For each 3-hourly snapshot from hi-res, we coarse grain the prognostic variables (u, v, w, H L , q T , q p , where u, v, w are the zonal, meridional and vertical wind, respectively), vertical advective fluxes, sedimentation fluxes, surface fluxes, tendency of non-precipitating water mixing ratio due to cloud microphysics, turbulent diffusivity, radiative heating and temperature. Coarse-graining is performed by spatial averaging as in YOG20 to horizontal grid spacings of 96km (x8) and 192km (x16).
We define the resolved flux of a given variable as the flux calculated using the dynamics and physics of SAM with the coarse-grained prognostic variables as inputs. The flux due to unresolved (subgrid) physical processes is calculated as the difference between the coarse-grained flux and the resolved flux. For example, the subgrid flux of H L due to vertical advection is calculated as where overbars denoted coarse-grained variables. For each high-resolution snapshot the coarsegrained prognostic fields are used to calculate the resolved vertical advective, sedimentation and surface fluxes. The subgrid fluxes are then calculated by taking the difference between the coarsegrained and resolved fluxes.
3 Neural network parameterization

Parameterization structure
The structure of the NN parameterization is broadly similar to the RF parameterization used in YOG20 except that where possible we predict fluxes and sources and sinks (rather than net tendencies in YOG20) in order to incorporate physical constraints into the NN parameterization (section 3.2). By contrast, for the RF parameterization in YOG20 it was sufficient to just predict net tendencies because the RF predicted averages over subsamples of the training dataset and thus automatically respected physical constraints such as energy conservation.
The NN parameterization is composed of two different NNs ( Figure 1). The first NN, referred to as NN1, predicts the vertical profiles of the subgrid vertical advective fluxes of H L ((H L ) subg−flux adv ) and q T ((q T ) subg−flux adv ), the subgrid flux due to cloud ice sedimentation ((q T ) subg−flux sed ), the coarsegrained tendency of q T due to cloud microphysics ((q T ) tend mic ), and the sum of the total radiative heating and the heating from phase changes of precipitation ((H L ) tend rad−phase , text S1). Thus the outputs of NN1 are where the superscript subg-flux refers to subgrid fluxes and the superscript tend refers to the total tendency due to a process. The tendencies due to cloud microphysics, radiative heating and heating from phase changes of precipitation are treated as entirely subgrid. In the case of cloud microphysics and phase changes of precipitation, this is because it is not possible to calculate the resolved values of these processes when q p is not used as a prognostic variable in the simulations (text S1). Tendencies are predicted at the lowest 30 "full" model levels (below z = 13.9km), while subgrid ice sedimentation fluxes are predicted at the lowest 30 "half" model levels, and vertical advective fluxes are predicted at the 29 "half" model levels above the surface (since advective fluxes is zero at the surface over ocean). Thus, NN1 has 29 × 2 + 30 × 3 = 148 outputs. We do not use NN1 to predict outputs for levels above 13.9km (≈ 134hPa) since the NN parameterization is not accurate at these levels, we want to avoid predicting near the sponge layer which is active at heights above 20km, and the predicted tendencies and subgrid fluxes are small above 13.9km with the exception of radiative heating. Above 13.9km the subgrid fluxes and microphysical tendency are set to zero and the radiative heating tendencies calculated at coarse resolution from SAM are used.
The inputs for NN1 are the resolved vertical profiles of q T and temperature (T ), as well as the distance to the equator (|y|, which is a proxy for insolation and surface albedo), giving 30×2+1 = 61 inputs. We verified that using top of atmosphere insolation as input instead of the distance to equator does not change the results presented in this study ( Figure S1). We found that using all 48 levels of T and q T as inputs in NN1 leads to an instability, possibly related to instabilities found previous studies when q p is not used a prognostic variable (Brenowitz and Bretherton, 2019;Brenowitz et al., 2020, YOG20). However, here we include inputs up to the lower stratosphere while Brenowitz and Bretherton (2019) and Brenowitz et al. (2020) removed some inputs above the midtroposphere to achieve stability, whereas here it is sufficient to not use inputs in the stratosphere.
The second NN, referred to as NN2, predicts subgrid surface fluxes of H L and q T and the coarsegrained vertical turbulent diffusivity (D) that is used for H L and q T . We only predict D in the lower troposphere (the 15 model levels below 5.7km) because it decreases in magnitude with height (YOG20), and the diffusivity calculated at coarse resolution from SAM is used above 5.7km. Hence the outputs of NN2 are giving 15 + 1 + 1 = 17 outputs. The inputs of NN2 are chosen to be the lower tropospheric vertical profiles of T , q T , u, v, surface wind speed (wind surf ) and SST, so that X NN2 = (T, q T , u, v, wind surf , SST), giving 4 × 15 + 1 + 1 = 62 features, where v in the southern hemisphere is multiplied by −1 when used as an input (see further discussion in YOG20). The tendencies due to subgrid vertical advective, sedimentation and surface fluxes are calculated online (i.e., when running SAM with the parameterization) from the predicted fluxes. The subgrid Figure 1: Illustration of the parameterization structure. There are two different NNs included in the parameterization, each having its own inputs and outputs (full description in section 3.1). Arrows indicate subgrid fluxes due to vertical advection and sedimentation, and ellipses indicate tendencies associated with sources/sinks due to cloud microphysics, radiation and phase changes of precipitating water. Blue (red) color indicates a variable, tendency or flux associated with moisture (energy).
energy flux due to ice sedimentation is also calculated online as where L s is the latent heat of sublimation. Similarly, the tendency of energy due to cloud microphysics is calculated online as where is the effective latent heat associated with precipitation (L c and L f are the latent heat of condensation and fusion, respectively, and ω p is the precipitation partition function determining the ratio between ice and liquid phases).
The results presented in the main paper are for NNs (both NN1 and NN2) with five layers (four hidden layers) with 128 nodes and rectified linear unit (ReLu) activation functions. Results for different NN architectures are shown in the supplementary information ( Figure S2). More details about the train and test datasets, the NNs training protocol and how inputs and outputs are rescaled can be found in the supplementary information (text S2) When running simulations with the NN parameterization, we limit the predictions of q T fluxes and tendencies such that we prevent q T from being assigned with negative values. More specifically, if the NN parameterization predicts any flux or any tendency at any vertical level that would lead to negative q T values if it acted on its own, we reduce this tendency or flux magnitude such that the minimal value that can be assigned to q T is zero. Removing this constraint (and allowing q T to be assigned with negative values) leads to a substantially different climate when the NN parameterization is used, although the simulations remain stable.

Physical consistency of the parameterization
Previous studies that used NN parameterizations usually predicted the sum of tendencies due to several different processes as a single output Bretherton, 2018, 2019). The coarse-graining and calculation of subgrid terms that is used in this study (section 2.2) enables us to predict the effect of each process on the prognostic variables separately (section 3.1), and where possible, the effect is diagnosed from other predicted outputs (equations 4 and 5). Furthermore, the NN parameterization predicts fluxes instead of tendencies where possible. These differences make the NN parameterization presented in this study physically consistent in the following ways: a Predicting the subgrid fluxes due to vertical advection instead of the subgrid tendencies guarantees energy and water are conserved by these fluxes. Similarly, predicting the flux due to sedimentation guarantees that changes in the energy of the atmospheric column due to sedimentation are only due to sedimentation that reaches the surface.
b Changes in H L due to cloud microphysics and ice sedimentation are not predicted by the NN, but are instead diagnosed from equations 4 and 5 ( Figure 1). Diagnosing these tendencies guarantees that the sources or sinks of energy at each grid point due to cloud microphysics and sedimentation are consistent with the amount of moisture that was subtracted or added at that grid point.
c The NN predicts the coarse-grained vertical diffusivity (instead of predicting subgrid diffusive tendencies or fluxes) which ensures that diffusive fluxes are downgradient and conserve energy and water in each atmospheric column.
d The precipitation is diagnosed from the NN outputs (text S1). Diagnosing the precipitation was not done in some previous studies that used NN parameterization in which the NN was used to predict precipitation directly  or the NN outputs could only be used to predict the difference between precipitation minus evaporation (Brenowitz and Bretherton, 2019).
Properties a,b, and c ensure that the NN parameterization exactly conserves energy in the sense that the column integrated frozen moist static energy is conserved in the absence of radiative heating, heating from phase changes of precipitation, surface fluxes, and conversion of condensate to graupel or snow (see Text S3).

Simulation with neural-network parameterization
To assess the NN parameterization, we compare the climates of x8, x8-NN (using NNs with five fully connected layers, text S2) and the hi-res. We focus on the zonal-and time-mean precipitation distribution ( Figure 2a) and the frequency distribution of the 3-hourly precipitation rate (Figure 2b). The frequency distribution is known to be sensitive to subgrid parameterization of moist convection (Wilcox and Donner, 2007), and the latitudinal distribution of mean precipitation is especially sensitive to subgrid parameterizations in the zonally and hemispherically symmetric aquaplanet configuration used here (Möbis and Stevens, 2012). The frequency distribution is estimated using 34 bins that are equally spaced in the logarithm of precipitation rate, where the lowest bin starts at 1mm day −1 , and the distribution is normalized such that it integrates to one when considering the whole distribution (including values lower than 1mm day −1 ). The hi-res and the x8-NN simulations exhibits a similar double ITCZ structure, both in amplitude and in the latitudinal structure (Figure 2a). Note that the presence of a double ITCZ in hi-res is likely to be dependent on the exact domain and SST distribution. In contrast, the x8 simulation exhibits a single ITCZ (Figure 2a), highlighting the sensitivity of the tropical circulation to changes in the horizontal resolution and to the inclusion of the NN parameterization. Also the frequency distributions of precipitation in the x8-NN closely matches that of hi-res (R 2 = 0.99 as measured across bins), although x8-NN overestimates the extreme events, while the frequency distribution in x8 differ substantially from the distribution of hi-res (R 2 = 0.35) especially for weak and extreme precipitation events (Figure 2b). The NN parameterization also brings the zonal-and time-means of the zonal wind, meridional wind, eddy kinetic and non-precipitating water closer to the hi-res climatology, and outperforms the x8 simulation for these variables (table S1). We find that these results are reproduced when the NN1 architecture is changed to have only three or four layers ( Figure S2, although the amplitude of the equatorial minimum of precipitation slightly varies between simulations). All the simulations we ran were stable and without climate drift, and even a substantially less accurate NN parameterization with two layers for NN1 (table S2) leads to stable simulations, though it does not capture the precipitation distributions ( Figure S2). When training an NN parameterizations for a coarsegraining factor of x16 and coupling it to a simulation with the corresponding grid spacing, similar results are obtained and precipitation distributions are captured correctly ( Figure S3). Overall, we find that coupling an NN parameterization to a simulation at coarse-resolution leads to a climate that is similar to the hi-res climate, and that the stability of simulations is not sensitive to the NN architecture, hyperparameter tuning or the horizontal resolution of the simulations. Several approximations where made when deriving the instantaneous precipitation rate (text S1). These approximations and inaccuracies in the NN predictions result in negative 3-hourly surface precipitation in 20% of samples in the x8-NN simulation. Nevertheless, almost all of the negative values are very small, and only 0.03% of samples are less than −1mm day −1 .

Comparing neural network and random forest parameterizations
In this section we compare the offline and online performance of NN parameterization to the ("alternative") RF parameterization that was developed in YOG20 and also did not include q p . For offline performance (performance when not coupled to a GCM), we find that the NN parameterization outperforms the RF parameterization across all variables ( Figure S4, text S4). Yet, the online performance of the x8-NN and x8-RF simulations is comparable (Figure 2c,d and table S1). Both x8-NN and x8-RF have a double ITCZ as in hi-res. The x8-NN simulation better captures the frequency distribution at most precipitation rates, though for extreme events x8-RF is more accurate (Figure 2d). Other climatological measures such as mean q T , meridional and zonal wind and eddy kinetic energy are better captured by x8-RF (table S1).
One advantage of NN is that it requires less memory compared to RF. For example, for x8, NN1 is 0.3 MB and NN2 is 0.2 MB when stored in netcdf format at single precision, which is ≈ 1900 times less memory compared to the memory needed to store the RF parameterization. Another advantage of the NN parameterization is that the x8-NN simulation requires less CPU time than the x8-RF simulation by a factor of 1.25, although both require much less CPU time than hi-res (by a factor of 54 in the case of x8-NN).

Reduced-precision parameterization
As discussed in the introduction, NN parameterizations run at reduced precision could bring considerable savings in speed and energy. To test the effect of reduced precision NNs, we run simulations in which the scaled inputs and outputs of the NNs are reduced in precision. Such simulations aim to check how precise the outputs and inputs have to be to give a similar climate to the full precision. In our default configuration, SAM and the NNs are evaluated in single precision which corresponds to 23 bits in the mantissa. We reduce the precision by rounding to 7,5, 3 and 1 bits in the mantissa and the resulting simulations are referred to as x8-NN-7bits, x8-NN-5bits, x8-NN-3bits and x8-NN-1bits, while keeping the same number of bits in the exponent (8 bits). Note that the 7 bits in the mantissa corresponds to the bfloat16 "brain" floating-point (7 bits in the mantissa, 8 in the exponent and 1 to determine the sign) which is used in TPUs.
For x8-NN-7bits and x8-NN-5bits simulations we find that the resulting climate is similar, while reducing the precision to x8-NN-3bits leads to a small degradation and keeping only 1 bit in the mantissa clearly degrades the results (table S1, Figure 2e,f). Overall, we find that it is possible to reduce the precision of inputs and outputs even beyond bfloat16 format without substantially affecting the climate of the simulations, suggesting that NN parameterizations at reduced precision could bring substantial advantages in speed and power requirements.

Conclusions
In this study we show that an NN parameterization with a physically motivated structure learned from coarse-grained output of a three-dimensional high-resolution simulation of the atmosphere can Results are shown for the high-resolution simulation (hi-res in blue; panels a-d), the coarse-resolution simulation (x8 in green; panels a-b), the coarse-resolution simulation with the NN parameterization (x8-NN in orange dash-dotted; panels af), the coarse-resolution simulation with the RF parameterization (x8-RF in black; panels c-d), and simulations with reduced numerical precision of the inputs and outputs of the NN parameterization with 7 significant bits in the mantissa (x8-NN-7bits in gray; panels e-f), and 1 significant bit in the mantissa (x8-NN-1bits in yellow; panels e-f) as compared to 23 bits in the mantissa for x8-NN. The frequency distribution is shown for axes with linear scale in the insets of b,d,f. be dynamically coupled to the atmospheric model at coarse resolution to give stable simulations with a climate that is close to that of the high-resolution simulation. In contrast to the approach in  we find that simulations with NN parameterization are stable for a variety of configurations, and in contrast to Bretherton (2018, 2019) they do not exhibit climate drift. Furthermore, in contrast to Ott et al. (2020) we find that achieving stable simulations does not require the NN parameterization to be very accurate in an offline test, and a mediocre performing NN1 with two layers (i.e., one hidden layer) is stable when coupled to SAM. We use the same high-resolution model output for training that was used by YOG20, which suggests that the stability of simulations with an RF parameterization in previous studies (O'Gorman and Dwyer, 2018, YOG20) is not only possible with RFs since we find NNs to be robustly stable as well. Instead, the stability of simulations with NN parameterizations may require accurate processing of the high-resolution model output to obtain subgrid tendencies and fluxes. (in addition to not including stratospheric levels). The main differences in the processing used here compared to previous studies with NN parameterizations of atmospheric subgrid processes are that the contribution of subgrid terms were calculated using the equations of the model for the instantaneous state of the atmosphere rather than approximating them based on differences over 3-hour periods Bretherton, 2018, 2019) and that subgrid corrections were calculated independently for each physical process. The latter allows us to encapsulate more physics in the parameterization, such as by calculating fluxes and sinks rather than net tendencies. A direct comparison between an RF and NN parameterizations shows that although NNs are more accurate in offline tests, when coupling the parameterizations to the atmospheric model at coarse resolution, both parameterizations have comparable results. Overall, our results imply that careful and accurate processing of the high-resolution output that is used for the training may be more important than intensive hyperparameter tuning of an ML algorithm. The time step in our coarse-resolution simulations cannot be larger than roughly 75 seconds because of the explicit time stepping of turbulent diffusion in SAM , and future work is needed to extend these results to longer time steps and to simulations with land and topography.
Finally we show that reducing the numerical precision of the inputs and outputs of the NN parameterization to bfloat16 floating-point format leads to a similar climate compared to using single precision. This implies that NN parameterizations with reduced precision could be used for faster training, and more importantly, for reducing the computational resources and energy needed to run climate simulations. To further investigate the feasibility of NN parameterizations with reduced precision, future work should also test NN parameterizations that were trained with reduced precision, such that the weights, biases and multiplications used during forward propagation of the NN are performed at reduced precision. Our results also suggest that the simulated climate may not be strongly affected by reducing the precision of conventional parameterizations or super parameterizations, but in those cases it would likely be necessary to check for each particular parameterization which parts of the algorithm can be safely reduced in precision (Düben et al., 2017). support from the EAPS Houghton-Lorenz postdoctoral fellowship and NSF AGS-1552195.
Supplementary Information for "Use of neural networks for stable, accurate and physically consistent parameterization of subgrid atmospheric processes with good performance at reduced precision" 3. Tables S1 to S2 Here we present further information on the equations of motion for thermodynamic and moisture variables used in SAM and how these equations are modified when we omit precipitating water as a prognostic variable following the approach in Yuval and O'Gorman (2020) (hereinafter referred to as YOG20, text S1). We also describe the training data, training procedure and architecture of the neural networks (NNs) and their implementation in SAM (text S2), we show that the NN parameterization conserves the frozen moist static energy (text S3), and we compare the offline performance of the NN and RF parameterizations (text S4).
Text S1. Equations of motion used for thermodynamic and moisture variables The equations for these variables in SAM may be written as  ∂h where h L is the liquid/ice water static energy (= c p T + gz − L c (q c + q r ) − L s (q i + q s + q g ), and q c , q i , q r , q s and q g are mixing ratios of cloud water, cloud ice, rain, snow and graupel, respectively); q p is the total precipitating water (rain + graupel + snow) mixing ratio; q T is the non-precipitating water (water vapor + cloud water + cloud ice) mixing ratio; ρ 0 (z) is the reference density profile; S1 u i = (u, v, w) is the three-dimensional wind; F Bi is the diffusive flux of variable B; P tot is the precipitation mass flux (due to rain, graupel and snow); S is the sedimentation mass flux; (h L ) tend rad is the tendency due to radiative heating; (q T ) tend mic is the non-precipitating water mixing ratio tendency due to autoconversion, collection, aggregation, evaporation and sublimation of precipitation. L c , L f and L s are the latent heat of condensation, fusion and sublimation, respectively; L p = L c +L f (1−ω p ) is the effective latent heat associated with precipitation, where ω p is the precipitation partition function determining the ratio between ice and liquid phases.
The precipitating water mixing ratio q p is a variable that varies on short time scales and is often not used as a prognostic variable in climate models, which typically have a coarser grid and larger time step than the high-resolution simulation. Therefore, we do not include q p in the NN parameterization scheme that is presented in this study to make our results more generally applicable. Following YOG20, we define a new prognostic energy variable (H L ) that does not include the precipitating water (q p ): Neglecting variations of L p in the horizontal and in time (which are small compared to the vertical variations in L p ), allows us to write the prognostic equation for H L as where F HLi = F hLi + L p F qpi , and the last term on the right hand side results from heating due to phase changes of precipitation. We define the sum of the total radiative heating and the heating from phase changes of precipitation as (H L ) tend rad−phase = (h L ) tend rad + 1 ρ0 ∂Lp ∂z (ρ 0 wq p + F qpz − P tot ). To find an expression for surface precipitation we assume that at coarse resolution (when the NN parameterization is used) we can neglect surface diffusive and horizontal fluxes of q p and the time derivative of q p in Equation (S3). Using these assumptions and vertically integrating Equation (S3) over the column gives an expression for the surface precipitation rate due to rain, graupel and snow: Following conventions used in SAM, we add any sedimentation at the surface to the surface precipitation rate and the total surface precipitation rate is calculated as: where S resolved (z = 0) is calculated online using resolved fields in SAM.

Text S2. Training and implementation
The data set for the NN parameterization is obtained from 337.5 days of 3-hourly model output taken from the hi-res simulation. This data was split into train and test datasets, where the first 320.625 days (95% of the data) were used for training, and the last 16.875 days (5% of the data) were used as a test dataset. To easily upload all data into the RAM during the training procedure, and to decrease the correlation between different training samples for each 3-hourly snapshot that S2 was used, we reduced the training data set size for the coarse-graining factor of x8 by randomly subsampling 30 (out of 72) atmospheric columns at each latitude for each snapshot. This results in training dataset size of 13, 856, 040, where a sample is defined as an individual atmospheric column for a given horizontal location and time step. When using a coarse-graining factor of x16, we use all 36 longitudinal grid points at each latitude, giving 8, 313, 660 training samples. We note that the split to train and test datasets is slightly different compared to YOG20 where we used 10% of the data as a test dataset.
The NNs training is implemented in Python using PyTorch (Paszke et al., 2017). The NNs weights and biases are optimized by the Adam optimizer (Kingma and Ba, 2014) combined with a cyclic learning rate (Smith, 2017). We use 1024 samples in each batch and train over 8000 batches before completing a full cycle in the learning rate. We use 12 epochs, where the first seven epochs are trained with a minimal learning rate of 0.0002 and a maximal learning rate of 0.002, and five additional epochs are trained after reducing both the minimum and maximum learning rates by a factor of 10. The NNs are stored as netcdf files, and then implemented in SAM using a Fortran module. The results presented in this work are for NNs with 128 nodes at each hidden layer and rectified linear unit activations (ReLu), and unless stated differently, NNs have five densely connected layers. Training typically takes 20 minutes when using 10 CPU cores. We note that we do not use batch normalization (Ioffe and Szegedy, 2015) since it leads to unstable simulations. The reason that using batch normalization leads to unstable simulations is that some neurons have vanishingly small variance, which leads to divergence of prognostic fields, typically within very few time steps.
Prior to training, each input (feature) of both NNs and the outputs of NN2 were standardized by removing the mean and rescaling to unit variance, where for the diffusivity the mean and variance were calculated across all 15 vertical levels. A different approach was used for the outputs of NN1 in order to weight the effects of the different tendencies and fluxes consistently. We first define for each training sample and subgrid process the equivalent tendencies at all vertical levels associated with the fluxes by calculating tendencies due to the predicted fluxes. We then multiplied all q T tendencies by L c such that all tendencies have units of J s −1 . Next, we calculated the variance of each of the tendencies associated with each of the outputs (variance was calculated across all 30 levels of each process) and normalized them all by the variance of (H L ) tend rad−phase which had the smallest variance. The resulting normalized variances were 3.29 for (H L ) subg−flux adv , 22.30 for (q T ) subg−flux adv , 14.93 for (q T ) tend mic , 2.26 for (q T ) subg−flux sed and 1.00 for (H L ) tend rad−phase . Finally, the outputs of NN1 were standardized by removing the mean and rescaling to the normalized variance. In the reduced precision simulations, we reduce the precision of NN scaled inputs and scaled (direct) outputs, but we do not reduce the precision of these means and scaling factors. Text S3. Conservation of frozen moist static energy We define the frozen moist static energy (FMSE; Kuang and Bretherton (2006)) as where q v is the mixing ratio of water vapor and all other variables are defined in text S1. To get an evolution equation for the column integrated FMSE we multiply equation S2 by the latent heat of condensation, then sum the multiplied equation with equation S5, and integrate this sum in the S3 vertical with density weighting which gives where the subscript h refers to the two horizontal coordinates. The terms in the right hand side of equation S9 represent FMSE tendencies due to: (1) surface fluxes, (2) the latent heat of fusion when ice sediments reach the surface, (3) radiative heating and heating from phase changes of precipitation, (4) the latent heat of fusion when condensate is converted to graupel or snow, and due to (5) horizontal advection and horizontal diffusion. The derivation of equation S9 is unaffected by the presence of the NN parameterization because: (1) the NN parameterization predicts vertical advective fluxes rather than their associated tendencies and these fluxes are set to zero at the lower boundary, (2) the NN parameterization predicts the diffusivity instead of diffusive tendencies, and (3) the NN parameterization uses equations 4 and 5 in the main paper to diagnose the subgrid fluxes or tendencies of H L due to sedimentation and cloud microphysics. The NN parameterization does not affect the horizontal fluxes. We thus conclude from equation S9 that the NN parameterization conserves the column integrated FMSE in the absence of radiative heating, heating from phase changes of precipitation, surface fluxes, conversion of condensate to graupel or snow and sedimentation that reaches the surface. Text S4. Offline comparison between random-forest parameterization and neuralnetwork parameterization Overall, the input and output structures of the NN and RF parameterizations have similarities, though several important differences exist. The inputs for RF-tend (which is analogous to NN1) are similar, except that in RF-tend all 48 vertical levels are used as inputs. Furthermore, RFtend predicts the net tendency summed over all processes included in NN1 of H L and q T at all 48 vertical levels (with the exception that radiative heating was predicted up to z = 11.8km), rather than predicting separate fluxes or tendencies for different physical processes. As a test, we also tried to use NN1 with a similar input and output structure to RF-tend and thus have it predict the net tendency over all processes rather than separate fluxes and sinks for each process. However, this approach does not ensure physical consistency in an NN parameterization (see section 3.2 of the main paper) unlike an RF which predict averages over training samples and thus automatically conserves energy, and the quality of online results varied substantially when different grid spacings were used. Future work could further compare two identical structures of RFs and NNs. Further details about RF-tend can be found in the supplementary information of YOG20.
NN2 and RF-diff have the same unscaled outputs and their offline results can be easily compared directly. To compare the offline results between RF-tend and NN1 we use the unscaled tendencies of H L and q T as calculated from the outputs predicted by NN1, such that we compare the same target values. Offline performance is measured here as the coefficient of determination (R 2 ) as applied to the test dataset. We find that the NN parameterization outperforms the RF parameterization S4 across all variables ( Figure S4).
The RF and NN parameterizations have some differences in their test (text S2) and train datasets. Since the RF parameterization requires more memory during training we used less training samples (5,000,000 for x8-RF compared to 13,856,040 for x8-NN). Using the same number of samples as in the NN lead to memory errors when training on nodes with 108Gb of random-access memory. Since the R 2 of the RF parameterization is roughly constant when increasing the number of training samples above 5,000,000 (Fig. S8 in YOG20), it is unlikely that adding more samples would substantially change the offline result of RF parameterization.   Table S1: Online performance as measured by the coefficient of determination (R 2 ) of the timeand zonal-mean of eddy kinetic energy, zonal wind, meridional wind, non-precipitating water, precipitation, and of the frequency distribution of 3-hourly precipitation. R 2 is calculated relative to hi-res for the coarse-resolution simulation with no ML parameterization (x8), with the RF parameterization (x8-RF) and with the NN parameterization (x8-NN). R 2 is calculated relative to x8-NN for the simulations with reduced-precision parameterizations with 7 (x8-NN-7bits), 5 (x8-NN-5bits), 3 (x8-NN-3bits) and 1 (x8-NN-1bits) bits in the mantissa. The eddy kinetic energy is defined with respect to the zonal and time mean. x8-NN-7bits corresponds to the bfloat16 half-precision format which has 7 bits in the mantissa, and the default parameterization (x8-NN) is single precision which has 23 bits in the mantissa. Precipitation∘di tribution hi-re x16-NN x16 Figure S3: (a) Zonal-and time-mean precipitation rate as a function of latitude and (b) frequency distribution of 3-hourly precipitation rate for the high-resolution simulation (hi-res; blue), the coarse-resolution simulation at 192km horizontal grid spacing without the NN parameterization (x16; green) and with the NN parameterization (x16-NN; orange dash-dotted  Figure S4: Coefficient of determination (R 2 ) for offline performance of the x8-NN and x8-RF parameterizations for the (a) subgrid tendency of non-precipitating water mixing ratio (q T ) as a function of pressure, (b) subgrid tendency of liquid/ice water static stability energy (H L ) as function of pressure, (c) subgrid surface energy flux as a function of latitude, (d) subgrid surface moisture flux as a function of latitude, (e-f) diffusivity as a function of latitude and pressure for the (e) NN and (f) RF. In panels a-d orange dash-dotted lines show results for NN and black lines show results for RF. R 2 is calculated based on the samples from the test datasets.   Table S2: Offline performance of NN1 for different NN architectures and for coarse-graining factor as measured by R 2 . The offline performance is given for different outputs for NNs with 2,3,4 and 5 layers at x8 resolution (x8-NN-2L, x8-NN-3L, x8-NN-4L, x8-NN, respectively), and for x16 resolution with 5 layers (x16-NN). Note that x8-NN and x16-NN use the default of 5 layers. All vertical levels used are included when calculating R 2 . All results are based on the test dataset. S9