Emulation of Cloud Microphysics in a Climate Model

We present a machine learning based emulator of a microphysics scheme for condensation and precipitation processes (Zhao‐Carr) used operationally in a global atmospheric forecast model (FV3GFS). Our tailored emulator architecture achieves high skill (≥94%) in predicting condensate and precipitation amounts and maintains low global‐average bias (≤4%) for 1 year of continuous simulation when replacing the Fortran scheme. The stability and success of this emulator stems from key design decisions. By separating the emulation of condensation and precipitation processes, we can better enforce physical priors such as mass conservation and locality of condensation, and the vertical dependence of precipitation falling downward, using specific network architectures. An activity classifier for condensation imitates the discrete‐continuous nature of the Fortran microphysics outputs (i.e., tendencies are identically zero where the scheme is inactive, and condensate is zero where clouds are fully evaporated). A temperature‐scaled conditional loss function ensures accurate condensate adjustments for a high dynamic range of cloud types (e.g., cold, low‐condensate cirrus clouds or warm, condensate‐rich clouds). Despite excellent overall performance, the emulator exhibits some deficiencies in the uppermost model levels, leading to biases in the stratosphere. The emulator also has short episodic skill dropouts in isolated grid columns and is computationally slower than the original Fortran scheme. Nonetheless, our challenges and strategies should be applicable to the emulation of other microphysical schemes. More broadly, our work demonstrates that with suitable physically motivated architectural choices, ML techniques can accurately emulate complex human‐designed parameterizations of fast physical processes central to weather and climate models.


Introduction
Atmospheric models combine fluid dynamics integrated on a discrete global grid with parameterizations of unresolved physical processes for weather and climate prediction.These parameterizations, encompassing phenomena such as cloud formation, precipitation, and radiative transfer, are crafted by experts and typically blend theoretical foundations with empirical relationships to capture interactions between various atmospheric processes.The ongoing development and refinement of these components require a careful balance between accuracy and efficiency to achieve high-fidelity simulations using limited computational resources.
Over the past few decades, advances in machine learning have led to substantial investments in computing facilities that combine more traditional CPU-based computing resources with accelerators such as GPUs.This shift in computational infrastructure has motivated the atmospheric modeling community to explore ways to capitalize on these newer resources to speed up simulations.The fluid dynamics algorithms implemented in atmospheric models can often be recoded for more efficient GPU computation using compiler directives or domain-specific language extensions (Dahm et al., 2023).However, the column-based physics parameterizations often involve more complex logic and data dependences that do not naturally fit into this paradigm.
An alternative approach to accelerating the physical components of atmospheric models is the creation of machine-learned emulators.Emulators are machine learning (ML) models trained directly on the inputs and outputs of a specific component, aiming to provide a seamless replacement of the original scheme.This strategy offers a natural path to speed up model operation on accelerator-based compute resources, which are optimized to run ML workloads.Consequently, most emulation studies have focused on radiative transfer (Chevallier et al., 1998;Krasnopolsky et al., 2005Krasnopolsky et al., , 2010;;Ukkonen et al., 2020;Veerman et al., 2021), the most expensive subcomponent in the typical atmospheric physics suite.However, recent studies have also emulated deep convection (O'Gorman & Dwyer, 2018), gravity wave drag (Chantry et al., 2021), atmospheric chemistry (Keller & Evans, 2019;Kelp et al., 2022;Schreck et al., 2022), and details of the warm rain process (Gettelman et al., 2021).
Emulation also serves as an excellent test bed for ML approaches that aim to improve on existing physical parameterizations, such as those using fine-resolution data to train corrective ML models (e.g., Brenowitz & Bretherton, 2019;Bretherton et al., 2022;Rasp et al., 2018;Yuval & O'Gorman, 2020).Typically, these learn improvements to the combined suite of physical parameterizations, for example, radiation, microphysics, turbulence and surface exchange, cumulus convection and orographic drag.Emulation of individual component physical processes is clearly posed as a supervised learning task, so it can be used to explore skill bounds, quirks, and optimal architectural choices for emulating an entire parameterization suite.
The cloud microphysics scheme plays a central role in atmospheric modeling, managing rapid phase changes such as condensation, evaporation, and precipitation.It is tightly coupled to the model dynamics through latent heat release.We are not aware of past studies using ML to emulate an entire microphysics scheme, perhaps due to its lower computational cost compared to radiation.Nevertheless, it is a key part of emulating the combined physical parameterization suite and exposes a variety of ML challenges that are relevant to that broader problem.It is also a fast-acting process, producing localized atmosphere heating and drying tendencies that are much larger than for radiation.Thus, emulation of a representative microphysics scheme is a compelling complement to emulation of radiation parameterizations.It can provide valuable insights into the potential and challenges of ML emulators of atmospheric physical processes.
In this work, we train an ML model to emulate the Zhao and Carr (1997, ZC) microphysics scheme.This scheme was used for many years in the Global Forecast System (GFS) model by the U. S. National Centers for Environmental Prediction (NCEP).Here, it is included in a recent version of GFS that uses the FV3 dynamical core (Harris & Lin, 2013), which we call the FV3GFS global atmospheric model.The ZC scheme, with only one prognostic condensate variable, seemed to be a conceptually simple but worthy machine learning target, even though it was clear from the start that an ML emulator would likely be less computationally efficient.We initially sought to emulate this microphysics scheme using a general ML modeling strategy applicable to other parameterizations of other physical processes, such as a dense neural net.However, we were successful only after customizing the ML architecture to better represent causal links between grid points and time steps inherent in cloud microphysics, for example, that clouds condense, form precipitation, and the precipitation interacts with the clouds and clear air into which it falls.Our methods had to handle liquid and ice clouds forming in a wide variety of temperatures that lead to condensate concentrations that range over orders of magnitude (but cannot be negative), together with cloud-free grid points.Importantly, they needed to work online, as a stable, accurate substitute for the original ZC scheme when used in global FV3GFS simulations run out for weeks or even years.These challenges are general to emulating any conventionally designed microphysics parameterization regardless of its number of hydrometeor types or algorithmic complexity, and they have analogs in emulation of atmospheric aerosol or chemical parameterizations.Thus, we hope our findings have general value for future model developers contemplating ML emulation of such processes.
In Section 2, we describe the emulator architecture, training data, and integration into the FV3GFS model.In Section 3, we demonstrate that the emulator serves as a stable, skillful replacement to the original Fortran Zhao-Carr microphysics scheme, with low global average bias for at least 1 year of simulation.Despite impressive overall performance, the emulator induces regional biases in the uppermost model levels-in our experience, a relatively common online issue with ML integrated as one component in conventional atmospheric models (e.g., Brenowitz & Bretherton, 2019;Clark et al., 2022).In Section 4, we discuss the major decisions that influenced the emulator's performance and address some remaining challenges and limitations of our approach.
In accordance with AGU's AI tool policy, the authors acknowledge the use of OpenAI's ChatGPT-4 tool to help edit the manuscript draft for clarity, conciseness, and grammatical correctness.All suggestions provided by the AI tool were reviewed and edited by the authors for correctness and consistency.The plain language summary was generated by prompting the tool for a generally accessible version of our written abstract and then edited by the authors.

Methods
In this work, we utilize the FV3GFS global atmospheric model (Harris & Lin, 2013), which is currently used by NOAA for operational weather forecasting.FV3GFS combines the FV3 nonhydrostatic finite-volume dynamical core with a suite of physical parameterizations developed for the Global Forecast System (GFS).For the simulations presented here, the FV3GFS model is run on a C48 cubed-sphere grid (approximately 200 km horizontal grid spacing) with 79 vertical levels.
Within FV3GFS, we target the emulation of the Zhao-Carr (ZC) microphysics (Zhao & Carr, 1997), which was used in the operational version of GFS until 2018.The ZC microphysics scheme predicts changes in cloud condensate, precipitation, and the associated heating and moistening rates at each grid point in a vertical column, based on state inputs.It features only a single prognostic hydrometeor type: the cloud water mixing ratio.The scheme divides the prediction into two subroutines: one calculating the local condensate change via grid-scale condensation (gscond) and the other calculating column precipitation and evaporative state adjustments (precpd).Figure 1 shows a graphical depiction of the information flow through the ZC microphysics subroutines.
We refer to the gscond scheme as grid-local or simply "local" since the net condensation is dependent on the current grid point's thermodynamic state (absolute temperature, specific humidity, and cloud condensate).However, gscond diagnoses the phase partitioning of cloud condensate into liquid and ice at each step based on local temperature, and for temperatures in a transition region (0°C ≥ T > 15°C), the presence of overlying ice cloud, introducing a slight conditional degree of nonlocal dependence.The condensate ice/liquid partitioning is not a formal output of this subroutine, and the calculation is duplicated wherever the information is necessary in the model physics.
On the other hand, we refer to the precpd scheme as vertically "nonlocal," since in addition to the local state, the diagnosis of precipitation and evaporation at each level is dependent on output precipitation from levels above.Similar to gscond, precpd partitions the precipitation diagnostically into rain and snow at each grid level during each time step-outputting a total surface precipitation and the fraction of frozen precipitation.Only the The "after last call to gscond" inputs are used to compute a relative humidity tendency that encompasses the rest of the model and prepcd.This approach to computing the tendency effectively adds three new state variables to the model.total precipitation amount is used by the land model at the end of the physics timestep.Appendix A gives further details.
The ZC scheme was appealing for ML emulation due to its apparent simplicity.However, the schematic (Figure 1) illustrates that the ZC scheme is architecturally more complex than we anticipated due to the implicit dependence on the column thermodynamic state sampled within the previous time step.Furthermore, nonlocal phase partitioning of condensate does not appear as an explicit output of the scheme, despite its use by other parameterizations.These subtleties added considerable time-consuming challenge to developing an accurate ML emulation of the ZC scheme.
To emulate the ZC scheme, we employ hooks to interact with the Fortran model via the package call_py_fort (https://github.com/nbren12/call_py_fort).This package enables users to call functions and interact with selected Fortran state fields within an initialized Python environment, giving access to the comprehensive suite of ML and data tools available in Python and accelerating ML prototyping and testing.

Training Data
We generate the training data set by initializing 30-day simulations from GFS analysis on the first day of each month in 2016, saving fields every 5 hr to sample the diurnal cycle.A list of all stored fields is shown in Table S1 in Supporting Information S1.We reserve 3 months of data for validation during training (February, June, and September).The training data set includes 1080 global snapshots consisting of 48 2 × 6 tiles = 13,824 atmospheric columns, totaling nearly 15 million samples.
From the saved training data, we derive the target increments for the ZC microphysics that we seek to emulate.The total change, denoted as Δ = Δ g + Δ p , is the sum of the two subroutine updates from gscond (Δ g ) and precpd (Δ p ).Both gscond and precpd calculate updates for temperature (T ), specific humidity (q), and the cloud water mixing ratio (c); precpd also diagnoses the amount of surface precipitation (P) during the time step.We note that the use of tendencies in this manuscript refers to the subroutine increment divided by the model time step (15 min).Figure 2 displays an example transect of tendencies of the target data for clouds and humidity along the 100°W meridian.The gscond cloud water tendency (Figure 2a; Δ g c) can be positive (condensation) or negative (evaporation), depending on local thermodynamic state.Active regions in this snapshot include the boundary layer of the subtropical Pacific and free-tropospheric weather features (e.g., convection or frontal zones) over land.Because cloud water tendency involves a phase change between water vapor and condensate, the corresponding tendencies of temperature (Δ g T ) and specific humidity (Δ g q) exhibit similar patterns to the cloud water tendency.The gscond tendencies for these three fields are fully determined by grid-local thermodynamic state, with the exception of one vertically non-local flag, which influences the diagnostic decomposition between liquid and ice clouds and the resulting latent heating tendency.
The corresponding precpd condensate tendency transect (Figure 2b; Δ p c) shows losses due to autoconversion of thicker clouds to precipitation.Regions of positive precpd vapor tendency (Figure 2c; Δ p q) are due to the evaporation of precipitation falling from overlying grid layers.
These transects highlight two general challenges for emulating microphysics.First, the microphysics scheme is not active at the majority of grid points.It produces a range of adjustments to the state fields where clouds or precipitation are present, but elsewhere, the tendencies should be exactly zero.Second, the condensation scheme can generate large condensate increments throughout the troposphere despite the humidity being orders of magnitude smaller in the upper troposphere than near the surface.Some other general considerations are also important for ML microphysics emulation.For instance, clouds are very sensitive to relative humidity.A small error in predicted water vapor or temperature can significantly impact clouds and precipitation.Second, cirrus clouds with small condensate mixing ratios can be as radiatively important as liquid water clouds with hundred-fold higher condensate mixing ratios.Thus, an ML emulator must accurately predict a large range of condensate tendencies to skillfully reproduce the original model's climate.Third, complete cloud evaporation/sublimation is common; to obtain this outcome in a model time step requires the condensate tendency to exactly remove all cloud condensate in a grid box.Lastly, microphysical tendencies are a combination of local (e.g., condensation) and non-local (e.g., column-based precipitation) processes and constraints.An emulation scheme must replicate these dependencies to yield accurate and physically consistent results.
These factors heavily influenced the final design of our emulation methodology, which we detail in the following section.We elaborate on the sensitivity of results to these choices and discuss remaining challenges in Section 4.

Emulator Architecture
The emulation model architecture is shown in Figure 3. Separate emulators for gscond and precpd take a total of 13 input variables, including the same set of inputs as the Fortran ZC scheme: T, q, c, and surface pressure as well as the "after last gscond" values of T, q, and model-level pressure and surface pressure.We provide additional inputs of pressure thickness of the atmospheric layer, as well as derived inputs of relative humidity (RH), and log-scaled q, c, and q after last gscond.Each input is normalized: and combined to form input channels for the emulation models.The mean, μ j , is a sample mean at level j using 150,000 random columns from the training data.The scaling factor, σ, is calculated using the standard deviation over all per-level centered (x j μ j ) values in the same sample.This scaling enhances training stability and conveniently downweights inputs from the upper levels, where the microphysics scheme is less active.Surface variables are normalized as a single level and then broadcast to 79 levels when merged into model inputs to simplify general usage.The same input data are passed to all three of the emulator subcomponents.

Condensation Emulator
In the condensation subroutine (gscond), net condensation Δ g c at a given point in an atmospheric column is physically determined by the thermodynamic inputs at that same level (grid-point locality).The gscond emulator takes advantage of this property for predicting Δ g c by applying a single multi-layer perceptron (MLP, also known as a feed-forward neural net) to each grid point independently, which we refer to as a dense-local model.The other output terms (Δ g q and Δ g T ) are determined at runtime via local conservation as described in Section 2.3.The MLP uses 2 hidden layers of 256 channels, each with ReLU activation.It takes in 79-level × 13channel inputs, applies the model to each level, and outputs a single column (79 × 1) through a linear readout layer.We train the gscond dense-local model for 50 epochs using the Adam optimizer with a learning rate of 0.0001.We use a mean squared error (MSE; Equation 2) based loss (Equation 3).
The target increment in the loss (ỹ, Equation 4) is conditionally scaled due to a physical expectation that cloud properties depend strongly on temperature (Figure S1 in Supporting Information S1).To accurately emulate cold cirrus clouds, which typically have little condensate and correspondingly small condensate increments, and also emulate warm liquid clouds, which can have hundred-fold larger condensate increments, the loss function normalizes to be sensitive in both cases.The scaling terms for the mean μ(T in ) and standard deviation σ(T in ) represent a piecewise interpolation based on the input temperature T in .We compute the underlying interpolation function by calculating binned mean and standard deviation values after grouping samples of Δ g c into 50 linearly spaced bins between the minimum and maximum input temperature.We optimize the gscond emulator ŷ = f (x) to predict temperature-scaled increments (ỹ) as functions of the grid point features x.These increments are descaled into a predicted post-gscond condensate amount (ĉ g , Equation 7) by adding the de-scaled increment to the input condensate amount.We include a post-gscond condensate MSE in the loss (Equation 3) using the normalized condensate amounts c′ g , ĉ′ g ) scaled by λ = 50,000 to make the loss contribution O(1).The addition of the final condensate value to the loss function improves validation MSE for the unscaled condensate increment by over 80%.This likely happens because the final condensate term gives additional weight to warm-cloud condensation.The remaining state increments for T and q are determined at runtime from the predicted Δ g c value (see Section 2.3).
Similar to Gettelman et al. (2021), we train an activity classifier to handle the mixed discrete-continuous nature of the condensation scheme, that is, the need to force the emulator prediction to either (a) zero tendency when there should be no cloud change during the time step, or (b) the exact tendency to fully evaporate cloud condensate present at the beginning of the time step.The classifier model employs the same dense-local architecture as the regressor, but predicts four target variables to identify the following classes: • Δ g c = 0, • c g = 0 and Δ g c ≠ 0, • c g ≠ 0 and Δ g c > 0, and • c g ≠ 0 and Δ g c < 0.
The first two cases, corresponding to situations (a) and (b) above, together usually account for 80% or more of the outcomes depending on the level (Figure S2 in Supporting Information S1).During inference, the model constrains Δ g c when the classifier identifies either of the first two cases.Otherwise, the regressor makes the condensate prediction.We train the classifier using categorical cross-entropy loss with the same hyperparameters as the regressor, except for an increased learning rate of 0.001.After training, the classifier is approximately 98% accurate over all classes and levels (Table S2 in Supporting Information S1).Please refer to Section 4.1 for a more in-depth discussion on the impacts of the conditional loss function and activity classifier.

Precipitation Emulator
The diagnostic precipitation scheme (precpd) generates precipitation through autoconversion of cloud condensate in upper levels.The precipitation falls and can either evaporate in lower layers or reach the surface.To enforce this downward dependence in the precpd emulator by construction, we use a recurrent neural network (RNN) that recurses over vertical layers starting at the top of atmosphere (see schematic in Figure S3 in Supporting Information S1).A single RNN layer, uses the same normalized inputs, x′ j , as the gscond emulator where j ∈ [0, 79) and j = 0 is the top of the atmosphere.In this form, h j is the RNN hidden state at level j, W h represents trainable weights for the recursion on hidden state, W x are the trainable weights for inputs, b is the bias, and (⋅) + represents a ReLU activation function.
We stack two hidden 256-channel layers followed by a level-independent linear readout layer ŷj = Ah j + b) to predict the increments Δ p T, Δ p q, and Δ p c.This construction ensures that only inputs x i from levels at and above level j (i ≤ j) can affect RNN predictions at level j.Since the state tendency adjustments from precpd are unidirectional, we embed additional constraints within the precpd emulator such that it converts clouds to precipitation (Δ p c ≤ 0), that it evaporates precipitation (Δ p q ≥ 0 and Δ p T ≤ 0), and that the final cloud is nonnegative (c p ≥ 0).The RNN loss includes the MSE for the normalized increments (using Equation 1 instead of conditional normalization) and the MSE of the normalized post-precpd output for each variable scaled such that the individual contributions are O(1).The surface precipitation rate (P) is diagnosed from the net loss in total column water at runtime using: where for each level j, (Δ p c j + Δ p q j ) is the local water change due to autoconversion and evaporation, Δp j is the input pressure thickness of the atmospheric layer, and g is gravity.

Prognostic Runs
The utility of a microphysics emulator ultimately depends on its performance when used within the atmospheric model as a substitute for the humandesigned parameterization it is trained to replace.Specifically, the emulator should not cause catastrophic model failures, it should consistently provide a skillful representation of the original microphysics behavior, and it should have a minimal impact on the integrated statistics (i.e., the climate) of the underlying model.To test this, we embed the ZC microphysics emulator in FV3GFS and run a series of prognostic tests using two model configurations: one with the emulator as the active microphysics scheme (online) and a baseline with the Fortran microphysics active (offline).In each case, we run the inactive component in a diagnostic mode ("piggybacked"; Grabowski, 2019) and save the resulting tendencies for comparison.
To evaluate the skill and climate impact of the emulated microphysics, we initialize 30-day simulations in each calendar month from February 2016 to January 2017.The initializations are taken from the end of the training data simulations, testing both model configurations on atmospheric states independent of the training data.We compute skill scores for all microphysics tendencies (ΔT, Δq, Δc; converted to a tendency by dividing increments by Δt = 900) and P using a modified R 2 score 1 ∑ ( ŷ y) 2 / ∑ y 2 .A score of 1 indicates a perfect emulation, while a value of 0 or lower indicates an emulator worse than a no-information prediction.We also compute the bias of the microphysics outputs and the atmospheric state over all levels and times of the 12 simulations.To assess long-term stability, we simulate a full year using the emulator in place of the ZC microphysics and check the global averages and bias for evidence of any climate drifts.
The last step in applying the emulator as part of an online simulation is to apply final physical limiters and constraints and generate the full set of outputs for the emulated ZC microphysics.For the gscond emulator, we compute the increments Δ g T and Δ g q through local conservation of the net condensation.First, we limit the net condensation based on moisture availability using: Then, the change in water vapor mirrors the change in condensate (Δ g q = Δ g c) and the temperature change is determined via latent heating (Δ g T = (L v /c p )Δ g c), where L v is the latent heat of vapourization and c p is the specific heat of air at constant pressure.However, this is an approximation, as this method ignores that some phase changes in ZC occur between ice and vapor, releasing additional latent heat.We did test adding a posthoc corrective term of ice cloud latent heating effects, but this slightly degraded online emulator skill.For online application, we set the top 5 levels of gscond increments to zero since the ZC microphysics scheme is never active in those stratospheric levels and noise issues in ML-predicted condensate increments arise in these levels (see Section 4.2 for further discussion).Finally, we add the increments to the corresponding input state variable to obtain fields after gscond (T g = T in + Δ g T, q g = q in + Δ g q, and c g = c in + Δ g c).
The precpd increment constraints are directly integrated into the ML model as described earlier.We derive the surface precipitation (Equation 9), and then add the precpd increments to the post-gscond values to generate the final scheme outputs (T p = T g + Δ p T, q p = q g + Δ p q, and c p = c g + Δ p c).

Results
We begin with the top-level results of our ZC emulation 30-day runs in Table 1.The offline skill scores for all emulated quantities are nearly perfect at ∼99%, with low root mean-square error (RMSE) values and biases that are 1-2 orders of magnitude smaller than the RMSE (i.e., a small component of the error).Online skill is a strict test where deviations from a realistic physical state can cause the diagnostic Fortran microphysics to output large state adjustments or even crash.Nevertheless, when the emulator is used online, it maintains high skill scores with only a ∼1%-5% reduction compared to the offline case.Predicted cloud water tendencies show the lowest average performance at 94%, which is still quite high for a sparse and highly sensitive tendency field.The corresponding tendency RMSEs of emulator tendencies versus piggybacked Fortran tendencies are roughly double those of the offline configuration, except for P, where the tendency RMSE is nearly four times larger.The larger online error result is an expected outcome due to detrimental feedbacks between the model and the ML emulator that cannot be accounted for when using offline training.The biases remain small in the online case, suggesting no systematic breakdown of the emulator behavior from the diagnostic Fortran microphysics.
We compare the time-averaged atmospheric state averaged across the 12 30-day online simulations with identically initialized baseline simulations to show that the emulator produces little mean-state drift when used in FV3GFS in place of the original ZC microphysics.Figure 4 depicts zonal averages of the online bias of the emulator-based simulation compared to the baseline simulation, which have been interpolated from model level to pressure coordinates to display biases at a true relative height.Table 2 gives global average area-and massweighted bias for selected output fields.
Cloud water is a key output of the microphysics scheme.Its zonal average mixing ratio (Figures 4a and 4b) has the largest absolute bias near the surface in Antarctica, ∼6 mg/kg.This bias is relatively large for the characteristically cold dry air there.Outside of the Antarctic, the cloud water biases are ∼3 mg/kg or less-a much smaller relative change from the baseline-and are generally positive, except for a negative bias in the tropical upper troposphere.The global-mean cloud water bias is small-0.2mg/kg, an approximately 2% increase compared to the baseline state (Table 2).These cloud changes result in O(1%) changes to the outgoing top-of-atmosphere longwave ( 1.4 W/m 2 ) and shortwave radiation (+1.3 W/m 2 ), but in total the changes largely cancel out.
Figure 4d depicts the online bias in RH, which displays a small shift toward saturation in the middle-to-lower troposphere.The largest biases in RH (>10%) occur in the Antarctic upper atmosphere near the large gradient in drying near the tropopause.There are also similar albeit smaller positive RH biases in the tropics and Arctic tropopause regions.Overall, the global-mean RH shows a small positive bias of 0.8% (Table 2), congruent with the small positive cloud water bias.
The zonal average temperature has a small cold bias of up to 1.5 K in the high latitudes.Between 50°S and 50°N, this bias is weakened or even slightly reversed at some pressures, but there is a thin layer of warm bias up to 1 K near the tropopause.The zonal temperature biases largely cancel out when averaged globally over the 30-day runs (Table 2).
Lastly, the total surface precipitation (emulated ZC microphysics + convective sources) has a slight positive bias of 0.03 mm/day, a 1% increase from the baseline simulation (Table 2).Figure 5a depicts the online zonal average surface precipitation just from the ZC microphysics component.The emulated ZC precipitation production is nearly identical to the baseline simulation owing to the high emulation skill of Δq and Δc, but produces 0.02 mm/ day less global precipitation than the baseline ZC scheme.This bias must mostly be associated with state drift rather than offline emulator errors, because the piggybacked Fortran ZC scheme, which is applied to the online emulator state, diagnoses slightly less precipitation than the online emulator, especially in the Northern Hemisphere storm track.The Fortran convection parameterization also responds to the slight emulator-induced state changes by producing a global mean convective precipitation increase of 0.05 mm/day.
The instantaneous precipitation-rate distribution based on all grid columns and sampling times (Figure 5b) corroborates this analysis.It shows that the emulator overproduces light precipitation (<0.1 mm/day) compared to the piggybacked Fortran scheme, but these two schemes agree well at most higher precipitation rates, and their small discrepancies don't explain the online emulator differences from the baseline simulation.Instead, the largest precipitation rate bins (∼100 mm/day) suggest that the online emulator-driven simulation shifts to fewer states that support heavy precipitation events compared to the baseline simulation.

1-Year Continuous Simulation
The monthly initialized runs show the embedded ZC emulator is stable for at least 30 days during all calendar months of the year, with low biases.To further explore the long-term fidelity of emulator-based simulations, we present results from a continuous 1-year integration starting in July 2016.While not strictly necessary for online stability, we additionally mask the top 5 levels of precpd increments during runtime, which is consistent with the masking of the gscond increments.We found adding the mask to the top 5 levels of the precpd scheme reduced the number and severity of transient tendency skill dropouts (Figure B1) for the 1-year simulation.We discuss the unresolved sensitivity of the emulator to the upper levels in Section 4.2.The online skill metrics for the 1-year continuous run are, reassuringly, almost identical to the average of the 30-day runs (Table 3).A time-latitude plot of total surface precipitation (Figure 6) compares the baseline and online emulation runs, demonstrating the emulation retains the spatiotemporal character of the baseline precipitation (and precipitating clouds by proxy) throughout the seasonal cycle.A slight reduction in the largest precipitation events for the online emulation is apparent in the tropics; we already noted this issue for the month-long simulations in Figure 5b.Some global-annualaverage biases (Table 4) are somewhat larger than in the 30-day runs: T ( 0.3 K), RH (1.9%), and net TOA outgoing radiation ( 0.4 W/m 2 ; the difference of a 2.1 W/m 2 outgoing longwave bias and a 1.6 W/m 2 reflected shortwave bias).Absolute cloud water and surface precipitation biases remain similar to those of the 30-day runs.Cloud water and RH have the largest relative bias from the baseline simulation at ∼4%, respectively.
The zonal average biases of T and RH from the 1-year emulator-based simulation are very small in the troposphere but become more significant in the polar stratosphere (Figure 7).In this region, large negative cold biases (as low as 8 K) are co-located with positive RH biases up to 30%.The temperature bias appears within the first few months of the simulation and stabilizes for the rest of the simulation.We further investigated these biases and found that both the gscond and precpd emulators have deficiencies in the dry, cold polar stratosphere.Within a few hours after the start of the simulation, the gscond emulator produces too much condensate because the emulator predicts condensation for what the Fortran piggybacked microphysics diagnoses should mostly be evaporation at marginal relative humidities (40%-50%; Figure S4 in Supporting Information S1).We have confirmed that the gscond bias drift is unrelated to precpd or the classifier.We hypothesize that the tendency drift is likely related to a subtle online shift in some characteristics of the input distribution specific to this region.
The precpd emulator's shortcomings in the polar stratosphere are evident from offline diagnosis.Specifically, errors from the emulator's noise floor produce evaporation despite no falling precipitation (Figure S5 in Supporting Information S1) in this region.This is a particular failing of the single-scaling loss normalization (Equation 1), where optimization fails to minimize the large relative errors in the polar stratosphere.The errors produce a directional bias due to constraints imposed in the model architecture (Δ p q > 0 and Δ p T < 0) and a lack of enforced conservation.As they grow, these biases in the high-latitude stratosphere likely feed back with radiation and the atmospheric circulation before ultimately equilibrating.

Challenges and Choices
In this section, we highlight key decisions that led to a skillful, stable, and low-bias emulation, as well as some remaining challenges.From the outset, our goal was to use simpler ML models with the potential for general applicability in emulating atmospheric physics parameterizations.However, the path to the final emulator necessitated several problem-specific choices to successfully emulate the ZC microphysics scheme.

Key Decisions
One of the most influential decisions was to target subcomponents of the microphysics scheme, specifically grid-scale condensation (gscond) and precipitation (precpd).Initial attempts to encapsulate the total ZC scheme tendency increments in a single model yielded high offline skill, but the online integration often resulted in difficult-to-interpret failures that crashed the simulation.This is a common failure mode when training models outside of the environment in which they are deployed (e.g., Brenowitz & Bretherton, 2019).Separating the subcomponents simplifies the enforcement of physical priors through model architecture design or output postprocessing.
Following component separation, we observed substantial improvements in online emulation skill by incorporating physically informed architectures.For the gscond emulator, we enforce grid point locality (i.e., dependence only on the grid point-local thermodynamic state) by using a dense-local MLP that does not mix any vertical information.For the precpd emulator, we enforce the downward dependence (i.e., rain falls downward) using an RNN that recurses downward over a vertical column.Table 5 displays the offline and online skill for a single 30-day run initialized in July, comparing the performance of the informed architectures to a reasonable uninformed default for atmospheric model process parameterization-a dense MLP combining features over the entire grid column to predict the full column increments.While these dense-column models exhibit high skill offline (always >95%), they fail online when continuously integrating on the atmospheric state.Replacing the RNN used for precpd emulation with a dense-column architecture that does not enforce downward dependence reduces cloud and precipitation skill to nearly 0%, even when using the physically informed gscond architecture.Using dense-column models for both subroutines results in negative skill (i.e., worse than zero-increment predictions) for all variables except surface precipitation.
The discrete-continuous nature of outputs from some atmospheric physics parameterizations (e.g., for microphysics) poses a unique challenge for emulation.Neural network regressors have difficulty producing exact zeros, since they are trained to a certain degree of precision and will produce noise below that threshold.This can complicate online integration, particularly for a microphysical scheme, where the local thermodynamic state may be quite sensitive to small changes in condensate or humidity, especially in very cold regions (e.g., Antarctica or the upper troposphere).For this reason, we introduced the activity classifier described in Section 2.2.1 into the gscond emulator.Figure 8 illustrates the need for such a classifier by comparing cloud distributions from simulations with and without a classifier to a baseline run.By day 15 after initialization, the condensate histogram shows that the emulation scheme without an activity classifier accumulates small values of cloud water (≤2 mg/ kg) at many grid points.Including a classifier within the gscond emulator to constrain the microphysical activity resolves this issue.Based on the good performance of the 30-day online simulations and non-locality of the precipitation scheme, we decided not to pursue an activity mask for the precpd emulator.However, the erroneous T and q precpd increments in the polar stratosphere contributing to biases in the 1-year run suggest a classifier might be helpful overall.
The final choice important to the success of the ZC emulator involved optimizing the model to predict condensate increments that span many orders of magnitude.As described in Section 2.2.1, we used a temperature-dependent scaling in the gscond loss function, ensuring proportionate errors across a large range of local microphysical states.Model-level scaling is insufficient to handle this because a given model level may span a broad range of temperatures (e.g., the tropical boundary layer vs. the Antarctic plateau).
In addition to the conditional scaling, we added select rescaled input values (RH, log-scaled q and c) into the emulator inputs.Removing log-scaled inputs negatively impacts offline skill in polar and upper-level model regions (not shown).Including RH as an input increased skill and reduced condensate biases, particularly in the Antarctic region.For example, by day 5 of a July 1 initialized simulation, the emulator using RH as an input has an Antarctic average column-integrated condensate of 87 g/m 2 compared to a baseline value of 79 g/m 2 .When not including RH, the average Antarctic column-integrated condensate value is 154 g/m 2 by July 5, roughly double the baseline value.Despite the overlap of the additional inputs, we believe they help reduce errors in cold-cloud regions by allowing the emulator discern vertical position, which is removed by per-level demeaning in the input normalization (Equation 1).We conducted an experiment to reintroduce the vertical information by adjusting the input normalization for air pressure to remove the column mean instead of the per-level mean from each level.This configuration also increased offline skill and largely removed the Antarctic condensate bias without the need for RH, but was generally more sensitive to skill dropouts when used online.

Remaining Challenges
In developing our emulation scheme, online simulations commonly presented unexpected challenges that needed to be addressed.Certain months, primarily October and November, tended to have lower online skill (∼85%-90% compared to ∼93%-96%) for clouds and precipitation compared to other months.The lower aggregate skill in these months was mainly due to significant precpd autoconversion misses ("skill dropouts") during convective events for a few low-latitude columns (see Appendix B for an example).These skill dropouts start in the mid-troposphere near the freezing level and quickly affect the entire upper troposphere.The emulator recovers in the affected grid columns within a few hours or, at worst, a few days.
To minimize such dropouts, we employed a strategy of training an ensemble of emulators initialized with varying random seeds (e.g., as in Clark et al., 2022) and then select combinations of gscond and precpd emulators with the best online skill during the most problematic months of October and November.While this approach does not guarantee prevention of severe skill dropouts during other months or in a year-long simulation, it consistently produces stable, low-bias emulators with high skill.
We still do not have a foolproof approach for designing emulators without occasional skill dropouts.For instance, the emulator configuration used in the 1-year online simulation produces a substantial skill dropout in a 30-day simulation initialized at the start of December, leading to a December Δc skill = 54%, while the original monthly run configuration (masking only gscond predictions at the top 5 levels) has no issues (December Δc skill = 94%).
Altogether, this suggests the need for further refinement of the architectural design and training choices, such as whether recursion from the top model level is necessary, whether additional measures should be adopted to reduce sensitivity to the upper levels, or whether more training data are needed to handle the few convective events on the edges of the data distribution.The RNN itself is tasked with handling the details of precipitation generation processes for the full column, but perhaps a level-by-level precipitation modeling approach with explicit conservation similar to gscond would improve the overall model skill.
To handle the large dynamic range of condensate increments, we use temperature scaling in the gscond loss function.While this is generally very beneficial, especially in tandem with the gscond classifier, it does not prevent the emulator from occasionally creating spurious cloud in the uppermost model levels.These levels lie in the stratosphere, where temperature increases with height.Warmer temperatures lead to larger-amplitude condensate "noise," which the emulator later struggles to remove.Because there should never be any cloud in the top-most levels, we pragmatically resolved this by masking gscond increments in the top 5 model levels.
However, as seen in the 1-year simulation polar stratospheric biases, a few issues remain related to emulator deficiencies in the upper levels.It is possible that a conditional scaling on both temperature and pressure would be beneficial for this, but we leave that avenue for future research.
While the current manuscript focuses on the development and evaluation of a robust, accurate ZC emulator, we recognize that speed of execution is a paramount consideration for emulator adoption, especially in operational settings.The current code infrastructure was designed for flexibility and ease of testing new ideas, rather than for optimal speed.In its current unoptimized state, the model with online emulation runs approximately 30% slower (∼5.8 s/time step) than to the original C48 simulation (∼4.8 s/time step) even when using available GPUs (4x Nvidia T4).Variable transfer between Python and Fortran adds around 7% to the run time.The remaining slowdown is likely related to choices in model architecture, such as shallow depth and sequential RNN steps, which lead to low GPU utilization (<10%).We note  that at the coarse C48 resolution, the number of columns processed by the physics is relatively low.Increasing the resolution may lead to less relative slowdown due to increased utilization for GPU calculations.We believe that it will be possible to design ML emulators of more complex microphysical schemes that are more speedcompetitive with the Fortran code which they aim to replace.

Conclusions
We have successfully developed an emulator to replace a simple Fortran microphysics scheme (Zhao-Carr) in FV3GFS, which controls grid-scale condensation (gscond) and precipitation (precpd) processes.Our findings demonstrate that when used online as a replacement for the Fortran scheme, the emulator maintains high skill (≥94%) with low global-average bias (on the order of 1% or less) and remains stable for at least 1 year of continuous simulation.To our knowledge, this is the first successful emulation of a more complete bulk microphysics scheme, albeit still much simpler compared to modern standards, and the first successful online emulation of a fast-timescale atmospheric parameterization central to global atmospheric forecasting.
A key contributor to the success of our emulator was tailoring its architecture to the underlying physical processes.By creating separate emulators for gscond and precpd, we enforce grid point locality and conservation for the condensation scheme, and we use an RNN to impose downward dependence in the atmosphere associated with falling precipition.This greatly improves the emulator's skill, especially when used online.Adding an activity classifier to the condensation emulator alleviated issues of excess condensate related to the discretecontinuous nature of the tendencies and field outputs.Using a temperature-scaled conditional loss function for the gscond emulator and providing re-scaled inputs to all emulators helped maintain skill across the high dynamic range of condensate and humidity tendencies that must be accurately predicted to simulate cloud processes throughout the global atmosphere.
As with any ML-based emulation problem, achieving perfection is difficult, and the current scheme is no exception.In 1-year online integrations, biases develop in the polar stratospheric temperature and humidity fields.These regions challenge the ML training because they have distinctly different local environments than the rest of the atmosphere and comprise a small fraction of the emulator's training data.Further improvements could clearly be made, but are beyond the scope of this paper, which was to demonstrate the feasibility of a skillful ML microphysics emulator for online use.For instance, a natural possibility that we did not have time to implement would be to explicitly predict precipitation flux at every model interface, which carries all the nonlocality in the microphysics.The hidden state of the precpd RNN is a skillful but imprecise proxy for this design, causing potential biases and drifts because physical constraints are imperfectly respected (e.g., that the evaporation of precipitation in any model level cannot exceed the downward flux of precipitation into that model level).A compounding difficulty in the present work and generally for physics emulation is the inability to train emulation schemes directly in the context of their deployment within an atmospheric model.Fortran tooling for ML applications is challenging compared to the Python, but is still required for current atmospheric models.We utilize a Python package (call_py_fort) that provides an exceptional solution for interactive prototyping, but is not optimized for computational efficiency.Modeling frameworks on the horizon may simplify this process of ML integration and speed the development path to emulators that perform well online (Dahm et al., 2023;Schneider et al., 2017).
Our results stress the importance of evaluating the online performance for any proposed emulator, as it is straightforward to produce skillful offline models that may not perform well when integrated back into the model.It is also important to recognize that the development of emulators that perform well online is a challenging and time-consuming endeavor.If efficiency is the only goal, it may sometimes be more practical to invest in porting existing codes to run on GPUs, for example, as emulation requires significant human effort and problem-specific tuning.
Despite the challenges, our method and results are a proof-of-concept that machine learning techniques can effectively emulate fast physical processes central to the dynamics in weather and climate models.While our focus has been on a specific microphysics parameterization, we hope that the illustration of our problem-specific decisions will inform the application to similar or more complex physical schemes.With further research and development, emulation techniques can continue to contribute to improved skill and efficiency of weather and climate models.

Appendix A: Zhao-Carr Microphysics
This scheme handles both phase changes-condensation and evaporation-and precipitation processes.Tendencies due to the former are typically 10x larger in magnitude.The prognostic variables used by the scheme are the absolute temperature T, specific humidity q, and a total cloud condensate mixing ratio c.
The gscond scheme handles evaporation of cloud and condensation.Evaporation of cloud is given by . f is relative humidity.f 0 is a critical relative humidity threshold which Zhao and Carr (1997) describe as "empirically set to 0.75 over land and 0.90 over ocean."q s is the saturation specific humidity.
Condensation C g on the other hand is given by a more complex formula involving a relative humidity tendency determined using the previous timestep's thermodynamic state (i.e., "after last gscond" quantities in Figure 1).See Equation 8of Zhao and Carr (1997).Both formulas depend only on the thermodynamic state of a single (x, y, z) location, but there is some non-local dependence on the assumed phase of the cloud and the corresponding latent heating rate.
The precpd scheme handles the conversion of cloud into rain/snow and the evaporation of the latter as it falls through the atmosphere.Broadly speaking, it can be written as the following:  Most of the formulas are proportional to rainfall P r and snowfall P s rates at a given level, though are some rate constants that depend exponentially on temperature.p t is the pressure at the top of the atmosphere.

Appendix B: precpd Emulator Skill Dropouts
Over the course of refining the emulation methodology, we observed larger variability in the online skill scores of cloud and precipitation predictions, despite minimal-or-no changes in emulator training or runtime configuration.In this section, we discuss the primary source of that variance, which we refer to as skill dropouts.As an example, Figure B1 displays the surface precipitation skill over time for two 1-year simulations.When the top 5 layer increment mask is adjusted from application to only gscond to both gscond and precpd, the severity of skill dropouts decreases markedly.
Upon closer examination of the skill dropouts, the precpd emulator appears to be the source of the issue.We focus on the dropout about 6 months into the gscond-only masking experiment to illustrate this point.In this case, a cluster of columns near the Maritime Continent is responsible for most of skill reduction.By removing the five grid columns with the largest tendency errors, the overall snapshot skill goes from approximately 0% to over 70%.When examining the tendency profiles from the column with the largest errors (Figure B2), the gscond emulator largely matches the diagnostic Fortran, while the precpd emulator completely misses the autoconversion of condensate to precipitation in middle-and-upper levels.Leading up to this time step, we have confirmed that gscond remains skillfull, while precpd skill degrades (not shown).The gscond emulator retaining skill throughout this event suggests that a process outside of the ZC scheme, such as deep convection, adds condensate throughout the column.The precpd emulator then fails to precipitate the added condensate.
Overall, we hypothesize that the skill dropouts are associated with training data insufficiency related to intense convection and/or unconstrained sensitivities of the RNN to upper-level inputs.It is encouraging that despite the magnitude of the misses, the ZC emulators resolve the issue in a few days or less for all cases observed (e.g., see Figure S6 in Supporting Information S1).We also note that the dropouts tend to be confined to only a few grid columns, typically occurring in the tropics or subtropics.The isolated spatial extent of the skill dropout sources highlights the challenge in achieving consistently high skill in our chosen metrics throughout the simulations.It also demonstrates how quickly the skill can deteriorate if even a few predictions degrade.

Figure 1 .
Figure 1.Information flow of the Zhao-Carr microphysics within FV3GFS for a single time step.Inputs of a given scheme are represented as inward arrows.The "after last call to gscond" inputs are used to compute a relative humidity tendency that encompasses the rest of the model and prepcd.This approach to computing the tendency effectively adds three new state variables to the model.

Figure 2 .
Figure 2. Latitude-pressure transects along longitude 100°W for a sample Zhao-Carr microphysics step on 8 July 2016 at 06Z showing: (a) the condensation rate from gscond, (b) the conversion rate of cloud to precipitation in precpd, and (c) the precipitation evaporation rate in precpd.Transect data has been interpolated to pressure levels from model levels for presentation.

Figure 3 .
Figure 3.A schematic of the ZC microphysics emulation architecture.

)
Journal of Advances in Modeling Earth Systems 10.1029/2023MS003851 PERKINS ET AL.

Figure 4 .
Figure 4. Latitude-pressure sections of zonal and time average state from baseline Fortran simulations (left) and online bias of simulation using the emulator (right) for cloud water mixing ratio (a, b), relative humidity (c, d), and air temperature (e, f).Averages are over 12 30-day simulations initialized in each month of the calendar year, using values vertically interpolated from model levels.

Figure 6 .
Figure 6.Time-latitude plots of the instantanous surface precipitation rate saved every 3 hr from the 1-year (a) baseline and (b) online emulation simulations.

Figure 7 .
Figure 7. Zonal mean bias of the 1-year online emulation simulation for (a) temperature and (b) relative humidity.

E
rr = E r (T, f , P r ) E rs = E r (T, f , P s ) P = P(T, f , c, P r , P s ) P sm = P sm (T, f , c, P r , P s ) P r = ∫ p p t (P E rr ) dp/g P s = ∫ p p t (P sm E rs ) dp/g.

Figure 8 .
Figure8.Cloud water mixing ratio distributions compared between three configurations: online emulation with a gscond activity classifier (blue), online emulation without an activity classifier (orange), and a baseline simulation (gray).Samples are taken from 8 3-hourly snapshots across day 15 of a 30-day simulation initialized on July 1.

Figure B1 .
Figure B1.Surface precipitation skill over the 1-year online simulation for two increment masking configurations: (orange) gscond-only top 5 layer masking and (blue) gscond and precpd top 5 layer masking.

Figure B2 .
Figure B2.Vertical tendency profiles from the (a) gscond and (b) precpd schemes during the December 22nd 12 UTC skill dropout event in the gscond top 5 layer increment mask 1-year simulation.Each subcomponent panel shows the condensate tendencies predicted from the emulator (blue) and the diagnostic Fortran (orange dashed) for the selected column with the largest errors.

Table 1
Skill Metrics for the ZC Microphysics Emulator Outputs Compared to the Fortran Microphysics Outputs Scores are shown for two configurations: offline (Fortran driving) and online (emulator driving).All table metrics are calculated for 12 30-day runs initialized at the start of each calendar month and then averaged together.

Table 2
Global Average Online Biases and Baseline Means Figure 5. (a)Zonal average surface precipitation rate from ZC microphysics compared between the online emulator (blue) baseline Fortran (orange) and diagnostic Fortran microphysics (gray), which is generated diagnostically using inputs from the online emulation state.(b) Surface precipitation rate distribution compared between the same schemes.Shown quantities are calculated from 12 30-day simulations initialized at the beginning of each calendar month.

Table 3
Online Skill Scores for the 1-Year Simulation Compared Against 30-Day Simulations

Table 4
As in Table 2 but for the 1-Year Simulation

Table 5
Sensitivity of Emulation Skill to the Use of General Versus Prior-Informed Model Architectures Note. "Dense-column" refers to a fully connected MLP with 2 hidden layers of 256 width and a linear readout layer."Dense-local" and "RNN" refer to the architectures described in the methods section.