Machine Learning for Stochastic Parameterization: Generative Adversarial Networks in the Lorenz '96 Model

Stochastic parameterizations account for uncertainty in the representation of unresolved sub-grid processes by sampling from the distribution of possible sub-grid forcings. Some existing stochastic parameterizations utilize data-driven approaches to characterize uncertainty, but these approaches require significant structural assumptions that can limit their scalability. Machine learning models, including neural networks, are able to represent a wide range of distributions and build optimized mappings between a large number of inputs and sub-grid forcings. Recent research on machine learning parameterizations has focused only on deterministic parameterizations. In this study, we develop a stochastic parameterization using the generative adversarial network (GAN) machine learning framework. The GAN stochastic parameterization is trained and evaluated on output from the Lorenz '96 model, which is a common baseline model for evaluating both parameterization and data assimilation techniques. We evaluate different ways of characterizing the input noise for the model and perform model runs with the GAN parameterization at weather and climate timescales. Some of the GAN configurations perform better than a baseline bespoke parameterization at both timescales, and the networks closely reproduce the spatio-temporal correlations and regimes of the Lorenz '96 system. We also find that in general those models which produce skillful forecasts are also associated with the best climate simulations.


Introduction
A large source of weather and climate model uncertainty is the approximate representation of unresolved sub-grid processes through parameterization schemes. Traditional, deterministic parameterization schemes represent the mean or most likely sub-grid scale forcing for a given resolved-scale state. While model errors can be reduced to a certain degree through improvements to such parameterizations, they cannot be eliminated. Irreducible uncertainties result from a lack of scale separation between resolved and unresolved processes. Uncertainty in weather forecasts also arises because the chaotic nature of the atmosphere gives rise to sensitivity to uncertain initial conditions. Practically, uncertainty is represented in forecasts using ensembles of integrations of comprehensive weather and climate prediction models, first suggested by Leith (1975). To produce reliable probabilistic forecasts, the generation of the ensemble must include a representation of both model and initial condition uncertainty.
Initial condition uncertainty is addressed by perturbing the initial conditions of ensemble members, for example by selecting directions of optimal perturbation growth using singular vectors (Buizza & Palmer, 1995), or by characterizing initial condition uncertainty during the data assimilation cycle (Isaksen et al., 2010). One approach for representing irreducible model uncertainty is stochastic parameterization of unresolved physical processes. A stochastic parameterization represents the probability distribu-tion of possible sub-grid scale tendencies conditioned on the large scale. Each ensemble member experiences one possible, equally likely realization of the sub-grid-scale tendencies. A more detailed motivation for including stochastic parameterizations in weather and climate models is presented in Palmer (2012).
Stochastic approaches for numerical weather prediction (NWP) were originally proposed for use in the European Center for Medium-Range Weather Forecasts (ECMWF) ensemble prediction system (Palmer et al., 1997;Buizza et al., 1999). They were shown to substantially improve the quality of initialized ensemble forecasts, and so became widely adopted by meteorological services around the world, where they are used to produce operational ensemble weather, sub-seasonal and seasonal forecasts (Teixeira & Reynolds, 2010;Reyes et al., 2009;Berner et al., 2010;Stockdale et al., 2011;Palmer, 2012;Suselj et al., 2013;Sušelj et al., 2014;Weisheimer et al., 2014;Sanchez et al., 2016;Leutbecher et al., 2017).
Recent work has assessed the impact of stochastic parameterization schemes in both idealized and state-of-the-art climate models for long term integration (P. D. Williams, 2012;Ajayamohan et al., 2013;Juricke & Jung, 2014;Dawson & Palmer, 2015;Wang et al., 2016;Davini et al., 2017;Strømmen et al., 2018). These studies demonstrate that including a stochastic representation of model uncertainty can go beyond improving initialized forecast reliability, and can also lead to improvements in the model mean state (Palmer, 2001;Berner et al., 2012), climate variability (Ajayamohan et al., 2013;Dawson & Palmer, 2015;, and change a model's climate sensitivity (Seiffert & von Storch, 2010). These impacts occur through non-linear rectification, noise enhanced variability, and noiseinduced regime transitions . In this way, small-scale variability represented by stochastic physics can impact large spatio-temporal scales of climate variability.
Despite the historical disconnect between the weather and climate prediction communities, the boundaries between weather and climate prediction are somewhat artificial Hurrell et al., 2009;Shapiro et al., 2010). This disconnect is highlighted by recent advances in prediction on timescales from weather to sub-seasonal-to-seasonal and decadal by operational weather forecasting centers around the world (Moncrieff et al., 2007;Vitart & Robertson, 2012). Nonlinearities in the climate system lead to an upscale transfer of energy (and therefore error) from smaller to larger scales (Lorenz, 1969;Palmer, 2001;Tribbia & Baumhefner, 2004). At the same time, slowly evolving modes of variability can produce predictable signals on shorter timescales (Hoskins, 2013;Vannitsem & Lucarini, 2016). Under the 'seamless prediction' paradigm, the weather and climate communities should work together to develop Earth-system models (Brunet et al., 2010;H. M. Christensen & Berner, 2019), as developments made in one community are expected to benefit the other. The development and use of stochastic parameterizations is a good example of this paradigm at work.
Recent years have seen substantial interest in the development of stochastic parameterization schemes. Pragmatic approaches, such as the Stochastically Perturbed Parameterization Tendencies (SPPT) scheme (Buizza et al., 1999;Palmer, Buizza, et al., 2009) are widely used due to their ease of implementation and beneficial impacts on the model (Sanchez et al., 2016;Leutbecher et al., 2017;. Other schemes predict the statistics of model uncertainty using a theoretical understanding of the atmospheric processes involved, such as the statistics of convection (Craig & Cohen, 2006;Khouider et al., 2010;Sakradzija & Klocke, 2018;Bengtsson et al., 2019). A third approach is to make use of observations or highresolution simulations to characterize variability that is unresolved in a low-resolution forecast model (Shutts & Palmer, 2007). This last approach can be used to develop data-driven stochastic schemes (Dorrestijn et al., 2015;Bessac et al., 2019) or to con-strain tunable parameters in stochastic parameterizations (Shutts & Pallares, 2014;H. M. Christensen et al., 2015;H. M. Christensen, 2019). A drawback of these datadriven approaches is that assumptions are made about the structure of the stochastic parameterization (e.g. the physical process to focus on, or the distribution of the stochastic term conditioned on the resolved state) in order to make the analysis tractable using conventional methods.
Machine learning models offer an approach to parameterize complex nonlinear sub-grid processes in a potentially computationally efficient manner from data describing those processes. The family of machine learning models consist of mathematical models whose structure and parameters (often denoted weights) optimize the predictive performance of a priori unknown relationships between input ("predictor") and output ("predictand") variables. Commonly used machine learning model frameworks range in complexity from simple linear regression to decision trees and neural networks. More complex methods allow modelling of broader classes of predictorpredictand relationships. Machine learning models minimize overfitting through the use of regularization techniques that impose constraints on the model structure and weights. Regularization is critical for more complex machine learning models, so that they can converge to optimal and robust configurations in large parameter spaces. For the parameterization problem, regularization and physical constraints are critical for ensuring that the machine learning model predictions match the distribution of observed values. Machine learning for parameterizations has been considered since (Krasnopolsky et al., 2005) and recently multiple groups have begun developing new parameterizations for a variety of processes (Schneider et al., 2017;Gentine et al., 2018;Rasp et al., 2018;Bolton & Zanna, 2019). However, these schemes have focused exclusively on deterministic parameterization approaches, but the need for stochastic perturbations is being recognized (Brenowitz & Bretherton, 2019).
One active area in current machine learning research is generative modeling, which focuses on models that create synthetic representative samples from a distribution of arbitrary complexity without the need for a parametric representation of the distribution. Generative adversarial networks, or GANs (Goodfellow et al., 2014), are a class of generative models that consist of two neural networks in mutual competition. The generator network produces synthetic samples similar to the original training data from a latent vector, and the critic, or discriminator, network examines samples from the generator and the training data in order to determine if a sample is real or synthetic. The critic acts as an adaptable loss function for the generator by learning features of the training data and teaching those features to the generator through backpropagation. The original GAN formulation used a latent vector of random numbers as the only input to the generator, but subsequent work on conditional GANs (Mirza & Osindero, 2014) introduced the ability to incorporate a combination of fixed and random inputs into the generator, enabling sampling from conditional distributions. Because the stochastic parameterization problem can be framed as sampling from the distribution of sub-grid tendencies conditioned on the resolved state, conditional GANs have the potential to perform well on this task.
The purpose of this study is to evaluate how well GANs can parameterize the subgrid tendency component of an atmospheric model at weather and climate timescales.
A key question is whether a GAN can learn uncertainty quantification within the parameterization framework, removing the need to retrospectively develop separate stochastic representations of model uncertainty. While the ultimate goal is to test these ideas in a full general circulation model (GCM: left for a future work), as a proof of concept we will use the two-timescale model proposed in Lorenz (1996), henceforth the L96 system, as a testbed for assessing the use of GAN in atmospheric models. Simple chaotic dynamical systems such as L96 are useful for testing methods in atmospheric modeling due to their transparency and computational cheapness. The L96 system has been widely used as a testbed in studies including development of stochastic parameterization schemes (Wilks, 2005;Crommelin & Vanden-Eijnden, 2008;Kwasniok, 2012;Arnold et al., 2013), data assimilation methodology (Fertig et al., 2007;Law et al., 2016;Hatfield et al., 2018), as well as using ML approaches to learn improved deterministic parameterization schemes (Schneider et al., 2017;Dueben & Bauer, 2018;Watson, 2019).
The evaluation consists of four primary questions. First, given inputs from the "true" L96 model run, how closely does the GAN approximate the true distribution of sub-grid tendency? Second, when an ensemble of L96 models with stochastic GAN parameterizations are integrated forward to medium range weather prediction time scales, how quickly does the prediction error increase and how well does the ensemble spread capture the error growth? Third, when the L96 model with a stochastic GAN parameterization is integrated out to climate time scales, how well does the GAN simulation approximate the properties of the true climate? Fourth, how closely does the GAN represent both different regimes within the system and the probability of switching between them?
Details of the Lorenz '96 model and the GAN are presented in Section 2, and the results of the weather and climate analyses described above are presented in Section 3. Section 4 provides a discussion of GAN parameterization 'health risks', while an overall discussion of results is presented in Section 5. Conclusions follow in Section 6.

Lorenz '96 Model Configuration
The L96 system was designed as a 'toy model' of the extratropical atmosphere, with simplified representations of advective nonlinearities and multi-scale interactions (Lorenz, 1996). It consists of two scales of variables arranged around a latitude circle. The large scale, low-frequency X variables are coupled to a larger number of smallscale high-frequency Y variables, with a two-way interaction between the Xs and Y s. It is the interaction between variables of different scales that makes the L96 system ideal for evaluating new ideas in parameterization development. The L96 system has proven useful in assessing new techniques that were later developed for use in GCMs (Crommelin & Vanden-Eijnden, 2008;Dorrestijn et al., 2013).
The X and Y variables evolve following: where in the present study the number of X variables, K = 8 and the number of Y variables per X variable, J = 32. Further, we set the coupling constant, h = 1, the spatial-scale ratio, b = 10 and the temporal-scale ratio c = 10. The forcing term F = 20 is set large enough to ensure chaotic behavior. The chosen parameter settings, which were used in (Arnold et al., 2013), are such that one model time unit (MTU) is approximately equivalent to five atmospheric days, deduced by comparing errordoubling times in L96 and the atmosphere (Lorenz, 1996).
In this study the full Lorenz '96 equations are treated as the 'truth' which must be forecast or simulated. In the case of the atmosphere, the physical equations of motion of the system are known. However, due to limited computational resources, it is not possible to explicitly simulate the smallest scales, which are instead parameterized. Motivated by this requirement for weather and climate prediction, a forecast model for the L96 system is constructed by truncating the model equations, and parameterizing the impact of the small Y scales on the resolved X scales: where X * k (t) is the forecast estimate of X k (t) andÛ (X * k , t) is the parameterized subgrid tendency. The parameterizationÛ approximates the true sub-grid tendencies: which can be estimated from realizations of the "truth" time series as following Arnold et al. (2013). The time step dt f equals the time step used in the forecast model for consistency.
A long "truth" run of the L96 model is performed to generate both training data for the machine learning models and a test period for both weather and climate evaluations. The "truth" run is integrated for 20000 MTU using a fourth-order Runge-Kutta timestepping scheme and a time step dt = 0.001 MTU. Output from the first 2000 MTU are used for training, and the remaining 18000 MTU are used for testing. A burn-in period of 2 MTU is discarded. All parameterized forecast models of the L96 use a forecast timestep of dt f = 0.005 MTU and a second order Runge-Kutta timestepping scheme. This scheme is to represent the time discretization of the equations representing the resolved dynamics in an atmospheric forecasting model.

GAN Parameterizations
The GAN parameterization developed for the Lorenz '96 model in this study utilizes a conditional dense GAN to predict the sub-grid tendency at the current time step given information about the state at the previous time step. We will investigate two classes of predictors of U t : both X and U at the previous forecast timestep, and X alone. In the following discussion we focus on GANs based on the first of these predictor sets; the construction of those associated with the second set is analogous to that of the first. Note that in this section we move to a discrete-time notation, with the time index indicated by the subscript, t, where adjacent indices are separated by the forecast time step, dt f . The GAN generator accepts X t−1,k , U t−1,k , and a latent random vector Z t−1,k as input to estimateÛ t,k , or the predicted U at time t. The discriminator accepts X t−1,k , U t−1,k , and V t,k as inputs (where V t,k may be either U t,k if from the training data orÛ t,k if from the generator) and outputs the probability that V t,k comes from the training data. All inputs and outputs are re-scaled to have a mean of 0 and standard deviation of 1 based on the training data distributions. Note that we choose to develop local GANs, i.e. GANs which accept X and U for a given spatial index, k, and that predictÛ for that index, k, as opposed to GANs that accept vector X and U and thus include spatial information. This is to match the local nature of parameterization schemes in weather and climate models.
Each GAN we consider consists of the same neural network architecture with variations in the inputs and how noise is scaled and inserted into the network. A diagram of the architecture of the GAN networks is shown in Fig. 1. Both the generator and discriminator networks contain two hidden layers with 16 neurons in each layer. The weights of the hidden layers are regularized with a L2, or Ridge,  penalty (Hoerl & Kennard, 1970) with scale factor λ of 0.001. Scaled exponential linear unit (SELU) activation functions (Klambauer et al., 2017) follow each hidden layer. SELU is a variation of the common Rectified Linear Unit (ReLU) activation function with a scaled exponential transform for the negative values that helps ensure the output distribution retains a mean of 0 and standard deviation of 1. Larger numbers of neurons per hidden layer were evaluated and did not result in improved performance. Gaussian additive noise layers before each hidden layer and optionally the output layer inject noise into the latent representations of the input data. A batch normalization (Ioffe & Szegedy, 2015) output layer ensures that the output values retain a mean of 0 and and standard deviation of 1, which helps the generator converge to the true distribution faster.
The GAN training procedure iteratively updates the discriminator and generator networks until the networks reach an adversarial equilibrium in which the discriminator should not be able to distinguish "true" data from generator samples. The inputs, outputs, and connections between networks are shown in Fig. 1. A batch, or random subset of samples drawn without replacement from the training data, of truth run output is split in half and one subset is fed through the generator and then into the discriminator or is sent directly to the discriminator. The discriminator is then updated based on errors in predicting which samples have passed through the generator. Another batch of samples are drawn and sent through the generator and then to a connected discriminator with frozen weights. The gradient with respect to the generator is calculated based on the reversed labeling that the generated samples originate from the true training data. This reversal forces the discriminator to teach the generator features that would worsen the discriminator's own performance. Gaussian noise is injected into the neural network at each iteration through both the input and hidden layers. We consider hidden layer Gaussian noise scaled to standard deviations at different orders of magnitude in order to determine how the magnitude of the noise affects the forecast spread and the representation of the model climate. In the training process, neural network weights are updated based on subsamples of examples drawn randomly from the training data without regard for temporal dependence. The GAN is trained as a map between X t−1,k (and possibly U t−1,k ) and the sub-grid-scale tendency U t,k . In forecast mode, we test providing both white, or uncorrelated noise, and red, or correlated noise to the GAN. The red noise is generated using an AR(1) process with a correlation equal to the lag-1 autocorrelation of the deterministic residuals of the GAN. The color of the noise is not relevant during the training process: both white-and red-noise representations are trained in the same way. The noise values are kept constant through the integration of a single timestep. The difference between the white-and red-noise representations only manifests when they are incorporated as parameterizations in the full model (Eqn. 1).
The GANs are all trained with a consistent set of optimization parameters. The GANs are updated through stochastic gradient descent with a batch size (number of examples randomly drawn without replacement from the training data) of 1024 and a learning rate of 0.0001 with the Adam optimizer (Kingma & Ba, 2015). The GANs are trained for 30 epochs, or passes through the training data. The model weights are saved for analysis every epoch for the first 20 epochs and then every 2 epochs between epochs 20 and 30. The GANs are developed with the Keras v2.2 machine learning API coupled with Tensorflow v1.13.
The GAN configurations considered in this study are summarized in Table 1. A short name of the format 'predictors-noise magnitude-noise correlation' is introduced to simplify identification of different GANs. For example, 'XU-med-r' refers to the GAN that takes X and U as predictors, and uses medium (med) magnitude red (r) noise. While most GANs tested include the optional additive noise layer before the output layer, the sensitivity to this choice was also considered. GANs that do not include this final noise layer follow the naming convention above, but are indicated by an asterisk.

Polynomial Regression Parameterization
The GAN stochastic parameterization is evaluated against a cubic polynomial regression parameterization,Û t,k , similar to the model used in Arnold et al. (2013).
The parameters [a, b, c, d] are determined by a least squares fit to the (X, U ) data from the L96 "truth" training run. It is known that the simple cubic polynomial deterministic parameterization U d t,k is a poor forecast model for the L96 system (Wilks, 2005;Arnold et al., 2013;, as X does not uniquely determine U . The variability in the (X, U ) relationship is accounted for using a temporally correlated additive noise term: where z ∼ N (0, 1), the first order autoregressive parameters (φ, σ ) are estimated from the residual process r t = U t − U d t , and the k processes are independent for different X variables.
The polynomial parameterization has been specifically designed to represent the impact of the Y variables in this version of the L96 model, just as traditional parameterization schemes are designed to represent a given process in a GCM. Previous studies have demonstrated that the polynomial parameterization with additive noise performs very well (Wilks, 2005;Arnold et al., 2013;. This 'bespoke' parameterization is therefore a stringent benchmark against which to test GAN parameterizations.

Metrics
The accuracy of ensemble weather forecasts can be summarized by evaluating the root mean square error (RMSE) of the ensemble mean. The lower the RMSE, the more accurate the forecast. The RMSE at lead time τ is defined as: where N is the number of forecast-observation pairs, X o (t) is the observed state at time t, and X m (t init , τ ) is the ensemble mean forecast at lead time τ , initialized at t init , such that t = t init + τ .
If an ensemble forecast correctly accounts for all sources of uncertainty such that the forecast of the spread of the ensemble and measured probabilities of an event are statistically consistent, the forecast is said to be reliable (Wilks, 2011). In this study we assess the reliability of the ensemble using the spread-error relationship (Leutbecher & Palmer, 2008;Leutbecher, 2010). This states that, for an unbiased and reliable forecasting system, the root mean square error in the ensemble mean is related to the average ensemble variance: where M is the number of ensemble members, and the variance and mean error have been estimated by averaging over many forecast-verification pairs. For the large ensemble size used in this study, M = 40, we can consider the correction factor to be close to 1. A skilful probabilistic forecast will have as small an RMSE as possible, while also demonstrating a statistical match between RMSE and average ensemble spread.
The simplest definition of the 'climate' of the L96 system is the probability density function (PDF) of the individual X t,k values. The climatological skill can therefore be summarized by quantifying the difference between the true and forecast PDF. The Hellinger distance H, is calculated for each forecast model: where p(x) is the forecast PDF, and q(x) is the verification PDF (Pollard, 2002). The smaller H, the closer the forecast pdf is to the true pdf. We also considered the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951), motivated by information theory, but found it provided no additional information over the Hellinger distance, so results for the KL are not shown for brevity.
Evidence that the L96 model displays distinct dynamical regimes of behavior for the parameter set considered was presented in (H. , in which regime affiliations were determined using a metric based on the temporally local covariance. (H.  found that during the more common regime (regime frequency ∼ 80%), the eight X variables exhibit wave-2 like behavior around the ring, while in the rarer regime, the X variables exhibit wave-1 type behavior. Another approach to characterizing regime structure that makes use of both the instantaneous state and the recent past of the system is Hidden Markov Model (HMM) analysis (Rabiner, 1989;Franzke et al., 2008;Monahan et al., 2015). In an HMM analysis, it is assumed that underlying the observed state variables is an unobserved Markov chain taking discrete values. The HMM algorithm provides maximum likelihood estimates of the probability distributions of the state variables conditioned on the instantaneous hidden state values, the stochastic matrix Q of transition probabilities for each time step, and an optimal estimate of the hidden state sequence.

Deterministic GAN Offline Error
XU-lrg-w* XU-med-w* XU-sml-w* XU-tny-w* X-sml-w* X-tny-w* XU-lrg-w XU-med-w XU-sml-w XU-tny-w X-med-w X-sml-w X-tny-w Poly Figure 2. Offline assessment of GAN performance. Hellinger distances between the GAN sub-grid tendency distributions given input X and U values from the truth run, and the truth run sub-grid forcing distribution as a function of training epoch.

Offline assessment of GAN performance
The GAN parameterizations are first evaluated on how closely their output subgrid forcing distributions match those of the truth run when the GANs are supplied with input X and U values from the truth run. This is summarized by the Hellinger distance in Figure 2. Most of the GANs show a trend of decreasing Hellinger distance for the first few epochs followed by mostly stable oscillations. GANs with both X t−1,k and U t−1,k as input tend to perform better in the offline analysis than those with only X t−1,k . Larger input noise standard deviations seem to reduce the amount of fluctuation in the Hellinger distance between epochs, but there does not appear to be a consistent correlation with noise standard deviation and Hellinger distance. Note that the weights as fitted at the end of epoch 30 are used in the forecast networks, regardless of whether the GAN at this epoch shows the minimum Hellinger distance.

GAN simulation of sub-grid-scale tendency distribution
The joint distributions of X t−1 and U t from the different model runs reveal how the noise standard deviation affects the model climate (Fig. 3). Larger noise standard deviations increase the range of X values appearing in the run but do not appear to change the range of U values output by the GAN. The behavior of the XU-tny-r GAN devolved into oscillating between two extremes. The X-only GANs did the best job in capturing the shape of the truth distribution although they underestimated the variance at the extremes. While the XU-w* and X-w* series captured the bulk of the distribution well, there were spurious points outside the bounds of the truth distribution for all of these models. The Polynomial model captured the conditional The truth joint distribution is overlaid in red contours on each forecast model distribution.
mean and variance of the distribution well but smoothed over some of the subtle details of the joint distribution that some of the GANs were able to capture.

Weather Evaluation
The parameterized models for the Lorenz '96 system are evaluated in a weather forecast framework. An extensive set of re-forecast experiments were produced for 751 initial conditions selected from the attractor. An ensemble of 40 forecasts was produced from each initial condition (i.e. no initial condition perturbations are used). Different random seeds are used for each ensemble member to generate the stochastic perturbations used in the GAN or polynomial parameterizations. Figure 4 shows the RMSE and spread for all weather experiments at 1 MTU. A reduction in RMSE indicates an ensemble forecast that more closely tracks the observations. A good match between RMSE and ensemble spread indicates a reliable forecast. The best performing GANs in terms of RMSE are X-tny-r, X-tny-w, and Xtny-w*. All of these models performed slightly better than the polynomial regression, which was competitive with most GANs in terms of both RMSE and the ratio of RMSE to spread. The spread of the white noise GANs is generally underdispersive. Most of the red noise GANs, on the other hand, are somewhat overdispersive with X-med-r having the spread/error ratio closest to 1. Figure 5 shows the RMSE and ensemble spread for a subset of the ensemble forecasts of the X variables performed using the GAN parameterizations or the bespoke polynomial parameterization. Xsml-w* demonstrates both low RMSE and similar spread to RMSE throughout the forecast period. X-sml-r and X-sml-w feature similar RMSE through 1 MTU, but Xsml-r has smaller RMSE after that point and a better spread-to-error ratio throughout the period. The XU GANs have higher RMSEs than their X counterparts because the XU models may have overfit to the strong correlation between U t−1,k and U t,k in the training data. Inspection of the input weights revealed that XU GANs generally weigh U t−1,k more highly than X t−1,k . At maxima and minima in the waves, the XU models  may be biased toward extending the current growth forward, which can be a source of error in forecast runs.

Climate Evaluation
The GAN parameterizations are also tested on their ability to characterize the climate of the Lorenz '96 system. First, the ability to reproduce the pdf of the X variables is evaluated. Each forecast model and the full L96 system were used to produce a long simulation of length 10,000 MTU. Figure 6(a) shows kernel density estimates of the marginal pdfs of X t,k from the full L96 system and from a sample of parameterized models. The pdf of the true L96 system is markedly non-Gaussian, with enhanced density forming a 'hump' at around X = 8. Compared to the true distribution, the XU-sml-w and XU-sml-r models both perform poorly, producing very similar pdfs with too large a standard deviation and that are too symmetric. However, the other models shown skilfully reproduce the true pdf. Figure 6(b) shows the difference between each forecast pdf and the true pdf. Several GANs perform as well if not better than the benchmark bespoke polynomial parameterization.
Figure 4(b) shows the Hellinger distance H evaluated for each parameterized model. The filled circles indicate the value of the metric when the pdf is evaluated across all X variables in both parameterized and truth timeseries. The crosses give an indication of sampling variability, and indicate the metrics comparing pairs of X variables, i.e. X t,j and X t,k for j = k. Quantifying the parameterized model performance in this way allows for easy ranking of the different models. While the AR(1) stochastic polynomial parameterized forecast model is very skillful (Arnold et al., 2013), several GANs outperform the polynomial model.
In addition to correctly capturing the distribution of the X variables, it is desirable that a parameterized model will capture the spatio-temporal behavior of the system. This is assessed by considering the spatial and temporal correlations in both the true system and parameterized models. The diagnostic is shown for a subset of the tested paramererized models in Figure 7. It is evident that the parameterized models truth XU-sml-w X-sml-w XU-sml-r X-sml-r XU-sml-w* X-sml-w* poly that skillfully capture the pdf of X also skillfully represent the spatio-temporal characteristics of the system. The X-sml-w* scheme performs particularly well, improving over the stochastic polynomial approach.
Following the regime results presented in (H. , we use HMM analysis to classify into two clusters the instantaneous states in the 4-dimensional space spanned by the norms of the projections of X on wavenumbers one through four (denoted Z j , j = 1, ..., 4). Because of the spatial homogeneity of the X t,k statistics, these wavenumber projections correspond to the Empirical Orthogonal Functions.
Kernel density estimates of the joint pdfs of the projections of X on wavenumbers 1 and 2 are presented in Figure 8, along with estimates of the joint distributions conditioned on the HMM regime sequence. The conditional distributions have been scaled by the probability of each state so that the full joint pdf is the sum of the conditional pdfs. For reference, the unconditional joint pdf for truth is shown in gray contours in each panel. The stochastic matrix Q shows the probability of remaining in each regime (diagonal values) or transitioning from one regime to the other (offdiagonal values). As in Figures 6 and 7, only a subset of results are displayed.
The clear separation of the truth simulation into two distinct regimes is modeled well by the polynomial parameterization and most of the GAN parameterizations. With the exception of XU-sml-w and XU-sml-r, the regime spatial structures and stochastic matrices are captured well. The GANs are slightly more likely to transition between regimes than the truth run, while the polynomial run is slightly more likely to stay in the same regime. Consistent with the other climate performance results presented above, the joint distribution of |Z i | produced by XU-sml-w and XU-smlr are strongly biased, with no evidence of a meaningful separation into two distinct regimes of behavior. To quantify the forecast model skill at reproducing the L96 behavior in wavenumber space, Figure 9 shows H calculated over the marginal pdfs of |Z 1 | and |Z 2 | as upward and downward triangles respectively. Several of the GAN evaluated are competitive with, or improve upon, the polynomial parameterization scheme. The best performing GANs also performed the best in terms of other climate metrics (e.g. see Figure 4).
We note that in particular, models which accurately capture the regime behavior will also show good correlation statistics when averaged over a long time series. The regime analysis can help diagnose why a model shows poor correlation statistics. For example, X-sml-w* accurately captures the marginal distributions of the two regimes, but the frequency of occurrence of the red regime, dominated by wavenumber-1 behavior, is slightly too high (at 54% compared to 51% in the 'truth' run). This reduces the magnitude of the negative (positive) correlation at a lag of 0.75 (1.5) MTU observed in figure 11.

Wavelet Analysis
To further investigate why the white noise and red noise GANs differ in operational performance, a wavelet analysis is performed on time series of outputs from the climate runs. A discrete wavelet transform decomposes the time series into contributions from different periods. The total energy E for a given period is represented as where w is the wavelet magnitude at a given timestep t. The total power for each period from each model is shown in Fig. 10. All sml GANs except the XU-sml-r and XU-sml-w follow the truth power curve closely. follow The polynomial regression follows the truth closely although it tends to underestimate the power slightly for each period. The GANs peak in power at the same period when the temporal correlation in Fig. 7 crosses 0. The GANs with poor Hellinger distances also contained more energy for longer periods. A clearer evaluation of the wavelet differences can be found by calculating the mean absolute percentage difference from the truth run at different wavelengths. The absolute difference between the truth and parameterized runs increases with increasing wavelengths, so the percentage difference is employed to control for this trend: The MAPD scores with wavelength in Fig. 10 shows that none of the GANs consistently perform better at all periods, but some do provide closer matches to the truth spectrum for the longer periods. In the peak energy period, the different GANs have minimum error for slightly different periods before increasing in error again. The Xsml-r GAN uniquely shows decreasing MAPD with increasing period, while the white noise GANs generally have similar differences across the range of evaluated periods.

GAN Health Risks
The disparity between the offline verification statistics and those from the climate and weather runs highlights the challenges in training GANs for parameterization tasks. Neither the values of the generator loss or the offline evaluation of the GAN samples correlated with their performance in the forecast model integrations. The generator and discriminator networks optimize until they reach an equilibrium, but there is no guarantee that the equilibrium is stable or optimal. Some of the differences in the results may be due to particular GANs converging on a poor equilibrium state as opposed to other factors being tested. GANs and other machine learning parameterization models are trained under the assumption that the data are independent and identically distributed, but in practice are applied to spatially-and temporallycorrelated fields sequentially, potentially introducing nonlinear feedback effects. GANs are more complex to train than other machine learning methods because they require two neural networks and do not output an informative loss function. Larger magnitude additive noise appears to help prevent runaway feedbacks from model biases at the expense of increasing weather prediction errors. The inclusion of the batch normalization output layer appeared to assist both training and prediction by limiting the possible extremes reached during integration.

Discussion
In this study, we chose to focus on GANs for stochastic parameterization primarily because the framework offers a way to directly embed stochasticity into the model instead of adding it a posteriori to a deterministic parameterization. The use of the discriminator network as an adaptive loss function is also attractive because it reduces the need for developing a hand-crafted loss function and can be scaled to higher dimensional and more complex outputs.
Several of the GANs tested show a weather and climate skill that is competitive with a bespoke polynomial parameterization scheme. For climatological skill, several different metrics were considered including the distribution of the X variable, spatiotemporal correlation statistics, and regime behavior. We found that forecast models that performed well according to one metric also performed consistently well for all metrics. The good performance of the GAN is encouraging, demonstrating that GANs can indeed be used as explicit stochastic parameterizations of uncertain sub-grid processes directly from data. Furthermore, a small number of GANs improve upon the bespoke polynomial approach, indicating the potential for such machine-learned approaches to improve on our current hand-designed parameterization schemes, given suitable training data. While this has been proposed and demonstrated for deterministic parameterizations (Schneider et al., 2017;Rasp et al., 2018), this is a first demonstration for the case of an explicitly stochastic parameterization. Figure 4(a) and (b) indicates a relationship between forecast models that perform well on weather and climate timescales. To quantify this further, Figure 11 shows the correlation between weather forecast skill and climate performance for each model considered. Models which produce weather forecasts with a lower RMSE also show good statistics on climate timescales. In contrast, the reliability of weather forecasts, i.e., the statistical match between spread and error, is a poor predictor of climate performance. This is reflected by the competitive performance of the white noise GAN for producing a realistic climate, whereas on weather timescales, red noise increases the spread and can thereby substantially improve the forecast reliability (e.g. consider the X-med, X-sml and X-tny GAN for white and red noise respectively: Figure 4). This relationship between initialized and climatological performance has been discussed in the context of global models (K. D. Williams et al., 2013). It provides further evidence that parameterizations can first be tested in weather forecasts before being used in climate models, as promoted by the 'seamless prediction' framework (Brunet et al., 2010).

Comparison of
Wavelet analysis helped uncover differences in model performance across different time scales. While the other evaluation metrics focused on distributional or error metrics in the time domain, the wavelet power spectrum separated the time series into different periods and enabled comparisons of the energy embedded in different scales. In particular, the wavelet analysis revealed that some of the GANs added energy to the system either at long periods or across all periods in some cases.
The standard deviation of the noise does impact both the training of the GANs and the resulting weather and climate model runs. Using too large a standard deviation limits the ability of the GAN to discover the structure of the true distribution of the data. Standard deviations that are too small may result in either the generator or discriminator becoming overly good at their tasks during training, which results in the GAN equilibrium being broken. During simulations, noise standard deviations that are too small can result in the system becoming trapped within one regime and never escaping.
In addition to the GAN configurations evaluated in the paper, other GAN settings were tested and were found to have similar or worse performance. Given the relative simplicity of the Lorenz '96 system, adding neurons in each hidden layer did not improve performance. Using the SELU activation function generally resulted in faster equilibrium convergence than the ReLU. Varying the scaling factor for the L2 regularization on each hidden layer did affect model performance. Using a larger value greatly reduced the variance of the predictions, but using a smaller value resulted in peaks of the final distributions being too far apart, especially when both X t−1,k and U t−1,k were used as inputs. We also tested deriving U from a 1D convolutional GAN that reproduced the set of Y values associated with each X. That approach did produce somewhat realistic Y values but contained "checkerboard" artifacts from the convolutions and upscaling, especially at the edges. The sum of the Y values was also not equal to U derived from Eq. 5.
The L96 system is commonly used as a testbed for new ideas in parameterization, and ideas tested using the system can be readily developed further for use in higher complexity Earth system models. However, the L96 system has many fewer dimensions than an Earth system model and a relatively simple target distribution. The relative simplicity of the L96 system may have also led to the more complex GAN overfitting to the data compared with the simpler polynomial parameterization. For more complex, higher dimensional systems, the extra representational capacity of the GAN may provide more benefit than can be realized in L96. The computational simplicity of L96 also allows for the production of extremely large training data sets with little compute resources. Higher complexity Earth system model output can pro-vide training set coverage spatially but will be limited temporally by the amount of computational resources available.

Conclusions
In this study, we have developed an explicitly stochastic GAN framework for the parameterization of sub-grid processes in the Lorenz '96 dynamical system. After testing a wide range of GANs with different noise characteristics, we identified a subset of models that outperform a benchmark bespoke cubic polynomial parameterization. Returning to the questions posed in the Introduction, we found that this subset of GANs approximates well the joint distribution of the resolved state and the sub-gridscale tendency. This model subset also produces the most accurate weather forecasts (i.e. lowest RMSE in the ensemble mean). Some GANs produce reliable forecasts, in which the ensemble spread is a good indicator of the error in the ensemble mean. However, these are not necessarily the GANs that produce the most accurate forecasts. The subset of models with the most accurate weather forecasts produce the most accurate climate simulations, as characterized by probability distributions, space and time correlations, and regime dynamics. However, we note that the GANs which produce skilful weather and climate forecasts were different to those which performed well in "offline" mode.
Although the GANs required an iterative development process to maximize model performance and were very sensitive to the noise magnitude and other hyperparameter settings, they do show promise as an approach for stochastic parameterization of physical processes in more complex weather and climate models. Applying other recently developed GAN losses and regularizers (Kurach et al., 2019) could help reduce the chance for the GAN to experience a failure mode.
The experiments presented here demonstrate that GANs are a promising machine learning approach for developing stochastic parameterization in complex GCMs. Key lessons learned and unanswered questions include: • While including the tendency from the previous timestep provides a natural approach for building temporal dependence into the parameterization, it can lead to accumulation of error in the forecast, such that local-in-time parameterization should also be considered. • Autocorrelated noise is important for a skillful weather forecast, but appears less important for capturing the climatological distribution. • It is possible that spatial correlations are also important in a higher complexity Earth system model, which could not be assessed here due to the simplicity of the Lorenz '96 system. • It is possible that the noise characteristics could also be learned by the GAN framework to automate the tuning of the stochasticity.
Future work will use these lessons to develop machine-learned stochastic parameterization schemes for use in higher complexity Earth system models. GANs of a similar level of complexity to those used for L96 could emulate local effects, such as some warm rain formation processes. Other generative neural network frameworks, such as variational autoencoders, should also be investigated to determine if they can provide similar or better performance with a less sensitive training process. and by the National Science Foundation through NCAR's Cooperative Agreement AGS-1852977. HMC was funded by Natural Environment Research Council grant number NE/P018238/1. AHM acknowledges support from the Natural Sciences and Engineering Research Council of Canada (NSERC) and thanks SAMSI for hosting him in the autumn of 2017. The NCAR Cheyenne supercomputer was used to generate the model runs analyzed in this paper. Sue Ellen Haupt and Joseph Tribbia provided helpful reviews of the paper prior to submission. The source code for this paper can be viewed and downloaded at https://github.com/djgagne/lorenz gan/.