Generative data-driven approaches for stochastic subgrid parameterizations in an idealized ocean model

Subgrid parameterizations of mesoscale eddies continue to be in demand for climate simulations. These subgrid parameterizations can be powerfully designed using physics and/or data-driven methods, with uncertainty quantification. For example, Guillaumin and Zanna (2021) proposed a Machine Learning (ML) model that predicts subgrid forcing and its local uncertainty. The major assumption and potential drawback of this model is the statistical independence of stochastic residuals between grid points. Here, we aim to improve the simulation of stochastic forcing with generative models of ML, such as Generative adversarial network (GAN) and Variational autoencoder (VAE). Generative models learn the distribution of subgrid forcing conditioned on the resolved flow directly from data and they can produce new samples from this distribution. Generative models can potentially capture not only the spatial correlation but any statistically significant property of subgrid forcing. We test the proposed stochastic parameterizations offline and online in an idealized ocean model. We show that generative models are able to predict subgrid forcing and its uncertainty with spatially correlated stochastic forcing. Online simulations for a range of resolutions demonstrated that generative models are superior to the baseline ML model at the coarsest resolution.


Introduction
Mesoscale eddies, with a horizontal scale roughly equal to the Rossby deformation radius, play a crucial role in ocean circulation.Mesoscale eddies carry most of the kinetic energy in the ocean and account for a substantial part of the transport of momentum, heat, and salt (Vallis, 2017).The dynamics of mesoscale eddies involve a variety of complex physical processes: potential to kinetic energy conversion, upscale energy transfer, upgradient fluxes, sharpening of jet currents, along-isopycnal mixing and bolus advection.Primitive equations can potentially capture all these processes if all the relevant spatial scales of motion are directly resolved on the computational grid.However, direct simulation of mesoscale eddies remains computationally expensive, especially in high latitudes where the deformation radius decreases (Hewitt et al., 2020).
Modern global ocean models have an eddy-permitting resolution (around 1/4 o , Haarsma et al. (2016)), such that the largest mesoscale eddies are resolved but smaller ones are not; therefore the effect of these smaller unresolved (subgrid) mesoscale eddies is missing and needs to be parameterized.A range of grid resolutions where a phys-ical process is partially (but not fully) resolved is often referred to as the gray zone (Berner et al., 2017;Christensen & Zanna, 2022).Traditional methods to parameterize mesoscale eddies (Redi, 1982;Gent & Mcwilliams, 1990) were designed to describe their mean effect on the large-scale flow.These parameterizations are suitable for ocean models with a very coarse horizontal resolution, where there is an approximate scale separation between the grid step and the size of mesoscale eddies, but not for the gray zone.
The "Large eddy simulation" approach (LES, Fox-Kemper and Menemenlis (2008); Sagaut (2006)) is a technique to build a mesoscale eddy parameterization in the gray zone.The LES framework introduces a spatial filtering (and coarse-graining) operator which splits the flow into resolved and subgrid components.The filter mimics the effect of finite resolution and its width is proportional to the grid step of the coarse model.The effect of subgrid eddies on the resolved flow is referred to as a subgrid forcing and is diagnosed from the output of the high-resolution model by applying the spatial filter to the governing equations.A subgrid model or parameterization is a model which relates the subgrid forcing to the resolved flow.In recent years many new mesoscale eddy parameterizations were proposed to better capture the effects of mesoscale eddies in the gray zone using some heuristic (or empirical) physical arguments (Thuburn et al., 2014;Jansen & Held, 2014;Mana & Zanna, 2014;Zanna et al., 2017;Bachman et al., 2017;Pearson et al., 2017;Bachman et al., 2018;Jansen et al., 2019;Bachman, 2019;Grooms et al., 2015;Berloff, 2018;Juricke et al., 2020).
Machine Learning (ML) methods have recently gained traction as a new direction for developing subgrid eddy parameterizations in geophysics and turbulence (Rasp et al., 2018;Bolton & Zanna, 2019;Maulik et al., 2019;Beck et al., 2019;Yuval & O'Gorman, 2020;Guan, Chattopadhyay, et al., 2022;Beucler et al., 2021;Shamekh et al., 2022;Wang et al., 2022).ML parameterizations capture the effect of subgrid eddies on the resolved flow by training a model in a data-driven fashion.The most popular approach to train ML subgrid models is to minimize the mean squared error (MSE) between their output and a subgrid forcing obtained by reducing the resolution of a high-resolution model via filtering and coarse-graining (Bolton & Zanna, 2019).Such models typically have excellent offline performance: they are able to accurately predict the subgrid forcing.However, the ultimate goal of subgrid parameterizations is to improve online performance, once the parameterization is included into the coarse ocean model and the model is integrated for a long time.The coarse parameterized model should then reproduce the statistical properties of the coarse-grained highresolution model (Sagaut, 2006).Recent work has shown that the offline and online performance of subgrid parameterizations correlate poorly (Ross et al., 2023): models trained with the offline MSE loss may be unstable when applied online (Beck et al., 2019;Maulik et al., 2019) and physically-based parameterizations have very low offline MSE but perform reasonably well online (Ross et al., 2023).Several approaches have been proposed to improve ML parameterizations.Kochkov et al. (2021) and Frezat et al. (2022) proposed an online training procedure that improves numerical stability properties but requires a differential model and has a considerable computational cost.Guan, Chattopadhyay, et al. (2022) suggested gradually enlarging the training dataset until the rare events in subgrid forcing are well captured.In Guan, Subel, et al. (2022) the MSE loss function was modified with an additional constraint involving energy exchange.Frezat et al. (2021); Guan, Subel, et al. (2022); Pawar et al. (2022) proposed to account for physical invariances of subgrid forcing.
Conventional subgrid parameterizations are deterministic and predict a single subgrid forcing for a given input (Berner et al., 2017), which represents the mean or most likely prediction given the resolved flow.However, many possible states of the subgrid eddies are typically consistent with a given resolved flow, so there is inherent uncertainty in the subgrid fluxes (Gerard, 2007;Berner et al., 2017;Christensen & Zanna, 2022).Quantifying this uncertainty requires characterizing the distribution of the subgrid forcing, conditioned on the resolved variables.The stochastic ML model of Guillaumin and Zanna (2021) performs uncertainty quantification by estimating the pointwise conditional mean and conditional variance of the subgrid forcing, but does not take into account spatial correlations.
In this work, we propose to leverage two powerful uncertainty-quantification ML frameworks to data-driven subgrid parameterization of mesoscale eddies: variational autoencoder (VAEs, Kingma and Welling (2013)) and generative adversarial networks (GANs, Goodfellow et al. (2014)).These frameworks provide a data-driven characterization of the conditional distribution of the subgrid forcing given the resolved flow.The resulting ML models are generative, meaning that they allow us to sample from the conditional distribution, and can be therefore directly deployed as stochastic parameterizations.Our proposed ML models do not contain a-priori assumptions about the structure of the statistical model.These ML models can therefore potentially capture any statistically significant properties of the subgrid forcing such as the spatial correlation of stochastic residuals, dependence on the resolved flow, or probability distribution (Adler & Öktem, 2018;Gagne et al., 2020;Alcala & Timofeyev, 2021;B. T. Nadiga et al., 2022).In addition, generative models can be trained and tested using the same datasets, as MSE-based ML models.
We implement our generative models in an idealized ocean simulation and evaluate them both offline and online.Our offline analysis shows that the generative models provide a flow-dependent prediction of uncertainty.The resulting stochastic residuals are correlated in space and reproduce stochastic backscatter (Leslie & Quarini, 1979;Chasnov, 1991;Frederiksen & Davies, 1997) in the correct band of scales.Additionally, generative models accurately simulate large-scale kinetic energy backscatter (Thuburn et al., 2014;Jansen & Held, 2014) and properly energize the flow.Our online analysis shows that the generative models have better numerical stability and metrics than the baseline ML model in Guillaumin and Zanna (2021) at coarse resolutions.

Idealized ocean model and subgrid eddy forcing
In this section, we describe an idealized numerical ocean model based on quasigeostrophic (QG) equations of layered fluid written in Python (pyqg, Abernathey et al. (2022)), see Figure 1.The configuration of the QG model and the corresponding definition of subgrid forcing are similar to those in Ross et al. (2023).We use this model to perform offline and online evaluation of the proposed methodology to build subgrid parameterization for a range of resolutions.manuscript submitted to Journal of Advances in Modeling Earth Systems (JAMES)

Governing equations
We solve numerically the QG equations for potential vorticity (PV) anomalies relative to the mean flow given by a prescribed vertical shear that plays the role of external forcing driving turbulence.
The two-layer QG equations in Cartesian coordinates (x is zonal, y is meridional) are: where m is the index of the fluid layer (1 for the upper layer and 2 for the lower layer); q m is the potential vorticity (PV) which is conserved on Lagrangian trajectories in absence of forcing and dissipation; ψ m is the streamfunction, related to velocity as is the meridional gradient of potential vorticity due to differential rotation (in β-plane approximation) and prescribed mean flow; r ek is the bottom drag coefficient; δ m,2 is a Kroneker delta which indicates that drag is applied only to the lower layer; f 0 is the reference Coriolis frequency; g is the reduced gravity and H m is the fluid layer thickness, H = H 1 + H 2 is the total depth; ∇ = (∂ x , ∂ y ) is a horizontal Nabla operator, where ∂ x , ∂ y are partial derivatives w.r.t.
x, y.The numerical schemes and how the small-scale dissipation (ssd) is applied to the governing equations are described in Appendix A. The kinetic and total energy per unit mass are respectively given by: where • is 2D spatial averaging.The QG system, described by Eq. ( 1) and ( 2) is initially perturbed from rest with random noise in the upper PV field, with a subsequent evolution over the next 2-5 years exhibiting a transition to turbulence.The initial random perturbations are limited to the range of scales of the coarsest model, and it allows to simulate similar energy growth in the transition from laminar to turbulent regimes at different grid resolutions, see Figure 1(a).Model parameters are given in Table 1 and correspond to the "eddy" configuration in Ross et al. (2023).
Mesoscale eddies emerge on a spatial scale determined by the deformation radius r d = g f 2 0 H1H2 H (Salmon, 1980;Vallis, 2017), denoted by the arrow in Figure 1.We choose the resolution of the reference simulation (256 2 ) in order to accurately reproduce the spectral energy transfer.Coarse-resolution models do not resolve the deformation radius properly and fail to reproduce various statistical characteristics (Hallberg, 2013;Hewitt et al., 2020), including kinetic energy (KE), spectrum of KE and energy transfer.In this work, we aim to improve the simulation of turbulence in coarse models by incorporating a subgrid parameterization model, which compensates for the missing physics.

Filtered equations
In this section, we derive the governing equations for the coarse model which follows the trajectory of the filtered and coarsegrained high-resolution simulation.These equations contain a new term that describes the interaction with unresolved eddies, the term that is not available at the coarse resolution and needs to be parameterized.We follow the Large eddy simulation (LES, Sagaut (2006)) approach to split the prognostic variables (φ) into resolved (φ) and subgrid (φ ) components by applying a spatial convolutional filter with kernel G(y) such that φ = φ + φ , (5) We use two spectral filters from Ross et al. (2023): one filter is a combination of a cutoff and a model filter ("Sharp"), the other is a combination of a cut-off and a Gaussian filter (we denote it as "Gaussian").Precise definitions are provided in Appendix B.
Applying the filter G(y) to the governing equations (1), (2), we obtain a set of governing equations for the filtered solution: S is the additional subgrid forcing produced by the unresolved eddies on the resolved scales, which needs to be parameterized.We will omit the index m for the subgrid forcing and related variables to simplify notation.The dissipation term ssd on a coarse grid in Eq. ( 7) is added a-posteriori to ensure the numerical stability of the simulations.In deriving Eq. ( 7), we used commutativity between derivatives and spatial filtering, which holds for spectral numerical schemes and spectral filters (Ghosal, 1996).Both subgrid forcing and numerical advection scheme are formulated in flux form, so we include numerical approximation errors into the definition of subgrid forcing (Ghosal, 1996;Chow & Moin, 2003;Gullbrand & Chow, 2003).Vertical lines show grid cut-off for coarse mesh κmax = π/∆x.Right: Snapshots of S at three different resolutions for the upper layer.

Subgrid forcing dataset
The solution to the governing equations (1), (2) for the high-resolution model is denoted by q.The filtered quantities (q) are defined on a coarse mesh.
The dataset to train ML subgrid parameterization models is obtained as follows.We integrate the governing equations in time for 10 years at high resolution (256 2 ) with time step 1 hour and save snapshots every 1000 hours, for a total of 86 snapshots.
The training dataset consists of 250 runs, each corresponding to a different random initial condition, for a total of 21500 snapshots.The validation and testing datasets consist of 25 runs each.For each coarse resolution (48 2 , 64 2 , 96 2 ), we compute a filtered solution represented on a coarse mesh (q, u) and subgrid forcing (Eq.( 9)) using Sharp or Gaussian filters.The spectral content of the resulting subgrid forcing greatly depends on the scale selectivity of the filter, see Figure 2.

Data-driven stochastic subgrid models
In this section, we introduce a probabilistic approach for the prediction of subgrid forcing, which can be used to build data-driven stochastic parameterizations.
Conventional subgrid parameterizations establish a functional relationship between the subgrid forcing (S, Eq. ( 9)) and the resolved flow (q) in the form of S ≈ S(q).Such parameterizations are typically deterministic; they produce a single prediction for a given input.However, there is inherent uncertainty in the prediction of subgrid forcing, because many possible states of the subgrid eddies are consistent with a given resolved flow.Therefore, we propose to instead generate a probabilistic prediction, by attempting to sample from the conditional distribution of the subgrid forcing given the coarse-grained flow (S ∼ ρ(S|q)).In order to generate a probabilistic prediction of the subgrid forcing, we propose to apply a generative ML framework, where samples from a desired distribution are obtained by transforming white noise using a mapping learned directly from data (Kingma & Welling, 2013;Goodfellow et al., 2014).We design and compare three different approaches, depicted in Figure 3, to learn this transformation: (a) A model based on Guillaumin and Zanna (2021), which predicts the pointwise mean and pointwise standard deviation of the conditional distribution of the subgrid forcing.(b) A generative adversarial network (GAN), consisting of a generator that generates subgrid-forcing samples by trying to fool a discriminator, trained to distinguish between these samples and the true high-resolution data.(c) A variational autoencoder (VAE) consisting of an encoder, which maps the input signal to a latent space, and a decoder, which decodes the latent variables to produce subgrid-forcing samples.The remainder of this section provides a more detailed description of each approach.

Guillaumin and Zanna model (GZ)
Guillaumin and Zanna (2021) presented a probabilistic ML parameterization, where the mean and variance of the subgrid forcing are estimated at every grid point using a neural network.The original formulation in (Guillaumin & Zanna, 2021) minimizes an i.i.d.Gaussian likelihood cost function to optimize the parameters of the network.Here, we propose an alternative training procedure, which we have found to be more efficient.Following the approach of Adler and Öktem (2018), we estimate the pointwise means and variances sequentially.
First, we estimate the conditional mean at each grid point by minimizing the MSE loss function: where S mean θ (q) is the output of a neural network with parameters denoted by θ, which receives q as an input.S, S mean θ , q ∈ R 2×n×n are tensors representing two layers of fluid, each layer having n × n points.The norm in the cost function is the 2 norm of the vectorized tensor, which for the vector of length D is The loss function is minimized over a training set consisting of samples of the resolved flow q and the corresponding high-resolution forcing S obtained as described in Section 2.3.Minimization of L MSE yields an optimal set of parameters θ * and a corresponding model which we denote as S mean (q).Second, we estimate the conditional variance at each grid point, based on the residual of the conditional-mean estimate r = S − S mean (q).To this end, we minimize the cost function where S var φ (q) is the output of a neural network with parameters denoted by φ, which receives q as an input.The final layer of the network is a softplus activation function ln(1 + e x ) to ensure that the variance estimates are nonnegative.The loss function is minimized over the training set fixing the residual r.The resulting model is denoted by S var (q).Additional training information is given in Appendix C.
The conditional-mean and conditional-variance models are used to implement a stochastic parameterization with white noise, as follows (see Figure 3(a)): where z ∈ R 2×n×n is sampled from a standard normal distribution.

Generative adversarial network model (GAN)
We propose to leverage the framework of generative adversarial networks (GANs) to build a probabilistic model (Goodfellow et al., 2014), which generates samples from the distribution of possible subgrid forcings (S) at a given resolved flow (q) denoted as ρ(S|q), where both variables are considered as 3D fields, S, q ∈ R 2×n×n .The mentioned distribution is defined implicitly by the dataset of pairs of S and q.The GAN framework consists of two networks, generator and discriminator, playing an adversarial game: the generator attempts to fool the discriminator, which is trained to discriminate between the output of the generator and actual data sampled from a desired distribution.
Sampling from the conditional distribution is possible with the conditional GAN model (cGAN, Mirza and Osindero (2014)), which informs both networks with the conditional variable.Specifically, the generator transforms the latent noise variable z and PV field to the subgrid forcing, see Figure 3(b): The discriminator returns a score given a pair of subgrid forcing and PV field denoted as D(S, q).There are many options to define the adversarial loss function (Lucic et al., 2018).We leverage a popular approach of Wasserstein GAN (WGAN, Arjovsky et al. (2017)) with the following optimization problem: where E is the mathematical expectation over the training samples.The discriminator D is optimized to estimate the Wasserstein-1 distance between the distributions ρ(S|q) and ρ( S|q), while the generator learns the true distribution by minimizing this distance.
Solving the optimization problem ( 14) may lead to the mode collapse phenomenon when the generator ignores the latent variable z: for every coarse field q the model may produce a single fixed subgrid forcing (Isola et al., 2017;Ohayon et al., 2021;Mao et al., 2019;Yang et al., 2019).To overcome mode collapse, we apply a technique proposed by Adler and Öktem (2018): feeding multiple generator outputs S to the discriminator for a given input q.Identical outputs are readily detected and penalized by the discriminator.We provide additional details in Appendix C.
Once trained, the GAN generator (13) can be used as a stochastic parameterization by sampling the latent variable z ∈ R 2×n×n from a standard normal distribution.

Variational autoencoder model (VAE)
As an alternative to the GAN framework, we propose leveraging the variational autoencoder (VAE, Kingma and Welling (2013)) to sample from the conditional distribution ρ(S|q).The VAE framework consists of two networks: the encoder and the decoder.The encoder produces a latent representation and the decoder reconstructs the subgrid forcing from this representation.A regularization term constrains the latent vector to be close to a simple distribution, chosen a priori.
The conditional VAE (cVAE) is obtained by feeding a conditional variable to the encoder and decoder (Sohn et al., 2015;Doersch, 2016;Zhang et al., 2016;Mishra et al., 2018;Pagnoni et al., 2018): the decoder maps the latent noise and conditional variable q to the subgrid forcing, see Figure 3(c): where free parameters are denoted as θ and we emphasize that the mapping is probabilistic.The probabilistic encoder with free parameters φ is denoted as z ∼ q φ (z|S, q).The encoder and decoder are trained jointly to maximize the lower bound of the likelihood of observing the training sample (also known as evidence lower bound, ELBO): where D KL (p(x), q(x)) = E p(x) ln p(x) q(x) is Kullback-Leibler divergence, a measure of the difference between two distributions.The reconstruction term encourages the encoder to seek an accurate latent representation of the subgrid forcing and encourages the decoder to assign a high probability to the training samples.The regularization term constrains the encoder to be close to the prior distribution ρ(z).We parameterize all probabilities with Gaussian distributions and replace the mathematical expectation with one sample from the encoder (reparameterization trick, Kingma and Welling (2013)).The resulting loss function is equivalent to a regularized MSE as explained in more detail in Appendix C.
The mean channel of the Gaussian decoder (15) can be used as a stochastic parameterization by sampling the latent variable z ∈ R 2×n×n from a standard normal distribution.

Offline analysis of stochastic subgrid models
In this section, we perform an offline evaluation of the stochastic subgrid models described in Section 3 using the dataset described in Section 2.3.We show spatial maps and spectra of the predicted subgrid forcing.We propose metrics for the evaluation of the predicted subgrid forcing and stochastic residuals and compare them for a range of resolutions.
For every combination of filter (Sharp and Gaussian) and resolution of coarse mesh (48 2 , 64 2 , 96 2 ), we train three machine learning models: GZ, GAN and VAE, for  a total of 18 different models of subgrid forcing.The baseline deterministic subgrid model is trained with the MSE loss function and it is referred to as "MSE" (we simply take the mean channel of GZ model).Following Kochkov et al. (2021), every model was trained five times with different random seeds.Three training instances failed and were excluded from the subsequent analysis: 2 realizations of VAE models at resolution 64 2 experienced the posterior collapse problem (zero spread, Dai et al. ( 2020)), and one realization of GZ model at resolution 48 2 had a large generalization error.In this section, we show results for Sharp filter, and similar plots for Gaussian filter are shown in Appendix D.

Analysis of stochastic predictions
In this section, we compare stochastic predictions of subgrid forcing to the true subgrid forcing.We suggest to split the stochastic prediction into the deterministic part and the stochastic residual.We define the deterministic part as a mean prediction of subgrid forcing at a fixed resolved field q -it is conditional mean denoted as E( S|q).The deterministic part of the GZ model is given by the mean channel S mean (q), and for GAN and VAE models we estimate it by sampling 1000 realizations of the latent vector (Adler & Öktem, 2018).The predicted stochastic residuals ( r = S − E( S|q)) should be compared to the true residuals (r = S − E( S|q)), and for an accurate stochastic model they should be statistically similar, see Wilks ( 2005 In Figure 4 we show predictions of the stochastic subgrid models.The deterministic part (E( S|q)) is similar for three stochastic models.The predicted stochastic residual (rightmost column) for GZ model looks like uncorrelated spatial white noise in contrast to the true residual.The predicted residuals for the other two models (GAN and VAE) are more visually similar to the true one.The pointwise standard deviation is a measure of the local uncertainty in the deterministic prediction and can be related to the second moment of residuals as: It is directly accessible for the GZ model, and for GAN and VAE models it can be estimated similarly to the conditional mean.The standard deviation fields have similar spatial structure for all three stochastic models.
We use the spatial power spectrum to analyze the spatial correlation.In Figure 5(a) we show the power spectrum of stochastic residuals.The true residuals are concentrated near the grid cut-off (Nyquist frequency, κ max = π/∆x) and near the spatial frequency of ssd filter (κ = 0.65π/∆x).The GZ model does not reproduce the twohill shape of the power spectrum of residuals.The GAN model accurately reproduces the power spectrum of residuals and improves the power spectrum of subgrid forcing (Figure 5(b)) compared to the deterministic and stochastic baselines (MSE, GZ).Note that accurate prediction of the power spectrum of subgrid forcing is a challenging task for deterministic models (Guan, Subel, et al., 2022) because optimization of the mean squared error leads to the loss of details in small scales (Isola et al., 2017).The VAE model predicts the correct shape of the spectrum of residuals, but the total variance of residuals is underestimated.The power spectrum of subgrid forcing for the VAE model is also lower compared to the other models.We explain it by the well-known issue of VAE architecture to predict oversmoothed images (Takida et al., 2022).
An important property of subgrid forcing in QG turbulence is an ability to energize turbulence on a coarse grid, i.e. kinetic energy backscatter (Jansen & Held, 2014).There are two popular approaches to simulate backscatter: stochastic residuals near the grid scale (Leslie & Quarini, 1979;Chasnov, 1991;Schumann, 1995;Frederiksen & Davies, 1997;Grooms et al., 2015) and mean energy injection in large scales (Kraichnan, 1976;Frederiksen et al., 2003;Graham & Ringler, 2013;Thuburn et al., 2014;Jansen & Held, 2014;Juricke et al., 2020).These two types of backscatter result from physical processes of a very different nature: stochastic backscatter simulates the loss of information about unresolved degrees of freedom but energy injection in large scales compensates for the unresolved inverse energy cascade.All the stochastic models are accurate in predicting large-scale energy injection (Figure 5(c)), and GAN model is the best in predicting stochastic residuals near the grid scale.
Marginal PDF of subgrid forcing is often used to evaluate subgrid models (Pawar et al., 2020;Maulik & San, 2017).Both GZ and GAN models improve this PDF in the high-probability region and in the tails compared to the baseline MSE model, see Figure 5(d).The VAE model is similar to the baseline MSE in this characteristic.

Quantitative offline analysis and metrics
Above we presented a qualitative analysis of the stochastic subgrid models, and here we propose metrics for their quantitative evaluation.We consider three classes of metrics, which demonstrate: the quality of the subgrid forcing, its deterministic part and stochastic residuals, see Table 2.We include spectral metrics for the subgrid forcing and residuals (L S and L r ) in order to evaluate to what extent the models capture the corresponding spatial structure.
In Figure 6 we report the evaluation of the offline metrics for the different models.
The upper row provides metrics on the test dataset with the same turbulence regime as the training set.We observe that the generative models (GAN and VAE) have slightly greater deterministic error (L rmse ) compared to the model optimizing this metric directly (GZ and MSE).The GAN and GZ models correctly predict spread of  In the lower row of Figure 6, we evaluate the generalization ability of the models by computing the offline metrics on dataset corresponding to a different turbulence regime, where flow is dominated by meandering jets, and which is therefore systematically different from the training data (see Ross et al. (2023) for description).GZ model considerably overestimates the spread of the residuals (2 < σ 2 spread < 10), and it deteriorates the spectral metrics (L S and L r ).Although the VAE model had various suboptimal metrics on the eddy turbulence configuration, it demonstrates the best generalization capabilities to the jet configuration: VAE model has reasonable spread σ 2 spread ≈ 0.8, and outperforms other models in the error of the deterministic prediction L rmse , the quality of the subgrid forcing L S and residuals L r .The GAN model generalizes better than GZ for most of the metrics, without reaching the performance of the VAE model.We observe similar results for the Gaussian filter, see Appendix D. During the first few years of simulation, QG model undergoes a transition from a laminar to a turbulent flow regime.Generalization to the transitional regime is a difficult test for subgrid models (Frezat et al., 2021) because the subgrid forcing is a few orders of magnitude smaller compared to the developed turbulence regime.Although we include the transitional regime in the training set, the relative importance of these samples is small due to their small norm.As a result, all subgrid models have large errors compared to the norm of the subgrid forcing during the first few years of simulation (t < 2 years), see Figure 7.The generative models (GAN and VAE) demonstrate the best performance in the transitional regime: error is one order of magnitude smaller compared to GZ.In the next section, we show that generative models are also superior to the baseline in the online simulation of transitional flow.

Online simulations with subgrid models
In the previous section, we demonstrated the encouraging ability of generative models GAN and VAE to simulate various statistical characteristics of subgrid forcing.In this section, we evaluate the performance of trained subgrid models in online simulations.In more detail, we use the output of the subgrid model S to replace the true subgrid forcing S in the governing equation for the coarsegrained dynamics (7), and perform numerical time integration.Our goal is to study how the subgrid parameterizations impact the dynamics of mesoscale eddies in a statistical equilibrium regime.
Our online experiments are summarized in Table 1.Compared to the generation of the training data, we run experiments for twice as long (20 years).Before passing the subgrid forcing prediction into governing equation we subtract the spatial mean in each fluid layer to ensure the conservation of PV.Recall that we train 5 different models (differing only in the initialization of the weights) for every combination of resolution, filter and type of subgrid model.Each of these models is evaluated in an ensemble of 10 online runs, with different random initial conditions.The total number of runs is approximately 1200.The statistical characteristics of the turbulence are averaged over the 10 ensemble members (and the last 15 years if applicable).We provide the confidence bounds for every averaged statistic defined by the minimum, maximum and median values over 5 realizations of the training algorithm.
Among mentioned experiments, there were a few unstable simulations at resolution 96 2 : one run of the VAE model for Sharp filter, and 3 GAN models out of 5 training realizations for Gaussian filter.We exclude mentioned experiments from the analysis.In these experiments, an eddy emerges which is constantly amplified by the parameterization, and in a spectral space it corresponds to the overestimated energy injection on the largest scale.This effect is possible because we do not control the amplitude of the parameterization as it is usually done in energetically-consistent physical parameterizations of backscatter (Jansen & Held, 2014).
Following Ross et al. (2023), we consider an error in PDFs of the turbulence fields.Define the Wasserstein distance between distributions as W 1 (F 1 , F 2 ) = |F 1 (ξ) − F 2 (ξ)|dξ, where F 1 and F 2 are cumulative distribution functions (CDF) of some variable ξ.In computing CDF, we aggregate spatial directions, 15 years of simulation, and 10 ensemble members.We consider 5 variables in place of ξ: potential vorticity (q m ), velocity (u m and v m ), kinetic energy ( 12 |u m | 2 ) and relative enstrophy ( 12 |curl(u m )| 2 ), and each fluid layer is accounted independently.The online distributional metric between the coarse-grid model (F model ) and the filtered and coarsegrained high-resolution simulation (F hires ) is given by the average of normalized errors: where and the normalization constant is the square root of the uncentered second moment.
An additional metric based on spectral characteristics is reported in Appendix E.

Sensitivity to the correlation time of latent variable
In order to leverage the proposed subgrid-forcing models in a stochastic parameterization, we sample the latent variable z independently at every time step (discrete white noise) similar to Zanna et al. (2017) and Guillaumin and Zanna (2021).
Following (Gagne et al., 2020), we also tested the sensitivity of the online simulation results to the correlation time of the latent variable.The time correlation is introduced with the autoregressive model of order one (AR1), which has covariance function E(z(t)z(t + n∆t)) = (1 − ∆t/τ ) n (Schumann, 1995), where n denotes the number of time layers between two time moments, τ ≥ ∆t is correlation time, and at τ = ∆t we restore the discrete white noise process.The online distributional metric (17) as a function of correlation time is reported in Figure 8.The optimal online metric corresponds to τ = ∆t, which justifies our method of sampling (white noise).

Results
In Figure 9 we show online simulations with subgrid models at the coarsest resolution.The unparameterized model ("lores") has underestimated kinetic energy (a) and underestimated KE spectrum in large scales (b).This is due to the poor representation of the inverse energy cascade on the coarse grid (c).The deterministic subgrid model (MSE) improves inverse energy cascade and KE spectrum in large scales, but small eddies near the grid scale are energized too much, see KE spectrum in small scales, KE level and tails of PDFs.The GZ model does not prevent overamplification of the small eddies.In contrast, the generative stochastic models (GAN and VAE) improve the simulation of the small eddies: see spectral characteristics in small scales, tails of PDFs and kinetic energy.Note that generative models (GAN and VAE) accurately reproduce kinetic energy growth in transitional flow (panel (a), t < 2 years) in agreement with the offline analysis.
Snapshots of the velocity modulus are shown in Figure 10.At time step ∆t = 2 hours baseline models (MSE and GZ) have too many small eddies, and at time step ∆t = 1 hour the flow becomes unphysical and overenergized.GAN and VAE models at both time steps produce physical solutions which look similar to the filtered and coarsegrained high-resolution simulation (hires).In Figure 11(a) we show distributional metric as a function of the time step.While baseline models (MSE and GZ) are very sensitive to the time step, the generative models (GAN and VAE) are relatively insensitive to the time step and have the smallest errors.This suggests that the generative stochastic models have better numerical stability properties.See Appendix E for further discussion on numerical stability.
In Figure 11(b) we show the distributional metric as a function of resolution.At the coarsest resolution 48 2 , the generative stochastic models (GAN and VAE) have 5-10 times lower error compared to the unparameterized simulation (lores) and 3-5 times lower error compared to the baseline models (GZ and MSE).For intermediate and higher resolutions (64 2 and 96 2 ) all ML-based models (GZ, MSE, GAN, VAE) improve distributional error compared to the unparameterized model, but the confidence intervals (shading area) exceed the difference between the median values.So we conclude that the effect of stochastic subgrid models (GZ, GAN, VAE), as opposed to the deterministic one (MSE), at these resolutions is negligible.The discrepancy between the offline and online analysis may be due to the inclusion of ssd term, time sampling method of the stochastic parameterization, and time integration scheme.Overall, gen-  erative models (GAN and VAE) improve simulation if there are issues with numerical stability, and perform as well as the baseline deterministic model in other cases.The spectral-error metric reported in Appendix E yields similar conclusions.
In Figure 12 we show the online results for the subgrid models trained on the dataset produced using the Gaussian filter.The subgrid models cannot substantially  improve the KE spectrum on large scales with respect to the unparameterized model (panel (b)), and it results in little or no improvement in the other statistical characteristics.At higher resolutions (64 2 and 96 2 ) we observe the improvement in reproducing the KE spectrum on small scales, but not the large ones (not shown).Similar to Zanna and Bolton (2020), we report in Appendix E how the kinetic energy in online simulation changes when the subgrid model is multiplied by the adjustable parameter.This characteristic clearly demonstrates that subgrid models trained for the Gaussian filter are less efficient in energizing the flow.The same issues for the models trained to predict the subgrid forcing diagnosed with the Gaussian filter were reported in Ross et al. ( 2023) and these may be caused by the mentioned discrepancies between the offline and online analysis.
In Appendix E we include additional online results.The online generalization to the turbulence configuration with jets shows that generative models clearly improve the simulation of the transitional flow, but at a later time, all the subgrid models including baselines experience numerical stability issues.Runtime for the generative models is the same as for the deterministic baseline.

Conclusions and discussion
In this work, we propose to leverage generative machine-learning models (GAN and VAE) to build stochastic subgrid parameterizations of mesoscale eddies.Generative models allow to sample from the conditional distribution of subgrid forcing given resolved variables.We performed offline and online evaluations of the proposed subgrid models, and compared them against baseline deterministic and stochastic ML models in an idealized ocean simulation for a range of resolutions.
Our main findings can be summarized as follows: • Generative models are able to simulate the stochastic residuals of subgrid forcing with spatial structure similar to the true residuals.
• Generative models accurately represent the energy transfer spectrum and thus reproduce the large-scale kinetic energy backscatter missing at coarse-resolution.
• The GAN model is superior to others according to the offline metrics for subgrid forcing; however, the VAE model demonstrates better offline generalization to the unseen turbulence configuration (meandering jets).• Both generative models (GAN and VAE) improve the numerical stability properties and prevent overamplification of the unphysical flows in online simulations at the coarsest resolution compared to the baseline ML models.
In spite of the different performance of GAN and VAE models in the offline analysis, their performance is similar in online simulations.Therefore, offline metrics or loss functions may be bad proxies for the online performance (Frezat et al., 2022;Ross et al., 2023).The energy transfer spectrum is one of the main properties of subgrid forcing which is essential to properly energize the flow and could be considered as an alternative loss function.However, the spatial structure of the subgrid forcing and stochastic residuals may be important to ensure the development of the physical solution.
Our online simulations are optimal when the time correlation of the latent variable sampling is equal to the model timestep, which is equivalent to a white noise model and consistent with our offline training methodology.The effect of the parameterization can be analyzed by decomposing it into deterministic and stochastic parts.The determinitic part is defined as the conditional mean; while the stochastic part as a white noise model.The white noise process model implies that the energy injection by the stochastic part of the parameterization approaches zero in the limit of the small time steps (Alvelius, 1999).In addition, the average energy injection is fully described by the deterministic part of the parameterization (i.e., conditional mean, see Moser et al. (2021)).One can modify the definition of the subgrid model, for example by including memory effects, to generate a stochastic model with non-vanishing energy input (Chorin & Lu, 2015;Gagne et al., 2020;Agarwal et al., 2021;DelSole, 2000;Berner, 2005;Bhouri & Gentine, 2022).
The important property of the proposed generative models: they do not introduce new limitations to be trained on the global ocean data compared to our baseline model (GZ).Moreover, compared to GZ, both generative models do not require an explicit expression for the likelihood function, and thus slightly more complicated architecture of the stochastic model can be used, for example, the final divergence layer (Zanna & Bolton, 2020) which allows building conservative parameterizations.We expect that the application of generative models for complex flows may greatly improve the quality of the generated stochastic residuals compared to the traditional methods.
where κ c = 0.65κ max .This filter is given by a combination of sharp cut-off coarsegraining and model filter (Eq.(A1)).
Motivation for using these filters is given in Ross et al. (2023).

Appendix C Training of the Machine Learning Models C1 GAN loss function
The presented below training algorithm closely resembles paper of Adler and Öktem (2018), where the discriminator analyzes two generated images.
We generate two images S 1 = G(z 1 , q) and S 2 = G(z 2 , q) for a given q with two samples from standard normal distribution z 1 , z 2 ∈ R 2×n×n and stack them in layer dimension: where S 1 , S 2 , S ∈ R 4×n×n .The WGAN loss (Eq.( 14)) for a single data sample transforms to: The discriminator D should be 1-Lipschitz in the first argument, and we enforce it with the gradient penalty (WGAN-GP, Gulrajani et al. ( 2017)): where S = S + (1 − ) S. The random number is uniformly distributed on [0, 1] and chosen uniquely for every training sample.For every batch we choose randomly S from set {S 1 , S 2 }.Note that D( S, q) ∈ R, ∇ S D( S, q) ∈ R 4×n×n and norm || • || 2 for tensor is defined above.Regularization preventing drift of discriminator: We minimize the following loss for the discriminator: and the loss to be minimized for the generator is: In the original paper L G = L W (Adler & Öktem, 2018), but we follow a typical approach when only generated samples constitute the generator loss (Dong & Yang, 2019).
Discriminator D is parameterized by DCGAN discriminator (Radford et al., 2015) with two modifications: we remove the activation function in the final layer and remove batch normalization because it is necessary for proper use of gradient penalty (Gulrajani et al., 2017).Following Arjovsky et al. (2017), we optimize the discriminator loss (Eq.(C1)) for five batches in a row, and then we optimize the generator loss (Eq.( C2)) for one batch.

C2 VAE loss function
To train the VAE model we parameterize every probability density in the VAE loss function (Eq.( 16)) with Gaussian distributions.
The loss function to be minimized (Eq.( 16)) for one training sample transforms to: where ẑ is one sample from encoder distribution, i.e. ẑ = µ φ + σ φ , ∼ N (0, I).Note that as suggested by Rybkin et al. (2021), we sum values of MSE loss and KL loss across dimensions.The variance of decoder distribution γ is a parameter regulating the relative importance of reconstruction and regularization terms.According to Takida et al. (2022), common problems of VAE such as posterior collapse and smoothness of generated images may result from the incorrect choice of parameter γ.Following Rybkin et al. (2021), we estimate the variance of the decoder as a mean squared error: We compute γ uniquely for every batch and do not differentiate it.

C3 Additional training information
All image-to-image mappings (mean and variance prediction in GZ, generator in GAN, encoder and decoder in VAE) are based on the same convolutional neural network (CNN) similar to Guillaumin and Zanna (2021); Ross et al. (2023) with parameters given in Table C1.We follow a common approach with the normalization of input and output variables before passing them to neural networks.Each channel representing a different physical quantity or different fluid layer is normalized by a unique standard deviation computed over the training dataset.Note that the variance channel of GZ model is normalized by the squared standard deviation of the mean channel.Normalization constants become part of the model and they are not adjusted in offline or online tests.Models are trained in Pytorch (Paszke et al., 2019), batch size is 64, training algorithm is Adam (Kingma & Ba, 2014) with standard parameters (β 1 , β 2 ) = (0.9, 0.999) for GZ and VAE, and (β 1 , β 2 ) = (0.5, 0.999) for GAN (Radford et al., 2015).The learning rate is lr = 0.001 for GZ and lr = 0.0002 for GAN and VAE.GAN and VAE models are optimized for 200 epochs, and in GZ model each channel (mean and variance) is optimized for 50 epochs.Early stopping or any other criteria for choosing the best epoch was not used.Weight decay was not used.We use the following scheduler of the learning rate for GZ and VAE: on every milestone [1/2, 3/4, 7/8] • N epoch multiply learning rate by γ = 0.1, for GAN γ = 0.5.Weights of the discriminator and generator of GAN are initialized with zero mean and standard deviation 0.02 (Radford et al., 2015).During inference, neural networks are switched to evaluation mode so that batch normalization layers use parameters accumulated during training.128,64,32,32,32,32,32, n out Filter width 5, 5, 3, 3, 3, 3, 3, 3 Boundary conditions periodic ("circular padding") Activation function ReLU, in hidden layers Batch normalization after ReLU, in hidden layers In Figures D1, D2 and D3 we show the results of offline analysis for the Gaussian filter.Our main conclusions about the performance of the stochastic models are the same as for the Sharp filter.Below we show that the time step is connected to the effective eddy viscosity in our particular numerical scheme, and thus the sensitivity of the online simulation results to the time step reveals the numerical stability properties.Small-scale dissipation (ssd) was formulated not as a tendency in RHS of the governing equation, but as a postprocessing operation following every time step, see Appendix A. Because ssd does not contain a time step explicitly, the effect of dissipation accumulates over several time steps (Lund, 2003).An effective filter that we apply to the solution per unit time interval (if solution is steady) is ssd 1/∆t , which converges at every radial wavenumber κ to Heaviside step function ( ssd(κ)) 1/∆t → H(0.65κ max − κ) as ∆t → 0. (E1) Filtering with this Heaviside step function approximately corresponds to 2/3-dealiasing scheme (Orszag, 1971), which is the discretization of inviscid equations energy and enstrophy.So, we expect that the smaller the time step, the smaller the effective eddy viscosity produced by the ssd term.By refining the time step, we established that unparameterized coarse models behave as inviscid simulation at ∆t → 0, i.e. energy accumulates near the grid scale, see Figure E1.
Below we derive a spectral metric for the analysis of online simulations.The rate of change of total energy (Eq.( 4)) is defined as Applying this formula to the governing equation (1) and using Parseval theorem, the rate of change of total energy in Fourier space is Here we neglected the contribution from the small-scale dissipation term ssd which acts on a limited set of wavenumbers.The corresponding energy balance equation for filtered and coarsegrained system (7) is In Eq. (E4) the energy transfer is proportional to ∼ u m q m and can be split into the resolved transfer ∼ u m q m and unresolved transfer ∼ u m q m − u m q m which is parameterized by the subgrid model.Every term in equation (E4) can be obtained from the corresponding term in the energy balance of high-resolution simulation (Eq.( E3)) by multiplying twice by the filter transfer function, i.e. by ( G(k, l)) 2 .
We define distance between two isotropic spectra E 1 , E 2 as: where κ c is the truncation wavenumber of the exponential filter (Eq.( B2)).And average normalized distance between model on a coarse grid ("model") and filtered and coarsegrained high resolution simulation ("hires") over "energy transfer", "energy source" and kinetic energy spectrum in upper and lower fluid layers.The described spectral error is shown in Figure E2. Figure E3 shows the sensitivity to the amplitude of the parameterization and Figure E4 shows the online generalization to the jet dataset.

Figure 1 .
Figure 1.Reference simulations at four different resolutions: (a) kinetic energy (Eq.(3)) as a function of time, (b) snapshot of the potential vorticity in the model with the finest resolution, (c) the spectral density of kinetic energy normalized as E = E(κ)dκ, (d) total energy transfer from nonlinear advection (see Eq. (E3)) normalized as ∂ ∂t E = ∂tE(κ)dκ.Coarse models fail to reproduce the energy cycle when their resolution is insufficient to resolve the deformation radius κ = r −1 d (see arrow).Black dotted lines in panel (d) show the dissipation model (ssd), which causes a spurious forward energy cascade (spikes in energy transfer).

Figure 3 .
Figure 3. Schematic of three stochastic subgrid models attempting to sample from conditional distribution ρ(S|q): (a) GZ model, (b) GAN model and (c) VAE model.GZ model predicts uncorrelated stochastic residuals, but generative models (GAN and VAE) transform white noise using a mapping learned directly from data (gray-shaded box).Discriminator and Encoder are supplementary networks that allow training of the mapping but are not required for subgrid forcing prediction.

Figure 4 .
Figure 4. Prediction of subgrid forcing (S) by stochastic models at a fixed conditional variable q on the testing dataset: GZ in upper row, GAN in middle row and VAE in lower row."Model mean" is the deterministic prediction, and "Simulated residual" is the stochastic part of the model, which should be similar to "True residual"."Model std" is the standard deviation of the subgrid forcing prediction.The root mean square (rms) value of each field is shown in the box.

Figure 5 .
Figure 5. Offline analysis of stochastic subgrid models (GZ, GAN, VAE): (a) power spectrum of stochastic residuals and (b) subgrid forcing; (c) energy transfer (see Eq. (E4)) and (d) marginal PDF of subgrid forcing.MSE is the deterministic subgrid model given by the mean channel of GZ model.

Figure 6 .
Figure 6.Offline metrics from Table 2 on testing dataset in the upper row and generalization to configuration with jets (Ross et al., 2023) in the lower row.Optimal values are given with arrows.Each model is trained 5 times with different random seeds.The shading area shows min-max values among training realizations, markers show median value.

Figure 7 .
Figure 7. Offline analysis of RMSE of deterministic part in transitional regime (t < 2 years) on test dataset.Norm || • || is given per one grid point.Shading corresponds to different training realizations of the same model.Sharp filter, resolution 48 2 .Each line is given twice: for lower and upper fluid layer.

Figure 8 .
Figure 8. Online distributional metric (17) as a function of correlation time τ of latent variable z for three stochastic subgrid models (GZ, GAN and VAE).lores is model on a coarse grid without parameterization.Shading area shows min-max values among training realizations, and markers show median value.Time step of the numerical integration ∆t is 4 hours.

Figure 9 .
Figure 9. Online simulations with parameterized models (MSE, GZ, GAN and VAE).lores is a coarse unparameterized model.hires is filtered and coarsegrained high-resolution model.MSE is a deterministic subgrid model given by the mean channel of the GZ model.Energy transfer on panel (c) gives a sum of contributions from the resolved advection and subgrid model (see Eq.

(
E4)).Shading area shows min-max values among training realizations, and lines show median value.The time step ∆t is 2 hours.

Figure 10 .
Figure 10.Snapshots of the modulus of velocity.Two columns with ∆t = 2 hours correspond to Figure 9.The smaller the time step, the smaller the effective eddy viscosity, see Appendix E.

Figure 11 .
Figure 11.Online distributional metric (Eq.(17)): (a) as a function of time step at the coarsest spatial resolution and (b) as a function of spatial resolution.The shading area shows min-max values among training realizations, and markers show median value.The smaller the time step, the smaller the effective eddy viscosity, see Appendix E.

Figure 12 .
Figure12.Online simulations with parameterized models.Similar to Figure9, but for models trained on the dataset with Gaussian filter.

Figure D3 .
Figure D3.Offline metrics from Table 2. Same as Figure 6 but for Gaussian filter.

Figure E1 .
Figure E1.Here we show that the smaller the time step, the smaller the effective eddy viscosity and the closer the unparameterized coarse model to inviscid simulation, i.e. energy accumulates near the cut-off for a small time step.

Figure E2 .
Figure E2.Online metric similar to Figure 11, but for spectral error.The shading area shows min-max values among training realizations, and markers show median values.

Table 1 .
Parameters of the quasi-geostrophic model in online simulations (eddy configuration from Ross et al. (2023)).

Table 2 .
Metrics for offline analysis of stochastic subgrid model S: RMSE of deterministic

Table C1 .
Configuration of convolutional neural network (CNN) parameterizing image to image mapping.Number of input/output images arbitrary (n in , n out ) Resolution of input/output/hidden layers arbitrary, but the same Number of filters

Table E1 .
Guillaumin and Zanna (2021) parameterized models.Runtime on one CPU core for unparameterized model ("-") and ML-based parameterizations to integrate QG model in time for 20 years.Theoretically, we expect that the runtime for MSE, GAN, and VAE models should be the same, and for GZ is twice as large.Runtime for the GZ model can be reduced if aggregate mean and variance channels into one CNN network, as it is done inGuillaumin and Zanna (2021).