A posteriori learning for quasi-geostrophic turbulence parametrization

The use of machine learning to build subgrid parametrizations for climate models is receiving growing attention. State-of-the-art strategies address the problem as a supervised learning task and optimize algorithms that predict subgrid fluxes based on information from coarse resolution models. In practice, training data are generated from higher resolution numerical simulations transformed in order to mimic coarse resolution simulations. By essence, these strategies optimize subgrid parametrizations to meet so-called $\textit{a priori}$ criteria. But the actual purpose of a subgrid parametrization is to obtain good performance in terms of $\textit{a posteriori}$ metrics which imply computing entire model trajectories. In this paper, we focus on the representation of energy backscatter in two dimensional quasi-geostrophic turbulence and compare parametrizations obtained with different learning strategies at fixed computational complexity. We show that strategies based on $\textit{a priori}$ criteria yield parametrizations that tend to be unstable in direct simulations and describe how subgrid parametrizations can alternatively be trained end-to-end in order to meet $\textit{a posteriori}$ criteria. We illustrate that end-to-end learning strategies yield parametrizations that outperform known empirical and data-driven schemes in terms of performance, stability and ability to apply to different flow configurations. These results support the relevance of differentiable programming paradigms for climate models in the future.


Introduction
The representation of unresolved processes is a key source of uncertainty in weather and climate models. Climate science and weather forecasting indeed heavily rely on numerical simulations of the Earth's atmosphere and oceans (Bauer et al., 2015;Neumann et al., 2019). But even the most advanced applications are currently far from resolving explicitly the wide variety of space-time scales and physical processes involved. This will likely remain the case for the foreseeable future because of the non-linearity of fluid dynamics and thermodynamics, and because of the finite nature of computational resources Fox-Kemper et al., 2014). Weather and climate models will therefore keep relying on approximated representations of the effect of unresolved processes in the form of subgrid parametrization schemes (Schneider, Lan, et al., 2017;Fox-Kemper et al., 2019). Parametrization schemes accounting for the impact of turbulence in the atmosphere and oceans at various scales will in particular remain essential components of these models.
Parametrizations of unresolved turbulent motions are usually based on first principles, physics, idealized experiments, field observations and high resolution simulations.
Their design involves a mixture of empirical and process-based modeling. Process-based models are formulated, tested and calibrated with experiments performed in the field, in the laboratory or with computers (Stensrud, 2009). On this basis, the actual parametrization scheme estimates a tendency term for the target model from its resolved variables.
The underlying conceptual framework can rely on some ensemble averaging procedure, so that the parametrization intends to capture the bulk statistical effect of unresolved processes, as for instance for turbulence models (Mellor, 1985). Alternatively, one may consider a spatial filtering procedure and the parametrization can then exploit the scaleinvariant properties of turbulence, as in Large Eddy Simulation (LES) models (Lesieur et al., 2005). This latter framework is used for instance for the parametrizations of ocean macro-turbulence in eddy-rich ocean models (Fox-Kemper & Menemenlis, 2008).
Recently, the use of machine learning (ML) for better parametrizing unresolved processes in weather and climate models has gained momentum. Calibrating physics-based parametrization schemes against observation with ML and emulators is for instance becoming common practice (Ollinaho et al., 2013;Schneider, Lan, et al., 2017;Couvreux et al., 2021). Emulation approaches based on ML have also been proposed as a strat-egy for accelerating or regularizing existing schemes (Ukkonen et al., 2020;Meyer et al., 2021). ML also provides new means to design new subgrid parametrization schemes from high fidelity simulations. In this context, ML may learn a mapping which predicts the tendency term due to unresolved subgrid effects from resolved quantities available in a target model. In atmospheric models, these approaches have been used to improve the representation of cloud micro-physics and moist processes (Krasnopolsky et al., 2013;Rasp et al., 2018;O'Gorman & Dwyer, 2018;Brenowitz & Bretherton, 2018;Seifert & Rasp, 2020). In ocean models, it is expected that the representation of macroturbulence could be improved with similar approaches (Bolton & Zanna, 2019;Guillaumin & Zanna, 2021).
The design of parametrizations with ML builds on the rise of scientific machine learning (SciML) and its broad application to physical sciences. SciML is an emerging field, which bridges scientific computing and machine learning. Some recent key developments in this field have been motivated both from physical insights and for their applications to physical sciences, especially in fluid dynamics (Carleo et al., 2019;Thuerey et al., 2021).
The conceptual developments in ML motivated by applications to problems governed by partial differential equations (Long et al., 2018;Sirignano & Spiliopoulos, 2018;Raissi et al., 2019) have for instance gradually freed ML from its black-box reputation. The design of parametrization schemes now directly benefits from ML approaches for dynamical system identification and equation discovery (Brunton et al., 2016;Zanna & Bolton, 2020). The ability to embed symmetries and law invariances into neural networks (Cohen & Welling, 2016;Cranmer et al., 2020;Alet et al., 2021) will also likely be important in the design of parametrization schemes (Frezat et al., 2021), and in applications of ML to fluid mechanics in general (Brunton et al., 2020;Vinuesa & Brunton, 2021).
But ML-based approaches to subgrid parametrizations are still mostly based on a priori learning strategies, which could limit their performance and applicability. There are indeed two different sorts of evaluation metrics for measuring the precision of subgrid models in turbulent simulations (Pope, 2000). A priori metrics, on the one hand, measure to what extent a given subgrid model is able to predict a tendency term due to unresolved subgrid effects at a fixed time. A posteriori metrics, on the other hand, require to perform simulations with the subgrid model, and measure its integrated impact on the simulated flows. The common strategy for learning subgrid parametrizations is to formulate a supervised learning task from a high-resolution reference simulation dataset.
In practice, learned parametrizations result from the minimization of a cost function based on some a priori metrics measuring how well a mapping can predict unresolved fluxes from coarse-grain quantities. Still, with such strategy, what we really intend to optimize is the ability of the parametrization to yield good solutions, when used a posteriori in numerical simulations. In principle, the versatility of ML algorithms should allow us to train parametrization schemes with learning criteria based on a posteriori metrics, adopting the so-called end-to-end learning framework (Glasmachers, 2017). But surprisingly, there are very few published examples of end-to-end learning strategies in computational fluid dynamics (Sirignano et al., 2020;Kochkov et al., 2021;Stachenfeld et al., 2021).
It is therefore yet unclear how subgrid parametrizations trained with a priori and a posteriori compare in terms of performance, stability and ability to apply to different flow conditions.
Flows governed by quasi-geostrophic (QG) dynamics provide an interesting and challenging testbed to evaluate learning strategies for subgrid parametrizations. Quasi-geostrophic theory indeed proposes a simple framework for studying geophysical flows constrained by earth rotation and stratification, as for instance large-scale atmospheric dynamics and ocean macro-turbulence (Cushman-Roisin & Beckers, 2011). The two-dimensional turbulence emerging from barotropic QG dynamics (Boffetta & Ecke, 2012;Majda & Wang, 2006) exhibits a dual cascade scenario with inverse energy transfers to larger scales and direct enstrophy transfers to smaller scales (Kraichnan, 1967;Thuburn et al., 2014). Because of this inverse energy cascade, developing subgrid parametrizations for QG flows is a challenging task, as the stability of numerical integration schemes is directly controled by the rate of energy backscatter, i.e. transfers from subgrid to resolved scales (Lilly, 1992;Carati et al., 1995). As a consequence, a large number of subgrid parametrization schemes have been proposed for two-dimensional turbulence (see e.g. Danilov et al. (2019) for a review) and well documented flow configurations with performance metrics for parametrization are readily available (Graham & Ringler, 2013). Unsurprisingly, attempts to learn subgrid parametrizations for two-dimensional turbulence with ML have been less successful than for other types of turbulent flows (Maulik et al., 2019;Guan, Chattopadhyay, et al., 2022). In particular, ad-hoc solutions had to be implemented in order to ensure the numerical stability of the learned schemes under energy backscatter conditions. For instance, Maulik et al. (2019) uses a clipping post-processing procedure to remove negative diffusivity while Guan, Chattopadhyay, et al. (2022) mitigates this problem in decaying turbulence by increasing the size of the training dataset. More recently, Guan, Subel, et al. (2022) and Pawar et al. (2022) demonstrated how incorporating physics in the models could lead to stable simulations that requires less data for training and generalizes better. Up to now, however, none of the published works investigates the longterm statistical performance of learned schemes far beyond the decorrelation horizon.
Learning stable parametrizations for two-dimensional turbulence in QG flows is therefore still an open problem.
In this work, we compare parametrizations for two-dimensional turbulence obtained with different learning strategies, at fixed computational complexity. In particular, we show that we are able to train a model based on a posteriori metrics with an end-to-end learning strategy. Through evaluation on three different configurations (decay, wind-forcing and beta-effect), the end-to-end learning strategy is shown to yield stable parametrizations that outperform previous physics-based and NN-based models without any explicit postprocessing step. Statistical metrics on long-term spectral transfers are shown to be in excellent agreement to direct numerical simulations (DNS), which is particularly encouraging for future climate models. The paper is organized as follows: In Sec. 2, we present the a priori and a posteriori learning strategies and the type of metrics they are respectively able to optimize. The application to quasi-geostrophic parametrizations is described in Sec. 3 with the numerical setup and baselines used in the evaluation. Results are presented both for short-term and long-term statistics for three different configurations in Sec. 4. Finally, we discuss the limitations and implications of the described strategies for realistic large-scale solvers.

Learning strategies
In this study, we address the simulation of the time evolution of geophysical quantities y(t). We assume the underlying governing equations to be known. Let us denote by f (y) these true dynamics. The numerical integration of this system being either impossible or expensive, we aim at solving the time evolution of reduced variablesȳ(t) such whereΩ ⊂ Ω, g a reduced-order operator, M a subgrid-scale parametrization and T is a projection operator that maps true variables to reduced ones. The objective in reducedorder modeling is to design operator g such that the evolution of the reduced variables match the projection T (y) of the true variables y. We note that for some reduced order problems, we identify f = g with variables existing on different spaces or dimensionalities.
Within a learning framework, one states the identification of subgrid-scale term R(y) = T (f (y))−g(T (y)) as a learning problem from reduced variables for a parametrization M(ȳ|θ) where θ are trainable model parameters. Under the assumption that projection operator T commutes with partial derivatives, the most classic approach comes to train parametrization M(ȳ|θ) as a functional approximation of closure term R. This approach has been widely explored in the recent literature (Vollant et al., 2017;Bolton & Zanna, 2019). It does not however constrain the trained parametrization to behave as expected when implemented in the solver of the reduced-order system. In this respect, an end-toend framework would appear as an appealing approach to explicitly state the subgridscale parametrization problem according to the best possible approximation of the true reduced variables. Such end-to-end approaches have shown many advantages in the approximation of differential equation in general (Chen et al., 2018;Bakarji & Tartakovsky, 2021;Fablet et al., 2021). When applied to physical problems, they are often referred as differentiable physics (de Avila Belbute-Peres et al., 2018;Um et al., 2020;Holl et al., 2020), since they require the gradient of all the considered operators and solvers to be available for the optimization algorithm. Overall, these two categories of learning approaches differ in the space where the training is performed, similarly to the definition of a priori and a posteriori metrics (Pope, 2000) for the benchmarking of SGS parametrizations. This is the reason why we refer to a priori and a posteriori learning strategies as detailed in the subsequent.

a priori learning
The a priori learning strategy comes to learn SGS parametrization using training metrics defined on instantaneous quantities, i.e. a direct measure of the accuracy of the model based on the predicted SGS term R(y). The a priori loss L prio has the form, -7-manuscript submitted to Journal of Advances in Modeling Earth Systems (JAMES) where M is a given SGS model to be evaluated. The most common a priori metrics found in the fluid dynamics community are the mean squared error (MSE) and the correlation between true and predicted SGS terms. Training a NN-based parametrization according to a a priori setting then comes to building a representative ground-truth dataset {R(y i ),ȳ i } n of paired SGS terms and reduced variables to solve the following minimization problem w.r.t. model parameters θ arg min Solving for (3) requires evaluation of the partial derivative of the a priori loss L prio with respect to parameter θ, which only involves the gradient of M, This approach has been applied to scalar (Vollant et al., 2017;Portwood et al., 2021;Frezat et al., 2021) and momentum (Gamahara & Hattori, 2017;Beck et al., 2019;Xie et al., 2020;Yuan et al., 2020) parametrizations of three-dimensional turbulence on different configurations. The two-dimensional case is also well documented in decaying (Maulik et al., 2019;Pawar et al., 2020;Guan, Chattopadhyay, et al., 2022) and double-gyre (Bolton & Zanna, 2019;Zanna & Bolton, 2020) configurations. We may emphasize that, by construction, the a priori learning strategy shall lead to the best a priori results, which shall translate in a good instantaneous prediction of the SGS term according to metrics L prio .

a posteriori learning
The a posteriori learning strategy states the SGS parametrization problem as the approximation of the true reduced variables according to some a posteriori metrics. This is important since it is possible for a model to perform well a priori while failing a posteriori, the most common factor being numerical instabilities due to the lack of smallscale energy dissipation (Maulik et al., 2019;Guan, Chattopadhyay, et al., 2022). Let us denote by Φ the flow operator that advances the reduced system in time, i.e., Numerically-speaking, flow operator Φ involves a time integration scheme (see Fig. 1) from start to end time t 0 and t 1 , respectively. Following recent advances in neural integration schemes (Chen et al., 2018;Ouala et al., 2021), we may consider here both explicit and adaptive schemes. Let (y(t),ȳ(t)) be some a posteriori metrics defined in the Now, the a posteriori minimization problem involves the time integration of Φ on subintervals [t 0 , t 1 ] from the dataset of trajectories spanning the interval [0, T ], with initial reduced states typically taken to beȳ(t 0 ) = T (y(t 0 )), Then from the Leibniz integral rule, updating model parameters θ requires the flow partial derivative, i.e.
This equation makes explicit that the gradient-based optimization of the a posteriori criterion involves the computation of the gradient with respect to all the components of the forward model, i.e. dynamical operator g as well as the considered time integration scheme.
Assuming that one can run all components within a differentiable programming framework (here, PyTorch (Paszke et al., 2019)), the embedded automatic differentiation tools makes these computations direct with no additional programming cost. In our experiments, this comes to performing an automatic differentiation for an explicit fourth-order Runge-Kutta scheme with N discrete time-steps, which defines the temporal horizon T = N ∆t.
The a posteriori strategy significantly widens the range of metrics which can be considered to calibrate the SGS parametrization. We illustrate this modeling flexibility for quasi-geostrophic turbulence in the next section. We may point out that this a posteriori learning strategy has recently been explored for temporally-developing plane turbulent jets (MacArt et al., 2021) and the short-term simulation of short-term two-dimensional flows . Here, we explore further its relevance for two-dimensional geophysical flows, including a benchmarking with the a priori setting for different flow configurations. Figure 1. Sketch of one learning step for the a priori and a posteriori strategies. The a priori loss is computed at instantaneous time t (dashed, red), while the a posteriori loss involves several states forward in time (dashed, blue).

Application to quasi-geostrophic turbulence
Geophysical turbulence is widely acknowledged to involve energy backscatter. This makes SGS parametrization a key issue for the simulation of ocean and atmosphere dynamics (Graham & Ringler, 2013;Jansen et al., 2015;Juricke et al., 2020). As case-study framework, we consider quasi-geostrophic (QG) flows. While providing an approximate yet representative model for rotating stratified flows found in atmosphere and ocean dynamics, they involve relatively complex SGS features that make the learning problem non trivial. As such, QG flows are regarded as an ideal playground to explore and assess the relevance of a priori and a posteriori learning strategies for SGS parametrization in geophysical turbulence. QG equations (Majda & Wang, 2006) are given by, which is equivalent to the transport of vorticity ω, obtained by taking the curl of the incompressible Navier-Stokes equations, i.e. ∇ · u = 0 and ω = ∇ × u and applying beta-plane approximation, hydrostatic and geostrophic balances. In addition, we have, where ψ is the streamfunction, u the velocity and J(ψ, ω) = ∂ x ψ∂ y ω − ∂ y ψ∂ x ω is the non-linear Jacobian operator. The model is parametrized by viscosity ν, linear drag coefficient µ, Rossby parameter β and a source term F . The QG equations have two in-variants in the limit of inviscid flow (Bouchet & Venaille, 2012), for energy E = 1 2 u 2 dr (12) and enstrophy In order to study scales interactions, we introduce the enstrophy spectrum, Z(k) in spectral space as the enstrophy contained in the shell of radius k, where the Fourier transform is represented by·, complex conjugation by · * and with dS (k) the integration element of the shell of radius k from wavevector k. The evolution equa- where the different terms of the right hand side are related to various effects: external energy source F , dissipation D, large-scale drag Q, beta-plane effect B and transfer between scales T . This last term writes This allows to define the enstrophy flux (Gupta et al., 2019) through the wavenumber k as (17) which is a key quantity to measure effect of SGS modeling on enstrophy distribution on the range of resolved scales.

SGS parametrization for QG dynamics
The derivation of the reduced model for QG dynamics follows the same procedure that is described for fluid dynamics in general. Assuming a known projection operator T from Eq. (1) at spatial coordinate x given as a discretization D : Ω →Ω and the convolution of y with a kernel function G(x) (Leonard, 1975), -11-manuscript submitted to Journal of Advances in Modeling Earth Systems (JAMES) We can then derive the equations which govern the evolution of reduced vorticityω as, ,ω ∈Ω (19) where R(ψ, ω) is the SGS term. For convenience, note that the reduced term can be expressed in a flux formulation, In this context,ω is only solved for the largest scales of the flow, and R(ψ, ω) accounts for the effect of unresolved motions on the resolved scales. The SGS term is thus not known from the reduced variables because of the non-linear interactions of small-scale dynamics J(ψ, ω). Following the notations introduced in Section 2, we aim to identify a QG SGS parametrization M(ψ,ω|θ) given the parametrization for operator g by (19) using both a priori and a posteriori learning strategies.
In order to study the effect of the SGS parametrization on scales interactions, we will consider the enstrophy spectrum Z(k) and the associated enstrophy flux Π Z (k) in various flow configuration. Note that when the governing equation of reduced vorticitȳ ω is solved, the transfer term is now split in a resolved and a modeled part, as We can also look at the resolved enstrophy equilibrium in physical space. The govern- where the last term Tz shows the direct effect of the SGS term on the resolved enstrophy equilibrium. From Eq. (20), this last term can be expressed as, with a first term related to diffusion, and a second term related to the transfer between resolved and unresolved scales: if (ūω − u ω) · ∇ω > 0, there is a direct transfer from resolved to unresolved scales (forwardscatter), but if (ūω − u ω) · ∇ω < 0, there is a transfer from unresolved to resolved scale (backscatter). The latter will be the key quantity to study the ability of models to take into account backscatter. However, because models are built directly for R(ψ, ω) the above decomposition can not be performed, and only Tz can be considered.

Numerical solver
The equations (9) are solved using a pseudo-spectral differentiable code with full 3/2 dealiasing (Canuto et al., 2012) and a classical fourth-order Runge-Kutta time integration scheme. The system is defined in a squared domain Ω ∈ [−π, π] 2 with a Fourier basis, i.e. double-periodic boundary conditions ∂Ω on N true = (N x , N y ) grid points with uniform spacing ∆ true = ΩN −1 true . The reduced states are obtained by projecting the true (or high-resolution) states through a convolution with a spatial kernel G δ (k) at spatial scale δ > 0 followed by a discretization on the reduced gridΩ, i.e. with larger spacing ∆ reduced = δ∆ true equivalent to a sharp cutoff, It has been shown previously that SGS parametrizations can perform differently depending on the type of filter used in the evaluations (Piomelli et al., 1988;Zhou et al., 2019).
We then aim to evaluate how the choice of the filter affects the learning strategies and we consider two common types of filters, defined in spectral space as Gaussian filter : Cut-off filter : Regarding numerical aspects, we can solve the time integration of the reduced system g with a larger time-step by a factor corresponding to the grid size ratio (or filter scale δ), i.e. ∆t reduced = δ∆t true . To generate the corresponding datasets {R(y i ),ȳ i } n and {y(t)} t∈[0,T ] , we subsample one true state every δ iterations performed by the true system f .

Baseline parametrizations
For benchmarking purposes, we implement some physics-based baselines and focus on parametrizations based on functional eddy viscosity (Kraichnan, 1976), i.e. models that artificially dissipate energy at relevant scales to remain stable. This is to be con-trasted with structural models that produce backscatter and thus suffer from stability issues and will not be considered here. One can state these parameterizations in a flux formulation, where the eddy viscosity coefficient ν e contains an arbitrary constant c P for which the optimal value depends on the flow configuration, with n depending on the scaling law used to derive the model. We used the dynamic procedure proposed by Germano et al. (1991) and Lilly (1992) where the constant is computed from a least-square minimization of the residual SGS term with a filter size larger than δ, i.e.ω = ω * Gδ,δ > δ. We also apply spatial averaging with positive clipping (Pawar et al., 2020;Guan, Chattopadhyay, et al., 2022) in order to avoid locally negative constants c P (x, y) < 0, i.e. ensuring that the models are purely diffusive and ν e ≥ 0.
One of the most popular SGS model has been proposed by Smagorinsky (1963).
It derives from the assumption of direct cascade of energy, which is relevant for threedimensional flows. However, this assumption is expected not to translate well to twodimensional or geophysical turbulence, even if it has been already employed in global climate models (Delworth et al., 2012). Following a similar derivation, the Leith model (Leith, 1996) is often referred as the two-dimensional counterpart of the Smagorinsky model, assuming a direct cascade of enstrophy. The models are defined as eddy viscosity coefficients proportional to the resolved strain rateS and vorticity gradient ∇ω, respectively,

Smagorinsky model
: Leith model : We will denote by M DynSmagorinsky and M DynLeith the dynamic versions of these two models where c P has been computed using the dynamic procedure mentioned above. we also use a periodic padding as a first layer.

Neural architecture and training
Our main focus being here the impact of a priori and a posteriori strategies, we consider the same neural-network-based parametrization M with the two learning settings. We use a convolutional neural network (CNN) architecture, which is particularly As shown in Fig. 2, we use a simple ConvNet with 10 layers of convolutions with non-linear ReLU activations. More involved architectures could further improve the performance. We may point out that our goal is not to design an optimal NN-based architecture, but rather to evaluate the impact of different learning strategies at similar computational complexity for the SGS parameterization.
Regarding the learning phase, the training loss for the a priori strategy (3) For the a posteriori strategy (7), the choice of training loss is more flexible, since we can explore spatio-temporal metrics for a batch of N = T /∆t reduced discrete integration steps.
To illustrate the basics of the strategy, we choose the MSE of the most important state of the QG system, i.e. the vorticity, The models are trained with the Adam optimizer on the same dataset containing 10 independent trajectories, i.e. simulations of 3000 snapshots each using different initial conditions, which gives a dataset of 30000 samples. As stated in (32) ing convergence. Empirically, we noted that 30 epochs were necessary for the a priori strategy, while 5 are enough for the a posteriori strategy. We note however that the latter is consequently more expensive due to the solver steps involved in the training loop.
Regarding the a posteriori strategy, training a model that performs the time integration of a system of PDEs inside the minimization loop may lead to instabilities and difficulties. To address these issues, we consider Algorithm 1. It involves the following key steps: True variables y are sampled randomly from dataset which is only required to contain true variables. In practice, the projection can be applied beforehand and dataset {T (y(t))} t∈[0,T ] can also be built from projected true states. Note that the outer loop iterates over the entire trajectories for each epoch. is greater than some threshold (Line 8). Incorrect predictions from the NN especially during the first epochs of the training process for PDE problems at the limit of numerical stability can lead to numerical blowups of the system and by consequence exploding gradient for the minimization algorithm. We then discard batches for which the Courant-Friedrichs-Lewy (CFL) is greater than some threshold, commonly chosen to be 1.

Results
In order to evaluate the performance of a priori and a posteriori learning strategies, we report numerical experiments for three different configurations of QG flows (see For these three cases, we consider the following experimental setup. The training and test data involve respectively 10 and 5 direct numerical simulations (DNS) corresponding to the same configurations with different initial conditions. The reduced systems are run with δ = 16, i.e. the reduced grid is 16 times smaller compared to the true grid resolution in each direction. Reduced systems are integrated for 6000 iterations for the non-stationary decay cases and 18000 iterations to determine long-term statistics of the forced and beta-plane configurations. We may emphasize that the simulations used for evaluation purposes are never seen during the training phase. The parameters of the different flows are shown in Table 1 in dimensionalized units. Overall, for each QG configuration, we report a quantitative synthesis for the cutoff and Gaussian filters and further illustrate the key features of the different learning strategies using a cutoff filter.
Important quantities discussed in the evaluation are both; • short term temporal evolution of quadratic invariants both for the energy (12) and the enstrophy (13), typical in weather forecast.

Decaying turbulence
In the context of two-dimensional SGS parametrization with ML models, the decaying turbulence configuration is one of the most studied (Maulik et al., 2019;Pawar et al., 2020;Guan, Chattopadhyay, et al., 2022). This type of flow is particularly interesting because of its non-stationary nature, i.e. the system invariants are temporally varying. Similarly to Guan, Chattopadhyay, et al. (2022), we sample the initial vorticity fields randomly from a gaussian distribution ω ∼ N (0, 1) at moderate wavenumbers k ∈ [10, 32] and integrate the system for 10000 iterations before reaching spectrum self-similarity (Batchelor, 1969).
From this starting time t 0 , the vorticity fields exhibit an early turbulence behavior with a lot of fine structures (see Fig. 3, top). DNS and reduced models with SGS parametrizations are run until t = 9.6, which is longer than the temporal horizon used in the training data by a factor of two. In two-dimensional decaying flows, we expect to see vortex pairing and emergence of larger structures as shown in Fig. 10   which particularly highlight the dynamics of the smallest resolved scales.

Forced turbulence
The second case-study involves QG flows with a source term F designed to mimic wind-stress. We study a particular configuration inspired by Graham and Ringler (2013), which evaluated the performance of a large number of physics-based parametrizations in mesoscale ocean simulations. To reproduce these realistic equilibrium solutions, we use a bottom drag (µ > 0) and initiate turbulent mixing from a wind-stress slowly varying in time at large-scale k = 4 with steady enstrophy rate injection Z(F ) = 3 such that, F ω (t) = cos(4y + π sin(1.4t)) (33) − cos(4x + π sin(1.5t)) In order to converge to a stationary turbulent state, we initialize the simulation runs from a few large-scale Fourier modes and spin-up on a smaller grid (1024 2 ) for over 500000 iterations. The initial conditions for training and evaluation (see Fig. 3, middle) are taken after energy and enstrophy propagation to the smallest scales of the true grid (2048 2 ) in about 25000 iterations.
To evaluate the long-term performance of the forced configuration in equilibrium, we run simulations until t = 2.88, which is at least 3 times longer than the complete decorrelation time of the system due to chaos. We report the vorticity fields at the end of the simulations in Fig. 10 (center). The true vorticity state exhibits both large vortices generated by the wind forcing and small filaments in between. Overall, we draw conclusions similar to the decaying turbulence regime. We note that the small structures are inaccurately predicted for both M DynSmagorinsky and M DynLeith due to dissipation and for the a priori model due to numerical instabilities. By contrast, the a posteriori  constant over time. This property is verified on the kinetic energy for both NN-based models, but the enstrophy of the a priori -trained model increases over time, which indicates some accumulation of small-scale energy (see Fig. 6) and may result in a potential future blow-up of the simulation. However, the time-averaged statistical enstrophy spectrum shown in Fig. 7 demonstrates the ability of the a posteriori model to reproduce accurately both the smallest scales and the largest scales of the simulation (small wavenumbers) compared to the other models.

Beta-plane turbulence
The third case-study runs the same forced simulation as in the previous configuration complemented by a beta-plane effect to account for the meridional variation of Coriolis force caused by a spherical shape. We take a Rossby parameter β corresponding to Earth's planetary rotation on mid latitudes (60°).
The beta-effect has an important impact on the topology of the dynamics, as it creates high-velocity longitudinal jets as seen in Fig. 3, bottom. In this simple setting without topography (flat bottom layer), the system does not go through state transitions but remains in a statistical equilibrium. The strong vorticity gradients in between jets seen from 10 (right) are predicted more accurately by the M DynLeith than the M DynSmagorinsky , which still over-dissipates at small-scale. The model trained a priori is not stable at all in this configuration, while the a posteriori model remains stable to simulate visuallyconsistent patterns.
The instabilities visible in the vorticity field of the a priori model are explained by the increasing energy and enstrophy in Fig. 8. This reveals a non-conservative behavior which leads to a simulation blowup. The a posteriori strategy performs extremely well on this long-term simulation, predicting in particular a correct enstrophy evolution compared to that of the DNS. The enstrophy spectrum and fluxes (see Fig. 9) are similar to those of the forced turbulence configuration, except that the linear damping has a stronger impact, due to the relative increase in velocity from the beta-effect. Overall, the conclusions are the same as the previous two configurations, with highest fidelity smallscale dynamics being produced by the model trained using the a posteriori strategy.    Table 3. Long-term performance of considered SGS parametrizations in the three different configurations with both cutoff and gaussian projection kernels. We compute the L 2 distance between the reference enstrophy fluxes and the ones simulated using the different SGS parameterizations. The a posteriori learning strategy clearly leads to much better scores.

Quantitative analysis
As a quantitative synthesis of our numerical experiments, we first report the performance on two metrics for the three case-studies and the different models: an a priori metric given by Pearson correlation coefficient of the predicted SGS terms (Table. 2.) and an a posteriori metric given by a variant of the error-landscape enstrophy flux assessment presented by Meyers (2011) (Table. 3). We report these performance metrics both for cutoff and Gaussian filters.
The selected learning strategy clearly impacts the corresponding metrics. For the three flow regimes, the a priori learning performs better on a priori metrics, whereas the a posteriori learning leads to the best score on a posteriori metrics. Concerning the projection kernel, physical models and the NN-based model trained a priori perform better with the Gaussian filter. Note that this is not surprising, as Gaussian filtering tends to smooth out discontinuities in the subgrid term (Canuto et al., 2007) which might help during a priori training. By contrast, the a posteriori learning scheme results in consistent scores for both the Gaussian and cutoff kernels.
We analyze the ability of the benchmarked models to reproduce modeled transfers in physical space. Fig. 11 shows the PDF of Tz for the three different configurations at the end of their respective trajectories. First, while we can not explicitly identify forward and backward transfers, we can see that the resolved transfer term produced by the physical models is not symmetric, i.e. it produces a larger amount of negative values. This suggests that theses models are not able to produce backscatter by construction. The a posteriori model always performs better than the a priori one, with predictions that better reproduce the DNS on the tails of the distribution. However, even if the a priori model has been demonstrated to be unstable on the beta-plane configuration, the resolved transfer term remains close to the DNS, but overpredicts positive tails, which, again, could indicate an incorrect representation of energy backscatter. We can further analyze the instability related to Tz by looking at the evolution of its spatial average, as shown in Fig. 12. The resolved transfer term of the a priori model rapidly diverges from the DNS in the beta-plane configuration, which identifies the instability. As expected, the physical models produce large negative values at the beginning of the simulations, which can be explained by their highly dissipative nature. The a posteriori model, however, is the most accurate and remains close to the DNS throughout the trajectories.

Discussion
This study investigated different learning strategies to train subgrid-scale (SGS) parametrizations for two-dimensional quasi-geostrophic turbulent flows. While the stateof-the-art has mostly explored a priori learning schemes, our numerical experiments stress the significant improvement brought by the a posteriori learning strategy to better reproduce small-scale dynamics on large temporal horizons with great accuracy. For all the flow configurations and coarsening schemes considered in this study, SGS parametriza- tions trained according to an a posteriori training loss clearly outperform both physicsbased and machine learning baselines.
The a posteriori learning strategy introduced in this paper opens the possibility to design stable subgrid parametrizations with more flexibility than state-of-the-art MLbased approaches. Indeed, we have here explored a relatively simple a posteriori training loss given by a vorticity-based MSE, the a posteriori learning scheme offers a much greater flexibility for the exploitation and combination of different a posteriori metrics during the learning phase. Losses defined from classic performance metrics such as energy transfers and distributions seem particularly appealing. One may also explore applicationspecific metrics including among others boundary layers flows. As the a posteriori learning strategy results in an improved stability of the trained SGS parametrizations, it may also offer means to explore more complex neural architectures for SGS terms. Here, we considered a relatively simple ConvNet, but more complex and state-of-the-art neural architectures including for instance ResNet, UNet and transformer networks could be worth exploring. Another interesting avenue is the joint training of a posteriori models in the context of data assimilation such as described in Bonavita and Laloyaux (2020) and Farchi et al. (2021).
An interesting connection can indeed be made between a posteriori learning and variational data assimilation techniques. Our a posteriori learning algorithm formulates a variational problem which is formally equivalent to the strong constraint 4D-Var scheme (Blayo et al., 2015;Carrassi et al., 2018). But, in our case, the control vector is composed of the parameters of the neural network and observations are assumed to be perfect. The analogy between a posteriori learning and 4D-Var therefore brings the question of whether parametrizations, or more generally corrections to existing models, could be learned directly from sparse and noisy observations (Schneider, Lan, et al., 2017). In this sense, a posteriori learning is related to the bias correction methods that have been proposed in data assimilation (Dee, 2005), and especially the schemes proposed to infer state-dependent corrections to existing models (Griffith & Nichols, 2000;D'andrea & Vautard, 2000). Interestingly, this field has received renewed attention over recent years with several authors proposing to approach bias correction with ML (Bonavita & Laloyaux, 2020;Farchi et al., 2021). In this context, we stress that our approach is very similar to the scheme introduced by Farchi et al. (2021), with the noticeable difference that we here learn a correction through the 4D-Var scheme itself, and not from the increment of the assimilation scheme.
By construction, parametrizations learned with the a posteriori strategy should improve the short-term forecasting capabilities of the models. This is true in particular if the training loss is defined as the sum of forecasting errors over a given time horizon as considered here. Interestingly, we noted here that for SGS parametrizations, this shortterm forecasting performance translates in a better long-term stability and representation of long-term flow patterns where the long-term horizon being several order of magnitudes greater than the time horizon in the training loss (18000 vs. 25 time steps). While recent studies have explored neural models for the short-term forecasting of realistic geophysical flows, especially for weather forecasting applications (Schultz et al., 2021;Weyn et al., 2021), we believe our study opens new avenues for the exploitation of learningbased components in climate-scale simulations, which remain an open challenge (Rasp et al., 2018). In this respect, to account for the chaotic nature of turbulent flows, a posteriori training losses could also benefit from statistical metrics as opposed to synoptic ones as the MSE used in this work.
But a strong requirement of the proposed framework lies in the differentiability of the considered dynamical model, which may question its practical applicability. Indeed, most large-scale forward solvers in earth system models (ESM) rely on high-performance languages that do not embed automatic differentiation (AD) capabilities. While it is generally recognised that adjoint models are very useful additional tools for these solvers (Barkmeijer, 2009;Wunsch & Heimbach, 2013), adjoint operators are readily available only for a small fraction of them (Heimbach et al., 2005;Vidard et al., 2015). We stress that the emergence of a new generation of models written in differentiable programming languages such as JAX and JuliaDiff (Häfner et al., 2021;Ramadhan et al., 2020;Sridhar et al., 2021;Huang & Topping, 2021) naturally supports our contribution. Besides, deep differentiable emulators (Nonnenmacher & Greenberg, 2021;Hatfield et al., 2021;Kasim et al., 2021) that learn a differentiable approximation of a non-differentiable forward solver or of its adjoint may also open new avenues for the development of SGS parametrizations for state-of-the-art ESMs with a posteriori learning strategies.
Still, while one could benefit from the ongoing effort for differentiable ESMs, the development and the practical implementation of ML-based parametrizations in non-differentiable ESMs (Marshall et al., 1997;Madec et al., 2017) is also an important question. In this context, physics-based neural networks using a priori training schemes arise as relevant options, given the promising results already reported for the oceanic (Zanna & Bolton, 2021) and atmospheric (Gentine et al., 2021) parts of climate models. We also believe that the proposed a posteriori strategy combined with physics-based constraints will also be worth exploring in future work.

Data Availability Statement
The results can be reproduced from the data, models and associated learning algorithms provided along with the pseudo-spectral quasi-geostrophic code, available in