Turbulence Closure With Small, Local Neural Networks: Forced Two‐Dimensional and β‐Plane Flows

We parameterize sub‐grid scale (SGS) fluxes in sinusoidally forced two‐dimensional turbulence on the β‐plane at high Reynolds numbers (Re ∼25,000) using simple 2‐layer convolutional neural networks (CNN) having only O(1000) parameters, two orders of magnitude smaller than recent studies employing deeper CNNs with 8–10 layers; we obtain stable, accurate, and long‐term online or a posteriori solutions at 16× downscaling factors. Our methodology significantly improves training efficiency and speed of online large eddy simulations runs, while offering insights into the physics of closure in such turbulent flows. Our approach benefits from extensive hyperparameter searching in learning rate and weight decay coefficient space, as well as the use of cyclical learning rate annealing, which leads to more robust and accurate online solutions compared to fixed learning rates. Our CNNs use either the coarse velocity or the vorticity and strain fields as inputs, and output the two components of the deviatoric stress tensor, Sd. We minimize a loss between the SGS vorticity flux divergence (computed from the high‐resolution solver) and that obtained from the CNN‐modeled Sd, without requiring energy or enstrophy preserving constraints. The success of shallow CNNs in accurately parameterizing this class of turbulent flows implies that the SGS stresses have a weak non‐local dependence on coarse fields; it also aligns with our physical conception that small‐scales are locally controlled by larger scales such as vortices and their strained filaments. Furthermore, 2‐layer CNN‐parameterizations are more likely to be interpretable.


I
Turbulent flows in physical systems span a vast range of spatial and temporal scales; in the Earth's oceans, for example, mesoscale eddies, the dominant reservoir of kinetic energy in the ocean, have scales of O(100 km) whereas, small-scale three dimensional motions at the air-sea interface driven by a combination of surface heating/cooling, wind action and the Earth's rotation have scales of O(1m).This vast range of spatial scales is well beyond the range of current numerical ocean models to solve given even the largest available compute facilities.Climate models face even greater challenges because they need to be run for years or decades to study climatic changes over long temporal horizons.In practice, these models are run at a certain resolution limited by available computational resources while unresolved turbulent processes are effectively represented or parameterized.The turbulent motions that are below the resolution of numerical models and need parameterization or representation are commonly referred to as subgrid-scale (SGS) motions, with their effect being expressed as effective fluxes (SGS fluxes) that can be added to the equations of motion already solved by numerical models.
Historically, parameterization of unresolved turbulent flows are computed through a combination of empirical data, physics-based modeling and simplified algebraic models computed using high resolution process models run over short durations of time.For example, three-dimensional small-scale turbulence near ocean surface is parameterized using the K-profile parameterization (Large et al., 1994) framework which is a ad hoc combination of empirical and physics-based approaches that model various turbulent process found at the air-sea interface, including surface convection and boundary layer rotating-stratified shear turbulence.Parameterizations are also found through numerical process studies (essentially idealized numerical simulations of a specific phenomenon), typically high-resolution Large Eddy Simulations (LES) to fit simplified algebraic models based on combination of dimensional analysis and physical models (Souza et al., 2020).
An alternative approach spurred on by the development of machine learning methods, in particular neural network based Deep Learning methods, along with the availability of large amount of data from numerical simulations, is to use high resolution process studies to generate SGS fluxes that can be accurately modeled without recourse to simplified algebraic models.With the availability of frameworks that allow incorporating neural network (NN) models trained in the Python programming language into Fortran (Ott et al., 2020), the language of most weather, ocean and climate models, one can in principle accurately model SGS fluxes in climate models, substantially adding to their predicability and fidelity and reducing their biases.In practice, a major issue that manifests is that the numerical models that incorporate NN-based models are subjected to challenges in numerical stability (Brenowitz et al., 2020).
Neural networks are hierarchical non-linear functions of simpler modular units containing a large number of tunableparameters that can in principle represent any arbitrary non-linear function given sufficient data (Hornik et al., 1989;Zhou, 2020).Typically complex models like these can be subject to over-fitting to a given dataset and not learn general relationships that are actually present in the data.However, a multitude of regularization techniques have been discovered over the past decade that allow NNs to learn non-linear relationships that generalize well across unseen data (Srivastava et al., 2014;Ioffe and Szegedy, 2015;Ba et al., 2016;Loshchilov and Hutter, 2016).Applications of NNs to dynamical modeling, in particular to the solution of complex dynamical equations that govern the climate system, face the additional requirements that solutions of equations be numerically stable and also that they not accrue unphysical biases over long time horizons.
NN-based parameterizations in dynamical systems can classified based on how the learning framework is designed.Online or a posteriori learning (Ma et al., 2018;Rasp, 2020) directly incorporates the NN-parameterization into the equations of motion; the equations of motions are time-stepped forward and the parameters of the NN are learnt subject to the constraint that NN-driven flow trajectory matches the outputs of a high resolution numerical solution truncated to the resolution of the NN-based solutions.This approach, a form of trajectory optimization that is common in imitation learning (Hussein et al., 2017) and has the advantage that the NN-driven solution is physics-aware, promises that well optimized trajectories are not subject to biases not present in the high resolution numerical solution.A drawback, however, is that the numerical framework needs to be entirely differentiable because the errors in the trajectory need to be propagated through the numerical solver itself; this idea being closely related to adjoint models in data assimilation.Differentiable numerical models, for example in weather and climate systems, cannot be created by simple modification of existing climate and weather models and need to be rewritten from scratch, a complex undertaking.Furthermore, for complex nonlinear dynamical systems, the trajectories need to be rolled out for multiple time steps and multi-step losses need to be optimized to ensure accuracy and stability of the learned NN-based equations (Kochkov et al., 2021;Keisler, 2022;Frezat et al., 2022;List et al., 2022); this can substantially add to the computational burden and complexity of the training pipeline.
Offline or a priori learning uses high resolution numerical models to diagnose all the SGS fluxes that are missing at lower resolutions.Then the diagnosed SGS fluxes are directly modeled using an appropriate Neural Network architecture which takes in the coarse-grid flow variables as inputs; the NN-training is accomplished using standard supervised training (through Maximum Likelihood Estimation).Methodologically this is a substantially simpler approach and places no restrictions on the numerical solver itself as the numerical solution and the learning of the NN are decoupled.Consequently successfully trained NN-models can be directly incorporated into existing climate and weather models.Correspondingly, however, a serious disadvantage here is that the NN model has no awareness of the dynamics of the underlying system and small errors made by the NN SGS model can accumulate when incorporated in the numerical solver, resulting in numerical instabilities or biased solutions, the issue becoming more acute the longer the numerical model is run (Brenowitz et al., 2020;Rasp, 2020;Frezat et al., 2022).Some of the numerical stability issues can be alleviated through probabilistic modeling of the SGS fluxes instead of a fixed NN-model (Perezhogin et al., 2023).
Both offline and online-trained models must be tested for online or a posteriori stability and accuracy over justifiably long time horizons.In other words, the training or learning process can be operated offline (i.e.direct supervised learning) or online, using a differentiable numerical solver, but all trained models must be deployed on the coarse-grained equations and tested for online stability and fidelity.We re-emphasize the distinction between the offline or online-learning process from actual deployment in the equations which is, by definition, always online or a posteriori.Both offline and online NN-based learning approaches have other challenges, in particular, that NNs are optimized for Graphic Processing Units (GPUs) which most climate system models are not currently designed to run on.In particular modern Deep NNs can have millions of parameters and running them on CPUs can add a substantial overhead to the numerical solver, possibly negating the effects of the gains in building accurate SGS models.
1.1.Related works.Decaying two dimensional turbulence is one of the most common model turbulent flow problems for discovery and testing of data-driven closure and parameterization (Maulik et al., 2019).The physics of decaying turbulence is well known (Brachet et al., 1988;McWilliams, 1984;Carnevale et al., 1991Carnevale et al., , 1992) ) and consists of a slow merger and interaction of vortices to form larger and larger vortices until the entire flow domain consists of a single large vortex.Recent studies have shown that even at high Reynolds numbers (Re∼20000), the SGS stresses in this problem can be accurately modeled with a modest number of data samples (Guan et al., 2022a;Frezat et al., 2022;List et al., 2022), leading to accurate and stable online coarse-grid solutions at 16X or greater grid downscaling factors.
Continuously forced 2D-turbulence, however, is a substantially more complex dynamical problem (Xiao et al., 2009) that has a persistent dual cascade of energy, an inverse energy cascade from forcing scales to a larger scales and a forward enstrophy cascade to smaller scales (Kraichnan, 1971).Only a handful of recent studies over the past two years have successfuly obtained accurate stable online closures for this and related problems.Guan et al. (2022b) studied the problem of a sinusoidally forced 2D turbulent flow solved in a doubly-periodic channel, also referred to as Kolmogorov flow in classical fluid mechanics, and demonstrated that offline learning could be effectively used to model SGS-stress using deep CNNs (having 10-layers, about 800,000 parameters) that resulted in stable online closures, provided sufficient amount of training data was generated, by using around 2 million high resolution model time-steps.In practice they find that it is sufficient to train on only 2000 snapshots, chosen every thousand time-steps to ensure that the snapshots were uncorrelated with each other.Guan et al. (2022b) also found that exploiting the geometrical symmetry of the doublyperiodic computational domain using rotationally equivariant convolutions substantially reduced their data requirements by a factor of 40.In a more recent study, Ross et al. (2023) considered the offline closure of a related idealized dynamical problem relevant to geophysical flows, namely two-layer quasi-geostrophic flow on the β-plane; for a brief description of the β-plane approximation, see Sec. 2.1.In their wide ranging study, Ross et al. (2023) examined how various choices for the inputs whether coarse-grained velocities or velocity gradients, the form of the output SGS-fields and the precise choice of the downscaling affected the online accuracy of their offline-trained CNNs having 8-layers, around 300,000 parameters.
Unlike the standard approach of parameterizing the SGS stresses using NNs, Kochkov et al. (2021) parameterized the nonlinear advection term directly through a "neural discretization" method demonstrated in simpler dynamical problems in earlier papers.The Kochkov et al. (2021) approach was trained in entirely online or a posteriori fashion through trajectory optimization.A follow up paper (Dresdner et al., 2022) by the same group found accurate online-trained parameterization of the same problem solved using spectral methods (instead of a finite-volume approach) though using a direct SGS-parameterizing CNN instead of the neural-discretization approach used in (Kochkov et al., 2021).A closely related study is by Frezat et al. (2022) which also employed online-learned parameterization of forced β-plane turbulence along with standard two-dimensional turbulence, but at a substantially higher Reynolds number than (Dresdner et al., 2022).Both Dresdner et al. (2022) and Frezat et al. (2022) used deep CNNs with 8-16 layers, the former work employing the encoder-process-decoder models now gaining prominence in neural turbulent forecasting models (Stachenfeld et al., 2021;Keisler, 2022).
1.2.Current work.Our approach, presented in this study, derives from the aforementioned studies on the methodology of parameterizing sinusoidally forced two-dimensional and geostrophic turbulence on the β-plane at high Reynolds numbers (Re∼25000) using purely offline training of the SGS stress.However, there are substantial departures between the choices made here which lead to significant improvements in training and efficiency of the turbulent closures.These points are not only technical, but also lead to new insights into the physical and mathematical aspects of closure problems of such turbulent flows.We provide below a list that summarizes the main contributions of this study.
(i) Accurate and efficient shallow CNN-closures at high Re.We demonstrate that simple two-layer CNNs are sufficient to obtain stable and long-term accurate online solutions at 16X downscaling for high Re-turbulence.
The resulting CNN-parameterizations of the SGS flux contain only O(10 3 ) parameters; two orders of magnitude smaller than aforementioned recent studies, leading in turn to substantially faster and efficient online CNN-LES runs.(ii) Hyperparameter space probing.We find the best models through extensive hyperparameter searching in the learning rate and weight decay coefficient involved in the optimization of the loss function given by (3.1) below.This probing operation is greatly facilitated by the choice of small CNNs used here.(iii) Optimization strategy.We show that the CNNs trained using cyclical learning rate annealing result in more robust and accurate online solutions compared to those trained using a fixed learning rate, as traditionally operated.(iv) Physical variables.Our inputs to the CNN use either the coarse velocity field, (ū, v) or the vorticity and strain fields, (ω, σs , σn ) and outputs are the two components of the deviatoric stress tensor; we do not consider ( ψ, ω) inputs because the streamfunction ψ is difficult to obtain in numerical models over complex spatial domains.(v) Loss function.Our CNNs are optimized by minimizing a mean square error loss function that measures their parameterization defect with respect to the SGS expressed in terms of the deviatoric stress tensor, and that is penalized by the sum of squares of NNs' weights; see (3.1) below.We do not need to use more elaborate physics-informed loss functions that have been used in recent work (List et al., 2022;Guan et al., 2022b).(vi) Weakly non-local parameterizations and physical implications.The success of shallow CNNs in accurately parameterizing turbulence for the class of turbulence problems considered here has direct implications for our understanding of the physics of the problem itself.First, since the spatial non-nonlocality of CNNs grows with depth, our closure results with shallow CNNs imply that the SGS stresses have only a weak non-local dependence on coarse fields.Such a weak spatial non-locality of our shallow CNN-parameterizations also means that they are more amenable for embedding into climate models typically relying on spatial domain decomposition techniques for computation on large clusters; correspondingly deep CNNs would require substantially larger data exchange between computational nodes.Finally, recalling that deep NNs with a great amount of parameters are typically required to approximate complicated nonlinear functions, due to the shallowness of our CNNs, we infer that the SGS stress must have here a relatively simple nonlinear dependence on the coarse flow fields.This simple observation provides a favorable ground for the nonlinear mapping underlying our shallow CNN-parameterization (Π CN N in Eq. (5.1) below) to be interpretable, which we leave for a future work.The relative simplicity of the recent analytical operator forms obtained by Ross et al. (2023) using their hybrid genetic programming combined with a sparse linear regression approach is consistent with our findings.

T
2.1.Dynamical equations and turbulent regimes.We use a popular model fluid flow problem that exhibits complex turbulence phenomena and is closely related to large scale turbulence in the Earth system and in planetary atmospheresnamely sinusoidally forced two-dimensional turbulence on the β-plane.The flow domain is specified in (x, y)-coordinates and consists of a square domain of size (L, L), with L = 2π.The governing equations of motion are the Navier-Stokes (N-S) equations which describe the evolution of flow velocity vector field, v(x, y, t) = (u(x, y, t), v(x, y, t)), at every point in the domain and in time.For the specific problem chosen here, the boundary conditions are chosen to be periodic, namely u(x, y, t) =u(x + L, y + L, t), (2.1) v(x, y, t) =v(x + L, y + L, t). (2.2) Note that this domain is topologically equivalent to a two-dimensional torus.In two dimensions the N-S equations in the presence of background rotation, are the evolution equations of the two components of the velocity fields, v ≡ (u, v) where u and v are the velocities along xand y-directions, respectively.In non-dimensional form, this becomes Here, f , the Coriolis parameter is the local rotation rate of the Earth or some other planetary atmosphere, k is the unit vector normal to the (x, y)-plane while is the sinusoidal time-invariant forcing field that continuously drives the flow; this specific form is chosen to be the same as in (Guan et al., 2022b) with the forcing wavenumber, k f = 4.The coefficient µ represents the linear drag coefficient that in geophysical flows purports to represent the effect of bottom friction and Re is the Reynolds number measuring the strength of the non-linear advection term relative to the viscous term (the second term on the left hand side relative to the second term on the right hand side); we choose Re = 25000.
The velocity field is also required to satisfy mass conservation captured through the continuity equation, The continuity equation can be implicitly satisfied in two dimensions by defining a scalar field called streamfunction, ψ(x, y, t) u = −ψ y , v = ψ x (2.4)In two dimensions, these two equations along with the continuity constraint can be expressed as an evolution equation of a single scalar field, the vorticity, that is defined as the two-dimensional curl of the velocity field and captures the local rotation of a fluid parcel.The vorticity evolution equation in two dimensions can then be written in the standard vorticity-streamfunction (ω, ψ) form as where the Jacobian, J(ψ, ω) = ψ x ω y − ω x ψ y captures the non-linear advection term, while the vortical form of the forcing becomes (2.7) The βv-term is an approximation relevant to geophysical systems and captures the effect of differential rotation experienced by the ocean and atmosphere in the Earth system in a tangent plane approximation (β = ∂f ∂y ).The presence of the βterm allows the flow to manifest specific kind of waves common in planetary systems, called Rossby waves which can substantially alter the turbulent flow dynamics relative to the flow which results when β = 0.
Eq. (2.6) is solved in Fourier space using a standard pseudospectral method (Orszag and Israeli, 1974); a semiimplicit Crank-Nicholson, 2nd-order Adams-Bashforth (CN-AB2) method is used for time-stepping with a time step of ∆t = 5 × 10 −5 .Our baseline high-fidelity model is solved on a square N 0 × N 0 grid with N 0 = 1024 so that our grid spacing in physical space is ∆ DN S = 2π/N 0 .According to standard parlance we refer to these solutions as Direct Numerical Simulations (DNS) because no parameterization is used to represent turbulence below the grid scale other than quadratic viscosity.Our value of viscosity coefficient, i.e. 1/Re in our non-dimensional formulation, is chosen such that either decreasing the grid size by holding Re fixed or increasing Re (decreasing viscosity) with the grid size fixed will lead invariably to a rapid pile-up of energy at the smallest wave numbers and eventually to numerical instability.As a comparison, the choice of Re ∼ 1000 in (Dresdner et al., 2022) allows the authors to obtain stable solutions without any parameterization for coarser grids of up to 128 × 128; their problem also involves a 16X downscaling from 1024 × 1024 to 64 × 64.Frezat et al. (2022) have a larger value of Re than the one used in this study in their 32X downscaling online closure while Ross et al. (2023), in their study in two-layer QG turbulence (for a downscaling factor of 4X from 256 × 256 to 64 × 64) likely have a Re value in similar range as Dresdner et al. (2022).
In the absence of the β-term, the flow is statistically homogenous and isotropic in space and time.Figure 1a) shows a single snapshot of the vorticity field after t = 2 × 10 6 ∆t; large coherent vortices, typically of length scales at or larger than the forcing scale, are prominent in a sea of fine filamentary structures.The flow kinetic energy (E = u 2 + v 2 ) is concentrated at the coherent vortices as a consequence of the cascade of energy to larger scales, while both the vortex cores and the filamentary structures are significant reservoirs of the flow enstrophy (Z = ω 2 ), the latter a consequence of the direct cascade of enstrophy to small scales (Kraichnan, 1971).
For finite values of β, however, the turbulent flow undergoes a symmetry breaking instability in the y-direction resulting in the formation of alternating banded zonal jets, i.e. along x-direction.The mechanism of jet formation is not a simple one and is best described as a form of turbulent instability, called zonostrophic instability (Farrell and Ioannou, 2007;Marston et al., 2008;Srinivasan and Young, 2012;Bakas et al., 2015).These jet-like structures are common in planetary atmospheres, the most striking example being the visible banded jet structures on Jupiter.Fig. 1(e) shows a snapshot of the jets formed for this specific choice of β = 20 for which the coherent vortices observed in the β = 0 case are no longer visible separately but now closely interact with the jets themselves, though the fine-scale filaments are clearly seen.

Downscaling.
In this section, we describe our approach to downscale or coarse-grain the high-fidelity DNS solutions described in the previous section to a lower resolution grid.Downscaling is effected in spectral space using a cutoff filter, also known as a sharp spectral filter which simply sets modes higher than a cutoff wavenumber to be zero.For a downscaling factor of n d , the grid spacing of the reduced order model is thus ∆ c = n d ∆ 0 and the number of grid points decreases to N c = N 0 /n d .We choose n d = 16, thus projecting the DNS solutions from a 1024 × 1024 grid to a 64 × 64 grid.In practice, before applying the cutoff filter, following Guan et al. (2022b), we first apply a Gaussian filter to the fields somewhat.The reason for this choice is three-fold: First, Zhou et al. (2019) demonstrated that this procedure produced SGS fluxes which have a higher correlation with the coarse-grained field, making the learning problem easier.Second, this brings the spectral learning problem closer to finite-difference and finite-volume approach common in Earth system models, because localized finite difference stencils can be represented as an effective exponential cutoff filter (Lele, 1992).Finally, only using the cutoff filter creates small scale, spatially nonlocal features in physical space that have little physical structure and are difficult to learn due to the spectral bias of neural networks towards learning lower frequency more easily than higher frequency (Rahaman et al., 2019), especially in offline-learned parameterizations where there is no inherent dynamical awareness; see discussion in Sec. 1.These reasons explain why using Gaussian and exponential filters became a common practice before cutoff in recent studies involving CNN-based parameterizations of two-dimensional turbulence; see e.g.(Guan et al., 2022a,b;Dresdner et al., 2022;Frezat et al., 2022).Denoting by ω(k x , k y ) the Fourier transform of the vorticity, ω(x, y), the coarse-grained field (represented by an overbar) is written as where * denotes the convolution operator, k c = π/∆ c = N c /2 is the cutoff wavenumber (for N c = 64, k c = 32) and the Gaussian filter G ∆c is given by with ∆ F = 2∆ c ; see (Guan et al., 2022a).The fields filtered from the DNS solutions on the lower-resolution grid are referred to as the Filtered DNS (FDNS) fields.Fig. 1b) and f) display snapshots of the 16X downscaled fields (i.e. the FDNS fields) corresponding to the DNS snapshots in Fig. 1a) and e) for the cases of β = 0 and β = 20 respectively.Interestingly, Ross et al. (2023) find that the above Gaussian filtering approach makes the learning problem more challenging but we find few issues with either the offline learning of the CNN or its online deployment.
2.3.Sub-grid scale stresses.We apply the filtering step in Eq. (2.8) to the momentum equation ( 2.3) and the corresponding vorticity equation (2.6).The momentum equation is then rewritten as where the SGS momentum flux tensor (sometimes just referred to as the stress tensor) is where τ uu = u 2 − (ū) 2 , τ uv = uv − ūv and τ vv = v 2 − (v) 2 are the three tensor components as S is clearly a symmetric tensor.We decompose Eq. (2.12) as where I is the identity matrix and E = τuu+τvv 2 is the SGS kinetic energy.The first term on the right-hand side (RHS), S d is referred to the deviatoric stress tensor and has only two independent components with the third independent component now appearing as the magnitude of the diagonal tensor, S 0 .Note the forcing terms are not affected by the filtering operation as we always chose our cutoff scale, k c > k f .
Applying the filter, Eq. (2.8) to the vorticity equation, Eq. (2.6), we arrive at the "filtered" equation corresponding to (2.11) in the form (2.15) is the SGS vorticty flux divergence that can be written in two other alternative forms.The first form expresses Π in terms of the divergence of the SGS vorticity flux, (2.16) The second form, more directly relevant to the approach pursued in this manuscript, relates Π to the the SGS momentum flux tensor, S (2.17) where the curl operation above is the two dimensional curl.The second equality above results because S 0 is a diagonal tensor, and it follows that ∇ × (∇ • S 0 ) = 0. Fig. 1d) and h) show snapshots of Π for the cases β = 0 and β = 20 respectively; the corresponding SGS vorticity fields, ω = ω − ω are also shown in the same figure for reference.There is a great deal of correlation between ω and Π but this is, evidently, not a simple dependence.
Having a data-driven model of Π allows us, in principle, to solve Eq. (2.14) though the question arises as to which form of Π above should be used for modeling.The most common form used is the one in Eq. (2.15) where Π is directly modeled using a neural network (Maulik et al., 2019;Guan et al., 2022b,a;Frezat et al., 2022).An alternative approach might be to model the two components of the SGS vorticity flux vector vω − v ω and to compute the divergence to compute Π in Eq. (2.16).The approach adopted here is inspired by (Zanna and Bolton, 2020), namely to model the components of the SGS momentum flux, S and compute Π using Eq.(2.17).Zanna and Bolton (2020) used a single convolutional neural network to model the three components of S, namely τ uu , τ vv and τ uv .We propose in this study a variation of this approach by learning a CNN approximation of the deviatoric stress tensor's two components, (τ uu − τ vv )/2 and τ uv , which, as pointed above, is sufficient to obtain Π; see Eq. (2.13).
2.4.The locality of the resolved and SGS scale interactions.The interactions between the resolved fields and the SGS motions are only weakly non-local (Eyink, 2005;Eyink and Aluie, 2009).This property can be e.g.inferred by examining the two-dimensional spatial cross-correlations between ω and Π.For reasons of efficiency the spatial cross-correlation is computed for each snapshot by using fast Fourier transforms through the cross-correlation theorem (a generalization of the Wiener-Khinchin theorem) (Fisher, 2008), and then averaged across time.In Fig. 2, we highlight the xand y-sections of the cross-correlation.When β = 0, the cross-correlation is essentially isotropic and examining any of these sections suffices; it can be observed in Fig. 2 that the correlation is weak beyond a 7-point region in space.When β = 20 the along-jet correlation mirrors the β = 0 result but the cross-jet correlation length is larger; this relates to interactions and coupling between the jets composing the flows.Even in this case, however, highly correlated regions are still relatively local.This observation motivates the choice of our neural network architecture employed in this manuscript, as detailed subsequently.F 2. xand y-sections of the spatial cross-correlations between Π and the resolved vorticity field, ω.For β = 0, the correlations along both directions are nearly symmetric so only one of them is shown.For finite β, the cross-jet correlation length (along y) is longer than the along-jet correlation length that is comparable to that in the β = 0 case.
As a side remark, we emphasize that autocorrelation of the resolved vorticity highlights the average size of the eddies in the β = 0 case and the jet size when β = 20 (red curves in Fig. 2).

2.5.
Galilean invariant SGS models.The equations of motion, Eqns.(2.3) and (2.6) are Galilean invariant, i.e. invariant to changes in the inertial frame of reference.To demonstrate this, and without loss of generality, we can re-write the equations in a reference frame translating with constant velocity (denoted by primes), V ≡ (U, V ) along x and y directions, i.e.
x = x − U t, y = y − V t, t = t. (2.18) (2.19) Note that the vorticity is invariant to a uniform translation while the other quantities are not; this is because the vorticity is comprised of gradients of velocity; in general all four possible gradients of velocity are invariant to Galilean transformations as are linear combinations of these velocity gradients, including ω and the two strain components, the normal and shear strains defined respectively as, (2.22)An important consequence of Galilean invariance is that the filtered equations (2.11) and (2.14) must also be Galiean invariant (the filtering operator being independent of time) and consequently so must be the SGS fluxes, S and Π.
Turbulence modeling or closure refers to the modeling of SGS fluxes, either Π directly or modeling S and then using Eq.(2.17) to obtain Π, as functions of the coarse field variables, although different choices for the input variables are possible as mentioned above.A natural choice in solving the filtered vorticity equation (2.14) is to choose the input coarse field variables to be ( ψ, ω).This choice made in (Maulik et al., 2019;Guan et al., 2022a,b;Frezat et al., 2022) consists of Π modeled directly as Π = Π θ ( ψ, ω), in which θ, based on common parlance, represents the set of parameters comprising the model which can range from a simple linear regression model to more complex choices like neural networks.Alternatively, one may choose the coarse-grained velocities as inputs (Dresdner et al., 2022) Zanna and Bolton (2020) use (ū, v) as inputs and model the three components of S = S θ (ū, v) to obtain Π through Eq. (2.17).However, it should be noted that none of these choices are Galilean invariant, because on changing the intertial frame of reference leads us to changes in ( ψ, ū, v) [as per Eq. ( 2 (2.23) thus a priori, models that are functions of the above variables will not be Galilean invariant though given sufficient data, the NN with sufficiently high number of parameters would likely learn this physical symmetry.However, it is possible to make a simple but sufficient modeling choice that implicitly assumes Galilean invariance.For the two components of deviatoric stress tensor, S d , we aim at finding the following parameterization where θ represent the set of all parameters of the CNN.Because (ω, σn , σs ) represent all possible linear combinations of gradients of the velocity, adding a uniform velocity leaves them trivially invariant .We also learn below non-Galilean invariant models, following Dresdner et al. (2022); Zanna and Bolton (2020); Ross et al. (2023) as (2.25) We choose not to learn neural models that use as inputs ( ψ, ω) (as e.g. in Guan et al. (2022a); Frezat et al. ( 2022)) for the simple reason that the streamfunction is generally not available in realistic geophysical models, making such choices not easy to generalize in practice.It is indeed important to note that on the spectral plane the differential operators involved are often diagonal, but in general a solution of an expensive Poisson equation is required to obtain ψ.
2.6.Baseline parameterizations.Following recent studies (Guan et al., 2022b;Frezat et al., 2022;Ross et al., 2023), we choose the parameterizations by Smagorinsky and Leith as baselines.These models see widespread use in both contemporary LES studies and realistic numerical models currently being used in the climate system (Bachman et al., 2017;Pearson et al., 2017).The Smagorinsky parameterization is generally implemented in the momentum equations, (2.11), and parameterizes S d in terms of the coarse strain tensor.However, we choose the version of Smagorinsky that is directly implemented in terms of the coarse vorticity field, ω (Maulik et al., 2019) and in divergence form (Frezat et al., 2022); note that the momentum and vorticity forms of Smagorinsky are not equivalent.Both the above parameterizations are diffusive parameterizations and assume that Π can be represented in diffusive form as Note that the fourth linear combination, the divergence, δ = ūx + vy = 0 due to the continuity relation.
where the only unknown is the 'eddy' diffusivity, ν e .The Smagorinsky parameterization models ν e as where | S| 2 = σ 2 n + σ 2 S is the strain magnitude and ∆ c is the grid size of the coarse grid.The Leith paramaterization takes the form ν e = c l ∆ 3 c |∇ω| .
(2.28) Note that both of the above parameterizations are Galilean invariant as explained in the preceding section.Following (Ross et al., 2023), we vary the constants c l and c s in [0.01, 1.0] and run coarse-grid solutions with Smagorinsky and Leith parameterizations to similarly long times as the CNN-LES to allow for a direct comparison.As a foreshadowing of the results ahead, we note that in the vorticity form expressed above, the flows obtained from the Leith and Smagorinsky models are extremely similar when c s = c l .

M
3.1.CNN architecture and non-locality.We employ an entirely offline machine learning pipeline to learn the SGS fluxes in this manuscript.As described previously, to solve the filtered vorticity equation, Eq. (2.14), we model S d using a Convolutional Neural Network that takes in the filtered field variables as inputs, either (ū, v) or (ω, σn , σs ); Π is then obtained from S d using Eq.(2.17).Convolutional layers are translationally-equivariant and a natural choice because the Navier-Stokes equations also have the same symmetry.Recent studies (Guan et al., 2022a,b;Dresdner et al., 2022;Frezat et al., 2022;Ross et al., 2023) employ deep CNNs for parameterization consisting of a few hundred thousand to a million parameters.We find, however, that shallow CNNs, in particular, simple two-layer neural networks with only O(1000) parameters are sufficient to model SGS stresses.Our precise architecture is shown in Fig. 3.The input layer consists of 2 or 3 inputs depending on whether the velocities or combination of vorticity and strains are chosen as inputs; the sole hidden layer consists of N f convolutional filters with the two output layers only having the two independent fields that constitute S d , namely (S 00 d , S 01 d ) = ( τuu−τvv 2 , τ uv ); see Eq. (2.13).The choice of the filter width, commonly referred to as the kernel size, of each convolutional filter in each layer is set to ks = 5.N f is a key hyperparameter, which we choose to be either 8, 16 or 32.The total number of parameters for each of these choices is 800, 1600, and 3200 for the case of (ū, v) input respectively.In either case, the number of convolutional filters is unusually low compared to the norm; our objective being to find the smallest possible NNs which lead to accurate and stable solutions and we show that these actually suffice.We employ a single swish activation function, S, after the first layer; i.e.S(ξ) = ξH(ξ) where H(ξ) is the sigmoid function.Swish tends to work better than ReLU across a number of challenging data sets (Ramachandran et al., 2017).
Two important points underline our choice of shallow CNN, the first of which is the effective non-locality.This quantity is also known as the receptive field (RF) of a CNN in the computer vision literature.It is defined as the set of points in the input domain which is connected to the center of the point region that the CNN is currently operating on.Alternatively this is the effective filter width of the CNN.For example, a single convolutional layer with a 5×5 filter has a trivial 5×5 = 25 point RF, i.e. an effective filter width of 5.Because NNs are compositions of layers comprising convolutional filters, the RF is a linear function of the number of filters.Thus a two-layer CNN with 5×5 filters has an RF of 9×9 points, i.e. an effective filter width of 9.As a comparison, the 10-layer CNN employed by Guan et al. (2022b) has an RF of 41×41, which on a 64× 64 domain is nearly global in its non-locality.The study by Maulik et al. (2019) used spatially local NNs, in the form of a fully connected NN instead of CNN, with a receptive field only equal to 3 × 3; too low for the current problem.Ross et al. (2023) employed an 8-layer CNN with a RF of 21×21; they examined the gradient of the output SGS fluxes with respect to the inputs at a single point for their β = 0 case and found non-zero values in only a 9×9 spatial region, consistent with the RF of our 2-layer CNN chosen in this study.However, we also find the non-trivial result that the same 2-layer CNN architecture also suffices in accurately modeling the β = 20 jets case as demonstrated in subsequent sections.
The RF is closely connected to the stencil size in numerical discretization and can be directly identified as such in the context of solving PDEs.While our choice of CNNs assumes implicitly non-locality as in other studies of SGS closure, in contradistinction, our closure results with 2-layer CNNs presented hereafter allow us to claim that the inherent degree of non-locality in SGS parameterization is weak, even at high Reynolds number.This observation has implications for deployment of SGS-parameterizations in ocean and atmospheric models which often use a spatial domain decomposition Note that for a general n-layer CNN comprised of ks = 5 filters, its effective width is 2n − 1.
for solution on large compute clusters.Having a CNN with large spatially locality would pose severe limitations on the domain decomposition, which are alleviated by choosing shallow CNNs.
The second point of note is the effective non-linearity.Due to their hierarchical structure, deeper NNs are able to represent a more complex family of nonlinear functions than shallow NNs having the same number of parameters (Bengio and Delalleau, 2011).Thus lower layers in the network learn simple features, while higher layers build upon these to represent more complex patterns and abstractions.Therefore our choice of 2-layer CNN implicitly makes certain assumptions towards simplicity and sparsity of the relationships between the SGS fluxes and the input coarse flow variables.It is possible to increase the degree of nonlinearity of the CNN without increasing its RF or non-locality through 1×1 "convolution" layers which are strictly speaking not convolutions but fully connected layers across the feature dimension.However, we do not find a need for such choices.

Data generation and loss function.
Since the output of our CNN is the deviatoric stress tensor, S d , we use Eq.(2.17) to compute the SGS vorticity flux divergence used in Eq. (2.14).Our loss function is a simple L 2 -loss between the CNN-parameterization of Π and that obtained from FDNS, that is regularized as follows Here, the norm • 2 refers to the L 2 -norm over the flow domain while the L 2 -penalty coefficient on the sum of squares of all the weights of the NNs is referred to as the "weight decay" coefficient.Compared to recent studies (List et al., 2022;Guan et al., 2022b), the use of additional physics-informed losses to further constrain our offline training, did not turn out to be an important ingredient to derive accurate closures within our shallow CNN framework.As shown in subsequent sections, we find high fidelity solutions without recourse to adding more physical constraints as part of the loss function.
To train our 2-layer CNNs, we generate four separate trajectories of high resolution solutions with differing random initial conditions.Following the protocol of Guan et al. (2022b), we use weakly correlated snapshots that are 1000∆t iterations apart for training and testing, to promote diversity within the training dataset.We use 200 snapshots each from the first two trajectories as our training set with a total of 400 snapshots corresponding to 400,000 high-resolution model iterations, and 400 snapshots of the third trajectory for the test set and the fourth trajectory (the validation trajectory) to initialize our CNN-LES run and compare with the corresponding FDNS evolution.Since we wish to characterize the fidelity of our parameterized solutions over long time horizons, we run the fourth trajectory for T = 6 × 10 6 ∆t, which is also the time for which we run our online CNN-LES solutions for validation.
A minor if important detail here is that we do not normalize our input or output data before applying the CNN on it, nor do we use any normalization layers like Batch Normalization (Ioffe and Szegedy, 2015).

Learning rate annealing and model selection.
The learning rate is crucial hyperparameter in the optimization of neural networks.While the simplest training approach is for the learning rate to be kept constant through the entire training process, specific forms of lr-annealing can substantially improve test accuracy.We choose cosine annealing with warm restarts (Loshchilov and Hutter, 2016) because of its tendency to provide a strong implicit regularization and robust solutions across a wide range of hyperparameters.
Figure 4(a) shows the cyclically annealed lr as a function of epoch.Within this optimization framework, lr is initiated at a maximum value (lr=0.001 in Fig. 4) and is decreased to zero via the cosine function over the course of 5 epochs before suddenly ramping back to its original value.Typical test losses as a function of epoch are shown in Fig. 4b (for the annealed case, red curve) and Fig. 4c (for the fixed lr case, green curve).For the annealed case, after the initial decrease, the test loss also rises but reaches a new local minimum after each cycle, with smaller loss function values than previously reached.
For the fixed-lr case, however, the test loss decreases to a minimum value at which the CNN model is chosen as the optimal one but then becomes erratic as the number of epochs increases.We also show the lower-bound envelope of the test loss for each case (in blue); note that this curve is monotonically decreasing over a larger number of epochs in the cyclically annealed case.In each case, we select the CNN model with the least test loss over the the entire duration of training which corresponds to the right-most point on the blue curve as marked by the red circle in Fig. 4b-c.Because of our specific model selection criteria, we do not need to monitor our training process, a consequence of which is that our test set needs to be diverse enough to reflect the dynamics' variability as well as as sufficiently large.

Hyperparameter grid search.
The other two central hyperparameters in the model optimization process are the batch size and weight decay.Typical batch sizes are chosen to be larger than some threshold, but we choose the batch size to be unity for all our model training which normally would lead to extremely slow and noisy training.The main reason here is that, as explained in Sec 3.1 above, the 2-layer CNN chosen here has an effective non-locality of 9×9 which is the region in the coarse-grid domain where the CNN acts on independently of the other parts of the domain.Since our coarse-grid domain has size, 64×64, this means that our effective batch size is actually 64 2 /9 2 ≈ 50.For sufficiently deep CNNs the effective batch size and actual batch size would be identical.
To find the best CNN parameterizations for our closure, we perform an extensive hyperparameter search in (lr, wd) search space.For each value of the hidden layer size, N f in {8, 16, 32} we vary our learning rate lr in {10 −4 , 10 −3 , 10 −2 } and our weight decay coefficient, wd, in {10 −5 , 10 −4 , 10 −3 }.For each of these parameter choices we train models for both fixed lr and cyclically annealed lr where the chosen value of lr for the annealed case represents the maximum value of lr.For each of the above choices, we consider two classes of inputs, (ū, v) and (ω, σn , σs ) as described in Sec.2.5.Thus we train and test a total of 3 3 × 2 × 2 = 108 CNN models (2 types of optimizations-annealing or not; 2 types of inputs) each for the β = 0 and for β = 20 cases.Note that because our CNNs have such a small number of parameters, both our training times and CNN-LES runtimes are extremely fast.
Training each CNN model to 50 epochs takes only about 1.5 minutes on a V100 GPU while running the CNN-LES model for T = 6 × 10 6 ∆t takes about 20 minutes.Due to the 16X downscaling adopted here, the time-step of our downscaled run is 16 times larger so this is equivalent to around 375,000 ∆t LES .We note that 84 of the 108 cases when β = 0 and 87 of the 108 cases when β = 20 are numerically stable through the course of the entire online run.We also observed two other solutions which became numerically unstable after t=5×10 6 ∆t illustrating the challenges with evaluating online stability.
3.5.Diagnostics.We evaluate the accuracy and fidelity of our online CNN-LES runs using a variety of metrics which quantify the structure and dynamics of the CNN-parameterized downscaled equations, relative to the ground truth FDNS solution.The kinetic energy spectrum, Ê(k), where the • symbol represents that the quantity is in spectral space and k 2 = k 2 x + k 2 y is the radial wavenumber, is a fundamental metric for quantifying turbulent flows and dynamical regimes spanning spatial scales.
We can also write more dynamically relevant cross-scale kinetic energy and enstrophy fluxes associated with the SGS flux, defined in spectral space as where R denotes the real part, Π(k) the SGS vorticity flux divergence in spectral space and z * represents the complex conjugate of z.Note that the above fluxes do not represent the total flux of kinetic energy and enstrophy in the CNN-LES solutions but only the contributions associated with Π. Π(k) is itself a metric which can be compared between the FDNS and corresponding CNN-LES runs as a test of the fidelity.We also construct the probability distribution functions of the coarse vorticity field, P (ω) and the SGS flux, P (Π).For turbulent flows, these quantities are not Gaussian and can have strong tails, a problem exacerbated by the intermittency of two-dimensional and geostrophic turbulence caused due to the persistence of long lived vortices.The intermittency problem ensures that without extremely long time simulations, the tail of the distributions are difficult to capture accurately and remain noisy.In general the standard approach for computing these quantities for turbulent flows is through Kernel Density Estimation (KDE) (Guan et al., 2022b;Ross et al., 2023), which depends on the the choice of the underlying kernel and method used to obtain the best fit distribution (Botev et al., 2010).However, by choosing to run our online validation solutions over long time horizons of size T = 6 × 10 6 ∆t, we obtain well defined probability density function (PDF) estimates by simply estimating histograms without recourse to KDE.We compute our spectra, fluxes and PDFs for the online CNN-LES solutions and the corresponding FDNS solution using 3000 snapshots separated by 2000∆t.
For each of these spectral, fluxes and PDF quantities, we define metrics aimed at evaluating similarities between the FDNS and CNN-LES solutions.The choice of these metrics are related to those used by Ross et al. (2023).For the energy fluxes, we simply use the coefficient of restitution between the FDNS (superscript D ) and online CNN-LES (superscript θ ) quantities to measure the disagreement as The energy and SGS divergence spectra of the online solutions are evaluated in similar fashion, though we replace the quantities themselves with their logarithm to ensure that errors at higher wavenumbers are adequately represented.Thus, Metrics that compare probability distributions are often referred to as divergences and a variety of options exist.We choose an L 2 -divergence which is simply the integrated square difference between the PDFs of the FDNS and online CNN-LES Note that to simplify notations, we have dropped the • symbol over the energy, enstrophy and Π symbols.4.1.Offline accuracy.We evaluate how the CNN models perform in predicting the SGS flux divergence, Π, on the test set given the inputs, either (ū, v) or (ω, σs , σn ).Ultimately our objective is to construct computationally efficient neural parameterizations which lead to accurate and stable online solutions.However, it is helpful to examine offline test set performance as a precursor to evaluating online performance and to potentially foreshadow a relationship between the two.Furthermore, the presence of a ground truth, absent in online solutions where the CNN-LES and trajectories diverge due to dynamical chaos, allows an examination of the nature of the modeling errors.
Given our large hyperparameter sweep in the (lr, wd)-space, we choose to show offline model comparisons for hyperparameter choices that lead to the best performance on online metrics defined in Sec.3.5, as detailed further in Sec. 4. The results are shown in Fig. 5 where we compare individual snapshots of Π D from the test set and the CNN prediction error (Π D − Π θ ) for different inputs, depending on whether or not learning rate annealing was used.The average test set R 2 , as indicated on the corresponding snapshots in Fig. 5, ranges from 0.73 to 0.8 but the prediction error shows surprisingly similar small-scale structure across different models and learning rate modalities for both the β = 0 and β = 20.This close similarity of the different models can be observed both in the power spectra and probability distribution functions, relative to that of the FDNS test data (bottom row of Fig. 5).From the spectra comparisons, we note that the disagreement between the model predictions and data are limited primarily to the three largest wavenumbers, at the end of the spectrum.We found that these high-wavenumber signals can not be modeled even with deep 10-layer NNs and our choice of small parameter-CNNs are not the reason.The presence of these high-wavenumber components is a consequence of a leakage of the spectral cutoff operator that is not removed by the Gaussian filter.4.2.1.Forecast accuracy.We start by examining the short-term forecast accuracy of our CNN-LES models.In order to achieve this, we run the vorticity closure equation Eq. (2.14), initialized with a same snapshot from the FDNS validation dataset and solve forward in time with either the parameterized CNN model learned from data, or the Smagorinsky/Leith parameterizations (we refer to all these runs generically as LES runs).We then compare the short term evolution of the precomputed FDNS solution with the corresponding online CNN-LES solution and Smagorinsky/Leith runs in turn.
Adopting this protocol, the longer our LES runs remain correlated in time with the ground truth FDNS, the better we consider the forecast accuracy of the LES to be.We use standard Pearson's correlation computed at each time between the FDNS and the LES ω snapshots, and visualize these as a function of time represented as multiples of ∆t, recalling that ∆t LES = 16∆t.
We then define a single metric for forecast accuracy, the decorrelation time, as the time after which the correlation between FDNS and LES drops below 0.96 (Dresdner et al., 2022); thus a longer decorrelation time implies a better forecast accuracy.Given that we are dealing with chaotic turbulent flows, one can also define an intrinsic decorrelation time of the FDNS itself, evaluated by perturbing the initial DNS snapshot with noise and then measuring how fast the flow decorrelates from the unperturbed DNS.Ross et al. (2023) found that Smagorinsky models had close to the best forecast accuracy among all their LES models in spite of poorer long-term online performance.Therefore, we believe it suffices to compare the relative forecast performance of our CNN-LES runs with Smagorinsky/Leith LES runs serving as a strong baseline, without concerning ourself with the intrinsic decorrelation of the DNS itself.
Figure 6 shows the resulting forecast results.First, we note that all four types of CNN-LES solutions (solid curves with solid markers) corresponding to cases shown in the offline section (Sec.4.1) have remarkably similar correlation curves.Furthermore, the CNN-LES solutions outperform the Smagorinsky cases (dashed lines) substantially taking almost twice as long to decorrelate from the FDNS runs (note that the time axis is logarithmic).A decrease of the Smagorinsky constant c s results into an improvement in the forecast performance.However, as discussed in subsequent sections, decreasing the value of c s below 0.025 (purple curve in Fig. 6) causes a small-wavenumber pile up of energy leading eventually to a deterioration of the long-term statistics compared to FDNS.
Similar observations hold for the case with β = 20, but unlike the β = 0 case, the forecast accuracy of CNN-LES models trained through cyclical lr is demonstrably better than for the CNN-LES models trained with fixed lr.In each case, the CNN-LES runs outperforms the Smagorinsky and Leith runs in terms of forecasting accuracy; with Leith's results very similar to the Smagorinsky's ones (not shown).8. Same as Fig. 7 but for β=20.The case corresponding to a fixed lr and (ω, σs , σn ) inputs is not shown because of poor accuracy.The inset in a) shows a blow-up of the low-wavenumber energy spectrum of each CNN-LES model.Note that unlike in Fig. 7, the CNN-LES model learnt through cyclic lr with (ū, v) inputs (blue curve with circles) substantially outperforms in accuracy the other CNN-LES models.

4.2.2.
Long-term accuracy.Short term forecasts are of direct interest to problems like weather forecasting but long term forecasts are more relevant for climate studies.There is no a priori reason to expect that high accuracy in the former implies the same for the latter, especially when concerning data-driven models whose training can be highly task specific.In fact, Dresdner et al. (2022) find that their turbulence parameterized solutions are matched in forecast accuracy by a purely data-driven auto-regressive NN model through the Encoder-Process-Decoder framework used in (Stachenfeld et al., 2021), a model that is ultimately unstable over long times.Our objective is to validate our CNN-LES setup for numerical stability and fidelity over long time scales, with the forecast accuracy being a mere side-effect.Given our aggressive choices regarding the size of the CNN we run our models for substantially longer times than in other recent works concerned with neural turbulent closures.Guan et al. (2022b) compute their online CNN-LES runs for 2×10 6 ∆t but we compute to T = 6×10 6 ∆t corresponding to 375,000 ∆t LES .As explained in Sec.3.5, this also ensures convergence of metrics like the probability distribution function, which can now be computed directly without resorting to kernel density estimation.
In this section, we highlight the long term accuracy of the four specific CNN models described in Sec.4.1 and Sec.4.2.1 for the six quantities chosen in Sec.3.5 as diagnostics for measuring solution fidelity; these are the 1-dimensional timeaveraged energy spectrum, E(k), the SGS divergence spectrum, |Π(k)| 2 , the cross-scale energy and enstrophy fluxes, E f lux (k) and Z f lux (k) and the 1-D probability distribution functions, P (ω) and P (Π) computed over the course of the entire online run.A comparison of the FDNS runs with the corresponding four CNN-LES runs is shown in Fig. 7.We find that all four cases have good agreement across the six shown metrics, with the highest overall accuracy being observed in the two cases with cyclical learning rate but different inputs.The SGS spectrum, |Π(k)| 2 and PDF, P (Π) has as high a degree of accuracy in the online runs as in the offline tests (Fig. 5) which also ties in with the high accuracy of the obtained Note that our computational setup is identical to theirs, except our Re value is marginally larger.structural flow metrics like E(k) and P (ω) and the dynamical metrics of the cross-scale fluxes.An implication of this result is that Galilean invariance is, evidently, not a particular difficult physical symmetry to learn in these cases.
However, these results do not translate to the case with β = 20 (Fig. 8) when the run corresponding to cyclical annealed lr and (ū, v) input (blue curves in Fig. 8) substantially outperforms on all metrics except for P (Π).While the role of the cyclical annealing in model robustness and fidelity is broadly expected, it is surprising that choosing Galilean invariance a priori in our modeling (through using (ω, σn , σs ) as inputs) actually hinders online accuracy.The precise reason is unclear but it evidently has to do with the presence of strong eddy-driven zonal jets in the β = 20 case that are absent when β = 0.This observation seems to imply that for the case of geophysical turbulence, it might be better to choose (ū, v) as inputs to the CNNs.It is also interesting to note that while Π(k) is accurately predicted as is the vorticity distribution, P (ω), the tails of the distribution P (Π) are missed.
Interestingly, we observe that our CNN solutions have here high online accuracy across both short and long-time scales (i.e.across both "weather" and "climatic" regimes).Previous results do not report on such solutions.While Dresdner et al. (2022) do not examine long-time fidelity, Frezat et al. (2022) find that their best long term accurate solutions (trained using their fully online methodology) actually have poor forecast accuracy and lose out to an offline-trained CNN model that is numerically unstable at long times.Consistent with this result, Ross et al. (2023) find that their forecast accuracy does not in general correlate with long-term fidelity.To underscore this result further, we evaluate the online performance of the best CNN-LES solutions for the β = 0 and β = 20 cases (among the cases shown in Figs.7 and 8) in Fig. 9 with reference to various Smagorinsky cases shown earlier in Fig. 6.Note that the Smagorinsky solution with the best accuracy in capturing the distribution, P (ω), (corresponding to c s = 0.025) has a pile up of energy at small scales (Fig. 9ab).F 10. A comparison of five of the difference metrics defined in Sec.3.5 against the sixth online metric, the spectral-diff that measures the difference between the energy spectra for online runs based on CNNs trained across a range (lr, wd)-values in the hyperparameters space.Here β=0.The red markers denote online CNN-LES models trained with cyclical lr while green ones with fixed lr.The circles use (ω, σn , σs ) as inputs while the triangles use (ū, v).The total-diff metric is defined in Eq. (4.1).The cases I, II and III marked (f) are visualized in Fig. 11.
Furthermore, the Smagorinsky cases fail to capture the tails of the vorticity distribution that the CNN-LES solutions do consistently; this is especially true for the jets cases (Fig. 9f).
Our online accuracy for the jets case is comparable to that obtained by Frezat et al. (2022) using their online-learning framework.A similar attempt by Ross et al. (2023) using a similar offline-learning approach as ours for parameterizing their corresponding jets case was, however, unsuccessful; their CNN completely failing to model jets after successfully solving their β = 0 case.Ross et al. (2023) use this result to remark on the lack of robustness of the data-driven CNN approach.Our results are in contradistinction with theirs and we do not thus share the same conclusions.It is unclear though why the results of (Ross et al., 2023) regarding the β-case differ from ours.While it is plausible that their quasi-geostrophic turbulent model could be a more challenging test case, their Reynolds number is actually a lot lower than ours.The reasons behind the discrepency between our online CNN-LES solutions and those of (Ross et al., 2023) may also lie in technical reasons such as sub-optimal hyperparameter searching, the depth of the CNNs used, and the choices of training procedures.We further discuss this issue in the Discussion section.
Finally, our best models, shown here and the preceding two sections are cherry picked from a large range of models trained over a range of hyperparameters.In the next section we examine the behavior of our CNN-LES models across this hyperparameter space depending on the training methodology or input choices.

Hyperparameter dependence of model fidelity.
In preceding sections we have demonstrated that by searching in the (lr, wd)-hyperparameter space we can obtain high-fidelity solutions with shallow CNNs.However, it is important to know how common or rare such high-fidelity solutions might be to find, and to in turn to be able to find the most accurate online solutions easily given such a large number of runs.Such an evaluation, though seemingly mundane and somewhat tedious, is especially relevant for foreshadowing the challenges in parameterizing more complex flow configurations observed in the climate system when such large hyperparameter searches might not be viable.To understand how these metrics vary in the hyperparameter space, we display the 5 metrics, other than spectral-diff (which measures the accuracy in estimating the energy spectrum) relative to spectral-diff.This helps us assess how the various metrics relate (implicitly) to each other by observing their variation against spectral-diff.Figures 10 and 12 show the results for β = 0 and β = 20 respectively.We examine each of these two cases as the results show different variations of the metrics across the (lr, wd)-space.We start by examining the total-diff metric (Fig. 10f) which highlights two important points.First, we observe the cyclical annealed cases (red markers) are more tightly clustered and closer to the origin (i.e.lower errors) while the fixed lr cases (green markers) have a much larger spread in performance.This feature indicates that not only does the cyclic lr procedure lead to more robust solutions overall by producing a narrower range of behaviors as the hyperparameters are changed, they are also overall more accurate as well.Such properties are even more pronounced for β = 20 (Fig. 12f) when the lower accuracy and robustness of the annealed lr cases stand out even more starkly.Second, within the cyclic lr cases (red markers), solutions that use (ω, σn , σs ) as inputs (red circles) perform best when β = 0 though not by a lot.However, when β = 20 almost all the best solutions are for the cases which use (ū, v) as inputs (red triangles).The cyclic lr cases also have substantially lower flux errors (Fig. 12cd) compared to the fixed lr cases though the other metrics are harder to distinguish.Curiously for both β = 0 and β = 20, the fidelity of P (Π) (Fig. 10e and Fig. 12e) seems to have the opposite relationship with the flow metrics with the fixed lr cases having higher F 12. Same as Fig. 10 but for β=20.Here again one observes an overall better online performance of the CNN-LES models trained with cyclical lr rather than fixed ones (red vs green markers).accuracy in this metric.We believe this is not a significant issue as tail errors in P (Π) are not insignificant even for the offline cases (Figs.5k and 5l).The cyclic lr cases, however, outperform on the spectral-sgs-diff metric (Fig. 10e and Fig. 12e).
Our exhaustive parameter search allows us to highlight a specific case with extremely large online error, an order of magnitude larger than that in the best case, while being numerically stable for long times (marked as Case III in Fig. 10f).To understand this anomalous solution better, we visualize snapshots of ω and Π for three cases corresponding to the CNN-LES runs with best and middling online runs along with this worst online run.A surprising fact is that both I (the best case) and II (the "middle" case) have very similar flow structure while III looks noisy and unphysical.Curiously though, the three cases have similar offline accuracy, again highlighting the challenges in predicting online accuracy from from offline accuracy.
The findings above also raise questions about the relationship between the large number of online-stable CNN parameterizations that we find as part of our hyperparameter searches.The close structural similarity of the SGS error observed in Fig. 5 across different input choices and optimization methods indicates that we may be learning a common family of parameterizations.While one might hope to find a single unique parameterization that is globally optimal under some accuracy metrics for complex turbulent flows, this can be challenging, if at all possible, to infer from finite data sets.One might wonder how these results change for a fixed neural network architecture but in the infinite data limit.Such questions are tied to the study of ergodicity properties.If ergodicity applies, the uniqueness (and existence) of a parameterization that averages out optimally the small scales is guaranteed in the infinite data limit (Chekroun et al., 2020, Theorem 4) although the use of finitely sampled data can lead to various admissible solutions with different scoring as reported in Fig. 10 and Fig. 12.A detailed study of this phenomenon under the lens of a possible ergodicity of the underlying forced flows (Foias et al., 2001;Chekroun and Glatt-Holtz, 2012;Hairer andMattingly, 2006, 2011) goes beyond the scope of the present work.

4.2.4.
Offline vs online predictions.The ultimate goal of any data-driven parameterization is to reach high-quality online fidelity and accuracy.To achieve such a feat requires typically to run the model over long times and evaluating the online solutions across various metrics such as reported the sections above.Such an approach becomes particularly expensive as one tests over a wide swaths in the hyperparameter space even when using shallow neural networks like in this study.Ideally one would hope that one could simply compute offline accuracy on the held-out test set and use that to estimate online accuracy.But this remains idealistic.In that respect, Figure 13 compares offline accuracy with the total-diff online accuracy measure defined in (4.1) across the (lr, wd)-hyperparameter space.The results are shown for β = 0 in Fig. 13a-c and for β = 20 in Fig. 13d-f.Unlike in the previous section, we separate the results for different filter sizes of the hidden layers.It can be observed that as the number N f of convolutional filters is increased corresponding to an increase in the number of CNN parameters, the offline error (the x-axis) systematically decreases with increasing N f , as manifested by the shift of the cloud of points towards the origin of the x-axis.However, the corresponding decrease in online error (the y-axis) is actually rather small and consequently difficult to discern.This observation highlights the crucial point that a large decrease in offline error might lead to only small gains in online CNN-LES performance.It also foreshadows the result that deeper CNNs that would inevitably improve offline accuracy do not necessarily lead to more accurate online solutions.
Concurrently, we note that the correlation between online and offline performance remains weak implying that diagnosing the latter is insufficient for knowledge of the former, our actual objective.One reason is that our metrics for offline performance, the mean square error, calculated over 400 snapshots and spatially averaged through the norm involved therein, does not reflect necessarily the dynamical differences in the data produced by the different CNN-LES models.It is conceivable that more physically grounded metrics of offline performance might in fact improve online skills without running more expensive CNN-LES runs for long times; this remains an area of active investigation.

D
5.1.Physical and computational implications of SGS near-locality.The primary result in this work is the effective learning of accurate parameterizations of two-dimensional and geostrophic turbulent flows with shallow twolayer CNNs.This accomplishment essentially implies that the SGS stresses for these problems have a spatially nearly local (or weakly non-local) dependence on the coarse fields.In other words we have inferred the physics of this class of problems through direct construction of a spatial nearly local parameterization.This is not in general true for all classes turbulent flows; for example convective motions in the the oceanic surface-mixed layer and broadly in the Earth's troposphere (processes that also result in cloud formation and subsequent precipitation) can be strongly correlated across a significant fraction of the vertical extent within which they take place.Recent ML models that successfully parameterize vertical fluxes in the atmosphere use NNs that span the most of the air column (Yuval and O'Gorman, 2020;Yuval et al., 2021).Similarly, traditional empirical physics-based parameterizations in the oceanic surface mixed layer employ a simple vertically non-local flux to represent convective turbulence.While current class of atmospheric ML-based parameterizations do not represent the horizontal eddy-fluxes, evidently for reasons of simplicty, our work here suggests that more complete representations would likely be spatially (nearly) local in the horizontal and have a greater degree of non-locality in the vertical; in the ocean the non-locality should be limited to the extent of the surface-mixed layer where air-sea fluxes are actively felt.
A secondary implication of this work is for purposes of numerical computation of oceanic, atmospheric and climate models.These models often rely on domain decomposition in the horizontal for numerical solution over large clusters of compute nodes, with inter-process communications typically limited to a small number of points on the boundaries of the sub-domains.Thus employing data-driven parameterizations precludes a high degree of spatial non-locality which would make the inter-process exchanges prohibitively expensive; note that this is not an issue in the vertical direction (i.e.normal to the Earth's motions).Our results above demonstrate that the horizontal SGS fluxes can be modeled through shallow CNN models and are thus easier to incorporate into existing pipelines than by using more cumbersome deep CNNs.5.2.Theoretical consequences and towards interpretability.Our neural closure results with shallow CNNs presented in this study are valid for cutoffs within the inertial range and for high Reynolds numbers.This problem is known to be difficult as small errors at the level of the SGS typically amplify the errors at the large scales due to the inverse cascade (Piomelli et al., 1991;Jansen and Held, 2014).To dispose of SGS parameterizations at low cutoff levels for such turbulent flows with a controlled error is thus one of the challenges to resolve.The accuracy and stability of our closure results are thus strongly supportive for the existence of a nonlinear function Φ CN N such that the SGS, Π, satisfies, after spin up, a relation of the form Π = Π CN N (u, v) + , (5.1) where the residual is a spatio-temporal function whose fluctuations are controlled and small in a mean square sense.In Eq. (5.1), Π CN N denotes the function found by means of shallow CNNs trained by minimizing our loss function in (v).Actually, (5.1) is a consequence of the very construction of Π CN N obtained by minimization of our loss function (3.1) along with its regularization terms.
We have shown that with the appropriate hyperparameter searching in the learning rate and weight decay coefficient space, as well as usage of cyclical learning rate annealing, small residuals can be reached offline-with shallow and weakly local CNNs-while leading to stable and accurate online solutions.As Π CN N is a nonlinear functional of the coarse-grained variables only, our good closure skills suggest that the knowledge of Π CN N is amply sufficient to achieve good performances in particular in the β-case when compared to other recent neural closures (Ross et al., 2023).
To the contrary, a good approximation of the conditional expectation, namely the best nonlinear functional averaging out the unresolved variables as conditioned on the coarse variables, is sufficient for the closure of forced two-dimensional turbulence problems at high Re.In this study, Π CN N with a small provides such an approximation and as such is likely to relate to the existence of an underlying optimal parameterizing manifold (OPM) linking the small and large scales in a least squares sense (Chekroun et al., 2017(Chekroun et al., , 2020(Chekroun et al., , 2021) ) as predicted by the theory of OPMs (Chekroun et al., 2020, Theorem 5).
Our neural turbulent closure results together with related recent studies restore thus some credentials to ideas proposed in the late 80s by Foias et al. (1988Foias et al. ( , 1991) ) envisioning two-dimensional turbulence as essentially finite-dimensional with turbulent solutions lying in some thin neighborhood, in a mean square sense, of a finite-dimensional manifold; see also (Chekroun et al., 2020, Eq. (1.5)).These ideas were watered down as shown to be valid only for cutoff wave numbers within or close to the dissipation range (Pascal and Basdevant, 1992) when relying on traditional analytic parameterizations such as initially proposed in (Foias et al., 1988).The usage of neural networks invites us thus to revise such conclusions based on a limited class of analytic formulas, and sheds actually new lights on this old problem as pushing the validity of relationships such as (5.1) for cutoff within the inertial range.The discovery of nearly-local shallow CNN-parameterizations to achieve this feat is likely to be interpretable and generalizable because of its intrinsic low dimensionality.We hope thus to reconcile the previous failures in analytic attempts with the recent empirical successes by seeking for new analytic formulas for closure that would exploit the discovery of our nearly-local shallow CNNs.

Machine Learning vs physical design choices.
In their recent work, Ross et al. (2023) suggest that the focus of ML-based parameterizations should be less on ML details like NN architectures or optimization techniques but should instead be focused on physical design choices, in this case on the choice of the inputs/outputs or the coarse-graining filter.Based on our results, we suggest that one should in fact focus on both aspects, and in some cases these are closely related e.g.our choice of the 2-layer CNN architecture based on the nearly local character of the SGS coarse-field components interacting with the small scales.This study shows that the choice of optimization techniques like cyclical annealed lr can make the learning task substantially easier and generally more robust; this would be even more relevant when the problems being considered are not simplified turbulent models like the ones dealt with here and in the studies referred to as in this study.Furthermore, hyperparameter searches, while tedious, can lead to substantial improvements in task accuracy and efficiency, and therefore should not be ignored.

F 3 .
Pipeline of CNN used to model SGS stresses.The input fields are either of the coarse field combinations (ω, σn , σ s ) or (ū, v) while the output filters are the two relevant components of the deviatoric stress tensor, S 00 d and S 01 d .The SGS vorticity fluxes are computed first through a tensor divergence followed by a two-dimensional curl operation.

F 4 .
Left: The learning rate of cosine annealing case (red) vs the fixed case (green) as a function of epochs.Middle: The total test loss (red) with the lower bound curve of test loss (blue) for the cyclical annealing rate case.Note that the blue curve is a monotonically decreasing function.Right: The same as the middle panel but for the fixed lr case.The red circle shows the model state that has the lowest test error over the entire course of training; the CNN model is saved for this point and used to drive the online CNN-LES runs.
ω D ) − P (ω θ )] 2 d ω, Π D ) − P (Π θ )] 2 d Π. comparison of the test data SGS flux divergence, Π D with that predicted by the trained CNN models, Π θ having different inputs and training methodologies.Top row: Comparison of a single test set snapshot, Π D with the the error of the CNN predictions Π D − Π θ depending on whether the inputs are either (ū, v) or (ω, σs , σn ) or whether cyclical learning rate annealing was used for training, for β = 0. Middle row: Same as the top row but for β = 20.Each snapshot also shows the coefficient of restitution, R 2 (Π D , Π θ ) computed over the entire test set dataset (consisting of 400 snapshots).Bottom row: Comparison of the power spectra, Π D (k) (black dashed lines) with Π θ (k), and the corresponding probability distributions, P (Π D ) and P (Π θ ) for each of the three cases across different values of β, shown in the two row.Note that the colored curves can be difficult to discriminate because of their close overlap in a consistent way with spatial error plots such as shown in panels b)-d) and f)-h).The markers have been removed from k) and l) to reduce clutter.
evolution of the correlation coefficient between the FDNS and multiple configurations of the CNN-LES models (each initialized with the same FDNS snapshot) as a function of time-steps of the DNS model (solid lines with markers); to get the corresponding number of time steps for the CNN-LES, divide by 16.Also shown are the correlations of the Smagorinsky-parameterized runs with the FDNS for different Smagorinsky constants as indicated.The CNN-LES cases shown here correspond to the best long-term online error based on a large hyperparameter search for the choice of inputs or learning rate choice; see Secns.4.2.2 and 4.2.3 below.

F 7 .
Comparison of FDNS data (dashed black line) with online runs across four different cases shown in Fig. 6 and Fig. 5 as indicated on the legend, for β=0 for the diagnostic quantities indicated.Note that the different curves are are barely discernible in a), b), f) and to a certain extent e).The real points of difference emerge in the two cross-scale energy fluxes, c) and d); here the two cases with (ū, v) are the most accurate.The markers have been removed from e) and f) to reduce clutter.

F 9 .
Comparison of long-term online performance of the best CNN models (both cyclic lr cases) from Fig. 7 and Fig. 8 with the respective Smagorinsky runs which are run for identical times as the CNN-LES cases.Only a few of the key metrics from the earlier figures are shown for brevity.Insets highlight the large-wavenumber comparisons between CNN-LES and Smagorinsky spectra; note that these are shown with a linear y-axis instead of logarithmic to highlight the differences further.The vertical dashed line in (a) and (d) shows the rightmost extent of the inset.Note that, compared to the online CNN-LES solutions, the Smagorinsky ones fail to capture the inverse cascade.

F 11 .
Snapshots of ω and corresponding Π for three online corresponding to cases marked I, II and III found in Fig.10f.In Sec.3.5, we define six different difference metrics based on the six quantities shown in Fig.7and Fig.8that compare errors in the online CNN-LES solutions relative to the FDNS solutions.Here, we construct a metric that consists of simply summing up the six metrics, defined thus as total-diff =(spectral-diff + spectral-sgs-diff + enst-flux-diff + energy-flux-dim + distrib-sgs-diff)/5 .(4.1)

F 13 .
Comparison of online total-diff metric vs offline test-error over the (lr, wd)hyperparameter space for a)-c) β = 0 and d)-f) β = 20.The results are grouped by the different values of the convolutional filters used, given by N f ; details of the CNN architecture are given in Sec.3.1.Markers color-coding is the same as in Fig. 12.

A
This work is supported by the Office of Naval Research (ONR) Multidisciplinary University Research Initiative (MURI) grant N00014-20-1-2023.This work is also partially supported (MDC) by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program [Grant Agreement No. 810370].