Stochastic-Deep Learning Parameterization of Ocean Momentum Forcing

Coupled climate simulations that span several hundred years cannot be run at a high-enough spatial resolution to resolve mesoscale ocean dynamics. Recently, several studies have considered Deep Learning to parameterize subgrid forcing within macroscale ocean equations using data from ocean-only simulations with idealized geometry. We present a stochastic Deep Learning parameterization that is trained on data generated by CM2.6, a high-resolution state-of-the-art coupled climate model. We train a Convolutional Neural Network for the subgrid momentum forcing using macroscale surface velocities from a few selected subdomains with different dynamical regimes. At each location of the coarse grid, rather than predicting a single number for the subgrid momentum forcing, we predict both the mean and standard deviation of a Gaussian probability distribution. This approach requires training our neural network to minimize a negative log-likelihood loss function rather than the Mean Square Error, which has been the standard in applications of Deep Learning to the problem of parameterizations. Each estimate of the conditional mean subgrid forcing is thus associated with an uncertainty estimate–the standard deviation—which will form the basis for a stochastic subgrid parameterization. Offline tests show that our parameterization generalizes well to the global oceans and a climate with increased 2 CO levels without further training. We then implement our learned stochastic parameterization in an eddy-permitting idealized shallow water model. The implementation is stable and improves some statistics of the flow. Our work demonstrates

. Despite vastly improving the physics of climate models, current parameterizations continue to induce biases in simulations (IPCC, 2013).
The era of Machine Learning offers an opportunity to improve the parameterization of unresolved processes using available data from observations and limited high-resolution simulations.While some progress has been made toward online learning of neural networks (NN) for unresolved processes in partial differential equations (Sirignano et al., 2020), the approach is not yet ready for complex climate simulations and might not be generalizable due to model dependence.Therefore, the typical approach in atmosphere and ocean modeling consists in training Machine Learning algorithms offline, with a subgrid forcing term that is diagnosed via a filtering operation over high-resolution simulation data, used as the "truth".Recent studies have shown the potential of Machine Learning approaches for atmospheric (Rasp et al., 2018;Yuval & O'Gorman, 2020) and ocean parameterizations (Bolton & Zanna, 2019;Zanna & Bolton, 2020) to improve simulations.So far, most studies on ocean parameterizations that use Machine Learning have been limited to the use of data from ocean-only models with idealized geometry.The viability of Deep Learning parameterizations trained using data from realistic coupled or uncoupled models and their potential to generalize to different climates both remain open questions.In addition, the stability and the physical behavior of implementations of Deep Learning parameterizations in models have also been a subject of debate (Brenowitz et al., 2020;Yuval & O'Gorman, 2020;Zanna & Bolton, 2020).
Here we address these questions by showing the high performance of a Deep Neural Network in offline predictions of subgrid momentum forcing in different climates using data from a high-resolution coupled climate model, which resolves ocean mesoscale eddies in many regions (Griffies et al., 2015;Hallberg, 2013), and show which will result in a stable implementation online.Our work focuses on parameterizing the interaction between mesoscale eddies and large-scale flow, which is key to establishing the transfer of energy between reservoirs and scales (Ferrari & Wunsch, 2009) and to establishing the large-scale ocean circulation (Waterman & Jayne, 2011).In particular, we propose a stochastic parameterization that aims to represent the inherent uncertainty of the subgrid forcing (Juricke et al., 2017), stabilize the online implementation of the parameterization (Palmer, 2012;Zanna et al., 2017) and reduce systematic biases (Berner et al., 2017;Gagne II et al., 2020).Stochastic parameterizations are in demand in particular in what has been referred to as the gray zone (Gerard, 2007;Jones et al., 2019;Zanna et al., 2020), where subgrid processes are partly resolved (Berner et al., 2017).
In our study, our NN model outputs the mean and standard deviation for the predicted momentum forcing, given coarse-resolution velocities.The predictions of mean and standard deviation of the momentum forcing form the basis of a stochastic parameterization that we will implement in an idealized ocean model.Our contribution therefore establishes a bridge between recent developments on Deep Learning approaches to the problem of stochastic parameterizations (Berloff, 2005;Brankart, 2013;Frederiksen & Davies, 1997;Juricke et al., 2017;Mason & Thomson, 1992;Williams et al., 2016;Zanna et al., 2017).
The manuscript is structured as follows.In Section 2, we describe the data, the NN architecture and the training procedure-which uses a probabilistic loss function.In Section 3, we conduct an offline test on a global scale, showing the ability of our NN to generalize to regions not seen during training.We also show the ability of our NN to generalize to a different climate in which 2 CO levels are higher and have affected the mesoscale variability.In Section 4, we demonstrate the potential for increased stability via an implementation of our stochastic parameterization into an idealized ocean model, without further training of the NN.Finally, in Section 5 we conclude and discuss the implications of our work and future directions.

Methods
In Sections 2.1 and 2.2, we describe the filtering and subsequent coarsening of the data in order to diagnose the corresponding subgrid momentum forcing necessary to force a coarse-resolution model.In Section 2.3 and 2.4, we describe a procedure that enables us to represent the uncertainty associated with the forcing using a probabilistic loss function for training.In Section 2.5, we review the structure of our proposed NN.Finally, in Section 2.6 we provide details about our training procedure.

Filtering and Coarse-Graining Procedure
In this section, we describe the processing necessary to generate the training data for our NN.The procedure follows the two steps presented in Zanna and Bolton (2020): low-pass area-weighted Gaussian filtering, followed by coarse-graining.Applied on the high-resolution surface velocities  u from the CM2.6 simulations, this procedure generates coarse-resolution velocity data that mimics the simulation data from coarser models and will serve as the input to our NN.In addition, given the high-resolution and filtered velocity data, we diagnose the missing subgrid forcing of a coarse-resolution model (e.g., CM2.5) compared to its high-resolution counterpart (here, CM2.6).This missing forcing is the subgrid parameterization needed at coarse resolution to mimic the effect of unresolved scales on the large-scale flow that will be learned by the NN.
Unlike data used in previous machine learning studies (Bolton & Zanna, 2019;Yuval & O'Gorman, 2020;Zanna & Bolton, 2020), the CM2.6 grid is on a sphere.In the zonal direction, the spacing is 1/10, and in the meridional direction, it varies from approximately 1/10 near the equator to 1/20 at high latitudes, so that the zonal and meridional extent of grid boxes varies with latitude.The grids of CM2.5, with 1/4 nominal resolution, and CM2.1, with 1 nominal resolution, have a similar structure.As such, the meridional length scale used to define the subgrid eddy forcing should depend on the latitude, which would in part capture the latitudinal variations of the Rossby deformation radius.In contrast with the typical approach, rather than selecting a uniform length scale to filter the data and generate a coarse-resolution field, we select a uniform and unitless integer scaling factor  , that defines the number of grid boxes from the high-resolution grid that map to a single grid box of the low-resolution grid.This is the simplest and most consistent definition of subgrid scale for the purpose of data-driven parameterizations.This unitless scaling factor applies to both the filtering and coarse-graining steps.
We now describe in details the two steps of our low-resolution data generation procedure given the fixed scaling factor  .As a first step we apply a low-pass weighted Gaussian filter, denoted by   , to the high-res- olution surface velocity data, with weights provided by grid box areas, to separate the subgrid from the resolved field (Bolton & Zanna, 2019).The standard deviation of the Gaussian kernel is set to / 2  , such that approximately 80% of its weight falls within the interval [ / 2, / 2]    of length  .Note that in using a uni- form scaling factor we also allow the use of standard convolution algorithms for regularly spaced data.This would not be possible if we were using a uniform length scale as we would then have to adapt the size of the filter in terms of number of grid points as a function of latitude, incurring a high computational cost to generate the training data.The second step simply consists of a coarse-graining procedure, resulting in the filtered and coarse-grained velocity field u u    , where •  denotes the coarse-graining operator by a factor  .The coarse-graining operator is based on the mean function applied over squares of side length  - equivalent to area-weighted average.After coarse-graining, the resulting grid consists of approximately 2  times less points than the high-resolu- tion grid.
This filtering and coarse-graining procedure is applied to surface velocites from the CM2.6 control simulation.Figure 1 shows a snapshot of the filtered and coarse-grained surface zonal velocity.In addition, the subgrid momentum forcing on the high-resolution grid, denoted and is then coarse-grained according to S S      .For exact implementation details of the procedure in the form of pseudo-code, please see Appendix B or the online code.
In our work, we primarily target parameterization for the eddy-permitting regime, or the gray zone, in which momentum parameterizations are in demand to mimic the inverse energy cascade (backscatter) processes while potentially overcoming spurious dissipation (Bachman, 2019;Charney, 1971;Jansen et al., 2019;Kjellsson & Zanna, 2017;Kraichnan, 1967;Larichev & Held, 1995;Leith, 1990;Rhines, 1977;Salmon, 1980;Treguier et al., 1997;Zanna et al., 2017Zanna et al., , 2020)).Here, we present experiments in which we set 4   , such that, irrespective of the subdomains of study, the coarse-grained grid has ∼4 times less grid boxes along each horizontal dimension.This choice of four grid boxes leads to a subgrid forcing of an ocean model at resolution 0.4, close to the resolution of ESM4 which has a nominal resolution of 0.5 (Dunne et al., 2020).
The filtering and coarse-graining procedure is applied to both data from the piControl and 1pctCO2 simulations.The piControl data set will be used both for training and offline testing, while the 1ptCO 2 data set will be used for testing only.

Prediction: Conditional Distribution of Subgrid Scale Forcing
Our goal is to learn a parameterization, denoted by Ŝ, of the diagnosed true subgrid momentum forcing, S (Equation 1), using deep learning.We propose a NN that uses maps of coarse surface velocities, u, at a given time as inputs, and estimates the subgrid momentum forcing components at that same time as outputs.
Here estimates is to be understood in a broad sense: it could be a single-value prediction or a probability distribution as discussed below.
Specifically, in this work we present a stochastic parameterization of the subgrid momentum forcing.To do so, we assume that at each grid box, the distribution of the forcing is Gaussian, conditionally on the coarse surface velocities (we do not assume that the marginal distribution of the forcing is Gaussian).We also assume that the forcing at individual grid boxes and times are conditionally independent given the coarse surface velocities.
GUILLAUMIN AND ZANNA 10.1029/2021MS002534 4 of 25 One of the main sources of uncertainty in the predicted subgrid forcing comes from the fact that we only use the resolved coarse velocities to make a prediction.Given a surface velocity field over a subdomain of the oceans at a given time, we do not necessarily expect the subgrid momentum forcing to be given by a deterministic mapping.To illustrate this statement, consider Equation 1; for the problem at hand,  u is unknown, and while  u u  is well-defined as a mapping, it is not invertible.Thus the parameterization problem can be viewed as an inverse problem, for which probabilistic representations are a common approach (Bishop, 1991).If  u is seen as a random variable, we may want to represent the probability distribution ( | ) P  u u , and the same applies to the forcing.Hence we may want our NN's output to determine a parametric probability distribution rather than a single number.
Besides, a stochastic parameterization of the subgrid forcing can also account for the fact that what we call the true subgrid forcing, Equation 1, depends on our choice of filter which may not adequately represent the "missing forcing" from any given numerical model at coarse-resolution.One could train our NN with subgrid forcing generated from a variety of methods, to partially account for the fact that the exact subgrid forcing is not known.
The output Gaussian distribution at each location can be interpreted as an estimate of the conditional distribution of the subgrid momentum forcing given the local velocity field.Its mean represents the expectation of that conditional probability distribution, while its standard deviation represents the uncertainty around the mean.Such representation will allow deriving confidence intervals of the predicted subgrid momentum forcing (see Section 3.2).It also forms the basis of our stochastic parameterization, see Section 4 about implementation.In the next section, we will show how to learn the mean and standard deviation of the subgrid forcing from data.

Probabilistic Loss Function: From MSE to Gaussian Log-Likelihood
To train our NN, we aim to find a local minimum to a loss function ( ,  )) ˆ( L S S θ -summed over all the samples of the training data set-that represents the mismatch between our prediction Ŝ and the true value S given the current state of the parameters of the NN, represented by the vector of parameters θ.Here, S, the target tensors of the NN corresponding to a single sample at a given time, has dimensions ( , , ) C x y n n n , where 2 C n  for the zonal and meridional component of the velocity field, and ( x n , y n ) is the size of the domain considered, i.e., the number of grid boxes in the zonal and meridional direction, respectively.We have ignored the number of mini-batches here for simplicity, which will be discussed in Section 2.6.
The most common loss function used in regression is the Mean Square Error (MSE), which in our case would take the form of, for a single sample, where k denotes the index of the component of the subgrid momentum forcing (here, 1 corresponds to the zonal component and 2 to the meridional component).Despite its widespread use within the Deep Learning community for regression, the MSE loss function is not always appropriate.To justify the above claim briefly, it is common to interpret the MSE loss function from a probabilistic perspective.For simplicity, we limit the discussion to univariate random variables, but this can be easily extended to multivariate variables.Let ,   be random variables; assume that  is observed, and  is such that its conditional probability density function given  is a Gaussian distribution with mean  and constant standard deviation  , If we assume that ( )   can be modeled by a parametric function f (in our case the NN) with parameter θ, the log-likelihood of the parameters , θ for an independent and identically distributed (i.i.d.) sample will be given by, Maximizing this log-likelihood (Maximizing the log-likelihood results in estimating the parameters of a probability distribution, so that under the assumed statistical model f the observed data  is most probable) over , θ can be achieved in a separable way (Davison, 2003): we first maximize over θ, which corresponds to training the NN using the MSE loss, and we then estimate  by simply computing the standard deviation of the residuals   1, , ( , ) . Hence, from a probabilistic point of view, by minimizing the MSE loss function, we are assuming a constant standard deviation (i.e., that does not depend on the velocity field).
In this study, we propose to relax this common assumption based on the literature and our understanding of the data.We replace the MSE loss function by a full negative Gaussian log-likelihood.Referring back to our univariate example, this would lead to replacing Equation 4 by, (5) where our function f -which would correspond to our NN-now has two components, one for the mean, 1 f , of the Gaussian distribution, and the other one for the standard deviation, 2 f .In particular, the term corresponding to the standard deviation of the Gaussian in Equation 5, In order to apply this to the problem of subgrid momentum forcing, we build our NN to output the two moments of a Gaussian distribution, at each location and for both (zonal and meridional) components of the subgrid momentum forcing.The output tensor, Ŝ, now has dimension (2 , , ) : we have four output channels (2 C n  )-the first two correspond to the means of the two components of the subgrid momentum forcing, the last 2 correspond to the associated standard deviations.Our loss function therefore takes the form of (ignoring constant terms), For ease of reading, we introduce a more natural notation, where we denote at location , i j.With this notation, Equation 6 takes the form of, The NN will learn to jointly optimize the two moments of the predicted Gaussian distribution, as shown schematically in Figure 2. Note that we also jointly train on both zonal and meridional components of the forcing, rather than having separate NNs for each component, as in Zanna and Bolton (2020).

Neural Network Architecture
Our NN is a Fully Convolutional Neural Network (Long et al., 2015) with a sequence of eight convolutional layers.The ReLU activation function is used for hidden layers.Given that the NN is fully convolutional, it can adapt to varying sizes of the input subdomain.We remind the reader that the input consists of two channels, one per component of the velocity field, while the output consists of four channels, two for each of the two components of the subgrid momentum forcing, see Section 2.4.We do not use any padding in the implementation of our convolutional layers.Due to the lack of padding in our NN structure, some pixels near the edges are lost in the application of convolutional layers.This results in the outputs predicted by our NN having spatial extent (  ,  )   x y n p n p   , where p is a nonnegative integer that depends on the size of the kernels used in the convolutional layers.
The mean of the subgrid momentum forcing predicted by the NN can take any real value, as such we do not use any activation function in the final layer for the first two channels.However, the output predicted for the standard deviations are required to be positive.To enforce this constraint, we use a softplus function, defined by, as a final activation function for the two output channels associated with the standard deviations.

Training and Validation Procedure
We now describe our training procedure.The inputs are fed into the NN in the form of mini-batches (i.e., small batches of several samples stacked along an extra dimension), rather than individually, such that the dimensions of our input tensors are batch ( , , , ) , where batch n is the number of samples per mini-batch, set to batch 4 n  in our experiments.A common practice in the methodology of NNs is to normalize inputs to be distributed within the interval [ 1,1]   , to avoid vanishing and exploding gradients in the application of the back-propagation algorithm.Here, we multiply the surface velocities by a factor of 10, which approximately corresponds to normalizing by the standard deviation ( 0.1  m/s) of the surface velocity data from the training subdomains.This same transformation is applied in testing.
The targets used to train and evaluate our NN consist of the true subgrid momentum forcing S for a given subdomain.We train our NN on data from the piControl simulation.We restrict the training data to a combination of four selected sub-domains of the oceans-shown as gray rectangles in Figure 1, see also Table 1-that correspond to various dynamical regimes: the Gulf Stream extension, the Equatorial Atlantic, just south of the Equatorial Pacific, and in the South Pacific gyre.Further improvements through more advanced selection and weighting of the training subdomains may improve performance.We select the first 80% of the data (approximately spanning 16 years) as training data, and use the final 15% (approximately spanning 3 years) for validation.
We ignore 5% of the data (1 year) to avoid any correlation between the training data and the validation data, as it could cause validation metrics to become over-optimistic.
During the training phase samples are entirely shuffled across the time dimension, as well as across subdomains.This allows to jointly train on data from all selected subdomains simultaneously.However, this requires the input tensors obtained from all the subdomains to have the same GUILLAUMIN AND ZANNA 10.1029/2021MS002534 7 of 25 spatial sizes.We therefore crop the input tensors according to the smallest size across subdomains for both spatial dimensions, resulting in training samples of spatial extent ( , ) (38, 45)   x y n n  . Given the nonMarkovian nature of turbulence (Kraichnan, 1959), further work incorporating temporal dependence and memory, rather than shuffling the data, might improve the skill of the learned parameterization (Porta Mana & Zanna, 2014).
We compute the average loss-defined in Section 2.4 -over the samples of a mini-batch, across the two components of the forcing, and across both spatial dimensions.The average loss is then back-propagated to obtain the derivatives of the loss function with respect to the NN's parameters.The NN's parameters are then updated using the ADAM algorithm (Kingma & Ba, 2015).ADAM has become one of the go-to optimization algorithms in the Deep Learning community, which is in part due to its robustness to the choice of the learning rate and its quick convergence.After the NN's parameters have been updated, we repeat the same process with a new mini-batch, and so on, until all the training data has been used, which corresponds to one epoch of training.At this point, we compute the average loss over the validation data, which was not used for optimization.We track this validation loss over the whole set of training epochs and repeat this process.We implement early stopping so that training stops once the validation loss has not improved for four consecutive epochs of training.More details about our final choice of hyperparameters, such as the learning rate, hand-picked through a validation procedure, can be found in Appendix A.

Offline Tests on a Global Scale
We test our probabilistic deep learning parameterization on a global scale and demonstrate its generalization properties offline using test metrics introduced in Section 3.2.In Section 3.3, we first carry out a test on piControl in order to assess the ability of our NN model to generalize to regions and dynamical regimes not seen during the training phase.We then carry out a test on 1ptCO 2 in Section 3.4, where the 2 CO levels in the atmosphere reach double those of the piControl simulation.Our results show that our stochastic deep learning parameterization performs well in this new climate, without requiring further training of our NN; this is crucial if such parameterizations are to be used in models for climate projections (O' Gorman & Dwyer, 2018;Rasp et al., 2018).

Global Reconstruction for Offline Testing
We directly apply our trained NN to the global coarsened velocities for offline testing.While the training subdomains are only 38 45  points, we use the global extent of the oceans for testing which results in an input tensor of dimension   , (900,625) x y N N  .This is possible due to the convolutional nature of the NN.When applying the NN to global data, we extend the input velocities cyclically along the zonal dimension (since the output values of the tensor for x i N  requires information from 1 i  ), thus ensuring the output covers all longitudes.Obviously, this is not possible along the meridional dimension, thus resulting in the loss of 10 p  grid boxes (see Section 2.5) along the meridional dimension at high latitudes in each hemispheres.
Velocity snapshots are assembled to form small mini-batches with size 4 (equivalent to 4 days); the size is determined by the available GPU memory.Nonocean points of the input grid are stored as NaNs.In our tests, we therefore ignore locations whose receptive field intersect with a continent and show them in black in the maps shown thereafter (note, the receptive field of a neuron within the NN's output is the set of input neurons that impact its value).

Metrics and Statistics for Offline Performance
To quantify the offline accuracy of our NN's predictions of the subgrid momentum forcing, we define several metrics.We note 7300 T  the total number of days over which these metrics are computed.
We first define our notation for the standard MSE and correlation.To make explicit the dimension along which the data is reduced to compute these two metrics, we write  , , ,  , , ,  1 1 MSE , 1, , , 1, , .T The combined MSE, that encompasses both components X and Y , can be shown on a map, and is defined as , , ,  , , ,  , , ,  , , , We also define a scalar MSE according to, , , ,  , , ,  , , ,  1 In addition to the standard MSE, we define an 2 R coefficient which is normalized by the value of the true subgrid forcing such that and a scalar version 2 R , according to, We note that In order to verify that our model is not simply predicting the seasonal climatology of the subgrid momentum forcing, we define modified versions of these quantities, e.g., where , , ,

S
is the climatological C component subgrid momentum forcing at location , i j and time t.This metric allows us to assess what percentage of the signal's variance we account for, after removing the inherent variability due to the seasonal cycle.
Another quantity of interest given our probabilistic representation of the subgrid momentum forcing parameterization is that of standardized residuals, given by, Under our idealized assumption, these normalized residuals are expected to follow a standard normal distribution.
We will also use confidence intervals based on the standard deviation, to quantify the uncertainty in the predicted subgrid forcing and evaluate its performance.Under the Gaussian assumption, a 95% confidence interval corresponds to,

Generalization & Dynamical Regions-Test on piControl
We carry out an offline test of our NN on global scale data from piControl.
There are large variations in subgrid eddy momentum in the piControl (Figure 3a) across the oceans, with the largest amplitude occurring in eddy rich regions such as the Gulf Stream, Kuroshio, Southern Ocean and equatorial regions.There is a strong coherence between the pattern of the variance of the mean of the true subgrid forcing (Figure 3a) and that of the predicted forcing (Figure 3b).This coherence holds for the zonal and meridional component of the forcing, as shown for example in the correlation map between the true zonal forcing and the mean component of the predicted zonal forcing (Figure D1).
The time-mean MSE over both components of the forcing (Equation 11) can vary by several orders of magnitude from one region to another (Figure 4a).However, these changes are largely due to the inherent spatial variability of the subgrid forcing, evident by comparing its spatial pattern (Figure 3a) with the spatial pattern of the MSE (Figure 4a).Therefore, the 2,clim , , i j R  coefficient (Equation 15) is more informative of the NN's performance (Figure 4b).
In most regions of the oceans, our NN is able to account for more than 70% of the signal's variance, with performance nearing 90% in regions where the variance of the eddy momentum forcing is the highest, for instance in the Gulf Stream region and Southern Ocean (see the appendix for maps of the 2,clim

R
computed for each component of the forcing-Figure D2-showing similar skill).These metrics indicate that the NN generalizes well to most regions, despite being trained on only four small subdomains of the oceans.However, our NN performs poorly in sea-ice covered regions, which is not surprising as the dynamics of these regions were not included in the training and widely differ from open ocean turbulence.Considering turbulence at the ocean-ice boundary will require numerical simulations that can adequately represent such processes.R is higher than the average of 2 R values over the map due to the higher 2 R values in regions where the variance of the forcing is large (note that Equation 14 is not the spatial average of Equation 12).
To demonstrate some advantages of predicting the two moments of a Gaussian distribution, we focus on time series at two different locations.We compare the time series of the true and predicted zonal forcing at GUILLAUMIN AND ZANNA 10.1029/2021MS002534 10 of 25   interval.The predicted standard deviation   std , , , ˆX i j t S varies greatly across the considered time window-indicating that the uncertainty of the forcing is not constant.Our NN performs best in turbulent regions.This is in agreement with 2 R maps where higher values are observed in regions where the forcing is larger, and also with results from idealized ocean models (Bolton & Zanna, 2019;Zanna & Bolton, 2020).Finally, to investigate regions with a low 2 R score, we analyze the time series of the true and predicted meridional forcing at 29 ,129  N W   (Figure D3), which corresponds to a location near the West Coast of the United States where the 2 R score is 0.532.The time series indicates that the low 2 R occurs due to a few extreme events that are not well predicted.
To further analyze our predicted forcing from the piControl data set, we study the global distribution of a stochastic simulation of subgrid momentum forcing generated using, , ,  , ,   , , , , , 1, , , 1, , , ˆĈ where the , , C i j  is are i.i.d.standard normal and the inputs to the NN are the coarsened surface velocities from piControl.The histograms of the global distribution of each component of the subgrid forcing for the true and simulated forcing show that the two distributions are very similar (Figure D4).However, the distribution of the true forcing has longer tails than that of the simulated forcing.This is partly due to our assumption that the distribution of the forcing, conditioned on the coarse surface velocity field, is Gaussian.We test this hypothesis by investigating the distribution of normalized residuals, defined by Equation 16. Figure D5a consists of the sample distribution of normalized residuals (blue), after subsampling one point out of ten along the time axis, and one point out of five along the spatial axes, shown together with the probability density function of the standard normal distribution (red).We also present a quantile-quantile (QQ)-plot of the sampled normalized residuals in Figure D5b, using the same subsampling procedure as in Figure D5a.The normalized residuals have much heavier tails than those of a standard normal.Hence, we could improve our model by using another distribution with heavier tails, or a multimodal distribution (Bishop, 1991).This approach will likely improve the offline prediction of extreme events which we have shown is problematic in our NN.

Generalization & Climate Change-Test on 1ptCO 2
One key challenge for deep learning parameterizations in ocean and climate modeling is for them to be able to generalize to a new climate (Martínez-Moreno et al., 2021;O'Gorman & Dwyer, 2018).So far, we have only used data from the piControl simulations, both for training (Section 2.6) and testing (Section 3.3).Here, we test the trained NN from Section 2.6, without further tuning, using simulated data from 1ptCO 2 .The surface velocities, associated kinetic energy, and subgrid momentum forcing, are influenced by the 2 CO forcing.The time-mean standard deviation of the surface velocity between piControl and 1ptCO 2 (Figure 6a) changes by up to 40% in some parts of the oceans.The majority of the changes are occurring in regions dominated by high kinetic energy in piControl such as the Gulf Stream region and its extension, the Kuroshio extension, or the Southern Ocean.Additionally, we identify large changes in the Indian Ocean and in the Arctic, ice-melt is likely related to changes in the latter.Similar changes in the subgrid momentum forcing between piControl and 1ptCO 2 are occurring as well (Figure 6b).The surface velocities used as inputs to the NN and the target subgrid forcing to be predicted are therefore significantly different from those of piControl.
In order to compare performance of our NN on piControl and 1ptCO 2 we use the same metrics as in Section 3.3.The MSE and  shown in Figure 7.Our NN performs as well for this new climate as it did for the climate it had been trained on (e.g., compare Figure 7 with Figure 4).The time-mean 2,clim R obtained on piControl and 1ptCO 2 show little difference (Figure 7c), except in the North-East Atlantic and in certain polar regions which were partially ice-covered in piControl, where there is a slight decrease in performance (at most 0.1) as measured by the time-mean 2,clim R .We compute scalar metrics of our NN's performance over the 1ptCO 2 simulation data, again limited to 60 ,60 S N   , and obtain 0.871 for 2 R and 0.858 for 2,clim R , i.e. very similar to the values obtained for piControl.The NN for subgrid momentum forcing trained on piControl data generalizes well to an unseen warmer climate as simulated by a coupled high-resolution climate model.

Online Implementation for an Idealized Model
Offline performance tests have not been good predictors for online performance, as shown for example in Zanna and Bolton (2020), at least not using current metrics.The coupling between the machine learning (ML) parameterization and the prognostic coarse-resolution ocean model must satisfy the same numerical stability criteria and conservation properties as any physics-derived parameterization.Therefore, good offline performance is a necessary but not sufficient condition to the success of any ML parameterization.
Zanna and Bolton (2020) implemented a convolutional NN parameterization which, while physically constrained, led to too vigorous an inverse energy cascade and energy backscatter.While the model was not numerically unstable, the behavior of the model was pushed into a different dynamical regime in which the eddy mean-flow interactions dominated over the external wind forcing.To ensure a reasonable dynamical behavior, the authors tuned down the parameterization by applying a spatially and temporally uniform multiplicative factor to reduce the magnitude of the forcing in an ad-hoc way.
The use of a stochastic parameterization has the potential to damp the eddy (and destabilizing) feedbacks seen in Zanna and Bolton (2020).Here, we use the same idealized barotropic shallow water model as in Zanna and Bolton (2020) and developed by Klöwer et al. (2018) to implement the stochastic deep learning parameterization learned from complex CM2.6 data; for further details on the model, consult these these references, or our code.The stochastic parameterization is implemented in a 40 km horizontal resolution run, and we compare the runs to a high-resolution model run at 10 km horizontal resolution, hence mimicking the change in resolution between CM2.6 simulation data and the coarse-grained data we generated to diagnose the momentum forcing.Note that we tested offline the learned CM2.6 parameterization with data from the shallow water model and shown that the learned NN model transfers well to a different data set and model before implementation, see details in Appendix C. Unlike CM2.6 which was on a B-grid, the shallow water model is discretized on an Arakawa C-grid.Therefore, at each time step of the integration, we first interpolate the two velocity components on tracer points and then pass them through our NN.This produces, for each grid box and for each component of the forcing, a mean where the , , C i j  are sampled according to i.i.d.standard normal distributions, and 96 is the number of points of the low-resolution grid along the x and y axes respectively.The field S  is then interpolated back to the u and v grid for the X and Y components, respectively, and used as the value of the subgrid momentum forcing in the shallow water model.
We ran the following simulations for 10 years each: a high resolution run (10 km), a coarse-resolution run (40 km) without parameterization, and three different ensemble members of the coarse-resolution run with the stochastic parameterization.The parameterized simulations are stable and produced a physically consistent state without any tuning or scaling factor.The kinetic energy of the flow is improved: both the mean and the standard deviation are very close to the high-resolution simulation (Figure 8).Similarly to Zanna and Bolton (2020), the variance of the velocity fields (not shown) and sea surface height (Figure 9) are vastly improved by the parameterization.However, changes in the mean velocity are rather small (not shown).We believe that the simplicity of the shallow water model used in the present study is at the core of the lack of substantial improvement in the mean flow and our parameterization will need to be tested in a more complex model.Unlike Zanna and Bolton (2020), no physical constraint was imposed when learning the NN parameterization in our study; yet, we do not observe any drift in the model.Despite using zero-padding GUILLAUMIN AND ZANNA 10.1029/2021MS002534 14 of 25  during the implementation, the solutions near the boundaries are not strongly impacted.Overall, the coarse resolution stochastic simulations are 25% slower than the unparameterized runs but more than 40 times faster than a high-resolution simulation at 10 km on the same CPU.However, this statement is to be taken with care as the high-resolution simulation was not optimized.

Discussion
Current parameterizations of ocean and atmosphere processes can lead to biases in climate models and remain a large source of uncertainty in climate projections.Therefore, harnessing state-of-the-art Deep Learning and statistical methods to improve parameterizations of subgrid processes has recently raised a lot of interest (Bolton & Zanna, 2019;Rasp et al., 2018;Yuval & O'Gorman, 2020;Zanna & Bolton, 2020).Here, we have demonstrated the potential of Deep Learning approaches for the problem of ocean momentum subgrid parameterizations using data generated by a realistic coupled climate model, as opposed to data from idealized ocean-only quasi-geostrophic or primitive equation simulations (Bolton & Zanna, 2019;Zanna & Bolton, 2020).
The use of data from realistic coupled climate models to train Deep Learning is nontrivial due to the size of the data, the use of the tripolar irregular spherical grid, and the coupling between the ocean and the atmosphere.Here, we establish a filtering and coarse-graining procedure to diagnose the subgrid momentum forcing in a global model and show that using only a limited number of subdomains, we can train a NN to skillfully predict the subgrid momentum forcing over the global oceans, and in a different climate with increased 2 CO levels.However, there are several remaining challenges.
We have shown that the offline skill of the predictions is lower in regions where sea-ice is present.Therefore, to improve parameterizations of ocean mesoscale eddies in these regions, it might be necessary to acquire data that can faithfully represent these interactions with sea-ice and likely with the atmosphere in polar regions.Another outstanding challenge is to diagnose and train for subgrid momentum forcing in grid boxes near continents.Our NN accounts for some nonlocal turbulent interaction (Frederiksen, 1999;Kraichnan, 1959), however the influence of large-scale far-field on the subgrid momentum forcing could potentially be improved by increasing the size of the subdomain used for training.The current size of the subdomain was chosen for convenience, by considering the best skill with the smallest amount of data.In addition, learning the standard deviation (in addition to the mean) of the subgrid parameterization takes into account the missing nonlocal interactions to some extent (and any other missing feedbacks in the learning procedure).While larger subdomains may perform better offline, training might require larger amounts of data to avoid overfitting and may not translate into a better online implementation.Yet, other structures such as the U-net (Ronneberger et al., 2015) allow for interactions with larger scales.Our experiments with a U-net structure have not yet led to any improvements in the subgrid momentum forcing parameterization; though further investigation might be required.
To improve the generalization of data-driven parameterizations, simulations spanning the phase space of turbulence (different forcing, viscosity, etc.) could be used to train Machine Learning models and potentially offer a window into the theoretical underpinning of subgrid models (Kitsios et al., 2016;Porta Mana & Zanna, 2014).These Machine Learning models could be implemented more generally in climate model simulations.In addition,  could alleviate these issues, these models cannot be run at global scales or include all components of the climate system, all of which influence the eddy momentum forcing.Satellite observations of sea surface height could be used to fine-tune our trained NN via transfer learning.Yet, using satellite data comes with its own challenges due to sparse spatio-temporal sampling and error due to the interpolation of the gridded datasets.These challenges could be overcome in the near future as higher resolution satellite data becomes available and improved gridding procedures are being developed.
We only used information from the surface layer to train for subgrid momentum forcing (since subsurface velocity data is not publicly available).Training a NN model for each layer of a climate model would likely be limited by computational power and storage (of the NN parameters) and of limited scientific use.Instead, a single NN model of similar complexity to the one presented in this manuscript can be trained using data from all layers of a coupled model (Bolton & Zanna, 2019;Zanna & Bolton, 2020).While the current NN might generalize well to other layers, we might expect degradation in skill in regions that interact strongly with bathymetry, for example.
The NN presented here was trained to predict the parameters (mean and standard deviation) of a Gaussian probability distribution at each grid box, therefore providing a probabilistic prediction of the subgrid forcing with Deep Learning.This probabilistic approach attempts to account for both the uncertainty in the mapping between the coarse velocity field and the subgrid forcing, uncertainty in any missing interaction not included in the learning, and uncertainty in the data itself.Stochasticity has been shown to improve model bias and produce more reliable ensemble predictions (Berner et al., 2017).Besides, while most current Deep Learning approaches to the parameterization of subgrid processes have been deterministic, a stochastic approach could be key when it comes to online implementations.Several Deep Learning implementations of parameterizations trained offline have resulted in poor stability properties or unrealistic flows in online simulations.Stochasticity could provide one approach to potentially solve this issue as shown in previous work (Palmer, 2012;Zanna et al., 2017).Using an idealized shallow water model, we showed that implementing our stochastic parameterization results in stable simulations and produces a realistic flow without any tuning.However, while the stochastic parameterization vastly improved some metrics (mean and variance of the kinetic energy), the impact on other metrics were only modest (e.g., zonal velocities).
The probabilistic approach presented here to learning the subgrid forcing remains simple and could be applied to parameterizing other processes.Yet it could benefit from more advanced probabilistic modeling.While we limited ourselves to conditionally i.i.d.Gaussian distributions, our analysis of residuals shows that representing higher moments could lead to a better representation of the distribution of subgrid forcing.In addition, we do not account for the uncertainty associated with the parameters of the NN (Jospin et al., 2020).While Bayesian NNs remain computationally more expensive, recent progress on that front could be an interesting avenue of investigation, and may provide additional assurance compared to single outputs.
Finally, combining closed-form parameterizations with stochastic Deep Learning approaches could be another fruitful avenue.For instance, it would be possible to predict the mean forcing via a closed-form equation, such as done by Zanna and Bolton (2020) using equation-discovery methods, while representing higher-order moments via a probabilistic Deep Learning approach similar to that proposed in this manuscript.This approach could improve our understanding of missing processes and their representation in climate models.While the effects of Deep Learning subgrid parameterizations on climate projections remain to be ascertained, the benefits of Deep Learning could be greater if they are used to understand processes from a probabilistic perspective.

Figure 1 .
Figure 1.Filtered and coarse-grained surface velocity u in [ / m s] from piControl used as training data: (a) standard deviation of surface velocity norm and (b) snapshot of the zonal component.The gray rectangles identify the subdomains selected for training the Neural Network (NN).

Figure 2 .
Figure 2. Our neural network (NN) outputs four maps: the two first maps are the maps of the means of the predicted forcing components, the last two maps are the standard deviation of the predicted forcing components.
C i j  for the time-mean MSE of the forcing, where the reduction is carried out along the time axis, i.e.
855; the skill demonstrates the high performance of our NN and further confirms that the NN does not merely predict large variations due to seasonal climatology.The global 2

Figure 3 .
Figure 3. Time-mean variance of the norm of subgrid momentum forcing in piControl: (a) True forcing S ‖ ‖; (b) predicted mean, ˆmean S ‖ ‖ in offline

Figure 5 .
Figure 5.Time series of the zonal component of the subgrid momentum forcing at (a) 30 ,60 N W   , a location dominated by turbulent behavior and (b) 20 ,104 S W   , a more quiescent location for 300 days: true forcing (solid blue), mean of the predicted forcing (orange), and 95% confidence interval (green).

Figure 6 .
Figure 6.Relative difference between piControl and 1ptCO 2 in the standard deviation of the (a) surface velocity norm and (b) subgrid forcing norm.Positive (negative) values indicate that the variance has increased (decreasing) in the 1ptCO 2 compared to the piControl.
stochastic subgrid momentum forcing S  implemented in the shallow water model is then generated (see schematics in Figure10) according to,

Figure 8 .
Figure 8. Kinetic energy [ 2 2 / m s ] (a) time series, and (b) histogram for the low resolution unparametrized simulation at 30 km (blue), low resolution parameterized ensemble member simulations (green, orange, red), and filtered + coarse-grained high-resolution simulation (purple).In panel (b) the solid vertical lines indicate the mean and the dashed vertical lines the standard deviation of the simulated kinetic energy; only one ensemble member is shown, but the other ensemble members produce similar statistics.

Figure 9 .
Figure 9.Standard deviation of sea surface height [ ] m for (left) the low resolution simulation; (middle) one ensemble member from the parameterized versions; (right) the high simulation.

Figure 10 .
Figure 10.Procedure for generating the stochastic parameterization (Equation18) implemented in the coarse resolution idealized model, based on the trained neural network (NN).

Figure D3 .
Figure D3.Time series of the true (solid blue) zonal component of the subgrid momentum forcing, mean zonal component of our neural network (NN) (orange), and 95% confidence interval (green), at 29 ,129 N W   .This location was selected within the region on the West coast of the United States where the 2 R is lower; the lower

Figure D4 .
Figure D4.Sample log-probability distribution of true (purple) and stochastically simulated forcing (green) in the control simulation, for both components-zonal (left) and meridional (right).The histograms are shown on a log scale due to the hyperbolic-type distribution of the forcing.

Figure D5 .
Figure D5.Distribution analysis of normalized residuals (Equation 16) of subgrid momentum forcing in the control simulation.(a) Sample distribution (blue) along with the probability density function of the standard normal distribution (red), (b) QQ plot (blue) of normalized residuals against the standard normal distribution, and line (green) defined by y x  .