Parameterizing Vertical Mixing Coefficients in the Ocean Surface Boundary Layer Using Neural Networks

Vertical mixing parameterizations in ocean models are formulated on the basis of the physical principles that govern turbulent mixing. However, many parameterizations include ad hoc components that are not well constrained by theory or data. One such component is the eddy diffusivity model, where vertical turbulent fluxes of a quantity are parameterized from a variable eddy diffusion coefficient and the mean vertical gradient of the quantity. In this work, we improve a parameterization of vertical mixing in the ocean surface boundary layer by enhancing its eddy diffusivity model using data‐driven methods, specifically neural networks. The neural networks are designed to take extrinsic and intrinsic forcing parameters as input to predict the eddy diffusivity profile and are trained using output data from a second moment closure turbulent mixing scheme. The modified vertical mixing scheme predicts the eddy diffusivity profile through online inference of neural networks and maintains the conservation principles of the standard ocean model equations, which is particularly important for its targeted use in climate simulations. We describe the development and stable implementation of neural networks in an ocean general circulation model and demonstrate that the enhanced scheme outperforms its predecessor by reducing biases in the mixed‐layer depth and upper ocean stratification. Our results demonstrate the potential for data‐driven physics‐aware parameterizations to improve global climate models.


Introduction
Vertical mixing parameterizations used in ocean general circulation models (OGCMs) represent the effects of unresolved processes on the mean state.These parameterizations have theoretical deficiencies due to the lack of understanding of inadequately represented or missing processes.To overcome this deficiency, parameterizations often require ad hoc/empirical modifications either to approximate the missing processes or to fit data.Vertical mixing schemes can be constructed with various assumptions and different schemes are calibrated differently.These inconsistencies cause the schemes to disagree among themselves (Li et al., 2019) and are a major source of model uncertainty (Hawkins & Sutton, 2009;Huber & Zanna, 2017;Fox-Kemper et al., 2019;Todd et al., 2020;Gutjahr et al., 2021).Poorly parameterized mixing can result in errors that accumulate over time, leading to biases in the OGCM.
New approaches are emerging to improve various parameterizations in ocean and atmosphere models using machine learning.We have applied neural networks, a type of machine learning, to improve a vertical mixing parameterization of the ocean surface boundary layer (OSBL).OSBL is a vital region of turbulence in the ocean.It acts as an interface between the atmosphere and the deeper ocean and it is important to accurately represent mixing in the OSBL.The atmosphere energizes the ocean through the OSBL.Mass, tracers, and momentum are transferred between the atmosphere and deep ocean via the OSBL, and inaccuracies in vertical mixing parameterizations can give rise to uncertain estimates of heat transport, sea level rise, ocean carbon uptake, etc.Including missing processes in upper ocean vertical mixing schemes impact large-scale phenomena, for example, accounting for Langmuir turbulence and submesoscale effects in the OSBL improves simulations of the Indian monsoon (Orenstein et al., 2022).
1.1 Modeling vertical diffusivity within Ocean Surface Boundary Layer (OSBL) parameterizations and the assumption of a 'universal' shape function We focus on the energetic Planetary Boundary Layer (ePBL) scheme, a first-order OSBL turbulent mixing parameterization as described in Reichl & Hallberg (2018) (see Section 2).The variation of the vertical diffusivity profile κ ϕ (of arbitrary scalar, ϕ) within the OSBL in ePBL and similar first order schemes can be expressed as a diffusivity scale (κ ϕ ) multiplied by a prescribed normalized diffusivity profile (i.e.shape function): where κϕ is often decomposed into a velocity and length scale (Large et al., 1994), g(σ) is a dimensionless shape-function, and σ = z/h is a dimensionless vertical coordinate, where z is the vertical coordinate and h is the depth of the boundary layer.OSBL parameterizations that follow this approach traditionally assume that g(σ) is a universal function or has a fixed component such as a cubic polynomial that does not change (O'Brien, 1970;Large et al., 1994), and therefore is ad-hoc.In the KPP scheme of Large et al. (1994), there is a cubic polynomial which is multiplied by a vertically varying turbulent velocity that sets the structure of κ ϕ .The cubic polynomial is universal, whereas turbulent velocity mostly affects the surface layer defined by the region 0 < σ < 0.1, making the cubic structure dominant below the surface layer.In ePBL scheme (Reichl & Hallberg, 2018), κ ϕ follows similar design.However, there is no physics-based justification for a universal or ad-hoc profile to exist, and it is widely understood that characteristics of boundary layers can vary considerably with forcing conditions (Li et al., 2019).We hypothesize that capturing variations of the shape function that are not considered in first-order OSBL schemes such as ePBL will improve the overall representation of vertical mixing in ocean models.In the subsequent text, our usage of the term 'universal shape function' will include shape functions which involve some ad-hoc components or approximations such as used in the ePBL scheme (see Section 2).
1.2 Second Moment Closure and an alternative to the 'universal' shape function Second Moment Closure (SMC) is an alternative approach to predict vertical diffusivity profiles within the OSBL (Rodi, 1987;Umlauf & Burchard, 2005).SMC does not require a shape function because it instead predicts the diffusivity from the turbulent kinetic energy (k) and the turbulent length scale (l t ).Various SMC approaches exist to predict k and l t and a general formulation to infer diffusivity is expressed as: where c ϕ represents the model stability functions (Umlauf & Burchard, 2005).SMC predicts a profile of vertical diffusivity based on models of physical processes that drive turbulent fluxes within the OSBL.SMC does not prescribe a shape function a priori.However, since SMC directly evaluates a diffusivity profile, the implied shape function and diffusivity scale can be diagnosed from the output.The implied shape function differs significantly from a universal shape function, as seen in Figure 1.The diagnosed shape-function and diffusivity scale from SMC can then be used to build a model for use in ePBL.We selected SMC over large eddy simulation as our "truth" because it is inexpensive compared to the latter, leading to effortless creation of training dataset spanning a wide range of forcing regimes.This is required for machine learning applications as they are hungry for a large amount of data.
A natural question is why SMC is not directly used instead of ePBL in OGCM.It remains impractical to directly use SMC for vertical mixing in climate simulation due to the sensitivity of their predictions to long time steps and coarse vertical grids often used in climate models (see Reichl & Hallberg, 2018).However, using the framework described in this article, ePBL can yield a closer approximation to the vertical diffusivity from the SMC scheme without sensitivity to the model's vertical resolution and time step.Our neural network approach allows ePBL to consider the physics-based variation in the shape function seen in SMC due to solving k and l t .This variability in the shape function will lead to different profiles of vertical mixing within ePBL than using a prescribed universal profile.

Machine learning is an emerging tool to improve OGCMs
Consider a physics-based parameterization that gives an output Ψ as some functional relationship F between physical quantities x: (3) Finding F is an optimization problem.It can be set as an optimal linear fit to some combination of x, but the fit might not work for different regimes or might implicitly depend on higher-order combinations of terms in x (nonlinearity) or some other neglected terms.
F can be assumed to be a function of non-dimensional parameters requiring onerous fitting.With machine learning, F can be a function of multiple combinations of parameters: where N is a machine-learning function, x = (x 1 , x 2 , ...) is the input vector and w are parameters (weights and biases).Machine learning involves determining (learning) the correct values of w by tuning the hyperparameters that give the optimal N (Brenner et al., 2019), which is becoming routine due to advances in training algorithms.The machine learning approach provides an avenue to include as many relevant parameters as desired in the vector x, which has been a significant challenge in traditional physics-based approaches.
Machine learning is favorable for the development and application of climate models due to the abundance of optimization algorithms and hardware (Balaji et al., 2022;Christensen & Zanna, 2022).Studies show that neural networks can be used in idealized model configurations, and recently, the use of machine learning has emerged in realistic GCMs.Artificial neural networks have been shown to improve sub-grid momentum transport in atmospheric models (Yuval & O'Gorman, 2021), predict precipitation (Shamekh et al., 2022) and fluxes (Shamekh & Gentine, 2023), while in ocean models they have been used to improve the parameterization of free convection (Ramadhan et al., 2023).Liang et al. (2022) applied deep neural networks to predict temperature and salinity evolution in the OSBL at a weather station (Station Papa).Partee et al. (2022) trained a deep neural network to learn subgrid kinetic energy of oceanic mesoscale eddies from a high resolution OGCM to improve their representation in a lower resolution OGCM.Convolutional neural networks (CNNs) have been used to predict parameterizations of ocean momentum backscatter in a variety of models (Bolton & Zanna, 2019;Zanna & Bolton, 2020;Guillaumin & Zanna, 2021) and have been implemented in an ocean primitive equation model (Zhang et al., 2023).Gregory et al. (2023) recently employed CNNs to learn data assimilation increments for sea-ice and showed that networks could be used to reduce biases in sea-ice.
Apart from neural networks, techniques considered part of the machine learning toolbox show potential to improve GCMs.The random forest algorithm has been used to parameterize moist convection (O'Gorman & Dwyer, 2018) and to learn small-scale processes from a high resolution atmospheric model (Yuval & O'Gorman, 2020).Mansfield & Sheshadri (2022) used Gaussian Process emulator to tune gravity wave parameterization in an intermediate complexity atmospheric GCM. Souza et al. (2020) use a Bayesian technique to fine-tune the non-local flux terms of the KPP parameterization of Large et al. (1994).
The aforementioned examples show the potential of enhancing conventional physicsbased schemes using machine learning techniques.This article draws inspiration from these demonstrations, recognizing the promise of machine learning in advancing ocean model parameterizations and prompting further investigation in this area.

Outline to use Neural Networks and output from SMC to improve ePBL
Artificial neural networks (ANNs) are trained using output from SMC that directly predicts the profile of vertical diffusivity and do not rely on ad hoc shape functions.As neural networks are powerful approximators, they can model the variability in the vertical diffusivity profiles of the SMC, but we formulate the ANNs to fit within the simplified framework of the first-order ePBL approach.Our procedure has the following advantages: 1. We use the neural networks to modify the vertical diffusion term within ePBL instead of directly predicting turbulent flux time tendencies (e.g.temperature and salinity), guaranteeing that the scheme conserves physical quantities.2. The neural networks are introduced in a manner that does not interfere with the potential energy-based mixing constraints of the original ePBL scheme, and therefore ePBL's robust numerical implementation is preserved.3. The ANNs predict quantities used to compute the diffusivity: the non-dimensional structure (shape function) and a turbulent velocity, which simplifies training, implementation, and interpretability versus directly predicting the diffusivities.4. ANNs yield strictly positive values of the vertical diffusivity, an important consideration for numerical stability (see section 3.4.2). 5. Our ANNs are as small as possible to balance accuracy and computational costs, as they will be used in climate timescale OGCM simulations.
We structure the article as follows.Section 2 describes the ePBL scheme and briefly addresses a calibration/tuning problem.Section 3 gives details of the network structure and describes the data used to train networks with estimates of uncertainty.Section 4.1 provides details on implementing the enhanced ePBL scheme, hereafter called ePBL NN.The new improvements in ePBL NN are demonstrated online using free-running singlecolumn model experiments (Section 4.2), and their impact on biases in an existing oceanice climate model is assessed in 4.3.We conclude with a summary and discussion of the broader implications of this work for applying machine learning to improve parameterizations in ocean climate models.
2 A Physics-Based Vertical Mixing Framework: The energetic Planetary Boundary Layer (ePBL) The ePBL framework, as described by Reichl & Hallberg (2018), is designed for climate applications of OGCMs and emphasizes robust solutions to changes in model time stepping and vertical resolution.The scheme is simple enough to implement efficiently within implicit diffusion solvers often used in OGCMs while maintaining important physical constraints on ocean mixing.The ePBL scheme performs with high skill in idealized models and OGCMs (Reichl & Li, 2019;Li et al., 2019), and has been implemented in NOAA -Geophysical Fluid Dynamics Laboratory's MOM6-based climate models: OM4, CM4, and ESM4 (Adcroft et al., 2019;Held et al., 2019;Dunne et al., 2020).ePBL builds on the paradigm of bulk mixed layer models (Kraus & Turner, 1967;Niiler, 1977), which constrain the boundary layer depth (h) via energetic implications of vertical mixing.The scheme therefore constrains the mixing based on parameterizing the rate by which turbulent kinetic energy is converted to potential energy within the OSBL: (5) Here, h is the (positive) depth of the boundary layer as defined in Reichl & Li (2019), w ′ b ′ is the vertical turbulent buoyancy flux, overbar represents an averaging procedure (e.g., over ensembles), and G is a relation that depends on the Coriolis parameter f , surface friction velocity u * , surface buoyancy flux B 0 = w ′ b ′ 0 , boundary layer depth h, and integrated release of potential energy by convective buoyancy fluxes.Reichl & Hallberg (2018) find G using simulations from single column models using SMC under a range of forcing scenarios.Later, this function was enhanced to include Langmuir turbulence using large eddy simulations (LES) (Reichl & Li, 2019).
ePBL extends the bulk mixed layer formulation to resolve vertical structure within the OSBL by applying a down-gradient flux profile using the vertical diffusivity given by where κ ϕ is the variable diffusivity of a scalar ϕ.The diffusivity varies with depth and is given in the following form: where L and v 0 are length and velocity scales.In the present implementation of ePBL (Reichl & Hallberg, 2018), the turbulent Prandtl number is assumed to be one and hence the diffusivity and viscosity are identically modified.Both L and v 0 are expressed as functions of position σ within the boundary layer.The length scale in equation ( 7) is set as (following O'Brien, 1970;Large et al., 1994): By assuming a fixed constant for γ, the expressions given by ( 7) and ( 8) may be expressed in the same form as (1), which reveals the role of the shape function as g(σ) = σ(1 − σ) γ .γ should not be a fixed constant.Constructing γ as a data-driven function is challenging and the form σ(1−σ) γ does not have a physical basis.The velocity scale v (σ) uses a similar formulation motivated to generally agree with the model k−ϵ (see Eqs. 43-45 in Reichl & Hallberg, 2018).
Although the integrated mixing in ePBL is constrained via the function G, the stratification resulting from the mixing is sensitive to the assumptions for γ and v 0 that set the diffusivity profile within the boundary layer.Differences in the diffusivity profile mean that even when the energetic constraints are accurate, inconsistent OSBL evolution and stratification can emerge when comparing ePBL with SMC such as k − ϵ (see Figures 6, 7).In this article, we enhance the physics-based ePBL approach by improving these velocity-and length-scale formulations with artificial neural networks (ANNs, see Section 3).

Artificial Neural Networks and Training Procedures
Artificial Neural Networks (ANNs) are one of the most widely used forms of machine learning models.ANNs are universal approximators and can find hidden nonlinear relations between quantities (Cybenko, 1989;Hornik et al., 1989;Hornik, 1991).In this section, we describe the fundamentals of ANNs and provide details describing the training procedures for the neural networks used to supplement ePBL's eddy diffusivity model.

Fundamentals
ANNs consist of nodes arranged in layers.Nodes are elements of a vector x that constitute a layer.See Figure 2 for a schematic.Each vector is connected to its adjacent vector via a transformation that involves multiplying with coefficients, called weights w, and adding an offset, called biases b.After transforming the vector with weights and biases, a nonlinear operation yields the next vector (or layer).The nonlinear operator is an activation function A. For an input layer consisting of a vector x 1 , one hidden layer x 2 and an output layer y, ANN can be written as Deeper networks are expanded versions of the above Equation set 9 and are obtained by adding additional layers.ANNs can capture nonlinear relationships within certain tolerances and can interpolate with high accuracy within the range of training data.We employ ANNs to learn the nonlinear relationship between chosen input parameters, described below, and the vertical diffusivity profile predicted using SMC.
3.2 Learning Diffusivity using two Neural networks N 1 and N 2 To train the ANN model to predict the diffusivity profile, we use the sigma coordinate defined as σ = z/h.Therefore, for the surface σ = 0, and at h, σ = 1.We define the diffusivity in the sigma coordinate in terms of a velocity scale, v 0 , the boundary layer depth, h, and the nondimensional shape function, g: where g(σ) is defined to give values between [0,1].We could have introduced vertical structure in few or all of the terms on the right hand side in Equation 10.Instead we use only g(σ) to provide vertical structure as we found out that it was convenient to train one profile than two or more.The benefit of adopting the sigma coordinate is in removing the dependence on the vertical coordinate (e.g., grid spacing in z) that varies in different ocean models.This allows us to train and infer (feed-forward) without depending on the model's vertical grid, which makes it practical to implement in an ocean model with an adaptive vertical grid (e.g.Bleck, 2002).
The velocity scale v 0 in Equation ( 10) does not vary with σ.The entire vertical structure of κ ϕ is captured by g(σ) alone.This is in contrast to Equation 7where both the length scale and the velocity scale vary in vertical direction and contribute to the vertical structure of κ ϕ .We made this choice to simplify the approach so that only one neural network is needed to capture the vertical structure of κ ϕ .
We choose to define the shape function and velocity scale using two separate neural networks: where N 1 and N 2 represent two distinct neural networks that are trained independently.N 1 requires inputs f, B 0 , u * , and h, while N 2 is found to depend on f, B 0 , and u * .We chose this strategy rather than combining the two outputs into one ANN for a couple of reasons.First, it is straightforward to cleanly diagnose g(σ) and v 0 from the data, as will be explained in Sections 3.4 and 3.5.Second, we anticipate that having separate networks will make the individual networks easier to interpret, which allows us to better understand physical processes modeled by the network.
Both neural networks N 1,2 are trained using the Pytorch package (Paszke et al., 2019).Rectified Linear Unit (ReLU) (Nair & Hinton, 2010) has been used as the activation function due to its simplicity and rapid convergence in training.

Data for training
The SMC data used to train the networks is generated using the single column model framework implemented in the General Ocean Turbulence Model (GOTM Umlauf & Burchard, 2005;Umlauf et al., 2014).GOTM provides numerous SMC options to predict the fluxes of turbulence and the vertical diffusivity.We employ a two-equation model: k−ϵ, with stability function closure following Schumann & Gerz (1995).The choice of this specific SMC parameterization is made to be consistent with Reichl & Hallberg (2018).Vertical mixing parameterizations remain an active research topic, and currently used schemes, including SMC, can exhibit biases in different forcing regimes and regions (Peters & Baumert, 2007;Li et al., 2019;Damerell et al., 2020;Sane et al., 2021;Tirodkar et al., 2022).Any biases in the training data are inherited by the neural networks.However, our neural networks can be trained using the output of different mixing schemes, including the improved schemes developed in future research.
The GOTM column model consists of a vertical grid with forcing applied at the surface grid point.It is applicable for flows with horizontal homogeneity, i.e., horizontal fluxes are zero or constant.GOTM simulations are performed by changing the following parameters: latitude (Coriolis), surface wind stress (surface friction velocity), and surface heat flux (surface buoyancy flux).Salinity is kept constant and temperature is the only active tracer, though the results are general for any combination of buoyancy fields and forcing.Our initial analysis indicates that the diffusivity of k−ϵ, κ kϵ , depends on the Coriolis parameter f , the surface buoyancy flux B 0 , the surface friction velocity u * and the depth of the boundary layer, h.We can only specify f , B 0 , and u * in single column simulations, and h is diagnosed from the time evolution simulated by GOTM.
Each GOTM case runs with a set of constant forcings.The time step is set at 60 s, and the vertical grid spacing is 1 m.The depth of the column is 800 m.The simulation results for the k−ϵ model are converged at this time step and resolution (see Figure 1 in Reichl & Hallberg, 2018).The initial conditions consist of zero horizontal velocity, the surface temperature is set at 20 °C, and the initial temperature stratification is set at 0.005 °C/m.Data was saved at hourly intervals.For every f , B 0 , and u * , we included one hundred instantaneous profiles of diffusivity at each hour from day 2 to day 6 in the training dataset.
We found (f, B 0 , u * , h) to strongly affect diffusivity compared to the background stratification established by the initial conditions.Stratification acts as a barrier to the deepening of the mixed layer, and therefore it is challenging to obtain deeper layers with strong stratification at the bottom of the mixed layer, and this limits the generation of training data spanning a wide range of h.Therefore, we choose a weak initial stratification.The effects of stratification on diffusivity in k−ϵ most directly impact the rate of deepening of the boundary layer (which is already captured by the energetic constraint of the ePBL), compared to the shape function itself.
Table 1 shows the range of forcing parameters of the training data.Forcing range is different for N 1 and N 2 because we found that the shape function does not vary significantly outside the range stated in Table 1.Hence we do not train on data outside that range, and the inputs to the network can be capped inside the mixing scheme in MOM6.For example, if the wind stress is 1.3 N/m 2 , capping prevents the wind stress from going beyond 1.2 N/m 2 as the shape function does not vary significantly beyond 1.2 N/m 2 .A similar argument can be made about the surface heat flux.The range selected to perform the sweep has been informed using the observed forcings in the JRA atmospheric reanalysis data set (Tsujino et al., 2018).The range shown in Table 1 covers most of the forcing space certainty as explained in Appendix B. For h, maximum variations for g(σ) were observed between 20 m and 300 m and beyond 300 m g(σ) is found to vary marginally.Randomizing the training data and splitting it into two sets (train and validation) could result in very similar elements from similar experiments being present in both the train and the validation data.This is undesirable since a fully independent validation dataset is required to monitor overfitting when training a neural network.To prevent this issue, a validation data set is independently generated to be 10% the size of the training data using a fully independent set of forcing parameters.No single element between (f, B 0 , u * ) is common between the training data set and the validation data set to strictly ensure the independence between the training and validation sets.The parameters f, B 0 , u * , and h are inputs to N 1 , while the output consists of a vector having values of g(σ) at 16 evenly distributed nodes, as shown in Figure 2.For each set of forcing (i.e.latitude, heat flux, and surface stress), the GOTM output consists of the evolution of the initial conditions into a developed boundary layer.The boundary layer deepens and variations in κ kϵ emerge.Ignoring the initial ≈ 2 days of data, h is diagnosed for each model output with a frequency of 60 minutes by analyzing the profile of the vertical buoyancy flux.Here, h is defined by the depth at which w ′ b ′ reaches and stays close to zero.This is the maximum extent to which the effect of surface forcing penetrates the upper layer through turbulent buoyancy flux.The diffusivity profile, κ kϵ (σ), is normalized by its maximum to find the shape function: The neural network cannot learn a continuous profile in σ, but instead we train it to learn on a subsampled g(σ) grid that consists of 18 equally spaced σ points (0, 1/17, 2/17, ...16/17, 1).σ at 0 and 1 is ignored in the training because g(σ) = 0 at the surface (σ = 0).At σ = 1, g(σ) → 0 and hence is assumed to be zero for training purposes.Therefore, the network predicts g(σ) at the 16 interior locations.Subsampling g(σ) to 18 evenly distributed points was found to be sufficient to capture essential features of g(σ) while maintaining a small enough network to later implement in an OGCM.

Overcoming limitation in h using synthetic data
ANNs show high prediction skill when input is within the range of training data.GOTM experiments can cover a wide range of data points that span latitude, heat flux, and surface wind stress, such as those historically observed in the real ocean.However, this is not true for h as it evolves prognostically and we cannot set its range for each run.We have chosen h to vary from 20 to 300 m (see Table 1) for training purposes, but for some surface heating conditions the boundary layer depth will saturate towards the Monin-Obukhov length L M O , which might be less than 300 m.As h is an input to the network and if for a particular case L M O < h < 300 m, then profiles will not exist and the network will have to predict outside the range of the training dataset.The network might end up predicting spurious profiles.
here, h is the boundary layer depth which is evaluated in the vertical mixing parameterization of OSBL in OGCM using physical arguments.
We address this issue by supplementing the training with synthetic data.For a particular case, if h saturates to, for example, 200 m, then an additional 10 profiles are added to cover the missing range of 200 m to 300 m in the training data.The shape function for these synthetic profiles is assumed to be the same as when h = 200 m, that is, g(σ) for h ∈ (200, 300) will have the same values as g(σ) for h = 200 m.This assumption is reasonable, since g(σ) was found to vary little for deeper boundary layers with surface heating.
Strong convection can cause a related issue due to quick deepening of the boundary layer within the spin-up phase of the turbulent OSBL.This gap is filled in the same way as described for deep boundary layer gaps.If the lowest value of h is, for example, 100 m, then ten profiles are added that cover 20 m to 100 m.The shape function for these ten profiles is assumed to be the same as that when h = 100 m.This fill-up of gaps in h is necessary to stabilize ANNs trained with our existing datasets.Knowing the exact bounding box of the training data set is imperative for a successful and stable implementation in a GCM.

Forcing N 1 to be strictly positive
The network N 1 consists of 4 input nodes, two hidden layers, and 16 output nodes (sensitivity to network hyperparameters is described in the next subsection).The four input nodes correspond to (f, B 0 , u * , h).The output nodes predict the shape function as described above.The output of N 1 , g(σ), is a vector of length 16.If g(σ) predicts a negative value of the shape function for any σ, it would lead to negative diffusivity values.We prevent this by training on the logarithm of g(σ).N 1 predicts log(g(σ)) and, while inferring, the exponential function is used.This ensures that the shape function is strictly positive.
The four inputs to the network (f, B 0 , u * , h) are normalized by their respective mean and standard deviation of the training data.For the 16 output nodes, each output was normalized by its own mean and standard deviation.For output node i, log(g(σ i )) was transformed into before training.The overbar denotes the mean, and the angled brackets denote the standard deviation.

Network Skill and hyperparameter sweep
To train N 1 two hyperparameters need to be tuned.The number of hidden layers and the number of nodes in those layers.For simplicity, we chose the same number of nodes in each hidden layer.A sweep was performed to test the accuracy of different networks.We varied the number of hidden layers from 2 to 4 and the number of hidden nodes in each layer from 2 to 512.Training data was randomized and provided as a single batch to train networks.
To measure the network's performance, linear correlation coefficient between the validation data and its prediction was calculated (see Figure 3 (a)) .The linear correlations for the 16 nodes were weight averaged with the mean value of g(σ) of the training data.The weight-averaged correlation is a better estimate of the network's skill for the given set of hidden nodes, as it reduces the influence of noisy values at the bottom of the boundary layer.The noisy values might be due to interpolation of the shape function profiles from the GOTM data.Based on hyperparameter sweep, we chose two hidden layers with 32 hidden nodes for N 1 , for which average correlation ≈ 0.9, and it is reasonably close to more expensive networks.For a deeper and wider network than 32 nodes, the average correlation score does not vary significantly, but the cost of using the network in an OGCM increases.Figure 4 displays the performance of N 1 .The first column (a) shows the error statistics between the validation data and the network's prediction in the normalized space for each output node.The boxes show the interquantile range, while the whiskers show the 5 th and 95 th percentiles of the error.The second column (b) shows the same percentile range as in column (a) but in the physical space of g(σ).The medians are superposed over the mean g(σ) profile of the entire dataset.This helps to visualize the skill of N 1 with respect to each σ value.Nodes 11 and 12 have a high error variance compared to other nodes.The error variances in column (a) are different from those in column (b) because the data have different variances along the nodes.The last node 16 has a high variance in (a) but because the g(σ) values at that node are small, poor performance at that node does not penalize the overall performance of N 1 .Node 16 is in the transition layer, which may have a large gradient of the tracer that might amplify the error in diffusivity at node 16.However, implementing this version of the network in ePBL yields an acceptable improvement in overall performance, suggesting that the error in node 16 is acceptable.Sensitivity in the transition layer will be investigated in more detail in future work.Column (c) has histogram plots of the validation data and its prediction.The network performs reasonably well and only shows inaccurate behavior when the data is multimodal.Column (d) shows the error histogram.The error has the highest variance at node 12, and is approximately Gaussian everywhere implying randomness.
For the neural network N 1 , Figure 4 shows the ability to predict the shape function offline.In general, the network shows high skill, as seen by the scores in Figure 3.The network N 1 shows some inaccuracies in predicting multimodal distributions for output nodes 10-15.A single network predicts the value of the shape function at all the nodes, and it could compensate for the accuracy at one node over the other.Increasing the size of the network (i.e.number of layers and nodes in them) slightly reduces this error, but the cost of computation increases significantly with size rendering them unusable for longer time-scale simulations.
N 1 is trained in all of the forcing regimes: surface heating, neutral, and convection.Perhaps, this adds a limitation to the network, which falls short of having very high skill for all the regimes.In our training experiments (not described in this article), training and predicting separately on the stable and unstable regimes gave higher skill than training on all regimes at once.Having two networks to predict g(σ) alone could lead to higher skill without increasing the number of hyperparameters.This might be a costeffective way to increase the overall accuracy of ePBL NN without expanding the size of the network.Increasing the number of hidden nodes in the hidden layers increases the cost of forward computation, while switching between the networks based on the forcing regime has a similar cost to using a single network.For simplicity, in this work we prefer to train all data using a single neural network and have not pursued this any further.
We used the L1 loss function (mean absolute error) for training, as it gave better training performance than the L2 (root mean square error).We also increased the convergence of the network parameters (weight and biases) by tweaking the loss function.
The loss values at nodes 8 to 13 were amplified by a factor of 100.This made the loss gradients steeper at the nodes that show the highest variance (seen in columns 1 and 2).This forces the network to put more weight on reducing errors on the nodes that are otherwise difficult to learn.The ADAM optimization algorithm (Kingma & Ba, 2014) has been used to train the weights and biases of the network with a learning rate of 10 −3 .networks exhibits poor performance.This could be due to the strong multi-modal nature of data at those locations.

Training N 2
The second neural network N 2 as shown in Figure 2 predicts the characteristic velocity, v 0 .Velocity is diagnosed from the training data using the following jugaad : where the overbar denotes the average of all the values of v 0 for a set (f, B 0 , u * ).The spread of max(κ kϵ (σ))/h for a constant (f, B 0 , u * ) is small, and averaging assists the neural network in training to predict the mean value (see Appendix A).
Similarly to N 1 , network is trained on logarithm of v 0 and exponential function is used while inferring to ensure that the predicted v 0 is strictly positive.The data is divided into 80-20% to train and test the performance of the network.As seen in Table 1, the training data cover a wide range compared to that of N 1 , including extreme forcing conditions anticipated in a realistic OGCM.When the network sees conditions outside this range, the input is capped at the nearest extremum data point.This is to prevent the network from extrapolating, which is less skillful than interpolation.The trained network has high skill (linear correlation of 0.99) as seen in Figure 5.

Evaluating impacts in a prognostic OGCM
Training, testing, and validation data provide one method for testing the network and its ability to reproduce training data.However, to fully test the network's potential for OGCM experiments the neural networks must also be implemented in free-running, prognostic models.Our implementation does not cause simulation to fail due to any spurious effects or instabilities which is a known problem with implementing neural networks in a GCM (see Brenowitz et al. (2020) and references therein).Stability might result because we implement neural networks as a component within the existing ePBL framework.We demonstrate the success of our implementation using both free running column model experiments and forced ice-ocean global OGCM climate model experiments.

Implementation of neural networks in MOM6
We now describe the implementation of our networks in the MOM6 ocean model.The weights and biases of the network are generated offline and stored in NetCDF files.Feedforward (inference) calculation of the network involves matrix multiplications and activation functions.These have been coded as subroutines in MOM6's vertical mixing module (ePBL).A flag activates the neural networks to predict g(σ) and v 0 .All inputs to the network are readily available within the ePBL module.The neural networks require the depth of the boundary layer h, which is provided by the ePBL scheme as described in Reichl & Hallberg (2018).The neural networks function alongside the algorithm by which ePBL derives h and therefore they do not interfere with any energy constraints set by the original scheme.Additionally, in MOM6, the diffusivity derived from ePBL and the neural network subroutines is passed to a main diabatic mixing module which combines diffusivities from various mixing parameterizations (such as Jackson et al. ( 2008)) within MOM6.More details can be found in Reichl & Hallberg (2018).g(σ) is obtained at 16 points between the surface and h.g(σ = 0) = 0 to satisfy zero diffusivity at the surface.At h, g(σ = 1) is set as a small number by assuming g(σ = 1) = c • g(σ = 16/17), where c is a small positive constant set as 0.1.GCM and single column runs were found to be insensitive to small and non-zero values of c. Shape function on σ is converted to the model's vertical grid by linear interpolation.The use of the sigma coordinate makes our scheme grid independent of the vertical coordinate.The shape function on the model grid is multiplied by v 0 • h according to Equation 10 to recover the diffusivity profile of the k−ϵ model.The subroutines pass on the diffusivity profile to the ePBL module.In MOM6, there are other parameterizations active along with ePBL to incorporate strong shearing regions found at the equator and also that handle background diffusivity.Both networks N 1 and N 2 are shallow, as they have two hidden layers with 32 nodes in each.OM4 model with ePBL NN requires ≈ 5-10% more runtime than ePBL.This cost may not warrant a need for GPUs to speedup the inference in this version of the scheme, but this option will be explored in the future.
The inputs to the neural network are also capped inside the subroutine to ensure the networks do not make predictions outside their training range.For N 1 , if any of the inputs (f, B 0 , u * , h) are outside the known range, then the subroutine limits the inputs and changes them to the nearest point in the four-dimensional hypercube formed by the four inputs.Our training data covers a reasonable space of the forcing parameter regime as observed among realistic present conditions (as it will be applied in this study).Data points outside the range are less probable, allowing the network to perform effectively for nearly all of the tested forcing conditions (see Appendix B).Capping the inputs prevents the network's output from being unphysical.If the network is applied for simulations in substantially different climate regimes (e.g., paleoclimate or for other planetary bodies) the training data could be enhanced.If the network receives inputs outside the known range, the shape function can have spurious values with irregular vertical structure.Capping the inputs ensures that this spurious behavior is prevented.The training on logarithm and using exponential function while inferring described in the earlier sections prevents non-positive behavior for both N 1 and N 2 .

Single Column Model Results
We compare three schemes to examine the performance of the network in single column model: GOTM k − ϵ, ePBL, and ePBL NN.MOM6 in single column configuration (as in ?) is used to run ePBL and ePBL NN, while GOTM is used for the k − ϵ experiments.The column models are forced at the surface grid interface with constant buoyancy forcing (surface heating of 50 W/m 2 ) and constant wind surface stress (0.2 N/m 2 ).Stratification is constant throughout the column in the initial conditions.To have the same entrainment in all the three cases, the m * value is diagnosed from the k−ϵ out-put and imposed in MOM6.The quantity m * is the non-dimensional integral of the entrainment flux and is given by 0 −h min(0, w ′ b ′ )dz = m * u * 3 for surface heating conditions.In Reichl & Hallberg (2018), m * has been parameterized using a function G as in Eq. 5. Instead of using the parameterized m * from Reichl & Hallberg (2018); Reichl & Li (2019), we use a diagnosed and time varying m * from k−ϵ to perform a controlled comparison with identical forcing conditions.This prevents deficiencies in the parameterized m * from causing any disagreements between MOM6 and GOTM.By matching the surface forcing and integral of the entrainment flux, the differences between all the three cases can only be due to diffusivities.
Two latitudes are compared: Latitude 40°(Figure 6) and 1°(Figure 7).The figures show the time series of diffusivity and temperature stratification.For both latitudes, the diffusivity and stratification in ePBL NN are in closer agreement with the k−ϵ model than the original ePBL model, showing the ability of the neural networks to match k− ϵ. ePBL NN has a diffusivity profile closer to k−ϵ than ePBL throughout the OSBL.In k−ϵ (SG), the turbulent diffusivity is computed from the simulated TKE and turbulent length scale, using stability functions that relate the Prandtl number to the Richardson number (Schumann & Gerz, 1995).The neural networks have "learned" those relationships (without direct knowledge of either parameter) that set the structure of diffusivity and hence show high skill in predicting the profile.
The upper ≈ 20% of the diffusivity profile is able to learn traditional constraints, such as the law of the wall scaling, since they are features of the training data.The bottom ≈ 40% of the OSBL shows more variability and is an important region for the entrainment process.In deepening of the boundary layer, the entrainment process mixes the higher density water masses (usually cold) from below the mixed layer with the lower density mixed layer above it (usually warmer).Outside of the polar regions, this process cools the mixed layer along with the sea surface temperature (and warms the interior) and has implications for ocean-atmosphere energy exchange and feedbacks.

Ice-Ocean JRA Forced Model Results
We next tested the ePBL NN scheme using the GFDL's OM4.0 ocean/sea ice model.The model has a nominal 1/4 degree resolution and is forced using the JRA forcing product (Tsujino et al., 2018).JRA forced simulations constrain the atmospheric fields that force the ocean model with the observed/reanalysis atmospheric data.This is different from the atmosphere-ocean coupled model as there is no feedback from the ocean response to the atmosphere.However, this approach is beneficial for testing parameterizations since two experiments can be more carefully compared without considering the complications of those feedbacks.Future work will examine the performance of these schemes in fully coupled climate models.Two sets of OGCM experiments have been performed: one using the ePBL scheme as a control run (e.g., as described by Adcroft et al. (2019)) and the second with the neural networks active to replace the shape function and velocity scale in ePBL.The simulations were performed for a period of 1958 to 2017.
In this study, we compare the two runs with observations to analyze the impact on: (1) Ocean heat uptake, (2) Sea surface temperature, (3) Mixed layer depth, and (4) Upper ocean temperature stratification in the Tropical Pacific.For sea surface temperature, data from the World Ocean Atlas (WOA) (Boyer et al., 2018;Locarnini et al., 2019) has been used to compare the two schemes.For the subsurface comparison: mixed layer depth and stratification, ARGO float measurements have been utilized (Argo, 2022).

Ocean Heat uptake and Sea Surface Temperature Comparison
Figure 8 shows the global ocean heat content for the three runs: one with ePBL NN shown in red-solid line, and the other two with ePBL by setting γ from Equation 8 as 1 and 3 shown as blue-dashed line and green-dotted line respectively.ePBL NN shows more heat uptake than the original scheme, and rate of warming is between ePBL runs with γ set as 1 or 3.This highlights the sensitivity of the total ocean heat content to the shape function and to boundary layer mixing schemes.For ePBL, γ from Eq. 8 has been set to 1 and 3.For ePBL NN, the tunable parameter γ does not exist.We can observe that the ocean's total heat content is sensitive to the vertical diffusivity set by the OSBL mixing scheme.ePBL NN replaces ad-hoc diffusivity of ePBL with a physics informed data-driven neural network.
Figure 9 shows the SST bias averaged over the years 2003-2017 for each 1°grid point.SST biases are similar in the two runs with minimal differences, which is expected since the atmospheric fields are prescribed and not coupled.SST around the eastern Pacific and Atlantic equatorial regions shows a slightly warmer bias for the ePBL NN run than for the ePBL.In the Indian ocean, the bias is slightly colder.The SST bias in the Gulf Stream and Kuroshio current is slightly warmer in ePBL NN by about 0.5 °C.The response of the SST to ePBL NN in the boundary current regions indicates that changes in the vertical viscosity or diffusivity also impacts the circulation in certain regions.
Changes in the patterns of SST can be due to changes in the mixed layer depth and the surface heat fluxes.The heat fluxes are computed as a function of SST, surface ocean velocity and ice cover as stated in Adcroft et al. (2019); Griffies et al. (2016).

Mixed Layer Depth Comparison
Summer and winter mixed layer depths (MLD) are compared, a metric usually used to indicate the depth at which atmospheric influences are directly felt in the ocean.Here, winter (summer) mixed layer depth is the maximum (minimum) of the monthly averaged MLDs for each grid point over the period 2003-2017.The MLD depends on the definition, and we evaluate it using two criteria: Reichl et al. (2022) and de Boyer Montégut et al. (2004).The criterion from de Boyer Montégut et al. (2004) uses a threshhold potential density of 0.03 kg/m 3 whereas Reichl et al. (2022) uses a threshold potential energy anamoly of 25 J/m 2 to define the MLD.Figures 10 and 11 show the MLD using the potential energy anomaly criterion and the potential density respectively.to 5.73 m as seen in Figure 11.Between -20°to 20°latitude, the average root mean square error (RMSE) for MLD bias in ePBL was about 7.94 m.In ePBL NN, the bias was reduced to 5.18 m.We have shown the latitude dependency of RMSE between model and observations in the supplementary section (see Figure S1).The ePBL NN scheme performs better under stable surface heating conditions than the ePBL scheme.The shallow MLD bias reduction has implications for equatorial oceanic regions and its effect on large-scale ocean-atmosphere feedbacks (Adcroft et al., 2019).Winter MLD biases (Figures 12 and 13) are very similar for both runs.The ePBL NN predicts diffusivity close to a second moment scheme but does not significantly impact the winter time bias simulated by the model with the original ePBL scheme.This is likely because other model physics and factors can dominate in setting the deep convective mixed layers and water properties.
Although ePBL NN has been trained on all the regimes including surface cooling, a different scheme or process might be compensating the effects of improved diffusivity.This could also be due to higher sensitivity of shallow mixed layers to changes in surface forcing than deep mixed layers.For shallow mixed layer depth, any perturbations in the atmospheric forcing will reach the base of the boundary layer quicker than it would reach in deeper layers.In Reichl & Hallberg (2018), the rate of conversion of turbulent kinetic energy to potential energy within the boundary layer (left hand side in Equation 7) uses a parameterization that depends on h.Changing the diffusivity can alter h which in turn modulates the rate of energy conversion.This can lead to changes in the MLD.
The MLD evaluated using the criterion of ( de Boyer Montégut et al., 2004) shows the same qualitative results as described above.The winter time MLD biases are similar for both runs.The summer time MLD bias shows a further reduction when evaluated using ( de Boyer Montégut et al., 2004) than with (Reichl et al., 2022).It is not unusual to get different values of mixed layer depths using different definitions.For both definitions, winter biases in ePBL NN are not worsened.Qualitative agreement of the reduction in summer bias in ePBL NN using two different criteria provides strong evidence of ePBL NN performing better than ePBL in terms of MLD bias reduction under fixed atmospheric forcing conditions.The final comparison we use to assess the impact of the neural network diffusivities on the model result is the upper ocean temperature stratification (∂Θ/∂z, where Θ is the potential temperature) in the Equatorial Pacific region.The thermocline in the Equatorial Pacific region plays an important role in ENSO dynamics with implications for the Earth's climate system [for e.g.] (Jin & An, 1999).The temperature stratification is shown for a vertical cross section along the equator spanning -220°to -80°E.Figure 14 (c) shows the observational data from ARGO floats (Roemmich & Gilson, 2009).Figure 14 (b) is the ∂Θ/∂z from the original ePBL and shows lower stratification as compared to ARGO observations.The ∂Θ/∂z from ePBL NN, as seen in Figure 14(a), shows significant improvements in the stratification of the upper ≈ 50 m of the ocean.Stratification in ePBL NN is closer to ARGO data in the equatorial region of the Pacific ocean.The neural network predicted diffusivities help to increase the stratitication and make it closer to observations than the simulation with the original ePBL with the ad hoc shape function for diffusivity.
Stratification acts as a barrier to mixing, this warrants further investigation into how ePBL NN changes transport pathways of heat through the OSBL in different regions of the world's oceans.Overall, the MLD bias is reduced, and stratification has im-proved for the upper ≈ 50 m.suggesting that ePBL NN works to fix these two biases in conjunction.

Summary
In this study, we apply neural networks to improve the parameterization of the vertical diffusivity in the ocean surface boundary layer.The data used to train the neural networks is obtained using second-moment closure simulations by running single-column model under various forcing scenarios, spanning the possible range of present-day and future conditions.The neural networks are implemented within the existing physics-based parameterization from the Energetic Planetary Boundary Layer (ePBL) framework of Reichl & Hallberg (2018).The neural networks augment the method to determine the vertical diffusivity in the ePBL scheme with data-driven relations but maintain the physically motivated energetic constraints on mixing from the original scheme.A benefit of our approach is that it yields a stable implementation in the OGCM (MOM6).This enables us to perform decade-scale simulations spanning 1958 to 2017.
Atmospherically forced Ice-Ocean experiments using the GFDL's MOM6 1/4°model suggest an overall improved performance due to the enhancements in ePBL NN relative to the original scheme.There is a reduction in biases of summer-time mixed layer depths and no exacerbation of the winter-time biases compared to ePBL.The stratification of the upper ocean in the tropical pacific shows improvements in the thermocline compared to the ARGO float observations.This analysis indicates that the resulting scheme is suitable for implementation in future OGCM configurations and experiments and is expected to reduce biases in climate simulations.Further analysis using a wider range of diagnostics in additional model configurations will be particularly beneficial.
The ePBL framework is already optimized for GCMs, providing larger time stepping capabilities ( ≈ O(1) hr) and ePBL NN leverages these advances with improved diffusivity profiles.It is computationally expensive to run a second-moment scheme in a GCM due to time-stepping restrictions ( ≈ O(10−100) s), but ePBL NN can yield eddy diffusivity profiles more consistent with a SMC within the original ePBL framework.This is significant for GCMs as we are achieving closer results to a model having a secondorder turbulence closure scheme, but able to maintain coarse resolutions and long time steps needed for climate scale simulations.We note that the longer implicit time step used in the numerics of ePBL (see Reichl & Hallberg, 2018) can lead to a smoothing effect which can complicate resolving small-scale structure, but we observe that the largescale evolution is tracked accurately.
While the results of this work are promising, numerous aspects remain important for future work.For example: 1. ePBL, ePBL NN, and the SMC considered here assume downgradient diffusion and hence have no nonlocal flux terms.The representation of nonlocal fluxes could improve the scheme further and potentially affect convective regions and Langmuir turbulence (e.g.Chor et al., 2021).In this application we do not explicitly consider the impact of Langmuir turbulence within ePBL (though it is part of setting the energy available for entrainment, see Reichl & Li (2019)).2. The neural networks can be made larger to capture more complex relationships in the data by increasing the number of hyper-parameters (hidden nodes).In this work we chose small networks for initial investigation.The successful use of small neural networks as efficient surrogate models of SMCs proves that we can replicate the behavior of complex models with high fidelity.Increasing the network size will be explored in the future and will likely require using graphical processing units (GPUs) for implementation in OGCM (Zhang et al., 2023).3. The performance of the modified vertical mixing scheme in a coupled model (atmosphereocean-ice) may not show the same impact on model bias as observed in this forced ocean-ice model.The atmosphere-ocean feedbacks will require exploration in future work.4. Improving the representation of the diffusivity profile has implications for many quantities that have gradients within the boundary layer.For example, changing the diffusivity of nutrients within the euphotic layer has implications for biogeochemical processes such as primary production.The implications of improved diffusivity for ecological modeling will be explored in future work. 5. Finally, we have trained on one SMC, the k−ϵ model with stability functions following Schumann & Gerz (1995).This parameterization was chosen for consistency with Reichl & Hallberg (2018), but alternative SMC models may yield different results.In future work this process will be repeated with different SMC schemes to understand the influence of SMC diffusivities on the performance of GCMs.One disadvantage is that SMCs have been assumed to be the "truth" but it might lack realism and hence future work will focus on including data from LES studies and observations.

Applications for first order ocean surface boundary layer parameterizations
One key achievement of this work is that it establishes a relationship between the shape function of upper ocean vertical mixing and the forcing parameters.Previous work in similar first-order upper ocean mixing parameterizations assumes that the shape function is fixed, or was set by ad-hoc approximations.This work further suggests that models that consider this variation in the shape function are more skillful at simulating upper ocean stratification and ocean mixed layers.The physics-informed functions (networks) developed in this work for determining the shape function from the forcing parameters is applied here in ePBL as an example.However, the function is not specific to ePBL and can also be used within other first order ocean surface boundary layer parameterizations (such as KPP, Large et al., 1994;Van Roekel et al., 2018).
It is also important to consider that the neural network based model used in this work is not the only approach to find a relationship between the forcing terms and the vertical mixing profile.The neural network is able to establish the existence of a relationship between its input and outputs, which is learned during the training process.While the neural network can be applied in ocean models as-is to improve simulations, we also desire an in-depth understanding of the patterns in the inputs that the network used to make its skillful predictions.In future work, we seek to relate the network's findings to the processes that govern the ocean surface boundary layer's behavior (e.g., with equation discovery).This may ultimately lead to a simpler, interpretable and computationally low-cost physics based model for the shape function that can be learned from the neural network and applied in ocean models.

Implications for augmenting ocean parameterizations with Machine Learning
A second implication of this work is demonstrating the potential for neural networks to improve parameterizations in ocean models.This implication is in agreement with several similar previous studies in earth system modeling (e.g.O' Gorman & Dwyer, 2018;Yuval & O'Gorman, 2020).As neural networks are not limited to individual processes, future avenues of research on ocean parameterizations will benefit from their usage.For example, neural networks can be applied to incorporate different mixed layer processes such as non-local fluxes during convection, entrainement, Langmuir turbulence, symmetric instability, surface wave effects, etc. into a single neural network model.Further improvements can be made which incorporate time history to improve predictions under transient forcings.Many existing ocean/atmosphere parameterizations have a physics based parent scheme with a few ad-hoc components or approximations.These components can be replaced or re-tuned using our approach or other emerging approaches such as Ensemble Kalman methods, posteriori criteria matching, etc. (e.g., Lopez-Gomez et al., 2022;Frezat et al., 2022, and references therein).Parameterizations in the form of weights and biases are advantageous because they can be re-tuned and further optimized to train as additional data, observations, and processes are presented.
The successful application of neural networks in an OGCM simulation unlocks the potential to test the importance of improving a certain process/parameterization in the model.For example, consider a case where the process studies' data exist, but a physicsbased parameterization might be challenging to develop.Neural networks can parameterize that process and its impacts in an OGCM can be explored before going into a detailed parameterization development, which can be resource-consuming.
One of the major sources of uncertainty in climate models arise from parameterizations due to their inadequate representation of sub-grid physics.Perhaps, high resolution or shorter time-steps can attenuate the effects of structural uncertainties in subgrid parameterizations.Computational limitations often impose constraints on factors such as resolution, ensemble size, and integration time scales within models.These limitations underscore the need for improving the current generation of climate models, while steering away from relying on higher resolution models or shorter time steps.Combining traditional process-oriented studies with the emerging field of machine learning offers the potential for synergistic advancements, leading to the refinement of sub-grid models.We have established a pipeline whereby an existing parameterization is augmented to harness the capabilities of neural networks.The successful integration of neural network within the energetic Planetary Boundary Layer (ePBL), and its application to an ocean model, introduces opportunities for enhancing parameterizations that govern upper ocean mixing in climate models.
Appendix A Why does v 0 change due to Coriolis parameter, f ?
In general, turbulent velocity scales are related to turbulent kinetic energy and depend on boundary forcing, u * and B 0 .Here, in addition to u * and B 0 , we find a dependency of the bulk turbulent velocity scale v 0 on f .The bulk velocity is diagnosed by using diffusivity and boundary layer depth from the training data as per Equation 14.To predict v 0 using N 2 the Coriolis parameter f has been used because we found the model improves in ability to predict variations in v 0 in the training data.This is evident from Figure A1.
Figure A1 shows the variation of v 0 with respect to latitudes and h under surface heating and cooling conditions.This indicates that f is a useful input for accurately predicting v 0 .Physically, the inclusion of f is related to the role of rotation in limiting the wind-input of energy and the shear production of turbulence in the boundary layer through Ekman effects.The variation due to h is smaller than due to f , around 5% of the mean value of v 0 for any particular set of forcing (f, B 0 , u * ).Since the implementation and generalization is significantly easier if the network only depends on external forcing parameters, we choose to include f as an input to the network and neglect h.igure A1.Variation of normalized v with repect to its mean value.v0 varies due to heat flux, surface stress, latitude.Variations due to h are within 5% of the mean value of v0 and hence it is reasonable to exclude h from being an input to N2.Note that v0 is a diagnosed quantity from the output of k − ϵ solely used to reconstruct the difusivity profile.

Appendix B Quantifying uncertainty range covered in the forcing data
ered in the training data.We can estimate this using Shannon entropy (Shannon, 1948) which measures the amount of uncertainty and variability in a variable (Sane et al., 2021(Sane et al., , 2020;;Carcassi et al., 2021).
Shannon entropy of an event x i is given by H(x) = Σ N i=1 p i log 2 (1/p i ) (Cover, 1999) and measures the average amount of information or surprise related to the event.We only use discrete probability distributions.Low probability events have high Shannon entropy because they cause more surprise compared to high probability events.It is a non-parametric measure and does not make any assumption about the distribution.u * and B 0 are non-Gaussian (Figure A1).
For u * : H(u * >0.03 m/s) ≈ 95.5% and for B 0 : H(|B 0 | > 2.1 × 10 −7 m 2 /s 3 ) ≈ 86%.This can be interpreted as the values u * > 0.03 have 95.5% uncertainty associated with them.So leaving out values of u * for which u * > 0.03 removes 95.5% uncertainty from the training data.This is a simplistic estimate and assumes u * and B 0 are independent.These estimates show that our training data cover 95.5% variability for u * and 86% of B 0 as observed under realistic conditions in a GCM.
The training data points are uniform and although they cover most of the range seen in realistic conditions, the training data does not follow the same marginal probability distribution of u * and B 0 as well as the joint probability distribution between u * , B 0 .For machine learning application of parameterization development the consequence of sampling from joint distribution of variables from realistic conditions versus having uniformly spaced forcing is unknown as of now and will be left for future study.the views of the National Oceanic and Atmospheric Administration, or the U.S. Department of Commerce.We were intellectually supported by various other members of the M 2 LInES project.We used the Stellar computational resources provided by Princeton University and the National Oceanic and Atmospheric Administration (NOAA) Geophysical Fluid Dynamics Laboratory (GFDL).We thank Dr. Enrico Zorzetto and Dr. Robert Hallberg for providing feedback for this article.We also thank Dr. Jun-Hong Liang and two anonymous reviewers for reviewing and providing profound insights, which led us to improve this manuscript.The authors thank the international Argo project and the various associated national programs for collecting and freely distributing the dataset.AS thanks his wife for showing remarkable grit in helping to improve the plain language summary.

Figure 1 .
Figure 1.Shape functions derived from various forcing conditions from a second moment closure (blue, shaded region) plotted against a universal shape function (brown, dashed line) used in GCM vertical mixing schemes.The observed discrepancy between them reveals a limitation in existing vertical mixing schemes.However, this deficiency can be effectively addressed through the application of neural networks, which have the potential to predict the shape function and diffusivity associated with second moment closures.

Figure 2 .
Figure 2. (a) Neural network N1.It requires four inputs (f, B0, u * , h) and output layer consists of 16 nodes giving values of g(σ) at those locations.(b) Neural network N2 requires three inputs (f, B0, u * ) and output is a scalar velocity scale v0.Diffusivity is obtained by:

Figure 3
Figure 3 (b) shows the loss curves for training the network.Training loss (magenta) and validation loss (green) decrease with the training epoch.The validation loss is higher than the training loss, but both eventually plateau.The difference between validation and training loss, shown in blue, remains constant in later epochs, signifying when training should be stopped.The validation loss does not increase, ensuring that the network is not overfitting the training data.The performance of the network is further tested by comparing it with the validation data.Strong agreement with validation data can be seen through the average correlation scores in Figure 3(a) and the error statistics in column (d) of Figure 4.

Figure 3 .Figure 4 .
Figure 3. (a) Average linear correlation coefficient between network output and the true values from data.Averaging has been done over all the 16 nodes.Different color represent different number of layers and x-axis shows the nodes in each hidden layer.(b) L1 loss curves for training and validation.

Figure 5 .
Figure 5. Performance of N2.(a) Loss curves.(b) Histogram of difference between network's prediction and data.(c) Predicted vs. true values for the training dataset.(d) Predicted vs. true values for the validation dataset.

Figure 6 .
Figure 6.Time series comparison for single column model configuration.Latitude set to 40 °, surface heat flux is 50 W/m 2 , and wind stress is 0.2 N/m 2 .The upper row compares diffusivity and the bottom row compares stratification.In both the cases, ePBL NN is in better agreement to the second moment closure scheme k − ϵ than ePBL.

Figure 7 .
Figure 7. Time series comparison for single column model configuration.Latitude set to 1°, surface heat flux is 50 W/m 2 , and wind stress is 0.2 N/m 2 .The upper row compares diffusivity and the bottom row compares stratification.In both the cases, ePBL NN is in better agreement to the second moment closure scheme k − ϵ than ePBL.

Figure 8 .
Figure8.Total ocean heat content compared between ePBL and ePBL NN for the duration of 1958-2017.For ePBL, γ from Eq. 8 has been set to 1 and 3.For ePBL NN, the tunable

Figures 10 ,Figure 9 .
Figures 10, 11  show summer time MLD.The summer time MLD bias has reduced significantly in ePBL NN as compared to ePBL.The average bias reduced from 7.22 m

Figure 10 .Figure 11 .Figure 12 .Figure 13 .
Figure 10.Summer time (shallow) mixed layer depth biases using the Potential anomaly criterion of Reichl et al. (2022).(a) MLD from ePBL, (b) Difference of MLD between ePBL and ePBL NN.(c) Bias of ePBL with respect to ARGO data, (d) Bias of ePBL NN with respect to ARGO data.We can notice the bias reduction from (c) to (d).

Figure B1 .
Figure B1.Left: Probability density curve of surface friction velocity u * .Right: Probability density curve of surface buoyancy flux Bo.The arrows denote the range covered in the training dataset.

Table 1 .
Range of parameters used to generate training data.We have added additional details about GOTM runs to the Supplementary section.

Table 1
gives the range of the forcing parameters covered in the training data set.A natural question is how much of the variability observed in GCM simulations is cov-