Vertical Velocity Diagnosed From Surface Data With Machine Learning

Submesoscale vertical velocities, w, are important for the oceanic transport of heat and biogeochemical properties, but observing w is challenging. New remote sensing technologies of horizontal surface velocity at O(1) km resolution can resolve surface submesoscale dynamics and offer promise for diagnosing w subsurface. Using machine learning models, we examine relationships between the three‐dimensional w field and remotely observable surface variables such as horizontal velocity, density, and their horizontal gradients. We evaluate the machine learning models' sensitivities to different inputs, spatial resolution of surface fields, the addition of noise, and information about the subsurface density. We find that surface data is sufficient for reconstructing w, and having high resolution horizontal velocities with minimal errors is crucial for accurate w predictions. This highlights the importance of finer scale surface velocity measurements and suggests that data‐driven methods may be effective tools for linking surface observations with vertical velocity and transport subsurface.


Introduction
Resolving oceanic vertical velocities, w, at the submesoscale, defined here as O(1-10) km spatial scales and O(1) Rossby number, is important for the vertical transport of heat, carbon, and nutrients between the surface and deep ocean (e.g., Ruiz et al., 2019;Su et al., 2018;Uchida et al., 2019).Directly measuring w is difficult as w is several orders of magnitude smaller than the horizontal velocity, and w is noisy since it includes the wave field and turbulent fluctuations.However, recent studies suggest it may be possible to infer w at the submesoscale from surface signatures since strong up-and down-welling are known to be associated with surface fronts, convergence, and cyclonic vorticity (D'Asaro et al., 2018;Freilich & Mahadevan, 2021;Ruiz et al., 2019;Tarry et al., 2021).We also expect that surface lateral velocity data will continue to improve in resolution and accuracy, leading to better estimates of divergence and vorticity at smaller scales.For instance, starting in 2023, the Surface Water and Ocean Topography (SWOT) mission plans to provide global surface horizontal geostrophic velocities down to ∼15 km, the highest resolution to date (Fu & Ubelmann, 2014;Wang et al., 2018).Furthermore, the NASA EVS-3 Submesoscale Ocean Dynamics Experiment (S-MODE) utilized Doppler Scatterometry on aircraft to measure surface lateral velocities at <1 km resolution (Farrar et al., 2020).The latter methodology is also the basis for a proposed future mission that would remotely sense waves and current velocities that are not just geostrophic, at even higher resolution than SWOT (Rodríguez et al., 2019).
To leverage these advances in finer-scale surface observations, we explore the feasibility of using data-driven methods to link surface fields with subsurface w.Existing techniques, such as surface quasigeostrophic (SQG) methods, diagnose the 3D mesoscale horizontal and vertical velocities from sea surface data using sea surface temperature (SST) or density and/or sea surface height (e.g., Isern-Fontanet et al., 2006, 2008;LaCasce & Mahadevan, 2006;Lapeyre & Klein, 2006;Qiu et al., 2020;Wang et al., 2013).However, these methods assume a quasi-geostrophic framework with Rossby number ≪ 1, an assumption that does not hold at the submesoscale.Moreover, SQG methods mainly capture the balanced part of w and perform poorly in the surface mixed layer where unbalanced submesoscale dynamics dominate (Qiu et al., 2020;Uchida et al., 2019).Another method to calculate w, based on the incompressibility of seawater, is to depth-integrate the surface lateral divergence (e.g., Tarry et al., 2021Tarry et al., , 2022)).However, this approach leads to errors that increase with depth because the divergence is not uniform throughout the water column, especially below the mixed layer.Hence the relationship between w and the surface divergence is more complex, nonlinear, and dependent on other variables.This highlights a need to explore alternative methods for diagnosing the unbalanced, submesoscale w.
Here, we utilize machine learning (ML) methods which are good at finding complex nonlinear relationships from large data sets.They have been used in oceanography to estimate horizontal currents at the surface (Sinha & Abernathey, 2021) and at depth (Bolton & Zanna, 2019;Chapman & Charantonis, 2017;Manucharyan et al., 2021), among many other applications (Sonnewald et al., 2021).Recently, Zhu et al. (2023) showed promise in using deep neural networks to predict w from sea surface height data only.Our goal here is to further test how well ML methods perform for estimating the 3D submesoscale w field.We train and compare the performance of three machine learning models: Multiple Linear Regression (MLR), Random Forest (RF), and convolutional neural networks (CNN).In addition, we evaluate the sensitivity of w predictions to data noise, spatial resolution, and to different input data variables, to provide insight into the type and quality of measurements needed for accurate estimations of w.

Training Data
Since we lack measurements of the vertical velocity, the ML models are trained using the output from an ensemble of numerical simulations generated with the Process Study Ocean Model, which solves the nonhydrostatic Bousinesq equations (Mahadevan et al., 1996a(Mahadevan et al., , 1996b)).The model simulations, described in He and Mahadevan (2021), represent a coastal upwelling front in a 500 m deep re-entrant channel that extends 96 km in the alongshore (x) direction (periodic boundaries) and 384 km in the cross-shore (y) direction (solid walls).The horizontal grid resolution is 1 km, and there are 32 stretched vertical levels ranging in thickness from 1 m at the surface to 36 m at the bottom.Nine simulations with different combinations of alongshore wind stress and initial stratification generate a wide range of dynamics that form a rich training set.The mixed layer depth is set by both the upwelling near the coast, as well as by the vertical mixing scheme which is dependent on the wind stress.More details about the model can be found in Supporting Information S1 and in He and Mahadevan (2021).
Each simulation starts from rest and is forced with an alongshore wind stress that sets up an upwelling front at the eastern boundary.The upwelling front undergoes baroclinic instability-to form meanders, eddies, and filaments that support a range of vertical velocities (Figure 1).The statistics of the modeled flow is representative of submesoscale dynamics (Shcherbina et al., 2013); the vertical component of the near surface relative vorticity is positively skewed (Figure S1 in Supporting Information S1), and the distribution of w is negatively skewed (Figure 1).Snapshots from three of the simulations performed with the same stratification but different strengths of the wind stress show the range and distribution of w, as well as the vertical distribution of w (Figure 1).We focus on the upper 200 m, which is deeper than the mixed layer depth, as w at the base of the mixed layer is of interest for vertical transport.We limit our analysis to a region extending 150 km from the coast (y < 150 km) which includes the upwelling front and avoids the influence of the offshore boundary.
From each simulation in our ensemble, we select five snapshots-each separated by 5 days-of the model output to use for training and testing.We reserve one snapshot (20% of the data) for the test set, and the remaining time slices (80% of the data) are used for training.All the errors and results presented here are from the test set.Different combinations of input variables are used to train the ML models, which is detailed in Table 1.The inputs Geophysical Research Letters 10.1029/2023GL104835 to the ML models include the surface density ρ, and the horizontal surface velocities u → h = (u,v) in the x and y directions, respectively.We also calculate the surface divergence δ = u x + v y , vorticity ζ = v x u y , and crossshore density gradient ρ y to use as inputs, because w is known to be related to those quantities (D'Asaro et al., 2018;Ruiz et al., 2019;Tarry et al., 2021).The dominant density gradient is in the cross-shore direction.In addition, we include the mixed layer depth MLD, and the peak stratification below the mixed layer N 2 , to test the importance of including information about the water column structure.To evaluate sensitivity to possible measurement errors, we add random noise at different levels (1%, 5%, and 10%) to u → h and re-calculate δ and ζ from the noisy velocities.The density is kept noise-free, as we expect errors in remotely sensed surface temperature to be smaller than in surface velocity.We also generate lower-resolution data by coarsening u and v from 1 to 5, 10, and 15 km through spatial averaging, and then use the coarse u → h fields to calculate δ and ζ (see Supporting Information S1 for more details).The output we are trying to predict is the vertical velocity at the same instant in time.For simplicity, we focus on predicting the full w, but there are many processes that contribute to the vertical velocity field and w can be decomposed into contributions from waves, balanced geostrophic motions, or unbalanced and turbulent motions that could be studied individually depending on the application (Qiu et al., 2020;Uchida et al., 2019).Details on the exact input-output pairs for each machine learning model are discussed below.

Machine Learning Models
We compare three types of ML models, MLR, RF, and CNN, and summarize some key differences between these models below.The Supporting Information provides more details about specific model implementation and parameters along with a schematic of the models and their input-output data (Figure S2 in Supporting Information S1).The ML models and the inputs they are trained with are listed in Table 1.
For MLR and RF, the inputs are the surface u, v, ρ, δ, ζ, ρ y , along with MLD, and peak N 2 values at a single grid point.The output is w at that same grid point, at a particular depth.We fit separate MLR and RF models, using a mean square error loss function for each depth level, with the same inputs.This allows the ML models to learn different relationships for every depth and enables us to assess the importance of each input variable as a function of depth.The main difference between MLR and RF is that MLR assumes w is a linear function of the inputs, while RF allows w to be a nonlinear function of the input data since RF recursively splits the data into subsets and predicts the average over a subset of the training data.One consideration when choosing these two models is that we need to explicitly provide the exact variables that we think are going to be important for their predictions, such as the gradient quantities δ, ζ, and ρ y .
Additionally, since we expect spatial patterns to be important, we use CNNs, which can efficiently consider a wider field of view.We use a relatively simple CNN for the sake of demonstration in this paper that consists of three convolutional layers followed by two fully connected layers, and the loss function is the mean square error.The CNN takes in a series of images as inputs, which are the 2D fields of u, v, ρ, MLD, and peak N 2 .We choose the size of the input image to be 32 × 32 km, and intentionally do not include δ, ζ, or ρ y since CNNs are designed to learn filters that identify important patterns such as fronts and gradients.Since CNNs are more computationally expensive to train, we do not train a separate CNN for every depth.Instead, the final predicted output is the depth profile w(z) at the center of the input image.

Results
The first set of w predictions are from the "MLR all," "RF all," and "CNN all" models, which are trained with the original noise-free 1 km resolution data.The "all" refers to using all of the input variables considered, which include surface variables, MLD, and peak N 2 (Table 1).For all 3 methods, 3D snapshots of the predicted w (Figure 2) closely resemble the true model w for the case of high winds in Figure 1a.Qualitatively, all three ML models capture the locations and coherence of the downwelling and upwelling regions, such as the intense downwelling filaments at y = 60 and y = 90 km.All three models accurately predict enhanced velocities in the mixed layer, and weaker w below the mixed layer.Scatterplots of the predicted w against the true w in the upper 200 m of the entire test set that includes the ensemble of nine simulations (middle row of Figure 2) offer a more objective comparison.RF and MLR have comparable r 2 with the true w of 0.64 and 0.62, respectively, while CNN has the highest correlation of 0.76.The CNN also does the best job of reconstructing the asymmetry in w, while MLR and RF are more likely to under-predict the negative velocities (Figures 2d-2f).Overall, at this stage, all 3 ML models do a decent job of reconstructing w in the upper 200 m.

Surface Data Only
To assess whether using surface data exclusively is sufficient for inferring w, we compare the ML "all" models against the "surface only" models that do not include MLD or peak N 2 as inputs (Table 1).Surprisingly, we find that training with only surface data results in a negligible decline in performance for RF (r 2 drops from 0.64 to 0.62) and CNN (r 2 drops from 0.76 to 0.73), while the MLR performance does not change (Table 1).We also do not see a large difference in the depth-dependence of errors between the "surface only" and "all" models in Figures 2h and 2i, which show the root-mean-square error (RMSE) normalized by RMS w, and r 2 profiles across the entire test set for the different ML models.We normalize the RMSE profile by the RMS w profile rather than compute the percent error at each individual grid point and then average, because the latter results in dividing by zero in some locations.Generally, RMSEs are relatively small above the mixed layer base and there r 2 is high.Below the mixed layer, RMSEs become the same order of magnitude as RMS w and r 2 becomes close to zero (Figures 2h and 2i).The solid profiles in Figures 2h and 2i can be grouped into two categories based on model type-the "CNN surface" and "CNN all" models (pink and black lines) are very close together, while RF and MLR (blue, cyan, and green lines) models have very similar depth-dependencies.
To explain why surface data is by itself sufficient, we can look at Figure 2g, which shows the "feature importance" as a function of depth.Feature importance is a metric from RF that measures the usefulness of each input feature by the total decrease in mean square error that results when RF partitions the data based on that input.This metric ranges from 0 to 1, where 1 means the variable is most important and 0 means the variable is not used at all.We see that above the base of the mixed layer, whose interquartile range is shaded in gray, divergence is by far the most important input (Figure 2g).Thus, removing unimportant variables like MLD and peak N 2 does not significantly impact model performance.Moreover, w can be well approximated as a linear function of surface divergence (and other variables), which explains why MLR performs so well and the non-linearity gained from the RF is not very beneficial.For comparison, we also show the results of depth-integrating the surface divergence in the dashed gray line, which performs well only in the surface layer above 40 m, but rapidly deteriorates below 50 m (Figures 2h and 2i).Within 40 m of the surface, RF and MLR outperform CNN (Figures 2h and 2i) because divergence is a good predictor of w close to the surface, and because the CNN is optimizing to predict the full depth profile of w, while RF and MLR predict w at a single depth.However at, or below, the base of the mixed layer, CNN surpasses RF and MLR.

Noisy Data
All results thus far make the unrealistic assumption of perfect data, so next we test the sensitivity of the ML models when trained with noisy velocity data, representing random observational error.Adding just 1% random noise to u → h drastically decreases the performance of MLR (r 2 drops from 0.62 to 0.38) and RF (r 2 drops from 0.64 to 0.34).But the CNN is not impacted (r 2 remains 0.76) (Table 1).Even with a 10% noise level, the CNN still has an r 2 of 0.68.Visually, the RF w predictions are noticeably noisier and patchier than the CNN predictions (Figures 3a-3d), and the RF predicted vertical velocities are also weaker than the true w, while the CNN is closer to the correct magnitudes.
Figure 3e shows the normalized RMSE profiles for RF and CNN trained on data with varying levels of noise.The errors for the MLR and RF models are nearly the same, so only RF is shown for simplicity.Each color represents a noise level: the solid lines represent CNNs and dashed lines represent RFs.When we add 1%  1) to be compared with Figure 1a, and (d) scatterplot of predicted w with true w in the upper 200 m across the entire test set.The 1:1 line is shown.(b and e, c and f) Same as (a and d) but for Multiple Linear Regression (MLR) "all" and Convolutional Neural Networks (CNN) "all" models (Table 1).(g) Feature importance for the RF "all" model.(h) Root-mean-square-error profiles across the entire test set, normalized by the RMS w at each depth.(i) Correlation coefficient r 2 of the entire test set with depth.The median mixed layer depth is indicated by a horizontal gray line and interquartile range is shaded in gray.
noise to the RF data, the normalized RMSE drastically increases from 0.09 to 0.69 near the surface at z = 10 m. Figure 3f shows the feature importance for the RF trained with 1% noisy data.Comparing Figure 3f with Figure 2g, we see that the importance of divergence drastically decreases when a small amount of noise is introduced in the velocities, and the RF prioritizes using the higher-quality surface density and cross-shore density gradient to compensate.But the loss of accurate divergence information ultimately renders the RF (and MLR) model useless.The dashed purple line in Figure 3e shows the performance of a RF trained without any gradient-quantities used as inputs, and there is no skill at all.When the divergence, δ, is degraded in quality by noise, it is almost as if δ is not provided at all.In contrast, the CNN is much more robust to noisy data (solid lines in Figure 3e).The CNN RMSE increases a little with higher noise levels, but not nearly as much as the RF errors.This main advantage of CNN comes from its inherent feature of  1 for more information on each model.convolving filters with the input images, and since the filters are learned during the training process, we hypothesize that the CNN is learning to filter out the noise.

Coarse Resolution
Another challenge of surface ocean data is that there is often a mismatch in spatial resolution between different variables, such as surface horizontal velocities and SST.Here, we test the effect of coarsening u h → from 1 to 5, 10, and 15 km resolution on the prediction of w at the scale of 1 km.The u, v fields at 1 km resolution are representative of Dopplerscatt (Rodríguez et al., 2018), the 5 km resolution of u, v is close to that of HF Radar (Paduan & Washburn, 2013), while 15 km will be closest to SWOT's resolution (Fu & Ubelmann, 2014).The surface density is kept at 1 km resolution, representative of L2 SST satellite data (Govekar et al., 2022;Kilpatrick et al., 2015).
Unsurprisingly, decreasing the spatial resolution of u → h yields worse predictions of w (Figure 3i, Table 1).For MLR and RF, coarsening u → h to 5 km causes r 2 to drop significantly from ∼0.6 to ∼0.3 (Table 1).The feature importance of the 5 km RF model in Figure 3j reveals that coarsening u and v strongly reduces the importance of divergence, which is expected because calculating δ on a larger grid size results in smaller values of δ.We can also see the effect of a coarse resolution δ reflected in the snapshot of 5 km RF predictions (Figure 3g), which are visually pixelated and have weaker w magnitudes compared to Figure 1a.The CNN is once again relatively less sensitive, with r 2 decreasing from 0.76 for 1 km resolution to 0.60, 0.40, and 0.36 for resolutions of 5, 10, and 15 km, respectively (Table 1).Impressively, the 15 km CNN predictions manage to capture fine-scale patterns in w (Figure 3h), but the magnitudes are under-predicted because the fine scale velocity gradients that are important for w are missing when the data is coarsened.

Discussion and Conclusion
We find that machine learning models may be promising tools for diagnosing submesoscale vertical velocities within and just below the surface mixed layer.All the ML models tested were skillful at predicting w from surface data.Adding subsurface information (mixed layer depth and stratification) results in only marginal improvements.This is because w is highly correlated with the surface divergence in our data set, which is consistent with known submesoscale patterns where w is related to various surface gradients (D'Asaro et al., 2018;Freilich & Mahadevan, 2021;Ruiz et al., 2019;Tarry et al., 2021).It is thus reassuring that MLR and RF accurately predict w when provided with surface gradients (δ, ζ, ρ y ), but have no skill without those inputs.On the other hand, the CNN can predict w without explicitly being provided the gradient quantities as inputs.This is because by design, the CNN learns to identify the important spatial patterns for making its prediction, which can be advantageous in scenarios where we know less about the system and do not have prior knowledge of which quantities are important to calculate.Furthermore, the CNN outperforms RF and MLR at predicting w at depth, suggesting that the surface fields contain more information about w that is not used in the RF or MLR models.Another consideration is that the CNN considers a wider field of view, so there may be important information contained in nearby grid points that are not captured in our RF and MLR.Overall, we find that ML methods-when provided with accurate high-resolution (1 km) data-are successful at linking vertical velocities and surface patterns associated with submesoscale dynamics.However, it is unrealistic to have perfect observations, and u → h at 1 km resolution is not yet commonplace.Adding a small 1%-10% noise to the surface velocities results in an extremely noisy divergence field, which is detrimental for RF and MLR.Methods for predicting w that rely directly on δ should thus carefully consider the errors in the surface horizontal velocities.In contrast, the CNN is relatively insensitive to noisy velocity data, which may be because it learns to filter out the noise during the training process.A caveat is that we use white noise in our experiments, but realistically, the form of measurement errors is more complex.For example, the Dopplerscatt noise structure has a radial dependence for each swath (Rodríguez et al., 2018), and further work is needed with more realistic noise forms for different types of measurements.Additionally, we find that using coarser resolution u h → data degrades the performance of all ML models, with CNN being the most robust.This is expected since it is difficult to resolve finer-scale velocity gradients with coarser resolution u → h .These results emphasize the importance of obtaining accurate, high-resolution surface u and v measurements for estimating submesoscale vertical transport in the future.

10.1029/2023GL104835
This study is meant to be a first assessment of the applicability of ML models for learning physical relationships and vertical velocities in an idealized system without surface or internal waves and boundary layer turbulence.We do not expect the ML models presented here to immediately generalize to more complex systems without further training with more realistic data sets containing wave motions.More work remains to be done to study the possibility of applying these ML models to observational data, which will likely include training on regional model simulations that use realistic topography and boundary conditions.We find the results promising even with relatively simple ML models, and expect that further advances can be made through additional experimentation with other types of ML models and architectures.All of the ML methods that are tested perform best near the surface, and errors increase with depth.This is contrary to QG methods, which are more successful at depths further below the mixed layer (Qiu et al., 2020;Uchida et al., 2019).Therefore, data-driven methods could be a good complement to SQG methods, and using both together could yield the best estimate of full water column vertical velocities.We think there is potential for much exciting future work to be done to move toward applying these methods to the real ocean.

Figure 1 .
Figure 1.Top row: Snapshots of vertical velocity w from three simulations with high, medium, and low wind forcing and an initial stratification of N 2 = 10 4 m s 2 .The colorbar is skewed as downward velocities are stronger.Only a portion of the numerical model domain that we focus on is shown, and black contours denote isopycnals with an interval of 0.1 kg m 3 .The direction of the wind stress is out of the page and shown in panel (c).Middle row: Normalized histograms of w in the upper 200 m from the snapshots shown in the top row.The vertical scale is a log axis and the horizontal scale differs.Bottom row: Root-mean-square (RMS) profiles of w (blue line) from the same snapshots in the top row.The horizontal lines denote the median mixed layer depth, with the interquartile ranges shaded in gray.

Figure 2 .
Figure 2. (a) Snapshot of predicted w from the Random Forest (RF) "all" model (see Table1) to be compared with Figure1a, and (d) scatterplot of predicted w with true w in the upper 200 m across the entire test set.The 1:1 line is shown.(b and e, c and f) Same as (a and d) but for Multiple Linear Regression (MLR) "all" and Convolutional Neural Networks (CNN) "all" models (Table1).(g) Feature importance for the RF "all" model.(h) Root-mean-square-error profiles across the entire test set, normalized by the RMS w at each depth.(i) Correlation coefficient r 2 of the entire test set with depth.The median mixed layer depth is indicated by a horizontal gray line and interquartile range is shaded in gray.

Figure 3 .
Figure 3. (a-d) Snapshots of w from Random Forests (RF) and Convolutional Neural Networks (CNN) trained on velocity data with 1% or 10% noise, to be compared with Figure 1a.(e) Profiles of the normalized Root-mean-square-error (RMSE) of w predictions from the CNN (solid line) and RF (dashed line) for different levels of data noise.(f) Feature importance for the RF 1% noise.(g, h) Snapshots of w from RF and CNN using coarsened u, v velocities at 5 or 15 km resolution.(i) Normalized RMSE for RF and CNN predictions trained on different resolution data.(j) Feature importance for the RF 5 km model.See Table1for more information on each model.

Table 1
Overview of theDifferent ML Models Evaluated: Multiple Linear Regression (MLR), Random Forest (RF), and  Convolutional Neural Network (CNN).For the MLR and RF, the inputs are values at a single grid point, while the CNN input features are images with dimensions 32 × 32.For the experiments with noisy data, noise was applied only to the u, v velocities, and then divergence and vorticity were calculated from the noisy velocity fields.Likewise, in the experiments with coarsened data, only the velocities are coarsened.The correlation coefficient r 2 of each ML model's predicted w with the true w in the upper 200 m of the test set is shown as an indication of its performance.