Global Predicted Bathymetry Using Neural Networks

A coherent portrayal of global bathymetry requires that depths are inferred between sparsely distributed direct depth measurements. Depths can be interpolated in the gaps using alternate information such as satellite‐derived gravity and a mapping from gravity to depth. We designed and trained a neural network on a collection of 50 million depth soundings to predict bathymetry globally using gravity anomalies. We find the best result is achieved by pre‐filtering depth and gravity in accordance with isostatic admittance theory described in previous predicted depth studies. When training the model, if the training and testing split is a random partition at the same resolution as the data, the training and testing sets will not be independent, and model misfit is underestimated. We solve this problem by partitioning the training and testing set with geographic bins. Our final predicted depth model improves on old predicted depth model RMSE by 16%, from 165 to 138 m. Among constrained grid cells, 80% of the predicted values are within 128 m of the true value. Improvements to this model will continue with additional depth measurements, but predictions at higher spatial resolution, being limited by upward continuation of gravity, should not be attempted with this method.


Introduction
According to mapping requirements proposed by the Seabed 2030 project, less than 25% of the ocean floor has been mapped (GEBCO compilation group, 2023).At a uniform spatial resolution of 1 arcminute, this figure is still less than 30%.From efforts such as the Seabed 2030 project (Mayer et al., 2018), coverage in publicly available compilations has improved in recent years, but the distribution of shipboard depth measurements remains heterogeneous and sparse, providing nearly complete high resolution coverage in some coastal areas but leaving unmapped gaps the size of western US states in remote regions (Figure 1a).While there is no substitute for shipboard surveys to recover high resolution bathymetry, we can make a good guess of the seafloor depth, at a limited resolution, in these gaps using gravity field data derived from satellite altimeters (e.g., Smith & Sandwell, 1994).
Satellite measurements have provided a wealth of information on the gravity field with global coverage, at a spatial resolution of 12 km and accuracy nearing 1 mgal (Sandwell et al., 2021).A new generation of swath altimeters will improve the resolution of the gravity field which may improve the accuracy beyond 1 mgal.Since gravity anomaly and depth are correlated within certain wavelength bands (Smith, 1998), we can infer depth from gravity.Within a restricted region, depths may be directly inverted from gravity measurements (e.g., Parker, 1972), but this requires a priori knowledge of crustal density and other geologic quantities such as the degree of isostatic compensation (Watts, 2001).There are limitations to this method preventing its application at very large scales.Smith and Sandwell (1994) developed (and revised in Smith and Sandwell (1997)) an algorithm for this purpose, and that algorithm is currently used in the popular global elevation product SRTM15+ (Tozer et al., 2019).The procedure uses admittance theory to design filters for bathymetry and gravity to linearize their relationship.The filtering steps remove most assumptions of isostatic compensation.After filtering, the algorithm determines the local slope of the bathymetry-gravity relationship on a coarse grid, and the predicted bathymetry is the product of the filtered gravity and slope (Smith & Sandwell, 1994).Smith and Sandwell (1994) termed the post-filtering process the "inverse Nettleton," referring to Nettleton (1939) to describe the process of selecting a best-fitting slope to describe the relationship of gravity anomaly and topography.As such, we will refer to this method as the Nettleton method.The quality of the estimation has improved with increasingly precise gravity recovery (number of measurements, orbits, instrument quality, processing techniques) and the greater quantity of shipboard measurements, especially in some of the previously poorly mapped regions.The most recent iteration of this prediction is described in detail by Tozer et al. (2019).While the predicted depth product has been widely adopted, some of the details in the prediction method may not be optimal and leave room for improvement.
There has been recent interest in using modern methods from machine learning to improve upon the prediction of bathymetry.For example, Annan and Wan (2022) and Wan et al. (2023) have used neural networks with various architectures to predict absolute depth from gravity and gravity-related quantities (e.g., deflections of the vertical, gravity gradients).These models have been limited to a particular study area for training, and the predictions are thus limited to the selected area.The present study is an attempt to update the global predicted depth grid, and our goal is to replace the Nettleton method of depth estimation with a new approach using techniques based on machine learning.Specifically, we will train a deep neural network (DNN) to predict depth globally using a publicly available collection of depth measurements.We distinguish our method from previous machine learning predicted bathymetry studies in a few key ways: we attempt a global prediction; we predict depth in a certain waveband rather than the absolute depth; and we split training and testing data in a unique way.

Data Preparation and Feature Generation
We begin with the collection of shipboard depth measurements.The collection is based on the collection used in the SRTM15+V2 product (Tozer et al., 2019), and a description of the data sources is found in that study and Becker et al. (2009).Since Tozer et al. (2019), data from 905 cruises (retrieved from the National Centers for Environmental Information (NCEI)) have been added to the collection.Data have been manually edited to remove erroneous measurements.For our purposes, data provenance is treated equally, and data are not distinguished by cruise ID or instrument type (multibeam or single-beam).Raw shipboard data are reduced by a median filter to 15 arcsecond resolution.These data are combined and blockmedian-reduced to 1 arcminute resolution on a spherical Mercator projected grid, spanning 80.738°-80.738°latitude.The result is a collection of 52,253,670 records of type [longitude, latitude, depth] (or [ϕ, θ, d]).Some potentially useful information is lost in the blockmedian reducing operation, but it is necessary for the filtering we apply (see next section).
We can use the coordinates ϕ, θ of any constrained depth record to sample the global gravity anomaly grid (Figure 1c) (Sandwell et al., 2021).Since the constrained depth cells are co-registered with the gravity grid, sampling is trivial The result is records of type [ϕ, θ, d, g].From these records, the target quantity we wish to predict is depth, and the features we may use to train the prediction are ϕ, θ, and g.We could use other geophysical or geographical grids as features, since they can be sampled by longitude and latitude.We will explore such additional features in the discussion.That longitude is cyclical is not captured by the simple numerical value, so we decompose the longitude, ϕ, to sin( ϕπ 180 ) and cos ( ϕπ 180 ), or ϕ s , ϕ c .With this treatment of longitude, we avoid a discontinuity at the Greenwich Meridian in the predicted depth grid.Following this amendment, our feature vector is [ϕ s , ϕ c , θ, g].

Filtering Depth and Gravity
Since gravity and bathymetry are only correlated over certain wavelengths (Smith, 1998), it would be less-thanoptimal to try and predict the absolute depth which contains wavelengths where gravity and bathymetry are uncorrelated.Here we use the filtering steps established by Smith and Sandwell (1994).The depth measurements are gridded on a 1 arcminute spherical Mercator grid using continuous splines in tension (Smith & Wessel, 1990) with a tension factor of 0.6, and this grid is separated into low-and high-frequency components with a Gaussian filter with 0.5 gain at 160 km (Equation 9, (Smith & Sandwell, 1994)).The high-pass depth is then low-pass filtered with 0.5 gain at 16 km, resulting in a 160-16 km band-pass depth grid-we will call this h.The constrained points of this depth grid are extracted (Figure 1d).We are losing some high-frequency information this way in order to match the spectral content of the gravity.The gravity anomaly is high-pass filtered in the same way as the depth measurements.Finally, the high-pass filtered gravity anomaly is downward-continued to the low-pass filtered depth using a depth-dependent Wiener filter (Equation 11, (Smith & Sandwell, 1994))-we will call this g* (Figure 1e).We will examine the effects of omitting this pre-processing step in the discussion.

Data Splitting
We must split our data set into training, validation, and testing sets.If we are to simply select 20% for testing and validation at random, then we will find that almost any given record in the testing or validation set is within 1 arc min of a record in the training set.Marks et al. (2010), analyzing predicted depths generated by the Nettleton procedure, showed the prediction error increases with distance from constrained nodes.How strong this effect is depends on the roughness of the seafloor, but it appears the errors become decorrelated beyond a certain distance (15-30 km).In other words, records that are sufficiently close in position are not independent (Figure 3b).In practice, the training loss and validation (and testing) loss will be nearly identical, and we will not have a good idea of when the model is overfitting during training or how the model generalizes to the unmapped gaps.
By splitting the data into longitude, latitude bins and randomly selecting bins for testing and validation, we can reduce the dependence of the data sets (Figure 3c).We use a bin size of 30 arc min (∼50 km at the equator) to group records, and then randomly select those bins for training, validation, and testing.This bin size could be made larger or smaller, or it could be made to vary based on prior knowledge of seafloor roughness, but it is important not to tune the bin size by model loss performance.The training, validation, and testing data sets comprise 31,320,896, 10,460,197, and 10,472,576 records respectively.

Model Architecture and Training
We use the TensorFlow software library to design and train the neural network (Abadi et al., 2015).The neural network comprises only successive densely connected layers (Figure 2) each using a rectified linear unit (ReLU) activation function to introduce non-linearity.Input features are normalized by the mean and variance of their distribution in the training data set.We use eight successive dense layers with 256 neurons per layer, and a final linear output layer.Model architecture can be tweaked ad nauseum, so we cannot claim this is a strictly optimal configuration, but this particular arrangement was selected because we found it performed as well as a wider but shallower model (e.g., 4 layers of 1,024 neurons each) while using far fewer parameters (4 × 10 5 vs. 3 × 10 6 ).
We choose mean squared error (MSE) as the loss function.Each dense layer is regularized with L2 regularization (λ = 0.01).We use the Adam optimizer (Kingma & Ba, 2017) with a learning rate of 0.001.Model training proceeds until the validation loss is no longer decreasing.

Inference: Generating the Global Predicted Grid
The end goal of this model is a global grid of predicted depths.After the model is trained, predictions of h are generated on a 1 arcminute spherical mercator grid (on-shore values are masked).The long wavelength depth, saved from the filtering step, are added to the predicted depth, h, to give absolute depth, d.Finally, for distribution, the predicted depth grid is "polished"-grid cells with constrained depth measurements are reset to the measured depth-but this step is omitted in the following discussion and analysis.

Base Model
Using the feature vector [ϕ s , ϕ c , θ, g*] to predict h, we achieve a training RMSE of 85 m, validation RMSE of 108 m, and testing RMSE of 109 m.Loss values are useful for comparing one trained model against another, but they are imperfect when comparing to the Nettleton method.Since no "testing" data is withheld from the Nettleton method (i.e., the prediction is tuned on all available data), we must caution the comparison of RMSE between the two methods.With that in mind, the RMSE of the Nettleton prediction and h is about 143 m.We will look more carefully at model misfit in the discussion.

Modeling Without Filtering Depths and Gravity
For comparison, a model trained without filtering depths and gravity anomaly performs much worse and offers no improvement over the Nettleton method.For this model, we achieve a training RMSE of 140 m, validation RMSE of 173 m, and testing RMSE of 175 m.This poor performance may be because the distribution of depth is more variable with location.For example, where the regional depth is 6,000 m, the mean depth will be near 6,000 m, and similar for a regional depth of 1,000 m.By high-pass filtering the depth, the overall variance of the data are reduced.Omitting the low-pass filter at 16 km also contributes to the worse performance.The Nettleton method RMSE above is for the band-pass filtered data, h.If the short wavelengths are included, the Nettleton RMSE is about 165 m.

10.1029/2023EA003199
Alternatively, we can high-pass filter depth and gravity but omit the low-pass filter at 16 km-in fact, we may desire to do this so we do not lose short-wavelength details.For this model, we achieve a training RMSE of 124 m, validation RMSE of 141 m, and testing RMSE of 142 m.We cannot compare these directly to the base model since the target quantities are different.Instead we can evaluate the RMSE of the base model prediction and the high-pass depth.In this case, testing RMSE values are nearly identical for the two models, suggesting the trained models are similar and the greater loss reflects the greater variance of the target data.

Added Features
It might seem that adding features from other global grids would be a good approach to decrease model loss and improve performance.For example, the spreading rate at the time crust is created is known to affect the roughness of bathymetry (Small & Sandwell, 1994).Crustal age and sediment cover will also influence the correlation of gravity and bathymetry (Smith & Sandwell, 1994).We use ϕ, θ to sample grids of crustal age, paleo-spreading rate (Seton et al., 2020), and sediment thickness (Whittaker et al., 2013), add these to the feature vector, and train a new model.In practice, there are problems with using these features.
First, these grids have many regions of missing data, and the missing values must be handled somehow.Since a key purpose of this model is prediction on a global grid, grid cells with missing values cannot simply be thrown out.We tested different schemes to replacing missing values: replacement with the mean feature value; replacement with mean feature value plus an additional boolean feature indicating replacement; and filling missing values with nearest neighbor interpolation of the feature grid.In our attempts to use crustal age, paleospreading rate, and sediment thickness as features, validation and testing loss are not improved (nor are they improved by any one such feature), and in fact model loss is worse with the additional features.In addition, at inference time, sharp discontinuities in the feature grids get mapped to the prediction grid creating unwanted artifacts.
Other gravity-related quantities such as deflection of the vertical and vertical gravity gradient (VGG) may be good features to use, but they may only be redundant.We found that adding VGG as a feature (Sandwell et al., 2021) improves model training RMSE only slightly, and it slightly worsens validation and testing RMSE.This model has a training RMSE of 85 m, a validation RMSE of 109 m, and a testing RMSE of 110 m.Since there is no improvement on the base model, we will consider that the preferred model and refer to it as simply the "DNN" model in the following discussion.

Generating a Predicted Depth Grid
After training, we generate model predicted depths on a global 1 min mercator grid.One problem that results is short wavelength artifacts or "hallucinations" (Figure 4e).These hallucinations typically occur with wavelengths shorter than the 16 km wavelength filter that was applied to the original gravity and bathymetry, so they must be a product of the DNN training.We can reduce these with regularization during training, but not completely nor in a deterministic way.For the distributed predicted depth grid, we apply a low pass filter with 0.5 gain at 16 km to remove these hallucinations.This post-inference filtering method does not weaken model results.In fact, error metrics are very slightly improved, and the prediction RMSE of h after filtering is 107 m for the testing data set.We use the predictions on the filtered grid in the following discussion.

Comparison to Nettleton Model
While not quantitative, it is important to visually inspect the DNN predicted depth grid and make comparisons to the Nettleton grid (Figure 4).The most obvious qualitative difference is in continental margin areas or areas of relatively smooth seafloor.In these areas, the Nettleton predicted bathymetry has a rough "orange peel" texture (Figure 4d), an artifact of downward continuation of noisy gravity data.This type of seafloor is smoother in the DNN prediction.
Using all available depth measurements (not partitioned for training/testing), we compare the error distribution of the Nettleton method and the DNN method.These results are shown in Figure 5.For the Nettleton method, the predicted depth is within 68 m for 50% of points and within 168 m for 80% of points.For the DNN method, these percentiles are 45 and 128 m, respectively.Additionally, the mean error has been reduced from 13 m for the Nettleton to 3 m for the DNN, indicating a less-biased estimate.Figure 5b shows the distribution of absolute model error in the southern oceans.Overall, the spatial patterns of misfit are similar for the two models.At this scale, the noticeable differences are found nearer to land-e.g., the West Antarctic Peninsula, Chile, Australiawhere the DNN model shows clear improvements over the Nettleton.

BODC Data
Because the Nettleton prediction is tuned using all available data, we do not have a concrete idea of how well it generalizes to unseen areas of seafloor.It is useful to reserve a data set that is not used in either model's construction.To compare the performance of the Nettleton and the DNN model, we used a collection of depth measurements from 279 cruises from the British Oceanographic Data Centre (BODC) that are not yet incorporated into the prediction model.The raw data are decimated to 15 s, and measurements that overlap with data already in the prediction data set are removed.We did not thoroughly inspect the BODC data for erroneous measurements, so measurements that differ from either predicted grid by more than 2,000 m are removed.In total, there are 6,242,414 points.The Nettleton prediction has an RMSE of 150 m for the BODC data set.The final DNN model has an RMSE of 104 m.
If we restrict the analysis spatially to the highest concentration of measurements (80% of the data are around the British Isles), the Nettleton RMSE is 73 m and the DNN model RMSE is 62 m-much lower than testing RMSE (Figure 6).This almost certainly reflects the proximity of these data to those in our training set.However, we see from the error distribution that the slight bias in the Nettleton prediction is not present in the DNN prediction.Earth and Space Science 10.1029/2023EA003199 HARPER AND SANDWELL

Potential for Improvements
Our model is a simple implementation of a neural network to predict depth globally, and we have shown its clear improvement over the Nettleton method.Yet, there are many possible directions for improvement depending on one's objectives.Expansion of the training data set, modifications of model architecture, or a multi-regional approach to the problem all offer potential to improve on our model.
If coverage of publicly available bathymetry compilations continues to improve as it has in recent years-and it likely will (Mayer et al., 2018)-model predictions will clearly improve (this would be true of any model).Lowresolution data in remote regions, which can be collected by autonomous vehicles, will likely offer the greatest benefit in our model approach.
We have not made use of high-resolution multibeam data in our model, and we do not aim to predict features at such resolution.Upward continuation of gravity anomalies limits the resolution of gravity from satellite altimetry to a length scale of about π times the regional depth (e.g., Smith & Sandwell, 2004), so it is not possible to realistically predict depth from only gravity (and its derivatives) at such scales.An approach using convolutional neural networks, as demonstrated by Annan and Wan (2022), may successfully learn from higher resolution bathymetry in regional settings.
A prediction model trained on regional data will, everything else equal, perform better in that region than a model trained on global data.Moran et al. (2022) identified regions where various learning algorithms might preferentially excel.This suggests a global model should alternatively be constructed from a suite of regional models.A particular case where such a multi-regional model would excel is in predicting higher resolution depth in areas where that is realistic.Susa (2022) showed such an approach to predicting depth in near-coastal regions.In this setting, altimetric ranging and gravity accuracy suffer from land contamination (Raney & Phalippou, 2011), and visible spectra may be correlated with bathymetry, making this an ideal case for alternative depth prediction.

Conclusions
1. Using a large collection of depth measurements and satellite-derived gravity anomalies, we trained a deep neural network to predict seafloor depth.2. We find that applying filters (described by Smith and Sandwell (1994)) to bathymetry and gravity before training is necessary for a good result, and conforms the data more closely with the assumption of identical distributions.3. When dealing with sparse heterogeneous sampling, the training-testing split must be treated carefully.If the training and testing split is a random partition at the same resolution as the data, the training and testing sets are not independent, and model misfit results will be too optimistic.Earth and Space Science 10.1029/2023EA003199

Figure 1 .
Figure 1.An overview of the data sets in map view.(a) Distribution of shipboard depth measurements in the global oceans based on publicly available data.(b) Zoomed-in view of depth measurements in the South China Sea, colored by absolute depth, d.(c) Free air gravity anomaly, g, for the same region as (b).(d) High-pass filtered depths, h, and (e) filtered gravity anomaly, g* (filtering described in the text).

Figure 2 .
Figure 2. Schematic of the neural network architecture.A feature vector with normalized inputs is transformed by successive hidden layers to predict depth.Continuing dots in the input leave the possibility of additional or alternative features.Output depth may be absolute or filtered depth depending on how the model is trained.

Figure 3 .
Figure 3.An example of partitioning the data.Same map area shown in Figures 1b-1e.(a) The collection of depth measurements.The data must be partitioned into a training, testing, and validation set.(b) Randomly withholding 20% of the data, sampled uniformly.Withheld data are shown in red.Using this partition scheme, almost any withheld point has a nearly identical point in the training set, so the sets are not independent.(c) Sampling the data after binning into groups of 30 arc min.Withheld data are shown in red.

Figure 4 .
Figure 4. Model-predicted depths with long-wavelength depth added.The same map area shown in Figures 1b-1e.(a) Nettleton method; (b) Raw DNN-predicted grid; (c) DNN-predicted grid, low-pass filtered at 12 km.(d) Example of the "orange peel" texture in the Nettleton prediction, boxed region in (a).(e) Example of "hallucinations" in DNN prediction, boxed region in (b).

Figure 5 .
Figure 5.Comparison of Nettleton and DNN models by prediction error.(a) Distribution of prediction error for all 1 arc min data (N = 52,253,670).(b) Average absolute difference (prediction -measurement) for Nettleton (upper) and DNN (lower) models.

Figure 6 .
Figure 6.Nettleton and DNN model predictions for withheld BODC data.(a) Depth measurements in the full data set shown in black, measurements from the disjoint BODC data set shown in red.Misfit of BODC data for (b) Nettleton predicted depth and (c) neural network predicted depth.(d) Distribution of BODC data misfit for Nettleton predicted depth and neural network predicted depth.