New and updated global empirical seawater property estimation routines

We introduce three new Empirical Seawater Property Estimation Routines (ESPERs) capable of predicting seawater phosphate, nitrate, silicate, oxygen, total titration seawater alkalinity, total hydrogen scale pH (pHT), and total dissolved inorganic carbon (DIC) from up to 16 combinations of seawater property measurements. The routines generate estimates from neural networks (ESPER_NN), locally interpolated regressions (ESPER_LIR), or both (ESPER_Mixed). They require a salinity value and coordinate information, and benefit from additional seawater measurements if available. These routines are intended for seawater property measurement quality control and quality assessment, generating estimates for calculations that require approximate values, original science, and producing biogeochemical property context from a data set. Relative to earlier LIR routines, the updates expand their functionality, including new estimated properties and combinations of predictors, a larger training data product including new cruises from the 2020 Global Data Analysis Project data product release, and the implementation of a first‐principles approach for quantifying the impacts of anthropogenic carbon on DIC and pHT. We show that the new routines perform at least as well as existing routines, and, in some cases, outperform existing approaches, even when limited to the same training data. Given that additional training data has been incorporated into these updated routines, these updates should be considered an improvement over earlier versions. The routines are intended for all ocean depths for the interval from 1980 to ~2030 c.e., and we caution against using the routines to directly quantify surface ocean seasonality or make more distant predictions of DIC or pHT.

Anthropogenic impacts on the environment are changing the physical and chemical state of the ocean. The accumulation of excess ocean heat (Roemmich et al. 2012;Purkey and Johnson 2013) and carbon Khatiwala et al. 2013;Carter et al. 2017Carter et al. , 2019aGruber et al. 2019) and the redistribution of freshwater between regions of the ocean (Durack et al. 2012) and geological reservoirs are modifying ocean circulation pathways and causing sea level rise (Nerem et al. 2018), ocean acidification (OA; Feely et al. 2004Feely et al. , 2009Doney et al. 2009;Jiang et al. 2019), and ocean deoxygenation (Sasano et al. 2018). These changes are fundamentally shifting the physical and chemical environments of marine organisms and threatening ocean ecosystems and services (Gattuso et al. 2015;Doney et al. 2020).
Global climate change poses a challenge for ocean monitoring, necessitating sustained high-quality measurements across timescales and across the vast and remote global ocean. A variety of approaches and platforms have been developed for ocean monitoring (e.g., autonomous surface vehicles, profiling floats, and fixed moorings), each of which has a niche for examining a range of temporal and spatial scales (Bushinsky et al. 2019) and each of which has strengths and weaknesses for addressing aspects of global change (Carter et al. 2019b). The cost and difficulty of measurements is a limiting factor for all approaches, so it is impossible as of today to have extensive high-quality and high-frequency measurements everywhere they are desired. Given this limitation, an emerging approach involves using algorithms that have been trained to reproduce measurements of seawater properties from co-located measurements of other seawater properties. These algorithms take advantage of strong regional correlations between seawater properties that result from oceanographic processes that shape the distributions of many different seawater properties in similar ways (e.g., organic matter cycling with nearly constant stoichiometric ratios between macronutrients, and freshwater cycling that linearly dilutes or concentrates most chemical concentrations in seawater). Once trained, the algorithms can be used to predict the desired properties from other properties that are more routinely measured either remotely by satellite or using available in situ sensors. This strategy has seen use for more than two decades (e.g., Goyet et al. 2000;Lee et al. 2006), though recent advances in skill, flexibility, and diversity of the algorithms available (Carter et al. 2016(Carter et al. , 2018Sauzède et al. 2017;Bittig et al. 2018;Landschützer et al. 2019;Gregor and Gruber 2021) have made it possible to create climatologies (Broull on et al. 2019(Broull on et al. , 2020Jiang et al. 2019), calibrate and monitor drift-adjustments for sensors on autonomous sensor platforms Takeshita et al. 2018), create novel global data products , and fill holes in data sets when the final analysis is not strongly sensitive to estimate errors, for example, when silicate and phosphate are estimated for use in seawater carbonate chemistry calculations (e.g., van Hueven et al. 2011) or when total alkalinity (TA) is needed to convert pH T between temperatures (Jiang et al. 2019;Carter et al. 2019a).
The growing number of use cases for seawater property estimation algorithms means it is important to refine the algorithms to the extent possible, especially given that some observing approaches depend on these algorithms for sensor calibration and validation. As a notable example, biogeochemical Argo floats calibrate pH T and nitrate sensors using algorithm estimates in the comparatively stable mid-depths of the ocean , and additionally rely on estimated seawater alkalinity at all depths to calculate dissolved inorganic carbon (DIC) and the partial pressure of CO 2 (pCO 2 ) (Gray et al. 2018;Williams et al. 2018).
Increasing ocean DIC content from anthropogenic carbon (C ant ) storage and decreasing pH T values from OA provide an ongoing challenge to the accuracy of these algorithms: the algorithms are trained, or fit, to data collected over the last three decades, but will be used primarily to estimate seawater properties specific to recent years and the coming years until improved algorithms become available. How then should we deal with the changes from, for example, OA? Three notable existing algorithms for pH T have simplistic and empirical treatments of the effects of OA. One has no parameterization for OA, but instead provides a suggested time-span for the algorithm ); another uses a simple density interpolation of empirically derived global changes that, for example, does not distinguish the rapidly changing intermediate North Atlantic from the comparatively static intermediate subpolar North Pacific (Carter et al. 2018); and the one last uses a regional empirical approach that risks misattributing long-term change and natural variability in pH T (Bittig et al. 2018). Broull on et al. (2020) also use an empirical relationship to capture the effects of OA for their DIC algorithm. These algorithms are expected to become increasingly biased under future OA conditions.
In this paper, we improve upon existing algorithms with new methods and new observational data products and encode them into a package of software routines in the MATLAB language. We also introduce a new neural-network approach that can return estimates from more diverse combinations of predictors than previous efforts. We also improve how the algorithms handle C ant impacts on DIC and pH T , and the new approach should allow future projections of these properties to be useful over longer time horizons while avoiding bias from empirical fits to interannual variability.

Basics, updates, new methods, and new features
The first of two products in this effort is an improvement upon the Locally Interpolated Regression (LIR) strategy for global and full-water column seawater alkalinity estimation that was implemented by Carter et al. (2016) and is similar to a method described by Velo et al. (2013). This approach was later updated and extended to estimating seawater pH T and nitrate (Carter et al. 2018: LIRv2) and was most recently expanded to oxygen, phosphate, and silicate estimates . The new improvements in LIR-based empirical seawater property estimation routines (called here: ESPER_LIR, equivalent to LIRv3), relative to LIRv2, include: 1. Use of the 2020 release of the GLObal Data Analysis Project data product (GLODAPv2.2020: Olsen et al. 2020), for predictor variables with many thousands of new measurements, particularly in the North Pacific, relative to the GLODAPv2 version used for earlier versions of the global algorithms. 2. Numerous additional data sets from the Gulf of Mexico and the Mediterranean Sea as training data, fixing large and important data gaps in LIRv2.
3. The ability to return estimates of DIC. 4. Simple and improved estimation of anthropogenic perturbations to pH T and DIC based on first principles, allowing better predictions of future changes in seawater carbonate chemistry. 5. Implementation of a distance weighting for the fit in ESPER_LIR, allowing more data to be used for each of the many regressions. 6. Ease-of-use changes that allow the insights from the LIR routines to be more easily adapted for regional applications.
In addition to LIR updates, we introduce new neural-networkbased routines (ESPER_NN) to take advantage of the strengths of neural networks including the ability to model nonlinear relationships between predictors and estimated quantities (Tu 1996). In several important ways this new algorithm imitates the design of the "Carbonate system and Nutrients concentration from hYdrological properties and Oxygen using a Neural-network version B" (CANYON) algorithms designed by Sauzède et al. (2017) and updated by Bittig et al. (2018). The significant differences between ESPER_NN and the existing algorithms are as follows: 1. Inclusion of new data from the GLODAPv2.2020 data product (as with the LIR updates).

Like ESPER_LIR, ESPER_NN uses a new first-principles-based
approach to estimate the impacts of long-term trends for pH T and DIC. 3. ESPER_NN can function with 16 combinations of seawater properties requiring at minimum salinity and coordinate information, while alternative neural network approaches also require oxygen and temperature. While the temperature, salinity, and oxygen are often available and are frequently an ideal predictor combination, there remain applications where oxygen measurements are not available (due to absent, failed, or fouled sensors) or not desired as predictors (such as when estimating preformed properties from only conservative seawater properties, e.g., Carter et al. 2021).
By most validation metrics, the ESPER_NN routines perform comparably to ESPER_LIR routines and, in some places, they perform better (see "Assessment" section). Nevertheless, we contend there are reasons to maintain both approaches. First, the LIR routines offer a degree of simplicity and estimate explicability that lends them additional value. To highlight the explicability of the LIR estimates, we have added the ability to return the coefficients of the equations that were used to produce each estimate as an additional optional routine output. This may be useful when querying the LIR routines for an equation that could be used for a regional study in another application. Similarly, regional coefficients could be added into the ESPER_LIR coefficient files to produce a modified routine that seamlessly transitions to using regional relationships within a specific area such as a marginal sea, while still using the relationships derived for the open ocean outside of that region. Also, as we discuss later, there is merit to having and using multiple routines when the errors in the estimates appear to be partially independent, as appears to be the case with ESPER_LIR and ESPER_NN. Both new routines are freely available as MATLAB functions at Zenodo (Carter 2021) and updates will be made available at the GitHub repository (see"Code Availability" section). Several changes have been made to the LIR function behavior that are noted alongside the reasoning behind the changes in Supporting Information S2.

Data products, training data, and test data
The primary data product used to train these algorithms is the GLODAPv2.2020 data product update (Olsen et al. 2020). In addition, we added data sets that will be included in the CARbon, tracer, and ancillary data In the MEDiterranean Sea (CARIMED) and that are included in the Coastal Ocean Data Analysis Project for North America (CODAP-NA; Jiang et al. 2021) data products. These data from the Mediterranean Sea (46 cruises spanning from 1976 to 2018 and covering all the sub-basins in the Mediterranean Sea) and the Gulf of Mexico (three cruises spanning 2007 to 2012) are included to ensure these important regions are well-constrained and the cruise information is provided in Supporting Information S1.1. These data products are focused on internal consistency and are inclusive for carbonate system measurements. We do not make a special effort in this study to incorporate high-resolution data from profiling sensors (e.g., 1 m oxygen values) or measurements from data products that focus on macronutrients or oxygen, but note that this could be an area of focus for future development.
As with previous versions of LIRs, we excluded data from GLODAPv2 that has not had secondary quality control checks (QC), and further omitted several sets of cruises that had large adjustments or appeared to have noisy measurements at depth (detailed in Supporting Information S1). We also excluded measurements from any bottle that lacked measurements for temperature, salinity, oxygen, and macronutrients (phosphate, silicate, and nitrate).
Homogenization of the variety of pH measurement types and calculations in GLODAPv2.2020 remains a challenge (see Supporting Information S1.2). As with LIRv2, the ESPERs return in situ pH T estimates that are intended to be consistent by default with pH T measured spectrophotometrically with purified m-cresol purple indicator dye and converted to in situ conditions, but can be made to return values that are intended to be consistent with pH T calculated from DIC and TA at in situ conditions (as CANYON-B does by default) using an optional flag. These approaches for arriving at pH T values have a documented disagreement (Carter et al. 2013(Carter et al. , 2018Williams et al. 2017;Fong and Dickson 2019;Alvarez et al. 2020), and we rely on the relationships developed by Carter et al. (2018) to interconvert between these pH T estimates. New observations are challenging the assumptions inherent to this approach (Takeshita et al. 2021), but currently there is insufficient data or mechanistic understanding to refine the relationships we use for interconversion.
For assessment purposes, we must separate validation data from training data and withhold the validation data from the versions of the algorithms used for assessment. It is better to withhold data from entire cruises to avoid obtaining unrealistically high skill estimates when reconstructing data from a synoptic cruise based on algorithms trained with other data from the same cruise. In past versions of LIRs, this assessment was conducted by creating algorithms that iteratively omitted each cruise while reconstructing data from the omitted cruises. However, this strategy would be too computationally intensive to employ with the ESPER_NN and would not provide a clear comparison to the CANYON-B neural network, which was trained with the original GLODAPv2 release. Instead, all data in GLODAPv2.2020 that were added following the original GLODAPv2 release (i.e., all cruises with GLODAPv2 cruise numbers ≥1000 and those incorporated from the Gulf of Mexico and the Mediterranean Sea) are used as test data for the validation versions of the algorithms that were trained only with the data in the original GLODAPv2 release. For general use, a release version of the ESPER_LIR and ESPER_NN algorithms was trained with the total data set to benefit from the recent data, and this release version is the only version provided at Zenodo and GitHub. Data within several marginal seas (the Gulf of Mexico, the Sea of Japan/ East Sea, and the Mediterranean Sea) are omitted from the bulk global open-ocean assessment statistics because these are regions where the validation versions of the algorithms have insufficient training data (i.e., none) to produce estimates. Similarly, data from the Arctic (here: north of 67.5 N) are withheld from the global assessment step because the Arctic is a problematic region for algorithms (see "Regional tests" section). Instead, algorithm performance is separately assessed in these regions to explore the limitations of the approaches used ("Regional tests" section). The numbers of valid, quality-controlled measurements available for each algorithm version in each subset of the data are given in Table 1.

Anthropogenic impacts on carbonate chemistry
The LIPHR (i.e., LIRv2 for pH T ) and CANYON-B algorithms use "estimate year" (i.e., for LIPHR, this is the calendar year expressed as a decimal, where the midpoint of the year 2020 would be given as 2020.5) as a predictor for seawater properties (or their reconstruction errors in the case of LIPHRv2) to capture the impacts of long-term trends on pH T estimates and the training data. However, recent research suggests that decadal variability in seawater property trends can rival, regionally, the magnitudes of the secular trends. This is true even for C ant which exhibits a large secular trend (Woosley and Millero 2016;DeVries et al. 2017;Carter et al. 2019a). This finding implies that empirical fits risk projecting trends from cyclical natural variability into the future. LIPHR avoids some biases from regional natural variability by using global empirical fits over density intervals, but, as a result, the routine is unable to distinguish between regions with rapid (e.g., the North Atlantic) vs. slow (e.g., the North Pacific) C ant accumulation. In addition, LIPHR assumes a fixed OA rate over time, but OA rates might be expected to accelerate due to the approximately exponential increase in atmospheric CO 2 . Therefore, while algorithms like LIPHR seem to accurately predict contemporaneous deep pH T , it is likely that biases will emerge over the coming years, particularly in regions where C ant penetration is large such as the North Atlantic (Gruber et al. 2019). The risks of natural variability biasing empirical trend projections are perhaps more acute for the properties that have weaker secular trends than DIC and pH T , such as nutrients and oxygen, although the empirical trends in these properties are usually smaller components of the overall variability in their estimates.
Given the challenges associated with accurately quantifying secular changes with short-term, empirical information, ESPER_LIR and _NN rely on a first-principles-based estimate of C ant and its impacts on pH T . This approach assumes that exponential increases in atmospheric anthropogenic CO 2 should eventually result in marine C ant concentrations that increase at rates proportional to atmospheric anthropogenic CO 2 concentrations. In other words, this approach relies on the assumption that C ant is in transient steady state (Gammon et al. 1982;Tanhua et al. 2007); this is an assumption used to adjust data to reference years in the most recent global C ant distribution change estimates for the 1994-2007 period (Gruber et al. 2019). This implies that, locally, the "shape" of the C ant vertical profile (or C ant vertical gradient) should remain constant over time while Table 1. Numbers of viable measurement combinations available for each property within the indicated data product subsets. The "total" column reflects the training data for the released routines, whereas the "GLODAPv2" column reflects the training data for the validation routines used to assess the algorithms against new/assessment data.

Property
GLODAPv2 New/assessment Total atmospheric CO 2 and ocean C ant values are increasing exponentially according to: Therefore, if a C ant value is known for a location in a reference year (e.g., C ant_2002_location in 2002 c.e.), then C ant can be estimated for that location in a desired year (C ant_year_location ). The coefficient within the exponent is derived by solving Eq. (1) to match Gruber et al.'s (2019) assumption of a $28% C ant increase over the 13 years from 1994 to 2007 (see their methods supplement). We note that this approach is not able or intended to resolve non-steady-state variations in C ant (Gruber et al. 2019), and the errors in the estimates that result from this deficiency are included implicitly in the assessed overall uncertainty estimates.
For the ESPERs, we utilize a gridded C ant product referenced to the year 2002 (Lauvset et al. 2016). This product was created using the transit time distribution (TTD) method (Waugh et al. 2006), and gridded to the same 1 Â 1 latitude/longitude resolution with 33 depth surfaces as the Global Data Analysis Project (GLODAPv2) gridded data product. This reference 2002 field can be used with Eq. 1 to estimate the difference between C ant in 2002 and C ant in the year in which a measurement was made, or an estimate is desired. Therefore, rather than having a time dependent prediction of pH T or DIC, we take the following steps to address anthropogenic trends ( Fig. 1 Steps 1 through 3 were performed before training the routines, while steps 4 and 5 are performed by the ESPER code each time it is called. Supporting Information S1.3 provides more detail for the pH T recalculations noted in step 5. There are uncertainties associated with the assumptions underlying both the 2002 gridded C ant data product and the transient steady state approach-particularly in regions where there are limited measurements of chlorofluorocarbons and other tracers used to calibrate the TTD approach. We therefore assert that Eq. 1 should not be used to estimate C ant distributions for any application where C ant is of primary interest. However, uncertainties in the adjustments that come from changes in these C ant estimates over time should be modest for a window of time around the year 2002 c.e., the year in which the adjustments are zero by definition. Eq. (1) implies that adjustment errors will be smaller than errors in the underlying 2002 C ant distributions for any estimate before 2039 (i.e., the C ant doubling time after 2002). As the training data are also adjusted in step 2, the effective magnitudes of the adjustments are related to the difference between the years of the estimates and the average measurement years of the training data used for those algorithms (which for most regions and algorithms is close to 2002 c.e.). These ESPERs should therefore be used with increasing caution for DIC and pH T after $ 2030. Regardless of these challenges, this parameterization of OA rates should be more accurate moving forwards than that used by LIPHR, and any improvements in the C ant estimates should directly reduce estimate bias in the modern era and the near future. Notably, implementing this approach decreased overall training data reconstruction root mean squared error for DIC by > 10% and decreased the trend in the DIC reconstruction error from $ 0.49 μmol kg À1 yr À1 to less than 0.03 μmol kg À1 yr À1 . We caution that these assumptions do not explicitly consider declines in ocean carbon uptake efficiency and the assumption of exponential growth can lead to very large DIC accumulations when used for distant projections. Future atmospheric CO 2 concentrations are highly uncertain, and user discretion is advised for any projections.
There is no time variance for ESPER estimates of quantities other than pH T and DIC.

ESPER_LIR construction
ESPER_LIR broadly functions similarly to LIRv2, which is described in detail by Carter et al. (2018). As with LIRv2, the ESPER_LIR algorithms use regression coefficients (C) that are specific to each of 16 equations and 44,957 locations on a 5 latitude Â 5 longitude Â 33 depth ocean interior grid subsampled from the World Ocean Atlas gridded product grid. These coefficients are interpolated in 3D space to the locations where regression coefficients are desired. The algorithm then uses the coefficients with user-provided seawater property predictor information (P) to produce property estimates. The LIR algorithms are constructed by fitting 16 different regressions that relate the properties of interest, X (e.g., silicate, nitrate, phosphate, oxygen, TA, DIC, and pH T ), to combinations of up to five predictor properties, P (including salinity, potential temperature, nitrate, phosphate, oxygen, and silicate), which are specific to each property of interest (Table 2). Each equation uses between one and five predictor properties and the generalized predictor equation is: The 16 variants on Eq. 2 are referred to as "ESPER equations" 1 through 16. Unlike LIRv2, depth is never used as a predictor for ESPER_LIR and is only used as a coordinate for regression coefficient interpolation. Versions with depth included as a predictor performed similarly or worse than versions with depth omitted during early testing.
The regression coefficients C i and C 0 are fit 44,957 times for each of the 7 estimated properties and each of the 16 ESPER equations. At each grid location, "local" data are selected from the subset of all data that are within 15 in latitude, 30 /cosine (latitude) in longitude, and within either (100 + z/10) m depth or 0.1 kg m À3 of the estimated density of seawater at that coordinate location. Here z is the coordinate depth in meters. As with LIRv2, these window dimensions are iteratively doubled when fewer than 100 measurements fall within the windows. These data selection windows are initially twice as wide as the windows used in LIRv2 in all dimensions. Doubling the baseline size of these windows is intended to include more data on average for the regression fits, introduce more modes of oceanographic variability into the fitting data, and thereby reduce multicollinearity. The average absolute values of regression coefficients in ESPER_LIR are only 80% of the average absolute values of the coefficients in LIRv2, suggesting ESPER_LIR is subject to less multicollinearity than LIRv2. However, widening the windows risks making the regressions less appropriate locally, so a weighting term is used that is equal to: The weighting term W reduces the cost of regression misfits to data that are distant or at significantly different depths from the regression coordinate location, and the maximum function caps the weights (at a value equivalent to the weight found when 5 latitude away) to ensure the regressions are not overly fit to data very near the coordinate where the denominator approaches 0. The Δz term is the difference between the regression coordinate depth (z) and the depth of the measurements. The Δlon is the minimum difference in the measurement and coordinate longitudes when using either the À180 to 180 or 0 to 360 conventions, and Δlat is the difference between the measurement and coordinate latitudes. The regression coefficients (C 0 and C Pi Þ are then fit using a regression of the form: As with LIRv2, data outside of the Atlantic, Mediterranean, and Arctic are excluded when fitting Northern Hemisphere regression coordinates within the Atlantic, Mediterranean, or Arctic-and vice versa-in order to prevent use of data from across Central America or the Bering Strait. The widths of the data inclusion windows and the coefficients in the weighting function were optimized by selecting the variant of eight combinations that had the best validation statistics. However, some of the combinations yielded comparable results for some predictors, so this parameter tuning process should not be considered exhaustive.

ESPER_NN construction
ESPER_NN relies upon a collection of feed-forward neural networks to estimate seawater properties with a similar operation to the LIR algorithm and a similar structure to the CANYON-B algorithm: ESPER_NN uses the same combination of predictor measurements as ESPER_LIR to produce estimates of the same properties, and does so with a function call that has similar syntax. Unlike ESPER_LIR, in addition to the predictors noted in Table 2, the ESPER_NN algorithm uses latitude, depth, cos(longitude-20 E), and cos(longitude-110 E) as predictors in each equation, making the estimates somewhat more analogous to a mapping approach than the ESPER_LIR estimates. Similar, but not identical, parameters are used in CANYON (Sauzède et al. 2017) and CANYON-B (Bittig et al. 2018): unlike the original CANYON, ESPER_NN offsets the 0 longitude for the reasons noted by Bittig et al. (2018), specifically that cos(lon) loses explanatory power at the prime meridian, which is a region of oceanographic significance. Offsetting longitudes to 20 E (and 110 E) puts these regions of minimum explanatory power over land masses to the extent possible.
ESPER_NN uses 896 neural networks in total: 8 neural networks (4 in each of 2 large ocean regions: see later) are used for each of the 16 combinations of predictors used for each of the 7 property estimates. ESPER_NN averages estimates from a "committee" or ensemble of four neural networks with different combinations of neurons and hidden layers to minimize the impact of errors from any one neural network. These four neural networks include a single one-hidden-layer network with 40 neurons, and three two-hidden-layer networks with 30/10, 25/15, and 20/20 neurons in the 1 st /2 nd hidden layers. One committee of neural networks is used in the Indo-Pacific-Southern Ocean regions and an additional committee used in the Atlantic Ocean, Arctic Ocean, and Mediterranean Sea. The ESPER_NN algorithm linearly interpolates between the outputs of these two committees of neural networks by latitude across the Southern Atlantic and the Bering Sea, being fully in the Indo-Pacific-Southern Ocean network by 44 S in the Southern Atlantic and fully in the Atlantic, Arctic, and Mediterranean network by 34 S. Similarly, the North-Pacific-to-Arctic transition occurs between 62.5 N and 70 N along Pacific longitudes. After this meridional blending step, there is a zonal transition implemented in the Southern Atlantic between these blended values and the Indo-Pacific-Southern Ocean network starting at 19 E and being completely transitioned at 27 E.
Techniques exist for illuminating the relative importance of predictor variables in machine learning approaches (e.g., Olden and Jackson 2002), but the exact equations used by the ESPER_NN algorithm are nevertheless more opaque and less explainable than the LIR equations. The networks are fit using the MATLAB r2017 Machine Learning Toolbox "feedforwardnet" and "train" function defaults, which include Levenberg Marquardt optimization with 15% of input data reserved for assessment during iterative fitting steps. However, the neural networks have been encoded as functions, so users do not require the Machine Learning Toolbox to operate ESPER_NN. Bittig et al. (2018) showed that linear regression and neural network estimates frequently have independent error fields. From this observation, they proposed that it might be advantageous to combine estimates from both approaches. We test this idea and find that it has merits in many circumstances. We therefore also release a wrapper function "ESPER_Mixed. m" that calls both routines, ESPER_LIR and ESPER_NN, and averages the estimates. We do not provide a similar wrapper function for CANYON-B, but we note that our assessment suggests the findings for the mixed approach could also apply to a mixed version of CANYON-B and ESPER_LIR Eq. 7. The ESPER_Mixed routine is assessed alongside the other algorithms in "Assessment" section.

Uncertainty estimation
The routines can return uncertainties for every property estimate, and the uncertainty values vary with input depth and salinity. These uncertainties are estimated at the 1σ (i.e., one standard uncertainty) level, so we would expect $ 95% of new measurements that have been through the GLODAPv2 QC process to fall within windows of AE twice the ESPER estimated uncertainties. The LIRv2 uncertainty estimation strategy for TA (Carter et al. 2018) is slightly modified and then implemented for all properties estimated by the two ESPERs. As before, this approach interpolates baseline error estimates (E X_Est ) in depth and salinity space. The interpolated values are based on the root mean squared errors (RMSEs) of all predictions from the validation versions of the routines within bins of salinity and depth. As with LIRv2, ESPER_LIR also scales these methodological uncertainties using userprovided predictor uncertainty estimates. The following equation is used when the user provides uncertainties for the predictors (E Pi_Provided ) that exceed the default assumed input uncertainties (Table 3): If the optional E Pi_Provided input is omitted then it is assumed that E Pi_Provided equals E P_Default (Table 3), and the two summed terms in this equation cancel. Here ∂X ∂P i is the sensitivity of the property estimate X to the ith predictor P i and the E Pi terms are the default and the user-provided predictor uncertainties. For the ESPER_LIRs, the ∂X ∂P i values equal the C Pi terms. For ESPER_NN calculations, the algorithm determines the sensitivities by iteratively perturbing the input predictors if and only if the user specifies larger-than-default predictor uncertainties. The uncertainties in Table 3 are the minimum uncertainties allowed by the calculations because these are the assumed uncertainties in the best open ocean training data available, so these uncertainties reflect one of the upper limits on the quality of estimates achievable with the algorithms regardless of the quality of the predictor measurements. The sole difference from the approach used for LIRv2 TA estimates is that the interpolated uncertainties now include the component of uncertainty that originates from potential errors in the training data. This saves a step in the calculations while providing numerically equivalent results.
The uncertainty for an ESPER_Mixed estimate is assessed simplistically as the minimum uncertainty assessed for the two component ESPER_LIR and ESPER_NN estimates ("Mixed ESPER" section).

Assessment
Routines are validated using versions of the algorithms trained only with the data that were present in the original GLODAPv2 release (Table 1). This cutoff was chosen to make the validation algorithms for ESPER_LIR and ESPER_NN comparable to the LIRv2 and CANYON-B routines to the degree possible. These "validation" versions of the algorithms are then used to recreate the "validation data set," or the newly added data in the GLODAPv2.2019 and GLODAPv2.2020 updates plus the other cruises from the Mediterranean Sea and the Gulf of Mexico. The reconstruction errors for these new measurements are used to derive error statistics for the five routines that we assess (LIRv2, ESPER_LIR, ESPER_NN, CANYON-B, and ESPER_Mixed). The validation data set is in some ways not ideal, in that it is not evenly distributed globally and there is spatial overlap between the test and the training data sets (Fig. 2). An alternate approach to assessing prediction errors involves omitting all training data from regions of the ocean representative of data gaps between cruises, and then estimating the errors within these gaps. This approach has been used previously by Sauzède et al. (2017) and Carter et al. (2018), but was found to generally yield smaller uncertainty estimates in the open ocean than approaches that omit entire cruises (Carter et al. 2018), so we conservatively rely on the cruise-omission assessments. The additional data sets from the Gulf of Mexico and the Mediterranean Sea that were incorporated into this paper were omitted from the global-average validation data set because neither had undergone secondary QC and because a small subset of the Mediterranean Sea data from GLODAPv2 had been previously incorporated into the training data product for some algorithms but not others. New measurements from the Sea of Japan/East Sea, a biogeochemically distinct region where no previous measurements existed in the original GLODAPv2 product, are also omitted from bulk validation statistics. However, validation statistics for these regions are given separately ("Regional tests" section).
The reported validation statistics are bias (average reconstruction error), RMSE, and the number of new measurements used for each assessment (N). The 10 th , 50 th , and 90 th error percentiles were examined as potential additional statistics, but these statistics were within expectations when assuming normally distributed errors with the given RMSE and bias statistics.

Macronutrients
The routines work well for macronutrients (i.e., phosphate, nitrate, and silicate) when given at least two predictors, reproducing the validation data with low average bias and a RMSE that is comparable to the measurement uncertainties (Tables 4-6). Phosphate and nitrate have a strong and welldocumented covariance in the ocean (Redfield et al. 1963). This covariance results in low RMSE statistics for the equations relating these properties to one another (e.g., ESPER Eqs. 1 and 2 in Table 2), but reduces the value of adding the other as a predictor when one is already included. This covariance is less strong between silicate and either phosphate or nitrate, and oxygen is comparably useful to the macronutrients when predicting silicate. Unsurprisingly, the equations with more fitting parameters tended to perform better, and the RMSE ranged from being comparable to nominal $ 2% measurement uncertainty at best (or AE 0.04 μmol kg À1 for a phosphate measurement of 2 μmol kg À1 ; Olsen et al. 2016) to 3-4 times worse when only S and coordinate information is used in the prediction. All algorithms assessed perform comparably for the ESPER equations using T, S, and oxygen as predictors (i.e., ESPER Eq. 7), but LIRv2 performs slightly worse for silicate. LIRv2 performs comparably to alternatives for many macronutrient estimates, but alternatives outperform LIRv2 for the equations with the largest RMSE values and fewest predictors (e.g., ESPER Eqs. 12 and 16), suggesting that the modifications in ESPER_LIR have resulted in an improvement in the least-accurate estimates. Likely, this is due to the larger number of measurements available for each regression in ESPER_LIR relative to LIRv2. Unlike the ESPER_LIR_validation routine assessed here, the released version of ESPER_LIR benefits from including the newly added data in the recent updates to GLODAP, and is therefore preferred to LIRv2 even when the validation statistics are comparable.

Oxygen
Validation statistics are reasonable for oxygen though persistently greater than the nominal 1% measurement uncertainty (i.e., 3 μmol kg À1 for a 300 μmol kg À1 measurement; Olsen et al. 2016), ranging from 4.5 to 13.2 μmol kg À1 in the global ocean for ESPER_NN_validation and ESPER_LIR_validation (Table 7). LIRv2 is also comparable, but again shows worse validation statistics for equations with fewer predictors and larger RMSE values. The statistics are markedly better at intermediate depths, and range from 2.7 to 6.0 μmol kg À1 between 1000 and 1500 m depth for ESPER_NN_validation. Below the well-lit surface ocean there is no gas exchange and essentially no primary production of organic matter, and the algorithms are therefore better able to capture the fewer processes controlling oxygen distributions. As a result, the oxygen algorithms perform less well at higher oxygen concentrations, which is evident in the larger error statistics globally than in the intermediate depth statistics, as well as in the comparatively diffuse cloud of estimates in the upper right of the oxygen histograms in Fig. 2. Interestingly, the neural network estimates in Fig. 2 appear less diffuse than the LIR-based estimates: the RMSE for ESPER Eq. 1 for only the top 200 m is 8.6, 7.6, and 8.0 μmol kg À1 for the LIR, NN, and Mixed validation ESPER variants, respectively. This suggests that the neural network framework is more skillful at capturing the nonlinear relationships between properties that can result in the presence of gas exchange and primary production in the surface ocean. Oxygen estimates show a non-negligible bias, overestimating oxygen by an average 0.9 μmol kg À1 for all three algorithms across ESPER equations. It should be noted that a large amount of the validation data used for this assessment are located within the North Pacific where oxygen concentrations are low, so this could reflect a small regional bias in the algorithms, a tendency to overestimate lower oxygen concentrations, or differences between the test and the training data products. Supporting this idea, the released versions of the algorithms-which use all data as training data-still have a 0.6 μmol kg À1 bias for the ESPER_Mixed_validation test data reconstructions while having a À0.1 μmol kg À1 bias for the ESPER_Mixed_validation training data reconstructions (i.e., GLODAPv2) and no significant bias for both data subsets combined.

Total titration seawater alkalinity
Seawater alkalinity continues to show strong predictability even with comparatively few predictors (Table 8) and has the smallest relative range in RMSE values with the least precise estimates having a RMSE that is less than double the RMSE of the most precise estimates (ranging from 3.7 to 5.2 μmol kg À1 for TA for ESPER_NN_validation estimates). The small range in assessed RMSE values is expected because all ESPER equations use S, and freshwater cycling is a major driver explaining variability in both S and TA. The excellent validation metrics for new and existing algorithms for TA likely reflect particularly precise TA measurements in the newly added cruises in GLODAPv2.2020, in part due to increased use of certified reference materials for TA (Dickson et al. 2003).
Interestingly, there is an estimate bias averaging 0.5 to 1 μmol kg À1 across ESPER equations for the various routines. It is difficult to identify the cause of these average mismatches when considering that the GLODAP secondary QC effort already adjusted several cruises to be in line with the existing GLODAPv2 data product. However, Olsen et al. (2019) note that many of the newly added cruises in the North Pacific show a negative bias against earlier cruises, consistent with this observation. Also, many of these cruises use single-point spectrophotometric TA titration endpoint detections, which Bockmon and Dickson (2015) previously noted could be a source of disagreement with TA values from full-pH-range titration fits. Interestingly, Sharp and Byrne (2020) have provided a mechanistic explanation that would account for these analytical disagreements if alkaline organic molecules were present in open-ocean seawater. While this discussion highlights the challenges of creating a consistent data product across research groups, the high precision and modest bias of this TA reconstruction nevertheless demonstrates the high quality of the underlying measurements and the importance of the GLODAPv2 secondary QC process.

In situ pH on the total scale
There is some difficulty comparing across pH T algorithms because the training data for earlier pH T algorithms were supplemented with several additional cruises (Bittig et al. 2018;Carter et al. 2018), many of which were since added to the GLODAPv2 data product in annual updates. This means that some algorithms would benefit from overlap between the training and validation data products in this comparison. The comparison cannot simply be limited to the truly new cruises because there are not many additional cruises where purified spectrophotometric dye measurements were made that were not used to train earlier algorithms; we limit our comparison to cruises with these spectrophotometric measurements because it has been shown that there are consistent disagreements between measured and calculated pH T (Carter et al. 2018;Alvarez et al. 2020). Moreover, measurements made with purified dyes are consistent with measurements made by sensors that have been shown to have the expected Nernstian response to pH T changes (Takeshita et al. 2020) lending support to the use of spectrophotometric pH T values over the disagreeing calculated values. Complicating the comparison further, the three new cruises that were not included in LIRv2 or CANYON-B pH T training data that do meet our criteria had large adjustments applied during the GLODAP secondary QC. Therefore, for this study we do not re-assess LIRv2 or CANYON-B, and instead show that the ESPERs have similar validation statistics (Table 9) to those published by earlier validation efforts for these algorithms (Bittig et al. 2018;Carter et al. 2018). We do note however, that the statistics obtained when we assess all four algorithms using T, S, and oxygen with the same data (not shown) are quite close to each other despite the partial overlap between training and validation data sets. This suggests all four algorithms are valid for pH T .
It is difficult to read into pH T validation statistics too much given the comparatively small number of valid assessment data points. However, one pattern in pH T assessment statistics that is apparent is that pH reconstructions benefit significantly from the use of either nitrate or oxygen as predictors, as these predictors provide information regarding organic matter remineralization. The ESPER equations with neither quantity have higher RMSE values, even when silicate is included as a predictor.

Total dissolved inorganic carbon
The routines reproduce DIC measurements with good skill and a small positive average bias, with RMSE values ranging from 4.8 to 16.7 μmol kg À1 globally and 3.2 to 7.0 μmol kg À1 at intermediate depths for the various validation versions (Table 10). Assessment statistics are comparable across the three routines that estimate DIC (LIRv2 does not). We caution that DIC does not have seasonal resolution in the surface ocean in most regions of its training data product. Therefore, estimates within the surface ocean should be treated with caution, and we recommend avoiding interpreting seasonality in the ESPER estimates. This caution applies to all property estimates, but is important to note for DIC specifically because of the high sensitivity of DIC to most modes of seasonal variability and the large scientific interest in seasonal DIC cycling. DIC calculations from measured pH or pCO 2 and estimated TA are expected to be less challenged by the lack of seasonal resolution than direct DIC estimates, as TA seasonality is usually less pronounced than DIC seasonality. These two approaches to DIC seasonality reconstruction can return quite different results in the surface ocean (Supporting Information S1.4). There are empirical routines for global DIC estimation (Broull on et al. 2020) and surface DIC estimation (Gregor and Gruber 2021) that are also trained with the surface pCO 2 measurements. In the many regions where surface pCO 2 has better seasonal data coverage than GLODAPv2, these routines are likely to better resolve DIC surface seasonality than ESPER or other DIC algorithms trained primarily with discrete DIC measurements.

Regional tests
We assess the performance of the algorithms in eight regions independently (Fig. 3). Some of these regions are where biogeochemical Argo floats are currently being deployed (i.e., the North Atlantic, California Current, Equatorial Pacific, and the Southern Ocean) and therefore where there is additional interest in the performance of the algorithms. Other regions are biogeochemically distinct places where there were no training data used for the CANYON-B and/or LIRv2 algorithms (i.e., Sea of Japan/East Sea, Gulf of Mexico, and the Mediterranean). These regions therefore allow tests of the likely errors one can expect when applying global algorithms to biogeochemically distinct regions where there were no available training data. Finally, the Arctic is a problematic region for the algorithms that warrants special attention.
We first consider the Southern Ocean, the Equatorial Pacific, the California Current, and the Northern Atlantic. The validation statistics in these regions where there are active ongoing biogeochemical float deployment efforts are, for the most part, consistent with the global average statistics. The Northern Atlantic shows validation statistics that are somewhat worse than global averages for macronutrients and oxygen and the California Current shows oxygen RMSE values that are equally elevated. Given the active physical processes and biogeochemical cycling in these regions of interest (and the comparatively small validation data set in the California Current), none of these sets of validation statistics are unexpected. We therefore conclude that the algorithms should function within expectations in these important regions and suggest Table 11 can be used to get a sense for how the global validation statistics might vary on a regional level.
The Sea of Japan/East Sea provides an excellent case study to assess the use of algorithms in regions without training data for three reasons: (1) this region had no data in the first GLODAPv2 release, and thus is a region where neither LIRv2 nor CANYON-B had training data; (2) a large quantity of high-quality data from the Sea of Japan/East Sea were included with the GLODAPv2.2020 release; and (3) the Sea of Japan/East Sea is biogeochemically distinct from the open ocean to the east of Japan, providing a challenge for the predictive capabilities of the approaches. Neither of the earlier generation of algorithms work well there with large average biases and RMSE values that are approximately nine times greater on average than in the first set of regions considered, but with significant variance between properties and routines (Table 11). LIRv2 is especially problematic in this region, and the marked improvement in ESPER_LIR_validation relative to LIRv2 suggests the wider data inclusion windows did indeed reduce variance inflation in this region. The release versions of the ESPERs that do include data from the Sea of Japan/East Sea as training data indeed reproduce these data with comparable fidelity to the global statistics (Supporting Information S1.4). We conclude this region is not a special challenge for algorithms when training data are included. The release versions of these algorithms updated with the new data should therefore work in the now-measured portions of the Sea of Japan/East Sea. Two additional marginal seas deserve mention. GLODAPv2 does not yet include data from the Gulf of Mexico or the Mediterranean Sea that have been subjected to the GLODAPv2 secondary quality control process (some data from the Mediterranean Sea are included, but with QC flags of 0). However, due to the large errors expected within marginal seas (and now demonstrated for the Sea of Japan) when training data are absent or omitted, data from two cruises to the Mediterranean were included in the training data for CANYON-B despite the lack of secondary QC. We now do similarly in the ESPERs and include additional data gathered as part of the CODAP-NA (Jiang et al. 2021) and ongoing CARIMED efforts (Supporting Information S1.1). The same lessons from the Sea of Japan/East Sea analysis apply to the reconstruction of measurements from the Gulf of Mexico and the Mediterranean Sea (Table 11). We caution that ESPER_LIR is challenged by the lack of data below 2000 m depth in the Mediterranean and increases its window sizes large enough to incorporate data at depth from the deep North Atlantic. This results in poor RMSE statistics even when the test data are included with the training data (Supporting Information S1.4). Until this is addressed, it is recommended that users interested in this area use ESPER_NN or CANYON_MED (Fourrier et al. 2020) in place of ESPER_LIR or ESPER_Mixed. Such regional algorithms can be meaningfully better for regional efforts, and work in progress on a regional algorithm for the Gulf of Mexico shows promise for reducing the RMS misfit to the observations from this region. The Gulf of Mexico challenges the ESPERs because this is a region where the underlying TTD-based C ant data product does not contain estimates, so C ant is crudely triangulated between the Pacific and Atlantic in this region. A regional algorithm could address this limitation with a more sophisticated approach. Finally, with intense seasonality, strong freshwater cycling and riverine inputs, seasonal ice cover, and broad continental shelves, the Arctic is an interesting "worst case scenario" for Table 7. Assessment statistics, reported as bias (AE RMSE) in μmol kg À1 , for various oxygen estimation routines presented both globally (top rows) and for the intermediate ocean (bottom rows, provided for comparison only as float-based oxygen sensors are not commonly quality controlled against algorithms). Eq. 1 0.5 (AE 5.3) 0.6 (AE 5.2) 0.5 (AE 4.5) -0.6 (AE 4.7)

Discussion and summary statements
Several patterns hold across the various properties. For example, including more predictors leads to better estimates on average (Fig. 4, showing an average across all properties for both ESPERs) when the predictor measurements are high quality (i.e., comparable to the measurements in GLODAPv2). Table 9. Assessment statistics, reported as bias (AE RMSE), for various pH estimation routines presented both globally (top rows) and for the intermediate ocean where float-based sensor measurements are often checked against algorithm-based estimates (bottom rows). Only measurements made with purified dyes were used in these assessments to ensure the validation data had no adjustments beyond those applied in the GLODAPv2.2020 secondary quality control process.  larger, near surface estimate errors for parameters influenced by air-sea gas exchange (e.g., pH T , DIC, and oxygen) are likely the result of their decoupling with predictor variables that are not gases (or are gases with different equilibration and residence times). These changes in parameter relationships near the surface due to air-sea exchange are also sensitive to dynamic processes (e.g., wind speed), which are not well captured by the predictor parameters, and are thus difficult to parameterize in static algorithm relationships. Finally, regional errors are sometimes significantly larger than global open-ocean errors, and regional biases are almost always larger than the global biases. This highlights an important caution for users of these routines: the global statistics may not be appropriate for estimates over a more limited area. For this, we note both that it is important to validate the algorithm estimates for a given region/application and to consider how large of an average estimate bias is likely for a region of a given size. As an example, we have assessed how the bias decreases as the size of the latitude and longitude window considered increases for ESPER_NN_validation nitrate estimates (Fig. 5). These average regional biases are computed by iteratively averaging all estimate errors inside windows of a given size around each of the grid points used by the LIR routines. Then, for each window size considered we compute an area-weighted average of the absolute values of the bias estimates for the grid points. In the example presented, the average estimate bias is approximately half of the global RMSE when estimates are averaged over a 10 Â 10 window, and as expected the bias becomes smaller as the averaging window grows. This shows that the estimates retain significant regional bias, implying nearby algorithm estimates cannot be treated as statistically independent. For a float or mooring that stays within a small spatial region, this algorithm bias could be somewhat worse still than shown in Fig. 5. For pCO 2 calculations based on pH T measurements that are adjusted to algorithm values, even a small average bias could lead to a meaningful change in calculated air-sea CO 2 flux.

Comments and recommendations
We have updated global algorithms for seawater biogeochemical property estimation and their associated MATLAB routines with new functionality using new methods and new data. We show that our new methods are mechanistically at least as skillful as earlier methods and are in some cases better. They also have the advantages of being trained with the latest quality-controlled data products, easy to implement in MATLAB, capable of estimating a variety of seawater properties, flexible with the choice of input parameters, and capable of adapting several aspects of their outputs to user needs (e.g., calculated-like or measured-like pH T ). Where possible, our validation statistics provide comparisons using validation versions of the algorithms with identical training and validation data sets for all versions of the routines assessed. We therefore recommend these updates even when validation metrics are comparable to those of earlier routines because the newer routines are trained from a larger data set with better temporal and spatial coverage. Two important features of our new routines are (1) the flexibility to predict many seawater properties from 16 combinations of seawater properties using either a regression approach or a neural network approach and (2) the implementation of a simple estimate of the impacts of C ant on pH T and DIC based on first principles. While the new C ant estimation strategy is an improvement Table 11. Regional assessment statistics for ESPER Eq. 7 of the validation versions of the algorithms and for CANYON-B. These statistics are obtained without including any training data from the new data added in the 2019 and 2020 GLODAPv2 data product updates; without the supplemental data in the Gulf of Mexico; and, in the case of LIRv2, ESPER_LIR, and ESPER_NN, without any measurements in the Mediterranean. The released ESPER_LIR and ESPER_NN routines should perform significantly better in the Sea of Japan/East Sea, the Gulf of Mexico, and the Mediterranean. Statistics obtained when these data are included are provided as Supporting Information.

Southern
over the LIRv2 approach for estimating the impacts of OA on pH, it nevertheless is quite simplistic and should not be relied upon when C ant distributions are themselves of interest. We test the practice of averaging estimates from multiple algorithms and find that it frequently improves estimates (in a global open-ocean RMSE sense). This practice is therefore recommended for most applications, and we suggest further improvements might be obtained by averaging estimates from still more algorithms such as CANYON-B or its updates. A wrapper function for averaging CANYON-B values is under development and may eventually be included at the same GitHub repository as the ESPER functions.
Our assessment also revealed/reinforced several important ideas to consider when using algorithm estimates: First it is critical to have measurements in the training data set that are near to the region in which estimates are desired. Poor reconstructions of the properties of seawater in the Sea of Japan/ East Sea from the versions of the routines that did not include measurements in this Sea highlight the importance of this caution. Write-ups of earlier algorithm assessment efforts also cautioned against the use of the algorithms in coastal environments and marginal seas where the algorithms did not have training data, but this case study helps quantify the large likely errors when proceeding despite this caution, as many data-poor marginal seas remain. Second, global oxygen, DIC, and pH estimation routine validation statistics are not as strong as the equivalent statistics when limited to intermediate depths. This is likely because the current generation of algorithms lacks data with sufficient temporal resolution to capture seasonal or shorter patterns of variability associated with gas exchanges. It is possible that the algorithms could be improved by incorporating measurements from the biogeochemical Argo array or other data products that are more seasonally resolved than GLODAPv2, though care would have to be taken to avoid reinforcing the algorithms with float data that is calibrated against earlier versions of the algorithms. This could perhaps be accomplished by removing float measurements that reside below the depths that experience seasonal variability from the data products used to train these future algorithms. At least until such an improvement is made seasonal variability in the estimated fields should be treated with caution.
At intermediate depths, ESPER_LIR_validation Eq. 8 reproduces oxygen with an RMSE of 4.8 μmol kg À1 using only T and S as predictors (and 3.7 μmol kg À1 for ESPER_Mixed_validation), raising the possibility that estimates could be used to check oxygen sensor performance on in situ platforms. Currently, most float oxygen sensors are subjected to a 1-point gain calibration against air-oxygen readings or climatological values at high oxygen concentrations, and a deep algorithm estimate could allow a 2-point check that would assess sensor performance at low oxygen saturation. Comparisons at park depths could circumvent potential issues associated with slow sensor response times. Our use of a smaller committee of neural networks with somewhat fewer nodes/neurons than is used by CANYON-B is a pragmatic decision based on the computational costs associated with training neural networks for many combinations of predictors and regions, and we have only done a small amount of neural network structure optimization. However, it should be noted that our use of separate network committees for the Indo-Pacific and Arctic-Atlantic regions effectively doubles the complexity of our networks, and that increasing the complexity further did not seem to meaningfully improve our predictions in limited trials. It is nevertheless likely that further improvements in fit and predictive power could be obtained with additional tuning.
While the neural networks are powerful, we demonstrate that the regression-based approach of the ESPER_LIR routines can nevertheless yield comparably skillful estimates in the open ocean or under the right conditions. We contend that the LIR machinery has an advantage of being more explainable than a neural network, and therefore that the LIRs serve a valuable role among seawater prediction routines. An example of where that could prove useful would be in adapting the LIRs to work in an inland sea. A user could append their own grid of regression coefficients determined for a marginal sea such as the Baltic or Mediterranean Seas or an inland waterway such as the Puget Sound, and the routine would transition seamlessly between global estimates and regionally appropriate estimates. This is a future direction for LIR development that would require partnerships with researchers investigating such bodies of water.
The ESPER_LIR routine lacks predictors derived from coordinate information-rather, this information is used in the interpolation of regression coefficients only. As a result, the LIR routines struggle more than the neural networks when applied in regions that are dissimilar from the training data in property space but are nearby in physical space. This can be seen clearly as larger reconstruction errors in the Mediterranean, the Gulf of Mexico, and the Sea of Japan/East Sea. This was doubly true for the LIRv2 routines which tended to also be less well-constrained than the ESPER_LIR (i.e., LIRv3) routines. By contrast, the neural networks also struggle, but tend to have better RMSE statistics for these regions. We reiterate that the release versions of the ESPERs should substantially outperform the bleak assessment statistics given for such regions because the release versions of these routines are trained with data in these regions (unlike the _validation versions, which are used to highlight the dangers of using algorithms in regions where they were not trained).  Table 2. RMSE generally decreases as the number of predictors increases, but not all predictors have the same predictive power and the incremental increase in predictive power diminishes when more than three predictors are used.

Fig 5.
Average absolute bias in ESPER_NN_validation ESPER Eq. 7 nitrate estimates (y-axis) vs. the size of the latitude and longitude windows (xaxis) over which the average of the absolute biases was computed. The three lines correspond to bias estimates that were averaged over a narrow 100 m depth window (blue line), over all depths (orange), and over the 1000-1500 m depth range commonly used for float calibration (red). Biases are area-weighted average estimates for each of the grid locations used by the ESPER_NN routine. Nitrate ESPER Eq. 7 is chosen as this is one of the equations that is used to calibrate and validate nitrate sensors on biogeochemical Argo floats.