Reconstructing missing data by comparing interpolation techniques: Applications for long‐term water quality data

Missing data are typical yet must be addressed for proper inferences or expanding datasets to guide our limnological understanding and management of aquatic systems. Interpolation methods (i.e., estimating missing values using known values within the dataset) can alleviate data gaps and common problems. We compared seven popular interpolation methods for predicting substantial missingness in a long‐term water quality dataset from the Upper Mississippi River, U.S.A. The dataset included 80,000 sampling sites collected over 30 yr that had substantial missingness for total nitrogen (TN), total phosphorus (TP), and water velocity. For all three interpolated water quality variables, random forests had very high prediction accuracy and outperformed the methods of ordinary kriging, polynomial regressions, regression trees, and inverse distance weighting. TP had a mean absolute error (MAE) of 0.03 mg (L‐TP)−1, TN had a MAE of 0.39 mg (L‐TN)−1, and water velocity had a MAE of 0.10 m s−1. The random forests' error rates were mapped and showed low spatiotemporal variability across the riverscape, indicating high model performance across many habitat types and large spatial scales. In the current era of “big data,” interpolation becomes an imperative step prior to ecological analyses yet remains unfamiliar and underutilized. Our research briefly describes the importance of addressing missingness and provides a roadmap to conduct model intercomparisons of other big datasets. We also share adaptable data analysis scripts, which allows others to readily conduct interpolation comparisons for many limnology applications and contexts.


Abstract
Missing data are typical yet must be addressed for proper inferences or expanding datasets to guide our limnological understanding and management of aquatic systems. Interpolation methods (i.e., estimating missing values using known values within the dataset) can alleviate data gaps and common problems. We compared seven popular interpolation methods for predicting substantial missingness in a long-term water quality dataset from the Upper Mississippi River, U.S.A. The dataset included 80,000 sampling sites collected over 30 yr that had substantial missingness for total nitrogen (TN), total phosphorus (TP), and water velocity. For all three interpolated water quality variables, random forests had very high prediction accuracy and outperformed the methods of ordinary kriging, polynomial regressions, regression trees, and inverse distance weighting. TP had a mean absolute error (MAE) of 0.03 mg (L-TP) À1 , TN had a MAE of 0.39 mg (L-TN) À1 , and water velocity had a MAE of 0.10 m s À1 . The random forests' error rates were mapped and showed low spatiotemporal variability across the riverscape, indicating high model performance across many habitat types and large spatial scales. In the current era of "big data," interpolation becomes an imperative step prior to ecological analyses yet remains unfamiliar and underutilized. Our research briefly describes the importance of addressing missingness and provides a roadmap to conduct model intercomparisons of other big datasets. We also share adaptable data analysis scripts, which allows others to readily conduct interpolation comparisons for many limnology applications and contexts.
Large environmental datasets commonly contain missing data, especially those that cover large geographic scales, have long-term measures, or need continuous temporal observations. Missing data occur for a variety of reasons, such as faulty equipment or intended experimental designs. Missingness may cause a mismatch in sample sizes with other variables and lead to blank cells in data matrices, making statistical analyses difficult. Missingness also happens intentionally within some experimental or observational study designs to efficiently sample large geographic areas and reduce costs of sampling and laboratory processing. In cases of intentional missingness, addressing the data gaps can increase the spatial density of data that allow for creation of new continuous spatial data or can address new research questions that were not the focus of the original study design.
Missing data in large datasets are problematic and should not be ignored (Nakagawa and Freckleton 2008). Often missing data are assumed to be "missing completely at random," but typically are missing nonrandomly (Little and Rubin 2002). Easy but incomplete solutions to nonrandomly missing data that are commonly used include re-coding values (e.g., a missing value is replaced with a value of "0"), mean imputations or substitutions (e.g., the mean value replaces the missing value), or listwise deletion (i.e., deleting an entire record of information if any associated value is missing). These approaches can lead to significant loss of valuable information or loss of statistical power from reduced sample size (Little and Rubin 2002). Moreover, these crude approaches may introduce bias in the dataset that can cause faulty conclusions (Beheim et al. 2021). Ecologists tend to continue using these simplified approaches over more sophisticated techniques like simulation modeling and interpolations (Yanai et al. 2018) that can relieve scientists from some of these concerning consequences.
A specific class of methods for addressing missingness is interpolation, which is covered extensively elsewhere (Little and Rubin 2002;Molenberghs et al. 2014). Data interpolation estimates the missing values using known values within the dataset. At least 25 interpolation methods have been developed within environmental sciences. Each method differs by model features such as how the estimate is calculated, how much data are used to derive estimations, and whether errors are provided (Li and Heap 2014). Interpolation remains underutilized in part because of the statistical knowledge and computer programming skills. Few references compare the precision and accuracy of various interpolation methods, and many previous studies simply selected a single method without stated justification or intercomparisons of alternative methods Heap 2011, 2014;Louvet et al. 2016). However, intercomparison studies of interpolation are beginning in fields of study outside of limnology (Penone et al. 2014;Miao et al. 2021;Picornell et al. 2021) and within limnology (Lottig and Carpenter 2012;Song et al. 2016). Intercomparison studies and available analysis scripts will guide aquatic scientists on choosing the most promising tools among many methods available for their data.
A 30-yr water quality dataset from the Upper Mississippi River System (UMRS; Fig. 1) in the United States provides one example of typical and intentional missingness and an opportunity to expand this rare long-term dataset. The monitoring study design and sampling scheme optimizes the ratio of information: costs of data collection, and then produces unbiased water quality estimates for reach-and strata-level inferences (Soballe and Fischer 2004;De Jager and Houser 2012). Accurate interpolations of these data allow for new site-level analyses and the creation of new, continuous spatial data layers. The UMRS data were missing either randomly or nonrandomly due to incidental, intentional, and accidental causes. The data had $ 10% incidental and nonrandom missingness for water velocity because the equipment produced unstable measurements at extreme velocities, as well as intentional missingness by not sampling in the river's main channel with extreme velocity. The sampling design intentionally sampled nutrients less frequently than other water variables because the reduced sampling regime appropriately addressed spatiotemporal variance for analyzing reach-and strata-level inferences (Soballe and Fischer 2004;De Jager and Houser 2012). The reduced nutrient sampling resulted in 66% of values that were "missing" completely at random for water column TP and TN compared to the other variables sampled more frequently (Table 1). The data were rarely missing from accidental data loss (< 1% of observations over 80,000 sampling sites as of 2020). Researchers are now interested in using this long-term data for site-level analyses and creating continuous data layers for geographic information systems, which require the data frame to have no missing values in any cell of the data frame. Using listwise deletion of this dataset for these new purposes caused the number of sampling sites to be reduced by $ 85%. Also, imputing 66% of missing nutrient data was inappropriate. Therefore, interpolation (if accurately predictive) would provide an opportunity to expand the dataset to advance understanding and restoration of the UMRS.
We sought to compare seven interpolation methods for addressing missingness in long-term ecological datasets. Our specific study objectives were to accurately interpolate substantial missing data that occurred within the long-term data from the UMRS (Fig. 1). We evaluated seven commonly employed interpolation methods to determine a top performing method for expanding the long-term water quality dataset. We interpolated the three key water variables with substantial missingness: TN (n = 50,293 missing data points, intentional per the study design), TP (n = 51,031 missing data points, intentional per the study design), and water velocity (n = 26,301 missing data points from incidentals related to extreme readings). We compared "local" and "global" interpolation methods (Table 1). We hypothesized global methods (that use all the data and variables for interpolation) would outperform local methods because the local habitat conditions can be highly variable and due to the multivariate, interactive nature of the variables in the UMRS (Houser et al. 2022) and aquatic systems in general. Finally, we hypothesized the prediction errors may show spatial patterns within reaches; for example, prediction accuracy may decline in riverine side channels with high habitat heterogeneity compared to the main river channel. We provided analysis scripts for these seven interpolation techniques as opportunities for aquatic ecologists to consider and readily address missingness within their own data.

Study system and study design
Long-term water quality data were collected by the Upper Mississippi River Restoration Program (https://umesc.usgs.gov/ ltrm-home.html) using standard protocols since 1993 (Soballe and Fischer 2004) to improve ecological understanding of this important river system. Water quality data were collected using a stratified-random sampling design to assess the status and trends of representative study reaches and strata along the river (Houser et al. 2022). Data are hierarchically nested by scale; specifically, reaches contain strata and strata contain sampling sites. Sampling covers six reaches and four strata that broadly correspond to the abovementioned habitat types (i.e., main channel, side channels, backwater lakes, and open impoundments).
All water data were collected across four seasons: spring, summer, autumn, and winter. Sampling occurred on the same 2 weeks each season of each calendar year for consistency. All variables of interest were collected near the water's surface (0.2 m depth) with an average sampling time of 12:00 h. Sample readings were either collected in situ or were analyzed according to rigorous laboratory procedures (Table 2; Soballe and Fischer 2004). The primary causes of 66% missing data for TN and TP were per the study design to reduce costs associated with complex laboratory analyses, and so "missingness" is an artifact of researchers expanding the dataset for new types of analyses not originally intended with the study design (e.g., multivariate, site-level analyses). Velocity had $ 10% missingness due to measurement errors during frequent but extreme velocity conditions (near 0 m s À1 and then values typically > 1 m s À1 ) and because velocity was purposely not measured in the main channel strata.

Procedures: Data preparation
We downloaded the entire water quality data and metadata on 07 June 2021 from the Upper Mississippi River Restoration Program website (www.umesc.usgs.gov/ltrm-home.html). The entire dataset included information from years 1993 to 2020, which was an initial data frame size of 204,345 rows (site data) and 133 columns (variables). For data preparation, we used software R (R Core Team 2022) and the tidyverse version 1.3.1 collection (Wickham et al. 2019), and recorded scripts for reproducibility (Broman et al. 2017). We extracted data only collected at the water's surface via a stratified random sampling scheme (and removed all sites that are considered "fixed sites"). We selected 11 continuous variables for the scope of our interpolations (Table 2). Although other variables are available in this dataset, they were correlated with the others (e.g., nitrate and TN) or binary (e.g., plant data) and thus Polynomial Reg (deg = X), multivariate polynomial regression with degree of X. *Global methods use all available data to derive the estimation, whereas local methods operate within a small, gridded area around the point being estimated to capture local spatiotemporal variation. The approach used in this study is provided first, but an alternative is in parentheses. † Deterministic methods only provide estimations and stochastic methods provide both estimations and associated errors. ‡ Convex methods yield estimates that are bound between the minimum and maximum of the observed values, whereas nonconvex methods can estimate outside of the range of the observed values. § Univariable methods use only one primary variable to derive the estimation, whereas multivariable methods use multiple explanatory/predictor variables to estimate the primary variable. The approach used in this study is provided first, but an alternative is in parentheses. k Whether or not the dataset undergoes a "train/test splitting" procedure whereby a random subset of the data are split into a training set and the remaining data are split into a testing set to evaluate model performance. Table 2. Water quality variables selected for interpolation from the long-term water dataset for the Upper Mississippi River, U.S.A. Each variable was used either as a response variable for interpolation or a predictor variable during interpolation. Further details on the data collection and processing either in situ or the laboratory are provided in Soballe and Fischer (2004).

Variable type Parameter
Acronym used in dataset* Unit of measurement omitted for modeling. Rarely (< 0.2% of samples), negative values occurred due to the measurement below the detection limit of the instrument and all negative values were set to the minimum, nonnegative value observed. The variables of interest had a corresponding column for quality flag (QF) with comments. The values with the QF codes equal to "0," "A," "8," or "64" were set to "NA" because the reported results were from inoperable equipment, nonstandard laboratory method, or marginal sample condition. All variables had many statistical outliers (defined as 1.5 times the interquartile range), which may or may not affect interpolations; however, we retained most statistical outliers for analysis because all values were deemed realistic values for the UMRS. We removed three values for TN that were not reasonable (>10 mg [L-N] À1 ) prior to interpolations. All cells with "NA" were considered missing data (for whatever the reason the missingness) and interpolated in next steps. The reduced dataset (n = 82,481 sites; data file titled "water_data_qfneg.csv") was subsequently used for all interpolation methods herein. The dataset was split into training (80%) and testing (20%) sets (Table 1) and described in more detail under each method's section heading below.

Procedures: Interpolation methods
We examined seven commonly employed interpolation methods for three key variables with substantial missingness: TN, TP, and velocity. Of the plethora of interpolation methods available, we chose these seven methods for intercomparisons because of their long-standing history, popularity, and general robustness for many data and applications (Li and Heap 2014). We provide a graphical description of how each interpolation method works (Fig. 2). The key model characteristics are compared in Table 1 that highlighted the main differences, advantages, and limitations of each modeling approach.
Method: Inverse-distance weighting (at two different time steps) Inverse-distance weighting (IDW) is a spatiotemporal interpolation model that assumes data close together (in time and space) are more similar than data farther apart (Fig. 2). To interpolate for a missing variable, users could use two or more closest measured values (Lu and Wong 2008). We used two of the closest measured values to interpolate missing data. The IDW method applied a weighted sum across the actual values using the Euclidean distance to scale the interpolated value as follows: ; var 1 and var 2 are the two known variables; and var interp is the interpolated value. The subscript with d represents the distance between the starting and ending point of samples, whereby samples closer in space or time are given more weight. The IDW distance did not account for possible terrestrial interference from river islands and side channels.
We built the IDW models using software Python 3.10.7 (Python Software Foundation 2021). For our IDW models, we accounted for both space and time for calculating "distance." We used the latitude/longitude variables for spatial distance. For temporal space, two different time steps (1 and 3 yr) were tested and reported. First, we partitioned data by season and year (either 1 yr or 3 yr) so that neighbors could only be chosen within their own subset. For IDW, errors were evaluated by first computing mean absolute error (MAE) and root mean squared error (RMSE) with all nonmissing values for each variable. Then, we interpolated the value for each known value using the two closest locations in space and time. Finally, we recorded the absolute difference and squared difference between the known and interpolated value, then averaged all the differences to obtain MAE and RMSE.

Method: Ordinary kriging
Kriging is a stochastic geostatistical interpolation method that factors in spatial autocorrelation, or the correlation of a variable with itself, over space (Burt et al. 2009). Therefore, kriging is particularly appropriate when a spatially correlated distance is known (as originally hypothesized in our water quality dataset) and therefore is a popular technique in soil and geological sciences. Kriging consists of two processing steps ( Fig. 2): (1) computing the weighted average of available samples by fitting a semivariogram model to estimate the spatial autocorrelation between known values and (2) using the model to interpolate the unknown values.
Before kriging, the missing values were removed. The remaining data points were displayed in Esri's ArcGIS Desktop (Esri 2020), projected to Universal Transverse Mercator Zone 15 North, WGS 1984, and separated by river reach (Fig. 1). We used the "kriging" tool in the ArcGIS Spatial Analyst extension to generate interpolated surfaces for each missing variable and reach combination (e.g., TN for Pool 4, TP for Pool 8, velocity for Pool 13). Parameter selections for each kriging operation included the ordinary kriging method that assumes an unknown constant trend, a spherical semivariogram, and an output cell size of 10 m. Geoprocessing was local to limit the extent of the interpolated surface to each reach rather than interpolating among reaches. Esri's Geostatistical Wizard performs crossvalidation by using all available data to compare interpolated values with observed values, which estimates the trend used during kriging and provides statistics on the prediction errors (i.e., RMSE and MAE) and report the average errors of the six reaches in Table 3.

Method: Polynomial regression
Polynomial regression uses linear or nonlinear correlations among a single or multiple variables within a dataset to build a model that predicts missing values (Sinha 2013).We used multivariable polynomial regression interpolation to predict Fig. 2. A graphical representation of how common interpolation methods work. Note these are hypothetical graphics and often the parameters are changeable (e.g., the number of points used in kriging, the number of data points in a polynomial regression, or the number of splits in the regression tree). For inverse-distance weighted method, the points on the river include the interpolated value (yellow star), the nearest neighbor values considering space and time used to derive the interpolated value (red triangles), and unused sampling points (black circles). In the kriging method, all values (black circles) are used to estimate the interpolated value (yellow star) but are weighted by distance (in meters). The polynomial regression strives to model nonlinear data by optimizing the model fit with the lowest bias and lowest variance. To find the preferred polynomial regression degree, we fit several different models with various degrees and perform k-fold cross-validation to determine which model has the lowest mean absolute error on the test data. The regression tree is split at nodes (variables) based on conditions ("cond.") of binary, threshold values. After the final splitting, the tree provides predictions ("pred.") or interpolated values based on the variable splitting process. River clip art images were modified with permission by the originator, Tracey Saxby (https://ian.umces.edu/media-library/, accessed on 21 September 2022). The random forests method combines many regression trees to improve the predictive accuracy of a single regression tree. Bagging bootstraps the data to create independent replaces of the training data, and then same regression tree is applied to each bagged tree. missing values for a desired target variable using the other 10 variables as predictors in the dataset (Tables 1, 2; Fig. 2). The sparse (< 0.1%) missing values in the predictor variables were imputed with the median. It is important to note that the missing values in the training set and testing set were imputed with their respective medians independently, because we do not want to give the model any information from the testing set to reduce bias. After median imputation, each value of the predictor variables was scaled using the RobustScaler in the sklearn module (Pedregosa et al. 2011) with the following transformation: X scaled ¼ X À X:median ð Þ =Interquartile Range ½ . The scaler object was fitted on the training set but used to scale both the training and testing sets.
We built the models using software Python version 3.10.7 (Python Software Foundation 2021). Next, we utilize the "train_test_split method" from sklearn (Pedregosa et al. 2011) to randomly split the full dataset into training (80% of data) and testing (20% of data) sets for model training and evaluation. Next, we used sklearn.preprocessing.PolynomialFeatures (Pedregosa et al. 2011) to transform the predictor variable matrix into a higher dimensional matrix with higher order terms (depending on the degree of polynomial). Degree 1 and 2 models were trained on the entire training set for each variable and evaluated on the test set using the LinearRegression class in sklearn (Pedregosa et al. 2011) and the final polynomial model had the lowest mean squared error (MSE) on the test data. We did not optionally perform model selection to determine which predictors (Table 1) were significant to allow direct comparison to the other global models ( Table 2) that contained all 10 predictors. The MAE and RMSE are reported on the test set.

Method: Regression tree
Regression tree algorithms predict a continuous target variable. They output a binary tree by implementing recursive binary partitioning, which creates branches or subsets of the data with similar target variable values (Fig. 2).
The data splits according to which predictor variable value minimizes the combined sum of squares error in the two resulting partitions of data. The interpolated target variable values are the averages of the points in the tree's leaf nodes. Regression trees benefit by easily handling missing values in the predictors through surrogate splits (Therneau et al. 2019). When a predictor variable is missing for a data point, the regression tree uses a surrogate split, an alternate predictor that partitions the data similarly to the original predictor variable.
We used the "rpart" package (Therneau et al. 2019) in software R version 4.1.2 (R Core Team 2022) and interpolated the missing values for TP, TN, and velocity. The regression tree used all predictors in Table 2. For each tree, the full dataset was randomly split into training (80% of data) and testing (20% of data) sets. The complexity parameter (CP) was set to 0.011, which establishes the minimum amount that the R squared value must increase after adding a new node to the tree. This CP value was determined using the 1 standard error rule on the TP, TN, and velocity trees.

Method: Random forests
Random forests are an ensemble machine learning algorithm that combines many regression trees to improve the predictive accuracy of an individual tree (Fig. 2). The random forests are popular because they are simple to implement, perform well with little to no tuning, capture nonlinear relationships, and can handle high-dimensional data (Boulesteix et al. 2012;Biau and Scornet 2016). Researchers have applied random forests to scientific fields of ecology, land cover, soil properties, and more (Tyralis et al. 2019). The algorithm ensembles decorrelated bagged decision trees. Bagging begins with bootstrapping the data replacement to create independent resamples of the training data. On each bootstrapped resample, the same regression tree is fitted. The model adds randomness to decorrelate Table 3. A comparison of interpolation methods for a long-term water quality dataset on the Upper Mississippi River, U.S.A. The random forests model had the lowest mean absolute error (MAE) and root mean squared error (RMSE) for all variables, which indicated higher prediction accuracy than the competing models. each bagged tree; each split in a bagged tree is limited to a random subset of all predictors. We used the "ranger" package (Wright and Ziegler 2017) in R version 4.1.2 (R Core Team 2022) to construct the random forests model. The full dataset was randomly split into training (80% of data) and testing (20% of data) sets. The model used all predictor variables in Table 2, and the occasional missing values for predictor variables were imputed with the median. We set the number of trees to ntree = 500 ("ranger" default), mtry value of 3 (the number of randomly sampled predictors considered for each split), and min_n of 5 (minimum number of observations needed to further split a node).

Comparing models and evaluating model performance
We conducted error analyses on all interpolation methods using one round of cross-validation to evaluate model accuracy. For three methods, the dataset was split into training (80%) and testing (20%) sets (Table 1). We were not able to use identical train/test datasets across each method because each model's software package partitioned the data differently and at random. For the IDW and kriging methods, train/test splitting is not an option and so we used "leave one out cross-validation" (which is where we selected a known value, made a predicted value from the model, and compared the difference).
We used several lines of evidence for choosing the "best" performing model. We compared performances with two error metrics, the MAE and RMSE. The MAE and RMSE are the best overall measures of model accuracy because they summarize the mean difference in the actual units of the observed and interpolated values (Willmott 1982). Below are the formulas for these error metrics, where y i is the actual value, b y i is the interpolated value, and n is dataset size.
The MAE is the average difference between the measured and interpolated values and uses the same scale as the data measured and is less sensitive than RMSE to outliers. The RMSE squares the MAE, so that RMSE more heavily weights larger differences in the actual vs. interpolated values. The interpolation method with the lowest MAE and RMSE fits the data best and is the "top performing method." We also examined the summary statistics and compared the relative magnitude of the differences in MAE and RMSE for each method. Boxplots and scatterplots which paired the interpolated values and actual values compared congruence.
To report the potential biases of the "top performing model," we calculated three metrics: absolute error (AE), percent bias (PBIAS) and Kling-Gupta Model Efficiency (KGE). The AE is calculated as: where y i is the actual value and the b y i is the interpolated value. The PBIAS and KGE bias metrics were calculated using the package "hydroGOF" (ZambranoBigiarini 2023) in R version 4.1.2 (R Core Team 2022). The PBAIS measures the averaged tendency of the predicted values to be larger or smaller than the observed values (whereby a PBIAS of 0.0 is ideal, indicating no bias). The KGE is a normalized model efficiency measure for general agreement between the predicted and observed values (whereby a KGE close to 1.0 is ideal).
The distribution of AE was plotted in histograms, where high congruence is indicated as errors being centered near zero and symmetrical. To assess spatial variability of the top model's prediction errors, we calculated the AE for each site within the dataset. The AE value was plotted for each of the three interpolated variables according to latitude and longitude, and we visually inspected the maps for any patterns of spatial variability.

Computational costs
All interpolation methods were successfully implemented with our full data matrix of size (82,481 rows by 11 variables), with computation times ranging from 0.25 to 7 h and computers with 16 GB RAM. These computer specifications are commonly found on current laptop and desktop computers and not typically limited by memory. The random forests model was the most computationally expensive ($ 7 h) because the algorithm fitted and aggregated 500 regression trees, and then the model bootstrapped individual trees before calculating the average solution. However, the R package "ranger" (Wright and Ziegler 2017) that we used had faster run times and less memory usage compared to other random forests packages (Wright and Ziegler 2017). We provided detailed analysis scripts and example river data for all interpolation methods (Larson et al. 2023), which provide opportunity for other scientists to implement and compare interpolations using their own data.

Error metrics
We refrained from providing subjective determination of whether the MAE was "good, sufficient, or poor" for each variable and method because no industry standards exist. Therefore, those judgments should be defined by the user's purpose for the interpolated values. Using cross-validation procedures can reduce subjectivity and provide quantification of errors and reveal potential biases (e.g., spatial bias). Furthermore, calculating multiple metrics to assess bias (such as PBIAS and the KGE) can assess bias quantitatively.
The random forests method consistently had the lowest MAE and RMSE scores compared to the other methods for all three interpolated variables (Table 3; Fig. 3). The random forests had high prediction accuracy according to several error metrics (Table 3) and diagnostic tests (Fig. 4). For all three water quality variables, the random forests' scatterplots suggested that most interpolated values derived from random forests were congruent with the actual values (r 2 = 0.74-0.79, depending on the water quality variable). The data distributions along the 1 : 1 line in scatterplots reveal slight underprediction biases for TP and velocity; specifically, random forests predictions sometimes yielded interpolated estimates less than the actual measures that were above the upper quartile (75% percentiles) of data. Any actual value for velocity greater than a threshold of $ 1.5 m s À1 (which was a statistical outlier, but there were many outliers) was always underpredicted (Fig. 4g). The boxplots comparing random forests' actual and interpolated values were well aligned within the interquartile range, and the interpolated values had fewer outliers than the actual data. Model residual error distributions were centered over zero and normally distributed, which indicated good model fits.
Although the "best" performing method was clearly identified as random forests according to several performance metrics, there was not a single "worst" performing method. The IDW (1-yr) method had the overall second-best performance (Table 3; Fig. 3). The IDW 1-yr method had comparable performance to random forests for the TN variable, as indicated by their similar MAE scores ($0.4 mg [L-TN] À1 ; Fig. 3). However, random forests performed better than IDW 1 yr for the TP and velocity variables, as indicated by the differences in MAE scores of 0.01 mg (L-TP) À1 and 0.03 m s À1 , respectively. For all variables, the IDW 1 yr outperformed the IDW 3 yr, which indicated spatiotemporal variation for water quality was high and that predictions were more accurate when interpolations were constrained to "nearby" water quality samples in both space and time (i.e., constrained within the same year). Both Fig. 3. The correlation between RMSE and the MAE of each interpolation method using long-term water quality data from the Upper Mississippi River System. The relative model performance for each interpolation method was compared for three key water quality variables: TN, TP, and water velocity. The method with the lowest MAE and RMSE (in these cases, "random forests" in the lower left quadrant) indicated the lowest error for the interpolated predictions. The precise values for RMSE, MAE, and sample sizes are presented in Table 2. Note the xand y-axis limits differ based on the water quality variables. The MAE units vary by variable: TN (mg L À1 ), TP (mg L À1 ); and velocity (m s À1 ).
the IDW 1-yr and IDW 3-yr methods had lower MAE, but similar RMSE, compared to the regression tree, kriging, and two polynomial regressions.
The regression trees, kriging, and polynomial regressions performed similarly with relatively high correlations between MAE and RMSE (Table 3; Fig. 3). The multivariate polynomial regressions with 2 nd degrees performed marginally better than the 1 st degree polynomial for all variables, and polynomial regressions with higher degrees were overfit and not reported in tables. Interpolation performance may be improved using polynomial regression that explores various degrees of individual predictor variables and includes backwards model selection, which drops insignificant, higher-order terms until all terms are significant. Detailed results for each method's performance are supplied in analysis scripts found at (Larson et al. 2023).
The interpolation performance for the top method (random forests) varied depending on the water quality variables, as suggested by the RMSE and MAE metrics (Table 3; Fig. 3). TP had the greatest prediction accuracy when applying random forests, with a MAE of 0.03 mg (L-TP) À1 . TN had prediction accuracy with a MAE of 0.39 mg (L-TN) À1 , which is a reasonable MAE in this river system with very high nitrogen concentrations (up to 8 mg L À1 ; see Fig. 4). The velocity Fig. 4. Scatterplots, boxplots of distributions, and histograms of residual errors that compared congruence of the predicted and actual values for TP (red, a-c), TN (blue, d-f), and velocity (yellow, g-i) from Upper Mississippi River long-term data. The predicted data were obtained using interpolation from a random forests machine learning algorithm; (a), (d), and (g) display a 1 : 1 line for the predicted actual values. Boxplots in (b), (e), and (h) are standardized by boxes (containing the 25 th , 50 th [median], and 75 th percentiles), the whiskers (containing the 0 th and 100 th percentiles, excluding outliers), and the dots (which are statistical outliers but retained for interpolations).
variable had highest relative error (MAE of $ 0.10 m s À1 ). The PBIAS shows some underpredictions for TP and velocity (PBIAS = 0.2 and 0.3, respectively) but not TN (PBIAS = 0.0). The KGE metric showed generally agreement between observed and predicted values (KGE = 0.68-0.75 range for the three water variables), and Fig. 4 shows the disagreements were from underpredictions of the uncommonly high values.
Applying interpolation to address missingness and explore spatial ecology We applied random forests to the entire water quality dataset from years 1993 to 2020 of the UMRS to interpolate blank cells in the data frame for TP, TN, and water velocity (n = $125,000 values). Merely using listwise deletion to deal with sampling sites with at least one blank cell for any water quality variable had removed 85% of the entire stratified random sampling dataset, whereas the interpolation retained the entire dataset with no missingness to allow for new types of analyses. The final data frame replaced missing values with interpolated values (n = 76,670 sites). The benefits of interpolation and no missingness included the ability to conduct sitelevel analyses and multivariable analyses with no data loss, especially in this case of retaining 85% of the dataset that would be lost using listwise deletion. The missing nutrient values are not problematic for the reach and strata (i.e., habitat type) analyses that the data were principally designed to inform (Soballe and Fischer 2004;Houser et al. 2022), but site-level analyses require properly addressing missingness prior to ecological assessments.
The drastic error differences among water quality variables (Table 3) suggested how spatiotemporal variation of riverine processes can affect modeling predictive capacity. Random forests had very high predictive ability for TP despite fairly high spatial variability (Houser and Richardson 2010) and temporal variability (Kreiling and Houser 2016) across the UMRS riverscape, which may suggest either relative stability of phosphorus or that we included appropriate covariates (Table 2) to accurately predict TP. Although nutrients are known to have spatial variability within the UMRS (Houser and Richardson 2010;Houser et al. 2022), our prediction errors did not show spatial patterns (Fig. 5), indicating the model's nutrient predictions did not have spatial biases. The variables of TN, TP, and velocity have strong seasonal patterns in this northern latitude river (Jankowski 2022), and the seasonal variability may add to the predictive capacity of interpolation. Water quality variables that do not have spatiotemporal autocorrelation due to season or other factors may have greater interpolation error. Despite hypothesizing that side channel habitats would be harder to accurately predict nutrient concentrations due to the complexity of those habitat types compared to the main river channel, this was not supported by model error predictions (Fig. 5). The random forests' error rates showed low spatiotemporal variability within multiple river channels (Fig. 5), indicating high interpolation performance across many habitat types and large spatial scales (at least 60 river km).
In contrast, water velocity had relatively lower predictive capacity for several possible reasons. Interpolation in other applications can substantially reduce concerns of nonrandom missingness (Little and Rubin 2002), but not in the case of this dataset when water velocity was commonly missing nonrandomly. The random forests interpolation is a convex method (Table 2) and thus was not able to accurately predict outside the bounds of the measured velocity values, including the (nonrandom) missing, extreme values that are commonly found and difficult to measure in the UMRS. Water velocity can quickly respond to ongoing changes in discharge and underwater features like aquatic plants or engineered structures common in the Mississippi and Illinois Rivers, and therefore interpolations may not properly capture the dynamics of discharge and fine spatiotemporal scale of velocity. Furthermore, velocity had higher field measurement error compared to TN and TP (Soballe and Fischer 2004), and therefore measurement error could propagate through interpolation and increase prediction errors.

Strengths and limitations of random forests
Until recently, random forests had limited use in the aquatic sciences despite its many strengths and applicability (Olden et al. 2008;Tyralis et al. 2019). A few of many beneficial properties of random forests for interpolation of water science data include relatively fast computations (Wright and Ziegler 2017), extensively studied in theory, and the models are nonparametric, can model nonlinear dependences among variables, can handle noisy, correlated, and several types of predictor variables (e.g., continuous and categorical), are effective with high dimensional datasets (Biau and Scornet 2016). Random forests do not necessarily require large data to be accurate (Biau and Scornet 2016), and so many ecological monitoring datasets may not be data-limited. Random forests have been applied globally to a variety of aquatic systems and topics (Tyralis et al. 2019), yet seem underutilized to-date. In this study, random forests also had advantages over the alternative methods, such as being global and multi-variable (Table 2). Interestingly, the "global, data-driven" random forests model outperformed the 'local, space-time' interpolation models with this dataset as hypothesized.
Random forests have additional applications, which are beyond the scope of these results but noteworthy. Random forests can identify and rank the variables of importance for making predictions (Tyralis et al. 2019), which can aid ecologists in recognizing which measured variables to retain in long-term monitoring, recognize if unmeasured (latent) variables exist and caused low prediction accuracy, and determine ecological relationships. Another neat application can use random forest prediction errors in space-time, such as particular aquatic habitats or periods in time, which may reveal habitat heterogeneity, spatiotemporal variability and dynamics (Chiao et al. 2012;Louvet et al. 2016;Vizcaino et al. 2016) and detect change under a range of simulated scenarios (Holloway-Brown et al. 2021). Random forest outputs can also link conceptually to places with high ecological resiliency or risk of ecological state transitions that could inform aquatic management and restoration priorities (Delaney and Larson, in review).
Like all models, there are limitations to random forests. A substantial barrier has been the difficulty for ecologists to implement and interpret machine learning (Olden et al. 2008) and not knowing how to report uncertainty and error from interpolation (Yanai et al. 2018). Random forests models are convex (Table 1) and cannot interpolate outside the bounds of the training set, which is problematic for nonrandomly missing, extreme values, and statistical outliers. From our data, the velocity predictions were always underpredicted at values > 1.5 m s À1 . Future velocity interpolations with random forests may be improved by including other measures like discharge and main channel connectivity in addition to the predictors in Table 2. In contrast, velocity may be better estimated using two-dimensional hydraulic models instead of interpolation models. The TP was also underpredicted at extremely high values, so caution is warranted for those estimates and inferences. Finally, scientists should not presume random forests to be the best interpolation method for all types of problems and datasets, and so intercomparisons of multiple methods is appropriate for novel datasets ).

Discussion
The method that authors choose to address missing data are important. The chosen method can either be accurately predictive and an opportunity to expand big datasets (as in this study), cause significant analytical errors and faulty conclusions (Beheim et al. 2021), or output differing parameter estimates in subsequent modeling (Song et al. 2016). Therefore, careful consideration and intercomparisons among methods for addressing missingness is imperative Our results are not a critique of the existing UMRS water quality dataset; rather, the results show opportunities to properly address the inherent missingness expected in any long-term data and how to use interpolation to extend a dataset. Our results should not change the results or interpretations from the many previous UMRS-specific studies using this dataset for reach-and strata-level inferences as intended in the experimental design (Soballe and Fischer 2004). Our interpolated dataset expands predicted values) derived from a random forests interpolation model for three water quality variables: TP (a), TN (b), and water velocity (VEL, c). We hypothesized that absolute error rates could present patterns within a reach based on spatial and habitat heterogeneity, but the lack of patterns shows the random forests predictions were spatially unbiased. many opportunities for new ecological analyses, particularly sitelevel analyses and for creating continuous spatial layers for use in geographic information systems. Our interpolated dataset now allows for rigorous site-level analyses that focus on complex, multivariate associations that was not previously possible with frequent missingness from using a dataset initially intended to answer questions at the strata-and reach-levels.
Future models could attempt to address error propagation from the interpolation to the next analytical model. Typically, scientists use a two-step procedure where first an interpolation model predicts the missing covariates with some estimates of uncertainty and error, and then those predictions and associated errors are used as input for a secondary analytical model (e.g., an ordination or regression for ecological analysis). This two-step approach is practical but not optimal because uncertainty is not transferred between the two models and error is not accounted for in the second analytical outputs.
The inherent limitation of interpolation is that the actual, measured estimates are still missing, and the interpolation merely created simulated or interpolated estimates. The concerns and consequences of having interpolated estimates compared to having actual estimates in any dataset is context-and user-dependent. Developers of aquatic sampling protocols would benefit from considering whether they require the physical collection of (often costly) limnological data to obtain actual values, or, whether the lower cost of simulated, interpolated data are sufficient. Cross-validation procedures that evaluate model performance (e.g., like the train test split procedure and comparisons of each model's RMSE and MAE in this study) are important decision tools for choosing between actual data versus interpolated data. Data analysts of large datasets should consider the sample sizes of the missingness, whether data are missing at random, and how interpolated estimates might affect the ecological analyses and interpretations. Past published analyses that used listwise deletion to datasets with substantially missing data should consider interpolating and re-analyzing the data to determine if the statistical or ecological inferences would change.

Comments
Future published works with long-term and large datasets should clearly articulate the steps and considerations they took to address missing data. Unfortunately, many ecological papers fail to mention missingness and solutions despite the many associated problems. At a minimum, scientific methods should address: How much data were missing? Were data missing at random or nonrandomly? Which methods (e.g., listwise deletion, imputation, interpolation, etc.) were compared for dealing with missingness, and how was the final interpolation method chosen? Are the analysis codes archived, available, and reproducible (Broman et al. 2017)?
It is likely that the "top performing method" may vary according to each dataset and their contexts. Data variation can impact the accuracy of all interpolation methods (e.g., see Table 3 for comparing the three variables), and the magnitude of impact is often method dependent . However, our results showed that machine learning is a powerful and underutilized tool for interpolating missing data. Scientists should attempt multiple interpolation methods for dealing with their own missing data, and the random forests algorithms and several accuracy metrics (like RMSE and MAE) should be included in their repertoire for intercomparison.
In the current era of "big data" typically not limited by computing resources, interpolation becomes an imperative step prior to ecological analyses. Interpolation literature tailored toward aquatic ecologists are available as a guide in this statistical realm (Olden et al. 2008;Tyralis et al. 2019; this paper). Interpolation code sharing like ours herein (Larson et al. 2023) allows for ecologists to test new ecological questions within the UMRS, as well as readily test and compare methods using their own datasets from any aquatic system. Interpolation can minimize information loss and maximize the accuracy of ecological inferences and forecasts, which is exceptionally important for water resource science and management.

Data Availability Statement
All water quality data are publicly available through the Upper Mississippi River Restoration Program's Long-term Resource Monitoring (https://umesc.usgs.gov/ltrm-home.html). The data analysis scripts and data are permanently archived on the U.S. Geological Survey's Open Source GitLab (https:// code.usgs.gov/umesc/ltrm/interpolating-missing-water-qualitydata) and Larson et al. 2023 (DOI: 10.5066/P9ZR7BWL). Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.