A robust data‐worth analysis framework for soil moisture flow by hybridizing sequential data assimilation and machine learning

As the collection of soil moisture data is often costly, it is essential to implement data‐worth analysis in advance to obtain a cost‐effective data collection scheme. In previous data‐worth analysis, the model structural error is often neglected. In this paper, we propose a robust data‐worth analysis framework based on a hybrid data assimilation method. By constructing Gaussian process (GP) error model, this study attempts to alleviate biased data‐worth assessments caused by unknown model structural errors, and to excavate complementary values of multisource data without resorting to multiple governing equations. The results demonstrated that this proposed framework effectively identified and compensated for complex model structural errors. By training prior data, more accurate potential observations were obtained and data‐worth estimation accuracy was improved. The scenario diversity played a crucial role in establishing an effective GP training system. The integration of soil temperature into GP training unraveled new information and improved the data‐worth estimation. Instead of traditional evapotranspiration calculations, the direct inclusion of easy‐to‐obtain meteorological data into GP training yielded better data‐worth assessment.


INTRODUCTION
During the past few decades, soil moisture measurement technology has experienced a surge in development. Direct observations with ground instruments (Li, Shi, Zha, Wang, & Hu, 2018;Walker, Willgoose, & Kalma, 2001) and indirect observations with remote sensing techniques (Crow & Wood, 2003;Wagner, Lemoine, & Rott, 1999) offer unique opportunities to advance soil hydrology. Massive data of different types, scales, frequencies, and accuracies are accumulating at an unprecedented rate, which hinders the efficient use Vadose Zone Journal 2016; Wang et al., 2018). A typical data-worth framework consists of prior, posterior, and preposterior stages (Dai, Xue, Zhang, & Guadagnini, 2016). The preposterior analysis evaluates the anticipated worth of future observations, for which possible distributions are predicted in advance by conditioning on prior data.
As recognized in the literature (Beven, 2005;Oreskes, Shrader-Frechette, & Belitz, 1994;Shi, Song, Tong, Zhu, & Zhang, 2015;Xu, Valocchi, Choi, & Amir, 2014;Zha, Zhu, Zhang, Mao, & Shi, 2019), model predictions are inherently uncertain due to the uncertainties from input data, model parameters, model structures, missing physics, and numerical implementation (numerical algorithms, spatial and temporal discretization; Beven, 2005). Once data are used to calibrate a soil moisture model, it is commonly assumed that the model has a fixed form of presentation. In our previous study , we assessed the data worth of observations in a sequential way by allowing soil hydraulic parameters to be updated. In this sequential data-worth analysis framework, only the most informative observation is selected, which can effectively avoid data redundancy and reduce the cost of data collection (Dai et al.2016;Man et al., 2016). Data-worth analysis can also prevent creating an extra uncertainty from excessive observations in data assimilation system . We found that potential observations that will be collected in the future and generated based on prior available data may obviously deviate from actual measurements due to the existence of model structural errors. The accuracy of dataworth evaluation degrades significantly when facing complex real-world conditions, especially for the mean-covariancetype index (e.g., relative entropy). It is necessary to remove potential adverse effects from inaccurate representation or omission of physical processes at the preposterior stage.
A few approaches have been proposed to deal with model structural errors within the framework of data assimilation, such as inflation of the background covariance (Anderson & Anderson, 1999;Hamill & Whitaker, 2005) and bias correction methods (De Lannoy, Houser, Pauwels, & Verhoest, 2007;Drécourt, Madsen, & Rosbjerg, 2006). Nevertheless, in both "covariance inflation" and "bias estimation" approaches, model structural errors are inclined to interact with other sources of model error due to the introduction of total or lumped model error (Pathiraja, Moradkhani, Marshall, Sharma, & Geenens, 2018;Zupanski & Zupanski, 2006). Furthermore, "covariance inflation" approaches require ad hoc tuning factors, which are difficult to determine. Almost all "bias estimation" methods rely on the assumption that the model error covariance is proportional to the state error covariance (Pauwels & De Lannoy, 2015).
To circumvent these drawbacks, a family of statistical learning techniques was proposed to build an error model from an inductive, data-driven perspective. Artificial neural network models were trained to forecast the residual of Core Ideas • A new data-worth analysis framework was proposed. • The new hybrid approach can alleviate biased data-worth assessment caused by model structural error. • The hybrid method offers an effective approach to excavate complementary value of multisource data.
conceptual rainfall-runoff models (Abebe & Price, 2003). Demissie, Valocchi, Minsker, and Bailey (2009) developed a framework to handle systematic error in a physical-based groundwater model, in which several error-correcting, data-driven models were considered. Xu and Valocchi (2016) adopted a fully Bayesian approach that integrates a Gaussian process (GP) error model with uncertainty analysis of groundwater flow. In Zhang et al. (2019), we proposed a sequential data assimilation scheme by hybridizing an iterative ensemble Kalman filter and GP (EnKF-GP). The main advantage of these statistical learning approaches lies in not requiring explicit representation of the model residual distribution. Instead, they learn complex relationships between the dependent variable (i.e., model structural error) and select predictors from historical data. Therefore, they are good candidates to statistically characterize the model structural error.
In this study, we further integrate the hybrid data assimilation approach  into our previous sequential data-worth analysis framework . Once the potential predictions are obtained, data-worth analysis is implemented to quantify the worth of alternative monitoring strategy (in terms of observation location, frequency, data type, etc.). Consequently, the most informative observations can be collected to reduce monitoring costs. The objective of using this hybrid approach is to alleviate the possible damage of model structural error on data-worth assessment, so that a more robust data-worth analysis can be made in complex environments.
In our previous paper , only information directly related to soil water movement was used to train the GP system. However, with the innovation of measurement technology, multisource data can now be collected simultaneously. Some variables (e.g., soil temperature) can be easily observed, and these indirect observations may contain valuable information on soil moisture dynamics. This study further quantifies the worth of several indirect data by not involving them in the equation. We hope to relax the requirement of physical model equations by showing that direct (soil moisture) and indirect data (soil temperature and meteorological data) can both make important contributions to data-worth estimation under unresolved model inadequacy.
In this context, Section 2 presents the principles of the modified restart EnKF, GP, and hybrid data-worth analysis framework. The experimental data from Falkenberg Station of the Meteorological Observatory Lindenberg-Richard Aßmann Observatory (MOL-RAO) network (from the International Soil Moisture Network [ISMN]) and model setup are described in Section 2.5. Section 3 presents a set of examples to demonstrate the ability of the proposed framework to improve data-worth estimation accuracy (covering aspects of scenario diversity, prior data content, training input augmentation, and replacement of evapotranspiration calculation). Finally, conclusions are drawn in Section 4.

MATERIALS AND METHODS
In Zhang et al. (2019), two variants of a hybrid data assimilation method (EnKF-GP) were proposed, including no feedback of error correction for model reinitialization (named EnKF-GP1) and feeding back error-corrected state variables for model reinitialization (named EnKF-GP2). This paper uses the EnKF-GP2 approach during data-worth analysis because of its simpler implementation.

Governing equation of one-dimensional soil moisture flow
Herein, one-dimensional soil water movement is considered. The flow is described by Richards' equation (Richards, 1931): is the spatial coordinate, oriented positively downward; and K [L T −1 ] is the unsaturated hydraulic conductivity. The solution of Richards' equation requires the knowledge of the unsaturated conductivity and soil moisture vs. hydraulic head. The van Genuchten-Mualem model (van Genuchten, 1980) is used to describe these constitutive relationships: where θ s [L 3 L −3 ] and θ r [L 3 L −3 ] are the saturated and residual moisture content, respectively; α [L −1 ] and n [-] are the shape parameters of the soil water characteristic curve; K s [L T −1 ] is the saturated hydraulic conductivity; and S e is the effective saturation. It is noted that heat transfer equation is not included in this study, even though temperature data provide an auxiliary data source during data-worth assessment. The Richards' equation is solved by the Ross method (Zha, Shi, Ye, & Yang, 2013).

The modified restart ensemble Kalman filter
The classical EnKF cannot guarantee the physical consistency in nonlinear system after updating (i.e., the updated soil moisture content and soil hydraulic parameters do not follow the Richards' equation). Such inconsistency may cause severe performance deterioration of data assimilation in a strongly nonlinear soil water problem. In this study, the modified restart EnKF (Song, Shi, Ye, Yang, & Navon, 2014) that has better physical consistency is used to implement the coupling of the GP regression and the EnKF, while retaining the computational cost at an acceptable level. Here, we give a brief introduction to this filter, and more details can be found in Sun, Wang, and Xu (2014). The parameter vector of interest, p k (e.g., soil hydraulic parameters), is augmented with the state variable vector, s k (e.g., soil moisture), into a joint state vector, y k = [p k , s k ] T , at time t k (k is the time step index). A collection of N 1 members of the state vector Y k can be written as For a set of observations obs available at time t = t k , their relationship with the true but unknown state variable true and the true augmented state vector true can be expressed as obs = true + ε = true + ε where k represents observation error, which is assumed to be zero-mean Gaussian with a covariance of D = E(ε ε T ); the matrix H represents the observation operator, which relates the state vector y and observation vector d obs .
For any ensemble member i at time t k , the state vector can be updated by combining model predictions and observations, using where i = 1, 2, …, N 1 ; the superscripts "f" and "a" refer to forecast and analysis, respectively; the subscript i is the ensemble member index; f , represents the model forecast for realization i at time t k based on information at time t k−1 ; a , is the model analysis given by conditioning on the observations at time t k ; and K k is the Kalman gain, which is given by where the covariance matrix at time t k , and f can be estimated by where f refers to the ensemble mean of f . After assimilating the observation obs , the modified Restart EnKF reruns the forward model from time 0 to the current time t k with the ensemble mean of updated parameter p k at time t k (Song et al., 2014). The new mean of state variables is given by where 0→ represents the model simulation from time 0 to t k ; the superscript re refers to the rerun; and E(*) denotes the expectation.
Then, a new ensemble of the state variables is rebuilt based on E( re ) in two steps: • Step 1. Calculate and save the fluctuations of ensemble realizations around their means: • Step 2. Construct the new ensemble members by imposing the updated mean on the fluctuations of the forecast ensemble: a,re Similar procedures are repeated until the final time. The readers may refer to Sun et al. (2014) for further method details.

The construction of a data-driven structural error model with Gaussian process training
On the basis of studies of Kennedy and O'Hagan (2001), Xu and Valocchi (2016), and Xu, Valocchi, Ye, and Liang (2017), the model structural error, e(x, p), can be expressed as an additive term: where the error model input x may consist of the physicalbased model output G(p) and other relevant information in addition to time and location of the quantity of interest. This allows for training data that are not directly used to construct the physical model G.
x is a 1 × d vector where d is the input dimension. The value of d is determined by the number of data types we attempt to fuse. Moreover, is a vector of the hyperparameter of the error model. Here, is the error term (more details are presented below).
A GP is fully specified by its mean function μ = E[e(x)] and covariance function k( In this study, a linear mean function μ(x) = T x is selected somewhat arbitrarily. In this function, is the linear coefficient. An isotropic squared exponential covariance function is given by where σ 2 and λ are two hyperparameters: σ 2 controls the marginal variance of e(x); and λ is the characteristic length scale hyperparameter. Inferences about all of the hyperparameters can be made in the light of training data (Rasmussen & Williams, 2006) } denote a set of n training data. X and G represent, respectively, observations and physically based model outputs at different locations and times, and X is a n × d matrix. Therein, G(p) q (q = 1, 2, …, n) is the simulated soil moisture of the qth training data. Since by assumption the distribution of the data is Gaussian, the log marginal likelihood is given by where the covariance matrix is calculated using Equation 15, and its ijth entry is

Vadose Zone Journal
Based on the above training points, the posterior distribution of the model structural error, e*, can therefore be derived for m arbitrary future input where * , = ( , * ) and * * , = ( * , * ); * represents the mean μ(X*, ); * and C ee (e*) represent the posterior mean and covariance of e*, respectively.
Finally, the predictions of state variables of interest, y*, can be calculated using where G*(p) is the physical-based model output corresponding to X*. It is worth emphasizing that the error term includes C ee in addition to the measurement error ( )  i.e., = C ee + ). We can see from Equation 20 that the variance of posterior state vector becomes larger due to the additional introduction of uncertainty from the GP error model. However, a larger variance is not necessarily adverse in sequential data assimilation. On the one hand, the inclusion of uncertainty from the GP error model can improve the underestimation of model error due to the ignorance of model structural uncertainty. On the other hand, the additional error can alleviate the filter inbreeding problem, which is commonly accounted for by inflating the state covariance (Hendricks Franssen & Kinzelbach, 2008;Yu et al., 2019). In our study, the GPML MATLAB toolbox version 4.1 documented in Rasmussen and Williams (2006) was used to carry out all GP training.

The hybrid sequential data-worth analysis method with gaussian process
The new hybrid data-worth analysis framework can be divided into three stages as follows.

Prior stage
At the prior stage (from time zero to time T p ) the integration of the modified Restart EnKF and GP is implemented sequentially (i.e., like EnKF-GP2 in Zhang et al., 2019, as shown in Figure 1). We take the simulation at time t k (0 < t k ≤ T p ) as an example to show the calculation procedure: Repeat these procedures until time T p . After sequentially assimilating all prior data A = { obs 1 , 2 , … , obs }, the updated parameter ensembles, a , are used to rerun the forward model from 0 to T t (T t is the total simulation time, and T t > T p ). Once re , (i = 1, 2, …, T t ; j = 1, 2, …, N 1 ) is yielded again, N 1 new GP models based on n = T p N obs training data can be built to make predictions of e* after time T p . In other words, a set of N 1 hypothetical observations are generated and denoted as , = * + E( re ) + δ (i = T p + 1, T p + 2, …, T t ; j = 1, 2, …, N 1 ) via Equation 20. N 1 should be large enough to include sufficiently diverse potential observations. In this study, N 1 is set as 200 to compromise between the computational cost and accuracy, as suggested by Man et al. (2016).
The difference between the traditional and hybrid dataworth analysis approaches is the procedure for generating potential observations during this stage. The traditional method does not consider model structural error but directly implements EnKF to assimilate all available prior data A, and then utilizes the updated physical parameters to generate potential observations B. The proposed hybrid method constructs the GP error model with the returned open-loop results based on the updated parameter mean via the modified restart EnKF. The learned model error by GP regression is used to compensate the current state vector to avoid or alleviate its adverse effect on the updated physical parameters. In this case, more stable parameters are utilized for predictions to obtain B.

Preposterior stage
For each possible data B i (i = 1, 2, …, N 1 ) at any time t k (T p < t k ≤ T t ), firstly we generate N 2 realizations satisfying a Gaussian distribution with a mean of B i , whereas the variance is the measurement error. Next, the EnKF is implemented through a set of N 2 Monte Carlo realizations for each of the N 1 hypothetical observations. This allows us to calculate the updated ensemble mean E(Y|A, B) and covariance Cov(Y|A, B), jointly conditioned on {A, B}. Quantities E B|A E(Y|A, B), E B|A Cov(Y|A, B), and Cov B|A E(Y|A, B) are then yielded by averaging over the collection of N 1 × N 2 realizations.

Posterior stage
At this stage, the real observations B ′ corresponding to the potential observations B have already been collected. Now, the real (i.e., reference or posterior) data worth can be evaluated in a sequential manner. The comparison of the expected and reference data worth reveals the effectiveness of the data-worth analysis framework.
The predictive statistics in the prior and preposterior analysis have the following theoretical relations (Neuman et al., 2012): Cov B|A E(Y|A, B) is the reduction of uncertainty due to the addition of future possible data B. For ease of presentation, these quantities can be denoted as Following Xu (2007), Zhang et al. (2016), and Man et al. (2016), a mean-covariance-type scalar measure, the relative entropy (RE), was introduced to quantify data worth in our study. As a signal-dispersion combined index, RE provides a measure of the information content of an analysis (posterior) probability density function (pdf) with respect to a background (prior) pdf. Considering that the background and the analysis pdfs are n-dimensional Gaussian functions, the relative entropy can be expressed as where J b is the signal (mean) part of the RE, and DP is the dispersion (covariance) part of the RE. In addition, when calculating the reference data worth, C 2 and E 2 refer to Cov(Y|A, B ′ ) and E(Y|A, B ′ ), respectively.

Data source and site description
In situ soil moisture observations, precipitation data, and soil temperature data were obtained from the ISMN. The Falkenberg station (52.1669 • N, 14.1241 • E; as depicted in Figure 2) of the MOL-RAO network was selected as the study site. At this grassland site, TRIME-EZ TDR sensors (IMKO) were installed to measure soil moisture at six different depths (0.08, 0.15, 0.30, 0.45, 0.60, and 0.90 m), as depicted in Figure 2a. Here, we used 700 d of observations from 1 Jan. 2004 to 30 Nov. 2005 as the study period. The daily precipitation and potential evapotranspiration data are presented in Figure 2b. The soil texture and saturated soil water content of the topsoil (0-30 cm) and the subsoil (30-100 cm) were also downloaded from the ISMN. In topsoil, the proportions of clay, sand, and silt are 6, 73, and 21%, respectively, while being 12, 61, and 27% in subsoil. The saturated soil water contents in topsoil and subsoil are 0.43 and 0.40 cm 3 cm −3 , respectively.
Meteorological data including air temperature at 2-m height, specific humidity at 2-m height, and all sky insolation incident on a horizontal surface downloaded from the NASA Prediction of Worldwide Energy Resources (NASA POWER) are shown in Figures 2d, 2e, and 2f, respectively.

Model simulations and evaluation setup
The main parameters of the study are listed in Table 1. The soil column has a height of 1 m and is discretized with 53 grids (the grid size is mainly 2 cm, with a local refinement of 1 cm at observation depths of 0.15 and 0.45 m). The top and bottom boundaries are set as an atmospheric boundary and a constant water content boundary (being equal to 0.35), respectively.
The van Genuchten-Mualem model is used to describe the constitutive relationships of soil hydraulic parameters in Richards' equation. Therein, α, n, and K s are assumed to be unknown. Based on the soil texture classification, the prior values of soil parameters are estimated using the Neural Network Prediction module of HYDRUS-1D (PC-Progress, https://www.pc-progress.com/en/Default.aspx?hydrus-1d), whereas variances of α, n, and K s are specified empirically.
Considering that soil hydraulic parameters generally satisfy lognormal distributions, the given mean and variance were converted to logarithmic form, as shown in Table 1. Subsequently, N 1 samples are generated as the initial parameter ensemble. In addition, θ r is assumed to be invariable (θ r = 0.04), which is also estimated with HYDRUS-1D. The soil moisture measurement error is set to be 0.03 cm 3 cm −3 . We deliberately introduced two categories of model structural error source: (a) we simplified the double-layered soil column to a homogeneous one, having properties that were consistent with those of the original topsoil; and (b) we neglected the surface cover of grass and assumed it was bare soil, when calculating evapotranspiration. The GP error model is constructed to describe the model structural error in soil moisture modeling. The input x of the GP includes observation depth, precipitation, evapotranspiration, and simulated soil moisture. Other candidate information, such as meteorological data and soil temperature, will also be considered for specific research purposes in different test cases. The output of the GP error model is the bias between the simulated and observed soil moisture. The prior values of GP hyperparameters are given in Table 1.
Here, to quantify the ability of capturing actual observations with potential observations, the RMSE of the two is calculated: where θ f represents the mean of forecasted potential soil moisture samples at time t k ; θ Obs is the corresponding observed value; and N t is the total duration of the simulation period at the preposterior or posterior stage (i.e., N t = T t -T p ). Within our sequential data-worth analysis framework, the value of the reference data worth depends on C 1 and E 1 as well as C 2 and E 2 . At the prior stage, the traditional method is to directly use the EnKF to sequentially assimilate the prior data to obtain C 1 and E 1 , whereas in the new data-worth analysis framework, C 1 and E 1 are calculated by hybridizing the modified restart EnKF and the GP. It is clear the produced C 1 and E 1 values from these two approaches are different. Even though the same observations are assimilated at the Vadose Zone Journal F I G U R E 3 A comparison of RMSE between actual soil moisture observations at various depths (0.08, 0.15, 0.30, 0.45, 0.60, and 0.90 m) and corresponding predictions with the traditional and new hybrid methods, respectively posterior stage in both methods, they yield different values of reference data worth. To compare the relative differences of data-worth estimation accuracy using these methods, the mean absolute percentage error (MAPE) between expected and reference RE is defined: where RE expect and RE refer represent the expected and reference relative entropy at time t k , respectively.

Validation of the hybrid data-worth analysis framework
In the first case (denoted C1), data between 25 Apr. and 1 Oct. 2005 (highlighted by blue rectangle in Figure 2) were used to verify the validity of the hybrid method. The data of first 120 d (the left of the blue dotted line in Figure 2) serve as the prior data (also the training data for GP), whereas the remaining data of the following 40 d (the right of the blue dotted line) are used to test the accuracy of soil moisture predictions after GP training. The potential soil moisture observations during the next 40 d can be predicted by including the GP correction. For ease of comparison, RMSE values between the observed and predicted soil moisture at 0.08 m during the whole preposterior stage (i.e. from the 120th to the 160th day) are presented in Figure 3. The significantly reduced RMSE values for the hybrid method indicate the greater efficiency of the hybrid method in reproducing the real observations. A comparison of the matching degree of expected and reference data worth based on traditional and hybrid methods is shown in Figure 4. Only data worth of the soil moisture at 0.08 m is presented, and the data worth is quantified using RE. Because the values of expected and reference RE differ in some cases by a few orders of magnitude, the logarithmic RE [ln(RE)] is shown. It is seen that the expected and reference data worth from the hybrid approach have a much better match than those of the conventional method, for both parameter identification and soil moisture profile estimation. Thus, the integration of the GP within our sequential data-worth analysis framework led to marked improvement of data-worth estimation accuracy.

Scenario diversity of training (or prior) data
As defined in Zook et al. (2012), "scenario diversity is a measurement of variations among the content of scenarios." The given scenario of the training data may reflect the temporal variability of daily precipitation, evapotranspiration, and other conditions. In this study, if a case experiences precipitation and evapotranspiration events of various intensities, we call it "diverse." A rich scenario diversity may be critical for GP training since machine learning method (GP regression here) is apt to obtain high prediction ability when it is trained by diverse input data. The impact of scenario diversity at the prior stage on the estimation of the distribution of potential observations and corresponding expected data worth is explored in this section.
The success of data-worth estimation using the proposed hybrid approach in C1 should probably be attributed to the diverse scenario of the training data. We compared the performance of the hybrid method, when the prior stage and the preposterior stage had totally different scenarios (herein, rainfall). Case 2 (C2) was designed to test this effect: at the prior stage (training data), the first 120 d between Figures 5a and 5b present a comparison of actual soil moisture measurements and corresponding predictions based on the traditional method, as well as our hybrid method for case C2. Only the results for depths of 0.08 mand 0.30 m are presented. It is seen that the potential observations deviate from actual observations, even when using the hybrid method. The continuous drop in soil moisture content during the preposterior stage is not experienced during the training period, and the GP seems unable to make a proper correction based on the Vadose Zone Journal F I G U R E 4 A comparison between the expected and reference data worth [the logarithmic relative entropy, ln (RE)] of soil moisture at 0.08 m regarding (a, b) parameter identification and (c, d) soil moisture profile predictions based on the traditional method and our hybrid approach, respectively. MAPE, mean absolute percentage error historical scenarios. Subsequently, the accuracy of data-worth evaluation is not satisfactory, as shown in Figures 6a and 6b).
We extended the training period from 120 to 400 d between 20 Feb. 2004 and25 Mar. 2005 (the left of the green dotted line in the green rectangle in Figure 2). This new case is labeled as C3. As shown in Figures 5c and 5d, the soil moisture predictions after GP training are in better agreement with the actual measurements, especially at depths of 0.08 m. The expected data worth also approached the reference counterpart (Figures 6c and d). The results demonstrate that inclusion of more prior data in the hybrid method can better reduce the negative influence from the model structural error. However, there is a potential failure of the hybrid method, if contrasting scenarios occur between the prior stage and the posterior or preposterior stages. From the perspective of model robustness, it is suggested to include datasets covering diverse scenarios during the training stage. Furthermore, the comparison between Figures 4b and 4d and Figures 6c and 6d reveals that the use of data from 120 d prior (in C1) performed comparably in estimating data worth with using data from 400 d prior (in C3). This indicates that having scenario diversity in the prior training data is of more importance than having a large data volume.

Augmentation of soil temperature data into the gaussian process training
In C3, although the prior (or training) data has been augmented from 120 to 400 d, there is still a large deviation between the predicted and observed soil moisture in deeper soils (e.g., at depths of 0.30 m), as shown in Figure 5d. In this section, we further improve the accuracy of data-worth estimation at these depths by using multiple data sources. One major advantage of the GP approach is that the inputs of the GP can take into account a variety of relevant data observations and corresponding predictions with the traditional method, hybrid method, and the hybrid method-T where the Gaussian process error model input x is augmented with soil temperature that are not directly used to construct the physical model (Xu et al., 2017). Here, we incorporate soil temperature data that were not directly related to the soil moisture equation into the training input x of Equation 14, and we investigate the benefit of training with additional soil temperature data on the prediction of potential observations and data-worth analysis. We then designed one more case, in which soil temperature (as shown in Figure 2c) was augmented into the GP input, whereas other settings were identical to case C3. Thus, the GP input included monitoring depth, daily precipitation, evapotranspiration, soil temperature, and simulated soil moisture.
The RMSE values between observed and predicted soil moisture from the traditional method, the hybrid method only trained with soil moisture data (labeled as the hybrid method), and the hybrid method trained with soil moisture and temperature data (labeled as the hybrid method-T) are compared in Figure 7. The RMSE values dropped significantly at most observation locations, especially for the soil moisture at depths of 0.30 and 0.45 m. Recalling results of Figure 5d, it seems that these additional temperature data are helpful to improve the prediction of potential observations. Figure 8 compared the data worth of soil moisture observations (at a depth of 0.30 m) in C3 and the new case that included soil temperature data. Here, only the data worth regarding parameter identification are presented. It was seen that after joining the temperature into the GP training, the data-worth estimation accuracy improved, yielding a reduction of the MAPE from 0.8312 to 0.7267.
The improvements in both potential observation reproduction and data-worth estimation demonstrate the feasibility of using the GP to mine the information hidden in soil temperature. The effectiveness of jointly training soil moisture and temperature data is due to the physical connection between soil moisture and soil temperature. As depicted in Figure 9a, there is a strong correlation between soil temperature and soil moisture at each depth. The cross-correlation matrix between soil moisture and soil temperature in space (at six different observation depths) is also presented (Figure 9b). It is interesting to see that soil moisture and soil temperature have a strong correlation across the whole soil profile. The complementary role of soil temperature data in soil moisture dynamics and soil properties estimation was investigated by Dong, Steele-Dunne, Judge, and van de Giesen (2015) and van de Giesen (2015, 2016). The intrinsic reason for such strong correlation is that soil thermal properties are a function of soil moisture, and the soil heat transfer process can be monitored to estimate soil moisture (Dong, Steele-Dunne, Ochsner, & van de Giesen, 2016). To be specific, soil-specific heat and volumetric heat capacity increase with increasing moisture content due to the gradual substitution of the gas phase in the soil pores by the liquid phase (Abu-Hamdeh, 2003). Our results demonstrate that the hidden value of soil temperature for soil moisture and parameter estimation can be excavated by GP regression without resorting to the coupled soil water and heat transport equation.
The correlation between soil moisture and soil temperature varies with depth ( Figure 9b), with the lowest correlations in very deep soil (at a depth of 0.9 m) and relatively low correlation in surface soil (0.08 m) and deep soil (0.6 m). The small correlation coefficient in surface soils may be because both soil moisture and temperature are susceptible to changes in external conditions. As the soil depth increases, the soil temperature is more influenced by the water temperature through direct heat transfer, whereas the soil moisture is less variable due to increasing difficulty of water vapor diffusion. It is noteworthy that the degree of correlation is in line with the degree of improvement in predicting potential observations by introducing soil temperature as additional training data. Thus, higher correlations at depths of 0.15, 0.30, and 0.45 m (Figure 9b) correspond to larger drops in RMSE values (Figure 7). The above analysis also explains the improvement in Figure 8b, suggesting that a universal value of soil temperature data can be expected. One important implication is that the data worth of one particular measurement program is not only determined by direct data described by the primary variables in the governing equation, but it also may be influenced by auxiliary data for variables that are not given in the equation.

Replacement of evapotranspiration calculation with the Gaussian process regression
Considering the flexibility of the GP in merging multisource data, we attempted to directly replace the Penman-Monteith cross-correlation matrix between soil moisture and soil temperature at each depth (0.08, 0.15, 0.30, 0.45, 0.60, and 0.90 m) equation with the GP, by introducing several easily available observations obtained from a regular weather station. Three new cases were designed, in which air temperature at 2-m height, specific humidity at 2-m height, and all sky insolation incident on a horizontal surface were respectively used to replace the evapotranspiration term in the GP input x. The approximate mapping from these meteorological data together with other basic information (e.g., monitoring depth, daily precipitation, etc.) was constructed by GP regression. For comparison, one additional case simultaneously considering all three meteorological observations was also designed. Other model settings in these four cases were identical to those of case C3. Figures 10a-10d depict a comparison of expected and reference data worth with the above four alternative GP inputs. Only the data worth of soil moisture at 0.30-m depth regarding parameter identification is shown. Overall, the MAPE values between expected and reference data worth in all four cases were smaller than that of of case C3 (having a MAPE value of 0.8312, as shown in Figure 8a). Training the GP with air temperature data brought about the most substantial improvement to the data-worth assessment, followed by training with radiation data, and then specific humidity data. By inspecting the correlation between the potential evapotranspiration (ET 0 ) and these three types of observations (with correlation coefficients of .53, .77, and .58, respectively), we found that a higher correlation corresponded to a larger MAPE reduction. These results surprisingly demonstrate that training a GP system with input data for the Penman-Monteith equation led to even better results than using a direct calculation of the Penman-Monteith equation. This may be because the grassland was oversimplified to bare soil, resulting in a large difference between the reality and modeling settings. To examine the modeling performance under Vadose Zone Journal F I G U R E 10 A comparison of expected (reference) data worth (relative entropy [RE]) of soil moisture at 0.30 m regarding parameter identification when the evapotranspiration term of Gaussian process input in C3 is replaced with (a) specific humidity at 2-m height, (b) air temperature at 2-m height, (c) all sky insolation incident on a horizontal surface, (d) simultaneously all three observations, and (e) a more accurate estimate of potential evapotranspiration (ET 0 ) with the Penman-Monteith equation by roughly considering vegetation information. MAPE, mean absolute percentage error more plausible settings, another case was designed using the true vegetation cover (grassland), whose root depth is assumed to be 30 cm and root density is vertically distributed evenly. Other settings were identical to those in C3. A comparison of expected and reference data worth of soil moisture at 0.30 m regarding parameter identification is shown in Figure 10e. In this case, even when more plausible settings for calculating evapotranspiration are provided, training the GP model with a direct calculation of the Penman-Monteith equation (MAPE = 0.7594) results in a poorer accuracy than that obtained directly with input data of this equation (Figure 10d). These results also imply that the evapotranspiration calculated with the Penman-Monteith equation often includes considerable uncertainties, which may arise from the equation form, as well as its associated model parameters. Meanwhile, limited by inadequate knowledge of the reality, we assume that soil evaporation only occurs within a depth of 0.4 m in this study. This excessive simplification is not in line with the fact that the depth range and vertical distribution of soil evaporation vary over time. Instead of directly calculating the Penman-Monteith method, we found that soil moisture at certain depths actually had a rather strong correlation with meteorological factors. For example, Figure 11 shows that soil moisture at the depths of 0.08, 0.15, 0.30, 0.45 m has a rather strong correlation with air temperature at 2-m height. In comparison with case C3 (Figure 8b), training the GP with all three types of data ( Figure 10d) still produces better accuracy of data-worth estimation. The improvement was larger than that obtained by training with specific humidity data alone. However, compared with cases using air temperature or radiation data, a degraded data-worth estimation accuracy was obtained by jointly using all three types of data.
This result seems contradictive to our common sense that the inclusion of multisource data can provide more information to improve our understanding of unknown systems (Corcoran, Knight, & Gallant, 2013;Shi et al., 2015). Figure 12a-12d shows the ensemble (black line) and mean (red line) of the hyperparameter (β) of the linear mean function after training for each input entry, for the above four GP input augmentation schemes. A comparison of Figures 12a-12c and Figure 12d leads to two findings: (a) integrating multiple meteorological data simultaneously results in a significant decrease in the absolute value of β with respect to each input entry, with the variation range being equal to 0.00-0.25 ( Figure 12d); and (b) depth information that is expected to have a high weight only has a very small weight, with a mean of 0.02, after all three types of meteorological data are merged. Similarly, rainfall is supposed to be an important factor for soil moisture estimation, whereas the weight of rainfall is close to zero in Figure 12d. These changes in hyperparameter values indicate that the excessive augmentation of input entry in the GP regression increases the difficulty of achieving global optimality. Under such circumstances, the performance of the trained GP system deteriorates. Since GP regression is a purely data-driven approach, it is still necessary to investigate criteria for choosing data for GP construction in the future.

F I G U R E 12
The ensemble (black line) and mean (red line) of the hyperparameter (β) of the linear mean function after training for each input entry, when the Gaussian process input is augmented respectively with (a) specific humidity, (b) air temperature, (c) all sky insolation incident on a horizontal surface (radiation), and (d) all three types of meteorological data, on the basis of monitoring depth, rainfall, and simulated soil moisture

CONCLUSIONS
In this study, a sequential data-worth analysis framework based on a hybrid data assimilation approach (EnKF-GP) was investigated. Compared with the conventional data-worth analysis approach, this hybrid method had two major advantages: (a) by introducing a GP training, the hybrid method was able to learn the model structural error from the prior data; and (b) the hybrid method was able to utilize multiple sources of data without involving all of them in the governing equation. Our research led to the following major conclusions: 1. Data worth relies on the amount of information that can be provided to support the modeling for a given purpose. Besides data characteristics (data collection location, frequency, and accuracy), model quality also affects dataworth analysis. The hybrid approach loosens the requirement of constructing perfectly sound conceptual models for hydrologic realities during data-worth assessment. 2. The comparison of cases C1-C3 demonstrated that the scenario diversity at the prior stage plays a crucial role in establishing effective GP training. It is suggested to include either sufficiently long or enough diverse prior data to avoid potential failure of the hybrid approach. 3. The data worth of one given measurement program is influenced by direct data and indirect data. The integration of soil temperature into the GP training input in soil water problems can unravel hidden information and consequently improve data-worth estimation. It also seems feasible to replace the Penman-Monteith equation with a GP system trained by critical weather data. However, integration of more types of data does not necessarily further improve the accuracy of data-worth estimation.
Further work may extend this study to two-or threedimensional variably saturated flow. Besides, only a few types of indirect data are considered in this paper. Our future interest is to integrate multiple different types of indirect data at various spatial scales.