The purpose of this work is to investigate the feasibility of downscaling Gravity Recovery and Climate Experiment (GRACE) satellite data for predicting groundwater level changes and, thus, enhancing current capability for sustainable water resources management. In many parts of the world, water management decisions are traditionally informed by in situ observation networks which, unfortunately, have seen a decline in coverage in recent years. Since its launch, GRACE has provided terrestrial water storage change (ΔTWS) data at global and regional scales. The application of GRACE data for local-scale groundwater resources management has been limited because of uncertainties inherent in GRACE data and difficulties in disaggregating various TWS components. In this work, artificial neural network (ANN) models are developed to predict groundwater level changes directly by using a gridded GRACE product and other publicly available hydrometeorological data sets. As a feasibility study, ensemble ANN models are used to predict monthly and seasonal water level changes for several wells located in different regions across the US. Results indicate that GRACE data play a modest but significantly role in the performance of ANN ensembles, especially when the cyclic pattern of groundwater hydrograph is disrupted by extreme climate events, such as the recent Midwest droughts. The statistical downscaling approach taken here may be readily integrated into local water resources planning activities.
 Groundwater is the primary source of drinking and irrigation water supplies in many parts of the world. Shallow groundwater influences 22–32% of global land area [Fan et al., 2013]. In the US, groundwater accounts for 20% of the total water withdrawal, providing more than 90% of water used for rural domestic supplies and 40% used for public supplies [Barber, 2009]. Freshwater aquifers are increasingly stressed by population growth and climate extremes, which are of particular concern for arid and semiarid regions that rely exclusively on groundwater. Worldwide, the pace of groundwater depletion has more than doubled during the last several decades, caused primarily by increasing water demand [Konikow, 2011; Wada et al., 2011, 2012]. Thus, to manage the limited groundwater resources in a way that will ensure equitable, sustainable, and economically prudent decisions, we must enhance our existing capability to monitor and predict water availability. Although in situ monitoring networks provide high-resolution estimates, remotely sensed data constitute the only source of information for assessing water resources in sparsely monitored regions [Alsdorf et al., 2007; National Research Council, 2008]. For this reason, the Gravity Recovery and Climate Experiment (GRACE) satellite mission has attracted much attention since its launch over a decade ago. GRACE observes temporal variations of Earth's gravitational potential which, after removing atmospheric and oceanic effects, is mostly caused by changes in terrestrial water storage or ΔTWS [Tapley et al., 2004].
 Remote sensing of groundwater storage changes cannot be done directly using the current technologies [Becker, 2006; Brunner et al., 2007]. In the case of the GRACE satellite, it only tracks ΔTWS but not changes of individual hydrologic components (e.g., surface water, soil moisture, and groundwater). By far, the most popular approach for inferring groundwater storage changes from GRACE ΔTWS has been to isolate and remove contributions of all other TWS components using either auxiliary information or land surface models. The foci of many previous studies were on comparison between the groundwater storage change obtained by disaggregating GRACE data and that by using in situ data, with the main purpose of validating the accuracy of GRACE data at either regional or continental scales [e.g., Henry et al., 2011; Leblanc et al., 2009; Rodell et al., 2007, 2009; Scanlon et al., 2012a; Swenson et al., 2008; Syed et al., 2008; Yeh et al., 2006]. Practical application of GRACE data for local water resources management, especially nowcasting and forecasting, has been limited. A fundamental issue is related to the low spatial (>150,000 km2) and temporal (>10 days) resolution of GRACE [Rodell et al., 2007]. Another compounding issue is the quantification of uncertainties arising from postprocessing and disaggregation GRACE signals.
 The raw GRACE data are sets of spherical harmonic coefficients describing monthly variations of Earth's gravity field. Postprocessing needs to be done to remove systematic errors (i.e., destriping) and random errors in higher-degree spherical harmonic coefficients not removed by destriping [Landerer and Swenson, 2012]. The postprocessing operations, however, also cause signal degradation. Disaggregation of GRACE data into contributions of different hydrological components tends to introduce additional uncertainties, as summarized in Seo et al. . Quantification of the uncertainties arising from aggregating (in lateral directions) and disaggregating (in the vertical direction) different TWS components is challenging, especially when large-scale hydrological models are used or when in situ observations are averaged over large areas. In the former case, rigorous uncertainty analyses would require probability distributions of inputs, which are hard to assess in practice. Simple techniques for estimation under uncertainty might be useful. For example, Sun et al.  applied a robust least squares approach to estimate aquifer specific yield, which requires only bounds of the ΔTWS components.
 At the present time, incorporating GRACE data directly into local water management decision making activities, such as groundwater allocation and drought management, is unlikely to be fruitful for aforementioned challenges. But can we use the GRACE product indirectly to inform water management at subgrid scales? This question is equivalent to testing the feasibility of downscaling GRACE data.
 Dynamic downscaling techniques are model based [Fowler et al., 2007]. Recently, Houborg et al.  assimilated GRACE data into the Catchment Land Surface Model using ensemble Kalman smoother and forcing data from North American and Global Land Data Assimilation Systems Phase 2 (NLDAS-2). NLDAS (1/8° and hourly) consists of several land surface models that cover central North America and is constructed from gauge-based observed precipitation, bias-corrected shortwave radiation, and surface meteorology analysis fields. The results of Houborg et al.  show statistically significant improvement in drought prediction skills over many parts of the continental US, highlighting the potential value of GRACE data. Sun et al.  imposed GRACE observations as constraints when recalibrating a regional-scale groundwater model such that the model-simulated ΔTWS should agree with the GRACE-derived ΔTWS. In their case, the regional aquifer model was originally developed for long-term water management and transforming the model for transient simulation at monthly intervals was challenging, requiring significant more spatial and temporal information on groundwater use, recharge rates, and boundary fluxes.
 As an alternative to model-based downscaling, statistical downscaling techniques explore empirical relationships between large-scale variables (predictors) and local-scale variables of interest (predictands). The empirical relationship is usually given in the form of a nonparametric model or, transfer function, that is “learned” by performing linear or nonlinear regression on training data sets. Once trained, the transfer function provides a functional mapping between predictors and predictands that can be used to facilitate cross-scale information exchange. Compared to dynamic downscaling, statistical downscaling is data driven and is much less time consuming to develop [Fowler et al., 2007].
 The main hypothesis of this study is that the GRACE ΔTWS may serve as a useful predictor for water level changes, in the absence of continuous in situ water level observations. Revolving around this hypothesis, the objectives of this study are to (a) develop statistical downscaling models for incorporating GRACE ΔTWS to predict water level changes and (b) validate the usefulness of statistical downscaling, or the “data worth” of GRACE, for wells located in different geographic and climatic regions. Here data worth refers to the contribution of additional observations in reducing predictive uncertainty [Brunner et al., 2012]. In addition to local water resources management needs, this study is also motivated by concerns over the dwindling coverage of in situ monitoring networks, which have been traditionally used to gauge the wellbeing of aquifers, as well as to provide support for local water management decisions (observation wells that support aquifer management decision making are widely known as index wells). The spatiotemporal coverage of in situ monitoring networks depends critically on dispensable local resources. Currently, many water agencies are facing increasing constraints because of budget cuts [Nelson, 2012]. Moreover, even with in situ well networks in operation, holistic groundwater assessments can be difficult to perform on a systematic basis because political and aquifer boundaries typically do not coincide with each other [Rodell et al., 2009]. Thus, insights gained from this analysis by using the freely available GRACE product will be of broad interest to the groundwater community.
 The remainder of the paper is organized as follows. Section 2 describes data used and data processing techniques. Section 3 discusses methodologies used for statistical downscaling and forecasting. Section 4 shows results from two sets of downscaling tests. Test I looks at three wells located in different aquifers and climate regions in the U.S. Test II zooms into a single GRACE pixel and further validates the feasibility of GRACE downscaling for multiple sites at subgrid scale. Finally, section 5 summarizes main findings of this study.
2. Data and Data Processing
 The gridded GRACE product (1° or ∼110 km) used in this study is based on Release-5 of GRACE Level-2 data, which has improved accuracy over its precursors due to new data processing algorithms [Landerer and Swenson, 2012]. As mentioned before, filtering and truncation of GRACE spherical harmonic coefficients result in signal attenuation and leakage, which need to be corrected through leakage and bias correction. Landerer and Swenson  used scaling to restore signal amplitudes. The spatially distributed scaling field is obtained based on simulation outputs from the Global Land Data Assimilation System (GLDAS). The scaling approach does not seek to match GRACE measurements to GLDAS model amplitudes; rather, it uses synthetic model patterns to determine relative signal attenuation based on the ratio of true and filtered signal amplitudes. The availability of this gridded product will enable users who do not possess necessary background in GRACE data processing to extract and use ΔTWS time series. The data set is being updated regularly and can be downloaded from Jet Propulsion Laboratory (JPL)'s ftp site [JPL, 2012]. Unless otherwise noted, the period of study spans from March 2003 (the beginning month of the gridded data set) to August 2012 in monthly intervals.
 Two tests are designed to validate the concept and significance of GRACE-based downscaling. Test I examines three observation wells located in Texas (Well-TX), Nebraska (Well-HP), and Illinois (Well-IL), respectively (Figure 1). The wells were chosen because (a) all of them have continuous water level records during the study period and (b) at the regional scale, good correlation (>0.45) has been found between GRACE-derived groundwater storage changes and those estimated based on in situ water level data for all three aquifers [Scanlon et al., 2012b; Sun et al., 2010; Yeh et al., 2006]. Further background information pertaining to the three wells is provided below.
 Well-TX is located in Reeves County, Texas, and is completed in the unconfined Pecos Valley alluvial aquifer. The average water level elevation is about 140 ft [1 ft = 0.3048 m] below land surface. According to Texas Water Development Board (TWDB), more than 80% of groundwater pumped from the Pecos Valley aquifer is used for irrigation and the rest is withdrawn for municipal and industrial uses [TWDB, 2012b]. The groundwater level data were queried from TWDB's groundwater database [TWDB, 2012a]. Both Well-HP and Well-IL belong to the US Geologic Survey (USGS)'s Active Groundwater Level network [USGS, 2013]. The former is located in Dundy County, Nebraska, and is completed in the High Plains aquifer, while the latter is located in Vermilion County, Illinois, and is completed in a shallow glacial aquifer. Climate of the High Plains is semiarid. The unconfined High Plains aquifer is the largest freshwater aquifer in the US. Extensive irrigation in recent decades has led to significant decrease in regional groundwater levels from the predevelopment era (1950s) [Scanlon et al., 2012b; Sophocleous, 2005]. The average water level of Well-HP is 133 ft below land surface during the study period. In comparison, water table of the unconfined Illinois aquifer is relatively shallow and the average water level of Well-IL is only 22 ft below land surface. Rodell and Famiglietti  showed that the groundwater storage change in Illinois is equal in magnitude to soil moisture change, and the two TWS components account for most of the TWS variations in Illinois relative to snow and surface water.
 All three Test I wells exhibit a certain degree of seasonality and mild nonstationarity (see supporting information S1, Figure S1). To quantify the correlation between GRACE ΔTWS and groundwater level changes, the high-pass, Hodrick-Prescott filter (HP filter) was applied to remove long-term trends from water level data (see section A).
 Test II further looks into seven continuously monitored wells that are all located within a 1° GRACE pixel at 101–102°W and 40–41°N (Figure 2). The area, which is in the Republican River Basin, encompasses several Nebraska counties, including Dundy County where Well-HP (i.e., Benkelman Well in Figure 2) is located. Nebraska is underlain by the largest volume of water-in-storage of the High Plains aquifer [McGuire et al., 2012]. Most of the wells selected for Test II are completed in the Ogallala Formation of the High Plains aquifer. One exception is Haigler Well, which is completed in a relatively shallow sand-and-gravel aquifer. The area is covered by a relatively dense in situ well network and, thus, offers a unique opportunity to further validate the usefulness of GRACE downscaling for multiple sites located within the same GRACE pixel. All well hydrographs were downloaded from USGS web site [USGS, 2013]. More information on Test II well locations and completion depths is provided in Table S1.
 In addition to GRACE ΔTWS, other hydrometeorological predictors used in this study include the Parameter-elevation Regressions on Independent Slopes Model (PRISM) total monthly precipitation and average minimum and maximum temperature data (4 km resolution) [PRISM Climate Group, 2012], which were extracted for all well locations for the study period.
 The statistical downscaling method adopted in this study is artificial neural network (ANN), which has been widely used in streamflow forecasting and water resources management [e.g., ASCE, 2000a, 2000b; Chang and Chang, 2006; House-Peters and Chang, 2011; Hsu et al., 1995; Maier et al., 2010; Moradkhani et al., 2004]. A major strength of the ANN lies in its universal approximation property—an ANN model with a single hidden layer can be trained to approximate the causal relationship of any nonlinear dynamic system without a priori assumptions about the underlying physical system. Another important attribute of ANN is its built-in capability to adapt to changes from the original environment [Haykin, 2009]. Thus, trained ANN models are relatively robust to data noise and are well suited to support real-time decision making.
 In the context of water level prediction, Coulibaly et al.  developed ANNs to predict the groundwater level by using in situ hydrometeorological observations and antecedent groundwater levels from the same well as inputs. Coppola et al. [2003, 2005a] applied ANNs to predict groundwater levels in response to variable pumping and weather conditions. Feng et al.  developed ANNs to simulate regional groundwater levels affected by human activities by incorporating population and monthly irrigation data as additional input variables. Chu and Chang  and Coppola et al.  integrated ANN in an optimization framework for groundwater resources planning, in which ANN models were trained as a surrogate of process models for predicting groundwater levels under different pumping conditions. The models developed in this study differ from previous studies in that the GRACE ΔTWS is used in lieu of antecedent groundwater levels as a predictor for 1 month ahead prediction. In other words, the models are developed for situations in which in situ monitoring is disrupted. Because of the GRACE data, the ANN models predict water level changes, instead of absolute water levels. Water agencies are often interested in the former quantity for water availability analysis [McGuire et al., 2012], although estimation of the long-term mean can also be accomplished using a number of existing regression methods [e.g., Hengl et al., 2004] and typically has less requirement on spatiotemporal frequency of data. For completeness, the ANN algorithm is briefly reviewed below. More details on ANN algorithms can be found in Maier et al.  and Haykin .
3.1. Multilayer Perceptron (MLP) Networks
 Time series prediction problems seek to learn a functional mapping between a set of predictors and the target variable y (assumed here as a scalar variable)
where is the mapping and is process noise. In this work, MLP networks are used to learn the functional mapping. MLP network is a type of feedforward ANN consisting of an input layer, one or more hidden layers, and an output layer, with each layer consisting of one of more neurons. All MLPs developed in this work are single-hidden-layer MLPs. Construction of an MLP proceeds by connecting layers one-by-one through a series of transformations. In the first step, connections between the hidden layer and input variables are established. Let denote a set of M predictors. The hidden layer consists of K hidden neurons and each is a weighted sum of predictors [Bishop, 2006]:
in which is a hidden neuron, are unknown weights associated with each input neuron, and is an unknown bias term used for correcting the estimation bias. The superscripts in equation (2) denote the layer number. In the second step, equation (2) is passed to a transfer function to yield outputs from hidden neurons
where are outputs, and is the transfer function. A commonly used transfer function for the hidden layer is the logistic sigmoid function, which ranges in [0, 1] and acts as a compressing function. Finally, connections from the hidden layer to the output layer are established via a linear transfer function
where are output neurons (i.e., model predictions), and and are the unknown weights and the bias term of the output layer, respectively. The number of output neurons is always one for the current study. In the training phase, the unknowns in equations (2) and (4) are solved through backpropagation, which is a process of propagating fitting errors backward through the network to obtain the optimal weights in each layer. The Matlab Neural Network Toolbox [Demuth et al., 2008] was used to develop and train all MLP networks for this study.
3.2. Performance Measures
 Performance of the developed MLPs is quantified using several criteria. Root mean square error (RMSE) measures the global fitness of a predictive model
where y and o are predicted and observed values, respectively; and is the number of target data used for testing. The scaled RMSE, R*, is defined as the ratio between RMSE and the standard deviation of observations
which can vary between zero and a large positive value. The correlation coefficient (R) is a measure of how future outcomes are likely to be predicted by the model and is equivalent to the sample cross correlation between predicted and observed values
where the overbar denotes mean values. The maximum absolute error (MAE) is defined as
 Finally, the Nash-Sutcliff efficiency (NSE), ranging from to 1, measures the predictive skill of a model relative to the mean of observations,
 In the literature, a predictive model is said to achieve (a) very good performance if the resulting NSE is greater than 0.75 and R* is less than 0.50, (b) good performance if NSE is greater than 0.65 and R* is less than 0.6, and (c) satisfactory performance if NSE is greater than 0.5 and R* is less than 0.70 [Moriasi et al., 2007]. These ranking criteria are used here as a convenient measure for comparing the relative performance of different models, although readers should be aware that they are context and application dependent.
3.3. ANN Model Structure
 Figure 3a shows a representative MLP network structure for 1 month ahead prediction. The input variables x includes six variables and the target variable y is water level change, . Both Well-TX and Well-HP MLPs use the network structure shown in Figure 3a. For Well-IL, an additional antecedent series, , is included. The MLP structures were determined to maintain a balance between model parsimony and performance. Recall that the main purpose of this work is to test the data worth of GRACE for fulfilling the role of in situ data. Therefore, the antecedent Δh is not included as a predictor when formulating 1 month ahead prediction MLPs. More details on the design and training of MLPs are provided in section 4.
 Water managers are often interested in water level changes at the seasonal scale for operational planning. A strategy proposed by Chang et al.  is used here for multimonth forecast. The prediction proceeds 1 month at a time, and each time the intermediate output is used to inform the next prediction. This multistep prediction strategy is referred to as serial propagation by Chang et al. , who found that inclusion of outputs from intermediate steps not only improved the accuracy, but also reliability of predictions. Thus, output from 1 month ahead prediction is used as a predictor when performing 2 month ahead prediction (Figure 3b). Similarly, both 1 and 2 month ahead Δh predictions are used as inputs for the 3 month ahead prediction. A separate MLP network is developed and trained for each additional prediction step.
3.4. Relative Importance of Input Variables
 In the literature, several metrics for quantifying the relative importance of ANN input variables exist [Gevrey et al., 2003; Olden and Jackson, 2002; Olden et al., 2004], some of which have been applied in previous water level prediction studies. For example, Coppola et al.  used the backward stepwise elimination, in which the model performance between excluding and including a certain predictor is compared. However, this classical stepwise method (either forward addition or backward elimination) is limited because the “necessity to use a new model for each variable selection skews the results” [Gevrey et al., 2003]. A comparative study of Olden et al.  examined the performance of nine relative importance metrics on a synthetic data set with known correlation structure. They concluded that a connection weight method which uses raw input-hidden and hidden-output connection weights in the trained ANN provides the best methodology for accurately quantifying variable importance. The application of the connection weight method proceeds in three steps. Let the input-hidden connection weight matrix be denoted as and the hidden-output connection weight vector as , where K is the number of hidden neurons, and M is the number of input variables (see Figure 3a). Each column in is first multiplied by (element-wise multiplication) to give a product matrix . Summing over rows of gives an importance vector. The relative importance of the ith input variable is then defined as follows
where are elements of . The connection weight method is straightforward to implement and will be used here to rank the relative importance of input variables.
3.5. Ensemble Generation
 Ensemble techniques provide a means for quantifying uncertainty and improving generalization and stability of ANN models, which becomes important when the size of training set is small such as in the current problem. Several methods exist for generating ANN ensembles, including bagging [Breiman, 1996], boosting [Freund and Schapire, 1995], and randomization of initial weights [Dietterich, 2000b; Opitz and Maclin, 1999]. Bagging is based on bootstrapping original training sets to generate different realizations; a network model is then trained for each realization. Boosting is a technique for producing a series of predictors trained with a different distribution of the original training data. Initial weight randomization simply uses different initial weights to train and generate an ensemble of networks. The difference in ensemble members reflects uncertainty caused by the limited training sample size and the local nature of ANN training algorithms. Previous studies suggest that the initial weight randomization method can yield results as accurate as the more sophisticated bagging and boosting methods [Opitz and Maclin, 1999; Shu and Burn, 2004]. Thus, it is used to create ensembles in this work.
 After the ensemble models are generated, they are combined to generate ensemble outputs. Common methods include simple averaging, weighted averaging, and stacking, a survey of which is provided in Zhou . Because the current problem involves ensemble models of similar performances (i.e., homogeneous learners), the computationally efficient simple averaging is used, in which all members are assigned equal weights. Regardless of the method used to construct and combining the ensembles, the generalization ability of a network ensemble is often better than that of a single model [Dietterich, 2000a; Zhou, 2012].
4. Results and Discussion
4.1. Test I
 Correlation analysis shows a positive correlation between GRACE ΔTWS ( ) and water level change ( ) for all three wells (Figures 4a–4c), with a Spearman's correlation coefficient of 0.33, 0.57, and 0.73, respectively, for Well-TX, Well-HP, and Well-IL. The correlation is the strongest for Well-IL because of its relatively shallow depth, while intermittent temporal lags between and can be observed in the case of the other two wells, especially Well-TX. The lags largely result from different water table depths, the vadose zone soil structure, and shallow aquifer geology which, in turn, all affect recharge patterns. In the case of TX, shows no obvious seasonal pattern, while distinct seasonal patterns can be observed in the other two wells. Precipitation and temperature, which have been included in many previous studies as predictors [e.g., Coppola et al., 2005b; Coulibaly et al., 2001], exhibit strong seasonal patterns as expected (Figure S2). Results of autocorrelation analysis on predictors, which is part of the ANN model selection process, are provided in supporting information S3 and Figure S3.
4.1.1. One Month Ahead Prediction
 Input data were split into three sequential parts: 60% was used for training, 20% for validation, and the rest for testing the performance of trained models. Before training, all input and target data were standardized by scaling linearly to the range using the maximum and minimum values of each series. Scaling is necessary to ensure that all variables receive equal consideration during training of the ANN. The number of hidden neurons is critical to the performance of ANN. If too many hidden neurons are used, the network may overfit and, thereby, reducing its generalization capability and increasing training time [Xu and Li, 2002]. On the other hand, using too few hidden neurons may result in underfitting. A rule-of-thumb is that the number of hidden neurons should be about half the number of predictors and should never be more than twice as large [Berry and Linoff, 2004; Minns and Hall, 1996]. In practice, the optimal number is determined by trial-and-error, in which hidden neurons are added gradually until the RMSE of training starts to increase. Following this general procedure, the number of hidden neurons used for Well-TX, Well-HP, and Well-IL models was found to be 3, 3, and 6, respectively. The algorithm chosen for backpropagation is Levenberg-Marquardt, a gradient-based algorithm. Training was stopped when either the performance goal was reached or when the error on validation subset increased consecutively for six iterations. Early stopping is a mechanism for preventing overfitting [Demuth et al., 2008].
 For each well and each lead time, an ensemble of 50 MLPs were generated and combined according to the procedure described in section 3.5. The performance metrics of the ensemble models are reported in Table 1, in which the right three columns were calculated based on the testing data. On the basis of NSE, the performance of 1 month ahead ensemble falls into either good or very good categories (see section 3.2 for definition of performance categories). The fit for Well-HP is the best and that for Well-IL is the worst. Same conclusion can be drawn on the basis of R*. Figures 5a–5c compare ensemble prediction of to in situ data. On each subplot, the two vertical lines mark the division of training, validation, and testing subsets.
The three numbers represent the number of input, hidden, and output neurons in each layer.
 The relatively poor performance of the Well-IL model is surprising because Well-IL shows the strongest versus correlation among the three wells. The poor performance may be contributed to both a sharp decline in water levels during the testing period and a change in seasonal variation patterns from a subdued (2003–2008) to a more pronounced one (2008–2012) (see Figure 4c). In the case of Well-TX, the 2006 drought event was covered by the training period, which helped model training to some degree, although the developed models need to be further improved to fully capture the extreme values. For the latter purpose, it is necessary to include as many extreme events as training patterns if the corresponding in situ data are also available. It is also important to realize that GRACE monitors the average large-scale impact caused by natural and anthropogenic events. Thus, if an observation well is significantly affected by local pumping activities, GRACE data may have limited value for it. In contrast to Well-TX and Well-IL, Well-HP shows persistent cyclic patterns throughout the study period, which makes its the easiest to predict.
 The relative importance of each predictor was quantified using the connection weight method described under section 3.4 and the results are shown in Figure 6a. For ease of visualization, relative importance belonging to the same variable group (e.g., and ) is combined. The relative importance of G in 1 month ahead prediction is 0.15, 0.08, and 0.18, respectively, for Well-TX, Well-HP, and Well-IL. In all cases, the contribution of GRACE is similar to or slightly greater than that of the precipitation group, but both are dominated by the temperature group which represents the most important predictor group for 1 month ahead prediction. The relative importance of GRACE appears to be higher when the Δh pattern is less cyclic, such as in the case of Well-TX and Well-IL. Temperature is a good surrogate for seasonality and, thus, it also reflects potential irrigation water uses. The roles of temperature and precipitation as observed here are consistent with those reported elsewhere in previous studies [Coppola et al., 2005b; Feng et al., 2008].
4.1.2. Multimonth Ahead Prediction
 Two and three month ahead MLP models were developed according to descriptions in section 3.3 and Figure 3. The ensemble performance metrics are summarized in Table 1. As mentioned before, the intermediate water level change prediction, , is used as feedback in 2 month ahead prediction (i.e., at ), and intermediate predictions and are used for 3 month prediction (i.e., at ).
 The performance of the 2 month ahead prediction spans from satisfactory to very good category, while the performance of the 3 month ahead predictions all falls into the satisfactory category. Figures 6b and 6c plot the relative importance of input variables. For the 2 month ahead prediction, the relative importance of GRACE data stays almost the same as that in the 1 month ahead prediction. The relative importance of intermediate output, , is second only to the temperature group. For the 3 month ahead prediction, the intermediate water level change predictions become the most important group. This is because of the strong autocorrelation of water level [Coppola et al., 2005a]. Thus, in the absence of in situ data, the role of GRACE in 1 month ahead prediction is seen as helping to jumpstart a serial propagation process. The outputs then provide feedbacks to subsequent multimonth ahead prediction. This way, the cross-scale information exchange is facilitated.
 Overall, Test I results are encouraging. Downscaling of GRACE data leads to modest but significant improvement in . Only one representative well in each different GRACE pixel was selected for testing. This leads naturally to the question about the value of downscaling GRACE for multiple wells at the subgrid level. Test II is designed to address this question.
4.2. Test II
 The 1° GRACE pixel used for Test II was selected mainly because of the abundance of in situ data that can be used for validation. As can be seen from Figure 2, the wells are relatively distributed across the pixel. Correlation analysis shows that a positive correlation between G and Δh exists for all seven wells in Test II (Figure 7), with Champion Well showing the strongest (0.61) and Haigler Well the weakest (0.42) correlation. Recall that Haigler Well is completed in a different formation than other wells. Nevertheless, the phases of Δh for all wells are quite similar, although the magnitudes of variations differ among wells. Grant South Well exhibits the largest variations, while variations in other wells are more subdued. For testing purpose, only 1 month ahead MLPs were developed, which adopt the same model structure as that used for Well-HP (i.e., Benkelman Well). Figure 7 also suggests that GRACE data show a slightly upward trend during 2008–2011, which corresponds to a slowdown in groundwater depletion (Figure S1). The performance metrics of all ensemble models are reported in Table 2.
Table 2. One Month Ahead Performance Metrics for Test II Wells
 The NSE values of all models are in either good or very good category. Benkelman and Enders Well are relatively close, and the ANN models of which show similar performance. Lamar Well and Champion Well, which are located in higher elevations, have lower NSE than others. Figure 8 shows the relative importance of each predictor group in the 1 month ahead MLPs. Again, GRACE has similar relative importance as the precipitation group, and both are dominated by the temperature groups.
 Thus, Test II indicates that downscaling of GRACE to multiple wells at the subgrid scale is feasible and has certain merits in this case. This is mainly because most of the wells are completed in the same formation. Of course, if wells are completed in different and disconnected formations, downscaling may have mixed results. Only one pixel was tested here. If data are available, further validation tests for other regions around the world can yield additional insights about the usefulness of GRACE downscaling.
 GRACE has supported many advances in TWS monitoring since its launch. To realize the full potential of GRACE for hydrological applications, however, monthly TWS anomalies from GRACE must be downscaled in space and time and extrapolated to the present, thereby meeting the specificity, timeliness, and high spatial resolution requirements of local water resources management [Houborg et al., 2012]. This study provides a relatively straightforward nonparametric approach for tapping into the predictive power of GRACE.
 A newly released, gridded GRACE product was tested for its value in serving as a surrogate for continuously monitored in situ water level measurements, which have seen a great decline in coverage decline. Groundwater agencies, especially those in semiarid regions, are interested in tracking groundwater level changes to manage aquifers in a sustainable manner [McGuire et al., 2012]. The nonparametric ANN models used in this study were developed using PRISM monthly precipitation, maximum and minimum temperatures, and GRACE ΔTWS as inputs. The target variable of interest is the groundwater level change, Δh. The wells tested are located in different geographic and climatic regions in the US. Main findings include:
 1. At 1 month prediction interval, the trained ensemble MLPs gave good to very good performance. The relative importance of GRACE to Δh prediction ranges from 8% to nearly 20%. The relative importance of GRACE tends to be greater when the water level patterns are less cyclic.
 2. The performance of the ensemble models is satisfactory at multimonth lead times.
 3. Results support the main hypothesis that GRACE ΔTWS can be downscaled to infer or predict Δh when continuous in situ measurements are not available.
 4. The approach developed here can be applied to force multiple ANNs developed for a network of wells, the outputs of which can then be combined via spatial interpolation techniques.
 5. The ANN approach taken here is well established, although its training and validation can be time consumption. However, any other statistical learning method can be used in its place if necessary.
 The Hodrick–Prescott filter [Hodrick and Prescott, 1997] or HP filter is used to obtain from absolute water level measurements. HP filter is a high-pass filter that removes a smooth trend from a time series by solving [Ravn and Uhlig, 2002]
where is the trend term and is a smoothing parameter. If the frequency of raw data is monthly, the recommended periodicity parameter of the HP filter is 14,400 [Ravn and Uhlig, 2002]. The Matlab function hpfilter was used in this study.
 The author is grateful to the Associate Editor and three anonymous reviewers for their careful review and constructive comments, which have significantly improved the original manuscript. The author also wishes to thank Sean Swenson at the National Center for Atmospheric Research for his constructive comments on the original manuscript.