A simple model was developed to estimate area-averaged evapotranspiration (ET) at the watershed scale, using widely available records of streamflow, precipitation, and basic meteorological data. The pivotal assumption of the model is that area-averaged basin-wide water storage (V) can be used to determine basin-wide ET efficiency. With this assumption, ET was modeled as a function of watershed storage, Priestley-Taylor potential evapotranspiration, and one free parameter (α) controlling the relation between basin-wide ET efficiency and storage. Watershed storage was found by integrating the water balance equation forward in time with observed precipitation and streamflow. By exploiting a hypothesized positive correlation between storage and streamflow, a method was developed to estimate the parameter α without calibration to measured ET, thus allowing the model to be applied in any watershed with measured precipitation, streamflow, and meteorological variables. The model was tested at sites within the AmeriFlux network in a variety of climates and ecosystems, using downstream U.S. Geological Survey streamflow to define the watershed boundary and measured AmeriFlux evapotranspiration to judge model performance. At most sites, the dynamics of modeled ET closely matched those of measured ET: daily root-mean-square errors averaged 0.067 cm d−1. In general, water storage and ET efficiency, especially during dry downs, were captured by the model, and the free parameter determined from the storage-streamflow correlation criteria was close to the optimal fit found through direct calibration. The performance of both the model and the indirect calibration strategy was best in arid to semiarid sites.
 Understanding of large-scale evapotranspiration (ET) is crucial for water resources management and agricultural planning [Dingman, 2002; Intergovernmental Panel on Climate Change (IPCC), 2008], ecological analyses [Woodward, 1987], and weather and climate modeling [IPCC, 2008]. But, while precipitation and streamflow records are both spatially and temporally extensive, good quality ET records (i.e., direct measurements at high temporal resolution, such as from eddy covariance) are sparse and short (usually less than 20 years). Some long panevaporation records exist, although interpretation of these records can be problematic [Brutsaert and Parlange, 1998; Pettijohn and Salvucci, 2009]. For instance, in a warming environment the conventional wisdom is that the hydrological cycle will intensify, implying that both precipitation and ET fluxes will increase in magnitude. Paradoxically, pan evaporation has decreased throughout the last half century while precipitation has increased, leading authors to develop opposite interpretations of ET trends from pan evaporation records [e.g., Brutsaert and Parlange, 1998; Peterson et al., 1995].
 Many attempts have been made to physically model ET, in order to increase our temporal and spatial knowledge of ET where measurements are lacking. However, the wide variety of available models, even within a single class of application (e.g., in climate models) is evidence of the difficulty in this undertaking. Part of the struggle in predicting ET results from the vast range of physical processes involved, such as soil and plant water transport, stomatal dynamics, radiation physics, and molecular and turbulent diffusion. Additionally, these physical and biological processes vary across a wide range of spatial and temporal scales, thus complicating our ability to predict and measure ET at large (e.g., regional) scales [Verstraeten et al., 2008]. For instance, an immensely complex three-dimensional geometry of evaporating surfaces (e.g., patches of soil, individual leaves, trees, grasslands) combines to produce the vapor flux simulated by models (at the scale of plots, watersheds, or climate model grids), which is then compared to data measured by instruments (at the scale of lysimeters, flux towers, or lidar). The potential disparity of process and measurement scale can lead to inconclusive progress. For example, how realistic is it if, by varying a dozen parameters, a “big-leaf” land surface ET model, whose physics is based in part on stomatal response to leaf water potential and radiation, is used to predict the ET flux for a grid cell many kilometers wide and with a complex distribution of leaves, leaf-water potentials, and absorbed radiation?
 Here we present a method to predict evapotranspiration at the scale of a watershed (less than approximately 5000 km2) using basic hydrological concepts and statistical methods. Specifically, we investigate the relation between ET and the active components of water storage (surface water, root zone soil moisture, deeper soil moisture, and shallow groundwater) with which ET interacts, along with its dependence on available energy, at the scale of the watershed itself. The method is tested at nine watersheds in a variety of climates and ecosystems, each of which is defined by a U.S. Geological Survey (USGS) stream gage and contains at least one AmeriFlux [Baldocchi et al., 2001] site.
 The pivotal hypothesis of the proposed method is that the total dynamic water content of a watershed (or basin-wide storage, V) is highly correlated to both the basin-wide evapotranspiration efficiency (i.e., ET relative to potential ET) and to the magnitude of streamflow (Q). This assumption should not be taken lightly, given that watersheds are systems of considerable complexity with regards to flow paths (e.g., including both riparian areas and uplands, saturated and unsaturated zones, gaining and losing streams, and complex vegetation patterns, to name a few). For this reason, it might be expected that a single basin-wide storage would neither be a good estimator of ET efficiency for an entire watershed, nor highly correlated with streamflow. However, from physical reasoning we expect to find that in general, both Q and ET efficiency would be positively correlated with storage. Specifically, base flow and surface runoff would be expected to increase with V as water tables rise, soil moistens, and infiltration capacity decreases. For instance, couplings between Q and measures of storage have been demonstrated in work by Troch et al.  (in which base flow was modeled as a function of groundwater storage), Shaw and Walter  (where correlation was assumed between soil moisture storage and base flow), and in TOPMODEL [Beven, 1977] and its many derivatives (e.g., SIMTOP [Niu et al., 2005]), although in these cases streamflow is used to approximate a storage deficit.
 Likewise, ET would be expected to increase with increasing availability of soil water and groundwater, albeit in highly complex ways involving root growth and extraction, stomatal response, and soil water diffusion, until it becomes rate limited by the available energy to vaporize water. This dependence of ET on V is implicit in water balance studies by Milly and Dunne  (where evaporation was modeled as a function of available water from precipitation) and Walter et al.  (in which it was noted that, in the context of climate change, an increase in soil moisture would lead to greater ET under nonpotential conditions) and is demonstrated by Salvucci  under statistically stationary conditions. With these general behaviors in mind, this study proposes that V can be used as a primary driver of the behavior of ET and streamflow on the aggregated watershed scale.
 The greatest practical potential of the model is its ability to indirectly estimate ET using only records of P, Q, and a few meteorological variables (i.e., without calibration to measured ET). Estimation of ET using only Q was previously accomplished by Palmroth et al. , but while Palmroth et al.  used streamflow recessions to estimate the parameters of a nonlinear reservoir model (which was subsequently employed to estimate ET), we utilize the hypothesized high correlation between model-estimated storage and measured streamflow to estimate the one free parameter that controls basin-wide ET efficiency. As a point of clarification, the method developed in this study is not the same as calibrating a land surface hydrologic model to streamflow. Streamflow generating processes are not modeled in this work. Instead, measured streamflow enters the model directly.
 Throughout this work, the philosophy taken was to model only what was necessary (in this case, ET and V, but not Q, as in most watershed models) and, in the spirit of Kirchner's  work, to see what can be learned from the data. In particular, the approach is consistent with two of the five directions proposed by Kirchner: (3) developing physically based equations for hydrologic behavior at large scales, and (4) developing minimally parameterized models that stand a chance of failing tests applied to them. The ET model discussed below is a simplification of an enormously complex process, but in many areas of study, including global climate models, the representation of moisture exchange is already at a crude scale. And at such crude scales, it is entirely possible that simple models derived from, and applied to, coarse scale representations of moisture, could be superior to the highly complex soil-vegetation-atmosphere transfer schemes (containing tens to hundreds of unknown parameters) which have been developed from leaf chamber to plot scale experiments.
2.1. Model Framework and Method for Indirect Parameter Estimation
 The proposed method relies on integrating the water balance using observed streamflow, and thus can only be applied at the scale of entire watersheds (as opposed to plots, hillslopes, or meteorological grids). The watershed water budget can be expressed in terms of fluxes as:
In equation (1), is the change in area-averaged watershed water storage with time, P is area-averaged precipitation, Q/Aw is streamflow normalized to the watershed area (Aw), and ET is area-averaged evapotranspiration, all expressed as fluxes [L/T] that pass through Aw. For the purposes of this study, V (with dimensions of length) refers to the dynamically varying portion of the watershed storage (i.e., shallow groundwater, soil moisture, surface water), below which there is some constant amount of storage that is unchanging.
 The flow of groundwater in and out of the watershed across watershed divides was assumed to be zero (or equal) so that all groundwater considerations in equation (1) could be ignored. This does not mean that groundwater was not considered in this study, as groundwater surely contributes to streamflow and composes part of the storage term defined above.
 The approach taken to model ET, and to estimate the free parameter of the ET model (see equation (7)), is most easily explained through a thought experiment: imagine a situation in which area-averaged P, Q, and ET in equation (1) are perfectly measured over an entire watershed, and the initial value of storage, V0, is known. A time series of basin-wide storage (V) could then be found by integrating the water budget (equation (1)) over time:
Now imagine replacing ET with some evapotranspiration model ( ) that depends on meteorological forcing and storage (V), and replacing P and Q/Aw with field measurements. If you again integrate the water balance equation, using the storage calculated at each time step in the ET model, an estimate of storage could be obtained:
Equation (3) is identical to equation (2) except that area-averaged basin-wide storage (V), change in storage with time ( ), and area-averaged evapotranspiration (ET) are replaced with model estimates ( , , and , respectively, where the hat implies a model estimate). The resulting time series of estimated storage ( ) in equation (3) will approximate the actual basin-wide storage (V) from equation (2), plus some error (ε):
The error of the approximation (ε) is due to a number of sources: (1) structural errors, biases, and unresolved processes in the ET model, (2) the choice of parameters in that model, and (3) measurement errors inherent in the precipitation, streamflow, and meteorological data.
 Presumably, a poor will yield a poor representation of the storage. In fact, a very poor model, for example a model in which is constant, will yield a very unstable time series of resembling a random walk. For such a poor estimate of storage, the correlation with and Q will presumably break down. If this is true, then a set of ET models can be judged by analyzing the statistical relationships that arise between the integrated estimate of storage ( ) and the measured, area-normalized streamflow (Q/Aw). Good ET models, on the other hand, will yield more accurate storage estimates, which in turn will be more highly correlated to streamflow. In summary, the proposed method consists of (1) integrating the water balance equation for an estimate of storage ( ) using observed precipitation and streamflow and a storage-dependent model of evapotranspiration and (2) assessing the ET model (and in doing so, estimating its parameters) by measuring the correlation of the observed streamflow with the integrated storage. This general approach could be applied to a variety of ET models of varying complexity. Below we demonstrate the approach with a simple ET model that depends only on one free parameter.
2.2. ET Model
 In this study, Priestley-Taylor potential evapotranspiration [Priestley and Taylor, 1972] was used to capture the meteorological drivers of ET:
In equation (5), PETPT is Priestley-Taylor potential ET [LT−1], αPT is the Priestley-Taylor factor (which, for simplicity, was assumed to be 1.26 [e.g., Eichinger et al., 1996]), Δ is the relationship between saturation vapor pressure and temperature (kPa K−1), Rn is net radiation (W m−2), G is ground heat flux (W m−2), ρw is the density of water (assumed to be 999 kg m−3), λv is the latent heat of vaporization (J kg−1), and γ is the psychrometric constant (kPa K−1) [Dingman, 2002].
 Priestley-Taylor potential evapotranspiration is an estimate of the maximum, or “potential”, evapotranspiration that can result from a given set of environmental conditions. Thus, multiplying this potential ET by an efficiency term (β) yields a prediction of evapotranspiration ( ):
 The basin-wide ET efficiency term used in the model was based on an exponentially decaying dependence on watershed storage:
In equation (7), β is the dimensionless ET efficiency factor, is model-estimated area-averaged storage (cm), and α is an empirical free parameter (cm−1). Thus, watershed scale evapotranspiration (equation (6)), which was used in the water balance equation (equation (3)), was modeled as a function of meteorological conditions (equation (5)) and storage. In this study, α was allowed to vary from 0.005 to 0.500 cm−1, as the strength of the relationship between ET and storage in any given watershed was a priori unknown. Equation (7) is applicable only for positive . During integration of equation (3), if dropped below zero, β was set to zero. Occasional dips of below zero occurred only for very large values of α, and those solutions never corresponded to the best fitting solutions of the model.
2.3. Data Processing
 The model was run on a half-hour time step, dictated by the frequency of AmeriFlux network [Baldocchi et al., 2001] precipitation and meteorological measurements. A complete and continuous data record (i.e., no missing values) for each measured variable was necessary in order to run the model. Outliers in the AmeriFlux forcing data (i.e., Rn, G, and air temperature, Ta) were identified using a multiple regression model consisting of Fourier components with frequencies ranging from diurnal to annual. All time steps with identified outliers (beyond six standard deviations from the multiple regression model) and any time step with a missing value in the forcing data were replaced using a gap-filling model. The gap-filling method can be summarized as follows: the data from a time step with at least one missing value or outlier were replaced with a complete set of forcing data from a different time step. The time step used for replacement was determined on the basis of the following sequential criteria: (1) within 1 h on the same day (2 h period) or, if that was not possible, then (2) within the same 2 h period of the day within 4 days of the missing data or, if that was not possible, (3) within the same 2 h period of the day and within 4 days of the Julian day of the missing data in any year of the study period. In this way, all data for any time step with a missing or outlying value were replaced with a full set of similar data, which presumably preserved the natural statistical relationship of the variables. Data gaps in the AmeriFlux precipitation data were filled using nearby alternate precipitation records [see Tuttle, 2011]. USGS streamflow data (National Water Information System data available at http://waterdata.usgs.gov/nwis/sw) were provided on a daily basis, but were interpolated to a half-hour time step by assuming that the streamflow was constant throughout an entire day.
2.4. Model Integration
 The model was initiated by assuming an arbitrary value for V0 (i.e., 50 cm), as the initial storage in the watershed was unknown, and by selecting a value for the free parameter, α. This V0 value was then input into the ET efficiency term (equation (7)) to obtain the efficiency factor (β). The measured meteorological forcing data from the first time step were used to determine the potential evaporation from equation (5). Using these values an estimate of ET (i.e., ) was obtained from equation (6), which was input into equation (3), along with measured precipitation and streamflow, to integrate the water balance and obtain an estimate of V (i.e., ) at the end of the time step. This was then used to calculate a β value for the following time step, and the above process was repeated throughout the entire study period to obtain a time series for and . At the end of this first iteration, the value from the last time step of the study period was used to initiate a subsequent iteration, so that over many iterations the and time series would reach an equilibrium that was independent of the original estimate of the initial storage, V0. This integration technique forced the estimated storage at the beginning and end of the time series to be equal, which is not necessarily true, but was a required assumption in order to obtain a result independent of our arbitrary initial storage. We started and ended the study period for each site at the same time of the year in an attempt to minimize any error resulting from this assumption. If any information about V was known, then this potential error could be reduced and the V time series could be better constrained.
 The and time series from the final, equilibrium iteration were taken to reflect the model estimation of conditions at the site. In this way, a unique solution to the watershed water balance (i.e., a time series of modeled and , complementary to the measured P and Q) was determined for each value of the free parameter, α.
2.5. Method for Model Assessment
 Results of the model were compared to ET measured at the given AmeriFlux site (ETm). Because each time series produced by the model must satisfy the long-term water balance, the ETm time series were scaled for comparison purposes so that the long-term average of the measured ET also satisfied the long-term water balance. All half-hourly values were raised (or lowered) using a simple, scalar multiplier (i.e., all values were consistently adjusted) and thus did not affect the diurnal, seasonal, or annual dynamics of the ETm. Instead, only the magnitude of the measured ET was altered. In this way, scaling automatically removed a “bias” term from calculated error values between model-estimated and measured ET, but did not affect correlation or variance. Scaling measured ETm in this way likely reduced issues related to precipitation undercatch (e.g., see Scott  for the Kendall Grassland site) and from measurement errors in AmeriFlux ET related to energy balance closure [e.g., Wilson et al., 2002].
 Prior to any analysis of model output, the , ETm, , and Q/Aw time series were aggregated from half-hourly to daily averages. The daily and ETm time series were then directly compared (i.e., for each day) using a root mean square error measure (RMSE). Because the model-estimated ET for the watershed was directly compared to the measured ET at the AmeriFlux site, this is hereafter referred to as the “direct” method. The time series that resulted in the lowest RMSE when compared to the measured ET was assumed to be the scenario that most closely represented actual conditions within the given watershed, along with the corresponding time series and α value. Issues regarding the representativeness of the AmeriFlux ET measurements in relation to the area-averaged watershed ET are discussed in section 4.2.
 The model was applied to nine watersheds across the contiguous United States. Each watershed contained at least one AmeriFlux eddy flux tower and was defined by delineating the boundary from a given USGS stream gage located at the watershed outlet. The watersheds spanned a wide range of climates and ecosystems (e.g., semiarid savannah to temperate woodland). General descriptions of each site are shown in Table 1.
Table 1. General Site Information and Characteristics
The U.S. Geological Survey (USGS) gauge listed for the Black Hills watershed was used for streamflow measurements, but the streamflow released from an upstream dam was subtracted from these data [Tuttle, 2011]. This removed the dam-affected portion of the streamflow from gauge 06410500 and effectively reduced the area of the watershed to the value shown.
 Alternate precipitation records were used to force the model for two sites. At Little Washita, U.S. Department of Agriculture (USDA) Agricultural Research Service (ARS) Micronet precipitation (available at http://ars.mesonet.org/) [Elliot et al., 1993] was averaged with AmeriFlux precipitation, and at Fermi, USGS precipitation data (http://waterdata.usgs.gov/nwis/sw) recorded at the watershed outlet gaging station (Table 1) were used in place of AmeriFlux precipitation due to the omission of a very significant rain event in the AmeriFlux record that was captured in the USGS data.
 The Duke watershed contained three AmeriFlux sites within its boundaries, each situated in a unique land cover type (hardwood forest, loblolly pine forest, and open, grassy field) in Duke University's Duke Forest. Thus, the data from these three towers were integrated into one data set using a weighted average before use in the model. First, the proportions of different land cover types in the watershed over the 5 year study period were determined using MODIS land cover data (product MCD12Q1, Land Cover Type L3 Global yearly 500 m resolution, International Geosphere-Biosphere Programme (IGBP) classification, available at the USGS Earth Resources Observation and Science (EROS), NASA Land Processes Distributed Active Archive Center (LP DAAC) online Data Pool, https://lpdaac.usgs.gov/get_data/data_pool), as shown in Table 2. Then, the portion of each of the IGBP land cover units that could be approximated by the three Duke towers was determined (e.g., it was assumed that 100 percent of the Deciduous Broadleaf Forest in the watershed could be represented by the Duke Hardwoods tower). Finally, the total area of the watershed that could be represented by each tower was separately tallied in order to find the weighted average used to combine the three data sets (bottom row of Table 2), which was then rounded to the nearest 5% because of the inaccuracy of the attribution method. As is evident in Table 2, the dominant tower contribution in the Duke watershed was found to be the Hardwood site (45%), followed closely by the Open Field site (40%), with a smaller contribution from the Loblolly Pine site (15%).
Table 2. Determination of the Estimated Weighted Average Contribution of Each Flux Tower to the Area-Averaged ET Flux in Duke Watersheda
Land Cover Class
MODIS IGBP (Land Cover Type 1)
Percent of Watershed Area
Percent of Land Cover Type
MODIS, Moderate Resolution Imaging Spectroradiometer; IGBP, International Geosphere-Biosphere Programme.
Deciduous broadleaf forest
Urban and built up
Cropland/natural vegetation mosaic
Estimated weighted average contribution (%)
 The Bartlett watershed necessitated an exception to the scaling procedure used to adjust measured ET for the long-term water balance (described in section 2.5). The mean streamflow flux for the watershed (0.301 cm d−1) nearly equaled the entire mean precipitation flux at the AmeriFlux site (0.350 cm d−1) despite a seemingly reasonable mean measured evapotranspiration flux for a northeastern, forested watershed (0.110 cm d−1). Thus, adjusting the measured ET in order to satisfy the long-term water balance would yield an extremely small flux of less than 0.05 cm d−1. However, through comparison to nearby National Oceanic and Atmospheric Administration (NOAA) National Weather Service Cooperative Station Network (COOP) stations (available from NOAA's National Climatic Data Center (NCDC) at http://www.ncdc.noaa.gov/oa/climate/stationlocator.html) and data from the Parameter-elevation Regressions on Independent Slopes Model (PRISM) (available from the PRISM Climate Group, Oregon State University, http://prism.oregonstate.edu; maps were created in June 2010 [Daly et al., 1994, 2002]), it was determined that the precipitation measured at the Bartlett AmeriFlux site likely underrepresented the mean precipitation flux of the watershed. This was probably due to the location of the AmeriFlux site at low elevation (∼272 m) within a mountainous watershed (peaks mostly from 600 to 1200 m but up to ∼1900 m). For this reason, the precipitation flux was scaled up to satisfy the long-term water balance. The measured ET was left unscaled, as only one flux could be scaled to satisfy the water balance, which is a possible source of error for this site. It is our belief that this upscaling of the precipitation record made the data more representative of area-averaged watershed conditions than if the ET were scaled down.
2.7. Indirect Calibration of ET Efficiency Parameter
 Here, the “indirect” method refers to a technique that attempts to estimate the free parameter (α) value identified by the “direct” method (i.e., RMSE between and ETm) without incorporating measured ET. Thus, the indirect method seeks to eliminate the need for measured ET in order to accurately predict watershed-scale evapotranspiration from the model described by equations (1)–(7).
 According to the main hypothesis of this study (discussed in section 1), there should be a strong positive statistical relationship between the storage of a given watershed and the magnitude of streamflow exiting the watershed. But, in the model developed above, streamflow is exogenous (as well as precipitation), so its effect on model-simulated storage ( ) is fixed. Thus, the variations between time series of calculated for various values of α, and therefore the amount of reducible error present in the modeled storage (ε in equation (4)), are due only to changes in estimated ET ( ). Following this logic, the better that the time series of approximates actual ET (which we take to be measured AmeriFlux ET) in the watershed, the better the time series of will approximate the unknown, actual storage (V in equation (4)), and presumably the stronger the relationship will be between the model-estimated storage ( ) and measured streamflow (Q/Aw), according to the hypothesis above. In other words, the most accurate time series of model-estimated evapotranspiration should also result in the highest correlation between model-estimated storage and measured streamflow.
 To test this proposition, the (calculated for each value of the ET efficiency parameter, α) and Q/Aw (constant across all model runs) time series were transformed to a normal distribution (since storage and streamflow are inherently non-Gaussian) [Nelsen, 1999], and the Pearson's product-moment correlation coefficient (r) was determined for the transformed variables. The value of α that yielded a series of and Q/Aw with the highest r was taken to represent the situation that best satisfied the hypothesis that there should be a strong positive relationship between storage and streamflow. To the degree that the value of α selected in this manner matched the value of α that yielded the best fit (i.e., lowest RMSE) between measured and model-estimated ET, the proposition was supported.
3.1. Direct Method: Comparison of Measured Versus Model-Estimated ET
 The model-estimated ET ( ) from the best-case (i.e., minimum RMSE) choice of the free parameter α agreed well with the measured ET, with RMSE values below 0.1 cm d−1 for all sites, as shown in Table 3. The minimum RMSE between model-estimated and measured ET averaged 0.067 cm d−1, with a minimum of 0.046 cm d−1 at the arid Kendall watershed and a maximum of 0.089 cm d−1 at the humid, forested Bartlett watershed. Interestingly, model-estimated ET for the Duke watershed, which utilized a weighted average of the three AmeriFlux sites (in different ecosystems) within its boundaries (see section 2.7), yielded a lower RMSE with weighted-average measured ET (0.050 cm d−1) than any of the three sites individually (0.061 to 0.077 cm d−1). This is likely because the weighted-average precipitation and meteorological data from the three different ecosystems are more representative of area-averaged conditions within the watershed than any of the three sites alone, and thus better complement the streamflow data, which leads to a better prediction of area-averaged storage and, therefore, a better prediction of ET.
Table 3. Summary of Model Results for the Direct ET Comparison and Indirect Parameter Estimation Methods
Minimum RMSE (cm d−1)
Maximum Transformed Correlation Between and Q/Aw (r)
Estimated Upper Bound for Error in Streamflowb (%)
The α values tested in this study were 0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, and 0.5 cm−1.
Errors in the streamflow records for the stream gages listed in Table 1 were estimated from USGS annual water data reports. In these reports, each daily streamflow value has been assigned a data quality label by the USGS indicating an approximate maximum error (i.e., “good” means the data are within 5% of the actual flow, “fair” indicates data are within 8% of the actual flow, and “poor” indicates that errors are greater than 8%). We assumed that poor values were within 20% of the actual flow in order to calculate an average upper bound on the streamflow error using all of the daily values from the study period. Thus, the estimated upper bound values represent the (approximate) average maximum error potentially present in each record.
Annual water data reports were not available online for gauge 05590950 (Bondville watershed) before 1 October 1997, so the value shown was calculated using 3616 of the 4017 (90%) days in the study period.
Annual water data reports were not available online for gauge 07327550 (Little Washita watershed) for the study period (1997–1998), so we have no sense of the error in this streamflow record.
 The agreement between the modeled and measured ET is visible both in scatter plots of the daily-averaged ET and in comparison of the time series of measured and model-estimated ET. Plots comparing ET at two sites with very different climates, Kendall and Duke, are shown in Figures 1 and 2 (Kendall) and Figures 3 and 4 (Duke).Figures 1 and 2a show measured versus model-predicted ET for the Kendall watershed (an arid desert grassland), while Figures 3 and 4 show the same data for the Duke watershed (a temperate, humid mix of forest and suburbs). In Figures 1 and 3, daily-averaged measured ET is plotted against daily-averaged model-estimated ET for the given site. Figures 2a and 4 show approximately 1 year segments of the ET time series, for comparison. Priestley-Taylor potential ET (PETPT) is shown in the black dotted line, which highlights the ability of the model to use basin-wide ET efficiency to estimate ET. Note that the estimated ET shown in Figures 2a and 4 are for the minimum RMSE case, not the model scenario chosen by the indirect correlation method (see section 3.2). For Kendall, the indirect correlation method chose the minimum RMSE model scenario, and thus the same ET time series shown above, but for Duke the correlation method did not choose the same scenario, and therefore predicts a slightly less accurate ET time series than that shown in Figure 4. This lack of agreement was a pattern seen primarily for humid sites, likely because the humid catchments were not as water limited as the arid sites (see section 4.1 for further discussion of this point). This is also why corresponding ET efficiency plots (similar to those shown for Kendall) are not presented for Duke, as ET efficiency changed little over the study period because of consistently high storage values.
 Generally, the model reproduced the pattern of the measured ET time series over a range of climates and ecosystems (e.g., Figures 1 and 2 and 3 and 4), given its simplicity (the only inputs were measured precipitation, air temperature, air pressure, net radiation, and ground heat flux at the AmeriFlux site, and streamflow exiting the watershed). However, the model excelled at predicting drops in the magnitude of ET due to storage dry downs (e.g., in Figure 2a), which would be completely missed by the model if not for the presence of the basin-wide ET efficiency term (β, equation (6)) or some other measure of storage, such as soil moisture. Note that periods of low model-estimated storage (in Figure 2b) result in low values of β, which, in turn, restrict the magnitude of the model-estimated ET (in Figure 2a) and cause it to fall significantly below the potential rate (i.e., the black dotted line), and vice versa. Figure 2c shows that, in addition to reproducing daily dynamics (Figure 2a), the modeled ET also agrees well with the measured ET at different levels of storage.
 To further assess the model results, the model-estimated storage ( ) time series from each watershed were qualitatively assessed against measured soil water content (i.e., soil moisture) from each given AmeriFlux site. An example from the Duke watershed is presented in Figure 5, where a time series of model-estimated storage is shown with a corresponding time series of measured soil moisture (in this case, a weighted average of the three AmeriFlux sites in the Duke watershed). It is unrealistic to expect that the pattern of the time series will match up exactly with soil moisture measurements at the AmeriFlux sites, as the storage calculated for the model is an aggregated value that includes water in the entire soil column, plus shallow groundwater, surface water, and any water present on vegetation surfaces, while the soil moisture is a point measurement in the near-surface soil (measured at less than 25 cm depth at all of the sites). The soil moisture records appear to be a bit more “flashy” than the storage time series, with many abrupt spikes due to precipitation events, as is to be expected from shallow soils, which wet and dry more rapidly than deeper soils. The storage time series show the same spikes, but they are less dramatic because of the aggregated nature of the simulated storage, and the curve is much smoother overall. However, the patterns of the and soil moisture time series do show a general correspondence. This is the case for all of the sites with good soil moisture records (i.e., all except Bondville and Fermi Prairie).
3.2. Indirect Method: Correlation of Normal-Transformed Streamflow and Storage
 The maximum correlation between normal-transformed, model-estimated storage and measured streamflow averaged 0.53 across the nine watersheds (see Table 3). In Figure 6, the RMSE values from the direct method are shown in comparison to the normal-transformed correlation (r) from the indirect method for model runs with varying values of α at each of the sites.
 For two sites (Bartlett and Kendall), the indirect, correlation method identified the same α value (among the 21 discrete values simulated) chosen by the direct method (i.e., RMSE, see Figure 6 and Table 3). In addition to picking reasonable estimates of the free parameter (α), the strong inverse relationship between the RMSE of the ET estimate with the storage-streamflow correlation coefficient (for the arid sites, in particular) supports the main hypothesis underlying the proposed parameter estimation strategy, i.e., that increasingly accurate ET models will yield increasingly accurate estimates of V, which in turn will be increasingly correlated to measured streamflow.
 The indirect method also matched the direct method (within one α increment) for the Flagstaff and Little Washita sites, and was close for Audubon (but with the maximum correlation shifted three α values lower than the minimum RMSE). However, the correlation method overpredicted the RMSE-chosen α for the remaining four sites (Black Hills, Bondville, Duke, and Fermi). In all four cases, the RMSE method displayed little sensitivity to α (at low α values). This is quite possible, as all four of the watersheds have specific properties that could reduce the coupling of the watershed storage to ET efficiency. For instance, the Black Hills and Duke watersheds are dominated by forest, which contain trees that might be able to access deep reservoirs of soil moisture or groundwater during drought periods, thereby causing ET to remain at higher than expected levels during dry downs. Similarly, the two remaining watersheds could show inconsistent behavior because of significant human impact (e.g., cropping practices in the Bondville watershed, suburban practices such as watering lawns in the Fermi watershed).
 The ability of the storage-streamflow correlation method is undoubtedly dependent on the quality of the streamflow record, as streamflow errors have the potential to decrease the correlation with storage or shift the maximum correlation value to a different value of the free parameter (α). For this reason, estimated upper bounds for error in streamflow are presented in Table 3 for each watershed, as determined from USGS records (see footnote b in Table 3 for explanation of error calculations).
 Additional analyses using base flow (separated from the USGS streamflow time series using the methodology of [Eckhardt, 2005]) in place of USGS streamflow were carried out to see if this improved the agreement between the RMSE and correlation methods. Base flow improved the correlation in six of the nine sites, but did not change the selected α values enough to account for the discrepancy at any of the four sites where the correlation method failed (absolute average of 1.2 increment change in α). Overall, agreement between the RMSE and correlation methods increased for four sites, but decreased for two. In our opinion, this slight benefit was not sufficient to outweigh the somewhat subjective nature of streamflow separation compared to measured streamflow, but base flow could be used in place of streamflow in the correlation method if desired.
 The performance of the indirect method can be summarized by plotting the RMSE value corresponding to the correlation-chosen α inside a box and whisker plot of the RMSE values obtained for each of the 21 α values (Figure 7). For most of the sites the correlation-chosen RMSE was below the median RMSE value, and four selected at or near the minimum RMSE. These results suggest that the correlation method was successful in identifying a time series of ET similar to the optimal choice (as indicated by RMSE) in four cases, and came close two others (Audubon and Bondville). The only exceptions to this trend are Duke (which chose the median value), Blacks Hills, and Fermi. In these cases, the storage-streamflow correlation method erred by choosing a parameter value that led to more sensitivity of ET to water availability (i.e., higher α), whereas the minimum RMSE resulted from an α value that indicates less sensitivity.
4.1. Model Assessment
 In general, the best-case results of the model (i.e., the direct, RMSE method) captured the behavior of evapotranspiration measured at the AmeriFlux sites (as indicated by the RMSE values and visual comparisons such as Figures 2, 4, and 5), which supports the hypothesis of a strong dependence of ET and Q on watershed-averaged storage. This makes intuitive sense, as water should be more available for evapotranspiration and increase the flow into streams (higher water tables, more soil moisture, etc.) in wetter watersheds [see Walter et al., 2004; Troch et al., 1993], while drier watersheds should behave in the opposite fashion. However, it is surprising that a single area-averaged value of basin-wide storage could be so successful at ET estimation, considering the significant heterogeneity of environments within each watershed (e.g., riparian zones, uplands). Surely, the degree to which the model-estimated ET matches the measured ET depends on the representativeness of the given AmeriFlux site compared to the rest of the watershed (discussed in section 4.2), along with the accuracy of data measurements, but the results of the model suggest that the dependence of ET on storage can allow for ET be successfully modeled at the watershed scale.
 The indirect method, based on the correlation of model-estimated storage and measured streamflow, was successful in identifying the free parameter values chosen by the RMSE in two of the nine cases, and was close for three other sites (Little Washita, Flagstaff, and Audubon). This success at these sites can be best explained in the context of the hypothesis that the storage of a watershed should be highly correlated to the streamflow leaving the watershed. Because of the setup of the model, the errors in measured streamflow and precipitation (and all meteorological variables) are exogenous, so the only parameter-dependent errors in model-estimated storage (ε in equation (4)) across different model runs result from integrating errors in estimated ET into the storage time series. Presumably, the measured streamflow has a high correlation with the actual storage of the watershed, but the integrated errors have the effect of shifting the estimated storage away from the actual storage of the watershed, thereby reducing the correlation with streamflow.
 This concept is illustrated in Figure 8, which shows model-estimated storage plotted against measured streamflow at the Little Washita watershed for the model scenarios that resulted in the lowest RMSE (0.080 cm d−1, α = 0.050 cm−1) and the highest RMSE (0.136 cm d−1, α = 0.500 cm−1) among all model runs, respectively. To better illustrate the relation, streamflow is plotted on a natural log axis. (Note however, that all correlations reported in Table 3 and Figures 6 and 7 are based on first transforming storage and streamflow to have normal (i.e., Gaussian) distributions, as is commonly done in Gaussian copula models [Nelsen, 1999].) In Figures 8a and 8b, the streamflow values are the same (as they are fixed), but the storage values can shift along the x axis depending on the integrated errors in the estimated storage time series. Clearly, the estimated storage time series in Figure 8a resulted in a better correlation than that in Figure 8b, and thus better satisfied the hypothesis that storage and streamflow should be highly correlated. Presumably, because the estimated storage in the scenario from Figure 8a was also closer to the actual storage, it resulted in a better estimation of ET than in the scenario from Figure 8b, as indicated by the lower RMSE. Thus, the model scenario that leads to the highest correlation between model-estimated storage and measured streamflow should also be the case that yields the best prediction of the actual storage, and therefore also the best estimation of evapotranspiration.
 However, the indirect, correlation method did not agree with the direct, RMSE method in all cases. In particular, it appears that the correlation method agreed well with the direct, RMSE method at more arid sites, while it failed to agree at more humid (or energy-limited) sites. (A notable exception is Bartlett, which received the highest rainfall, but had perfect agreement between the correlation and RMSE methods.) This trend is likely related to the key assumption of this model that ET should be sensitive to storage. At arid sites, there is undoubtedly a strong coupling between water availability and ET (i.e., the ET is water limited), while at humid sites the strength of this coupling seems to be less strong, or at least inconsistent (i.e., the sites are more energy limited). Thus, when the watershed is humid, we speculate that the basic assumption behind the model may not be satisfied, and the two methods may fail to agree. However, it is important to point out that the RMSE at the sites where the correlation method failed (i.e., Black Hills, Bondville, Duke, and Fermi) is not very sensitive at low values of α, so even a poor choice of α will result in a time series of modeled ET with error similar to the best-case time series.
 The model presented above employed a half-hourly time step and involved interpolation of streamflow data to half-hourly values, but the model output and forcing data were then aggregated to daily values before analysis. To assess the impact of the model time step or streamflow disaggregation on model performance, we also tested the model at a half-hourly (without daily aggregation) and daily time step. While we found very slight benefits of integrating at the half hourly time step (related to nonlinearity of the ET efficiency term during integration), and then aggregating to a daily time step for analysis, there was very little difference in the model results between the technique presented above and integration using a daily time step (average difference in RMSE of less than 0.001 cm d−1). Additionally, the correlation method was more or less unaffected by time step or streamflow interpolation (a minimal shift of one increment in α across the different methods was seen at three sites, Bondville, Flagstaff, and Little Washita, but did not change at any other sites). However, for a few sites the α value chosen by the minimum RMSE at the half-hourly time step was quite different from that chosen by the daily RMSE. This is likely due to large outliers in the half-hourly data resulting in large increases in RMSE, whereas the effect of these outliers is smoothed out during daily averaging. These results imply that the model can be employed irrespective of time step (between the half-hour and daily time scale), but that analysis should take place at the daily time step for best agreement between the RMSE and correlation methods.
 As mentioned above, a major consideration when assessing the model results in this study is that of representativeness. Validation of the model is dependent on the assumption that flux measurements at a given site are representative of the watershed as a whole. In other words, it is assumed that the eddy covariance measurement of evapotranspiration from the tower footprint (which varies in time) and point measurement of precipitation at the AmeriFlux site, and (area normalized) streamflow at the stream gage, are equivalent to the actual area-averaged values for the given watershed. However, this is not necessarily true, and must be taken into account when interpreting the results of the model. Generally, if the watershed is very heterogeneous, then the representativeness of a single station within that watershed must be called into question. The sites in this study whose representativeness may be suspect are Black Hills (streamflow impact from an upstream dam, which was removed from the data [see Tuttle, 2011]), Bondville (agricultural watershed, yearly crop rotation at the site of the AmeriFlux tower), and Flagstaff Unmanaged Forest (large elevation drop from headwaters to stream gage, multiple ecosystems, possible withdrawals from streamflow for irrigation). Additionally, the USGS annual water data reports for the stream gages (in Table 1) noted some sort of anthropogenic impact (e.g., water withdrawals upstream of the gauge) at three other sites (Duke, Kendall, and Little Washita).
 To further address the issue of representativeness, we compared the precipitation forcing at four sites (Audubon, Duke, Fermi, and Little Washita) to precipitation from the North American Land Data Assimilation System (NLDAS) [Mitchell et al., 2004], which is a merged precipitation estimate based on gauge and radar data (and corrected for orogenic effects using the PRISM method [Daly et al., 1994]) provided for the contiguous United States at a 0.125° latitude/longitude grid scale. In this way, we were able to construct a watershed average time series of NLDAS precipitation for each site by taking a weighted average of all grid cells that fell (completely or partly) within each given watershed. The average daily RMSE between the AmeriFlux and NLDAS precipitation time series for the four sites ranged from 0.276 cm d−1 (Duke) to 0.359 cm d−1 (Little Washita), and the difference between NLDAS and AmeriFlux mean daily precipitation ranged from +0.037 cm d−1 (Fermi) to −0.010 cm d−1 (Audubon). The use of NLDAS precipitation in the model did not result in better agreement between the α values chosen by the ET RMSE and storage-streamflow correlation methods, with an increase in agreement at only one of the four tested sites (Fermi, one α increment) but decreases in agreement at two others (Audubon and Duke, one and two α increments, respectively) and no change at the fourth (Little Washita). The above findings suggest that there were small differences in the magnitude and variation of the watershed average precipitation compared to the precipitation recorded at the AmeriFlux sites, but that that the AmeriFlux data were relatively representative of the watershed average (or, at a minimum, that the model results are relatively insensitive to small representativeness “errors” in forcing).
 In summary, we recognize that it would be preferable to force this watershed-scale model with direct measurements of watershed average precipitation and meteorological forcing, and validate it against direct measurements of watershed average evapotranspiration, but these data do not currently exist for the nine sites in this study. The goal of this research was to attempt to fill this data gap by developing a model capable of estimating watershed scale ET. Rather than use indirect, model-dependent measurements (e.g., remote sensing) or model output to force and validate our model, we chose to use the sparser but more accurate direct measurements and assume the data were representative of the watershed average, and our additional analyses for four sites suggest that this assumption was reasonable.
4.3. Implications of Error in Measured Forcing and Validation Data
 The NLDAS precipitation data described in section 4.2 were used to test the sensitivity of the model results to error in the precipitation forcing. By comparing the results of the model using NLDAS precipitation to that of AmeriFlux precipitation, respectively, we effectively created a proxy for error in the precipitation forcing input into the model. The difference in the RMSE between the estimated ET from the NLDAS-forced model versus the estimated ET from the AmeriFlux-forced model ranged from −0.006 cm d−1 (Audubon) to +0.015 cm d−1 (Fermi). However, these RMSE differences translated into very small differences in the minimum RMSE-chosen α values (absolute average of only one α increment across the four tested sites, with a maximum of two α increments at the Fermi site). Similarly, the storage-streamflow correlation showed modest differences between the NLDAS-forced and AmeriFlux-forced model results, with a range of −0.03 (Duke) to +0.04 (Little Washita). However, the differences in the correlation-chosen α were slightly larger than those of the RMSE-chosen α, with an absolute average change of 1.75 α increments between the AmeriFlux and the NLDAS forced models (and a maximum of 3 at the Duke and Little Washita sites). These larger, but still modest, α differences are likely due to errors in precipitation becoming integrated into the estimated storage ( ) time series and affecting the relationship of and streamflow. In summary, the direct, RMSE method is not very sensitive to errors in precipitation forcing, but the indirect, storage-streamflow correlation method is slightly more susceptible to these errors. This finding also necessitates consideration of errors in streamflow. If errors in estimated storage could affect the correlation method, then so could errors in streamflow. We have estimated upper bounds for the error in streamflow for each watershed in Table 3 from USGS error estimates. It is possible that streamflow errors could partly explain the poor performance of the correlation method, because some of the sites with the highest streamflow errors (Table 3) also showed the worst agreement between the RMSE- and correlation-chosen α.
 It is also important to consider errors in the data used to validate the model. For example, there is a pending correction for the eddy covariance data at the Bondville site, which can be on the order of 10% (M. Heuer, personal communication, 2012). The typical error in latent heat (LE) measurements at AmeriFlux sites is on the order of 5%–20% (20–50 W m−2) for half-hourly values [Foken, 2008]. If it is assumed that these measurement errors are independent, the daily average errors in LE should have a standard deviation of about 3–7 W m−2. Converted to centimeters per day, the standard deviation becomes 0.01–0.02 cm d−1. Because the daily averaged, measured ET time series are used to determine the RMSE for each site, RMSE differences of less than 0.01–0.02 cm d−1 are not very meaningful. In fact, this error value may be a lower bound, as the errors in latent heat measurements are not statistically independent throughout the day, and annual averages can still result in energy balance closure errors of around 10% [Wilson et al., 2002]. This implies that the error does not decrease as rapidly as it would for independent and unbiased errors. This inherent measurement error is important to note, as the range of RMSE values across all α values approaches this error for some sites (e.g., approximately 0.02 cm d−1 for Audubon, 0.04 cm d−1 for Black Hills and Kendall). In light of this inherent uncertainty, the results of the indirect correlation method appear even more promising, as none of the differences between the minimum RMSE and the RMSE value for the model scenario with the highest correlation (Table 3) were greater than 0.02 cm d−1.
 In this study, a simple model was created to estimate evapotranspiration (ET) at the watershed scale using a basic water budget and a hypothesized relationship between a single area-averaged value of basin-wide water storage (V) and the ET efficiency. Physically based equations were used to estimate ET, which allowed the water budget to be integrated forward in time to obtain estimated time series of basin-wide storage and ET. In this way, the model was partly physical, in that equations based on physical processes were used to calculate ET, but also partly empirical, as the model sought to optimize one unknown free parameter (α) that was presumably unique to each watershed and only determinable through model iteration. An effort was made retain model simplicity by maximizing the use of measured variables (i.e., precipitation, streamflow, and meteorological forcing used to estimate ET), while keeping unknown parameters to a minimum. In general, the model did a good job of capturing the behavior of evapotranspiration measured at the AmeriFlux sites, which suggests a strong dependence of ET on watershed storage is valid in many cases.
 The true benefit of the model developed herein lies in the fact that measurements of ET are not necessary for the model to function, which is extremely beneficial because long-term, high-quality records of ET are scarce. In order to take advantage of this aspect of the model, an indirect (i.e., no comparison to measured ET) free parameter selection method was developed that utilizes a hypothesized strong positive relationship between watershed storage and streamflow. The results of this method correspond well with the direct (RMSE) method for arid sites, but do not agree as well for humid sites. However, the RMSE at humid sites is not very sensitive to changes in α, so even poor selections of α may lead to reasonable time series of estimated ET. Our preliminary results suggest that this method could be confidently used to predict large-scale ET for any arid area with sufficient precipitation, meteorological, and streamflow data, which could be useful for climate and water resources modeling of ET fluxes, as models often employ grid cells that are close to watershed scale. However, more care should be taken when applying this model to humid environments.
 We thank the AmeriFlux principal investigators for the sites in this study, whose work made this research possible: Tilden Meyers (Audubon Research Ranch, Black Hills, Bondville, Little Washita); Andrew Richardson, David Hollinger, and Scott Ollinger (Bartlett Forest); Mark Heuer and Timothy Wilson (Bondville); Gabriel Katul (Duke Forest, Fermi Prairie); Ram Oren, A. Chris Oishi, David Ellsworth, Kim Novick, and Paul Stoy (Duke Forest); Roser Matamala, David Cook, Miquel Gonzalez-Meler, Julie Jastrow, and Barry Lesht (Fermi Prairie); Tom Kolb and Sabrina Dore (Flagstaff Unmanaged Forest); and Russell Scott (Kendall Grassland). We also thank the AmeriFlux network, the U.S. Geological Survey (USGS) National Water Information System (NWIS), the U.S. Department of Agriculture (USDA) Agricultural Research Service (ARS) Grazinglands Research Laboratory (GRL) Micronet, the NOAA National Climatic Data Center (NCDC), the NOAA Atmospheric Turbulence and Diffusion Division (ATDD) Surface Energy Balance Network (SEBN), the PRISM Climate Group at Oregon State University, and the USGS Earth Resources Observation and Science (EROS) Center, NASA Land Processes Distributed Active Archive Center (LP DAAC) for granting free online data access.