A method for uncertainty constraint of catchment discharge and phosphorus load estimates

River discharge and nutrient measurements are subject to aleatory and epistemic uncertainties. In this study, we present a novel method for estimating these uncertainties in colocated discharge and phosphorus (P) measurements. The “voting point”‐based method constrains the derived stage‐discharge rating curve both on the fit to available gaugings and to the catchment water balance. This helps reduce the uncertainty beyond the range of available gaugings and during out of bank situations. In the example presented here, for the top 5% of flows, uncertainties are shown to be 139% using a traditional power law fit, compared with 40% when using our updated “voting point” method. Furthermore, the method is extended to in situ and lab analysed nutrient concentration data pairings, with lower uncertainties (81%) shown for high concentrations (top 5%) than when a traditional regression is applied (102%). Overall, for both discharge and nutrient data, the method presented goes some way to accounting for epistemic uncertainties associated with nonstationary physical characteristics of the monitoring site.


| INTRODUCTION
Effective evaluation of process-based water quality models requires an understanding of the uncertainties in the observational data used in calibration processes, as estimates of catchment discharge and nutrient loads are affected by significant uncertainties (Beven, Buytaert, & Smith, 2012;Coxon et al., 2015;Harmel, Cooper, Slade, Haney, & Arnold, 2006;Harmel, Smith, King, & Slade, 2009;Johnes, 2007;McMillan, Krueger, & Freer, 2012;Westerberg, Guerrero, Seibert, Beven, & Halldin, 2011). In some cases, the uncertainties may be such that for some events the observational data may not be useful for model calibration and evaluation (Beven & Smith, 2015;Beven, Smith, & Wood, 2011;. Continuous river discharge measurements are often obtained by observing the river water level (stage), and then using a fitted curve (rating curve) to convert these to an estimated discharge. A rating curve is a model of the relationship between observed stage and discharge for a gauging site. As a result, uncertainties in the resultant discharge data can come from errors in river stage measurement, errors in the gauged discharges, lack of gauged data over parts of the curve (particularly the higher end), uncertainties that arise from the fitting of the rating curve itself (e.g., structural error in the model used), and changes in the rating curve over time (McMillan & Westerberg, 2015). When considering water quality data, nutrient loads are calculated for a specific period (typically daily) using river discharge along with measurements of concentrations of the nutrient of interest (e.g., phosphorus [P]). Therefore, uncertainties in load estimations arise from uncertainties in the discharge estimates and in the sampling and measurement of determinand concentrations in addition to their aggregation to the temporal and spatial scales of interest (McMillan et al., 2012).
In hydrology, all important uncertainties can be considered to be epistemic in nature: that is, they arise from a lack of knowledge of the key underlying processes (e.g., Beven, 2016;Nearing et al., 2016). However, we may choose to treat some uncertainties as aleatory (i.e., they arise from simple random variability). Typically, measurement errors in variables such as stage or nutrient concentrations are treated as aleatory variables. In contrast, epistemic uncertainties can include changes to the river cross section, vegetation growth, the effect of sampling and analysis protocols on concentration measurements, and the choice of a functional form for the rating curve; all of which can affect subsequent estimates of discharges and nutrient loads.
Many previous studies have attempted to estimate the uncertainties in both discharge and water quality measurements using a wide range of techniques (Harmel et al., 2009;Johnes, 2007;Moatar & Meybeck, 2005;Webb et al., 1997). For discharge, it is common to fit a statistical regression model to the available stage and discharge gaugings, which allows a statistical estimate of uncertainty in the rating curve. Simple power law or polynomial functions have often been applied, or multisegment functions where the rating curve appears to show a complex shape (e.g., Herschy, 1999). There are, however, alternatives, including drawing on fuzzy regression and fuzzy set concepts (Blazkova & Beven, 2009;Krueger et al., 2010;Pappenberger et al., 2006) and nonparametric regression techniques (e.g., LOWESS, Cleveland, 1979;Coxon et al., 2015), which have been employed on stage-discharge measurements and water quality variables to estimate the uncertainties in discharge and nutrient concentrations and load calculations (Lloyd, Freer, Johnes, Coxon, & Collins, 2016). A further method, focused on uncertainty in the rating curve, has been suggested by McMillan and Westerberg (2015). Their voting point method allowed for situations where channel form and velocities might change over time so that many candidate rating curves might be plausible.
In this study, we extend the voting point method to use water balance data to constrain rating curve uncertainties and also apply the voting point method in the estimation of continuous nutrient concentrations and loads. Further to this, we also place our results in context with other uncertainty techniques by comparing with estimates from using more traditional methods (e.g., fitting a power law function to the available observations).

| Study area
Newby Beck is a small headwater subcatchment located in the River Eden basin in the North West of England, in the United Kingdom.
The catchment is approximately 12.5 km 2 in size with an average elevation of 234 m above sea level. The discharge measurement site for this catchment is a rated section of channel, with water level data collected (with a Schlumberger Water Services (SWS) Mini-Diver) at 15-min intervals. As shown in Figure 1, the cross-sectional area of the rated section of the channel changes significantly at higher water levels, which could contribute to the epistemic uncertainties associated with the discharge produced by the rating curve.
Fourteen discharge measurements were available to develop a site specific rating curve. In addition, a high frequency bankside monitoring station was situated at the outlet, which recorded nitrate (NO 3 ), total P (TP), and total reactive P (TRP) at 30-min intervals (Outram et al., 2014).
The TP and TRP measurements were conducted using a Hach Lange combined Sigmatax sampling module and Phosphax Sigma analyser (Perks et al., 2015). Rainfall over the catchment was recorded at 15-FIGURE 1 Cross section geometry of Newby Beck outlet. The black dots show the heights of each of the 14 available gaugings. The grey dashed lines show the channel cross section at the top and bottom of the gauged range, and the black dashed line highlights the change in channel cross section at high flows well beyond the gauged range min intervals by three tipping bucket rain gauges, and daily rainfall data were obtained from a gauge in the centre of the catchment from the Met Office Integrated Data Archive System network (Met Office, 2012). Other meteorological data were provided by an automatic weather station, which was located towards the centre of the catchment. For this study, the data were analysed over three hydrological years (2011-2012, 2012-2013, and 2013-2014). 2.2 | Constraining uncertainty at high flows using water balance information As applied by McMillan and Westerberg (2015) rating curves are typically fitted using power laws or segmented power laws. Typically, when presented with limited numbers of river gaugings (in this application only 14 in-bank flow measurements were available), the majority of the data falls at the lower flows, and therefore, extrapolation is required beyond the end of the gauged range. With the power law, this extrapolation often introduces large uncertainties, particularly for overbank flow. At the Newby Beck outlet, very few of the gaugings are at the higher flow values (Figure 1). To constrain the extrapolation beyond the gauged range in such cases, we have incorporated the Velocity Area Rating Extension (VARE) model of Ewen, Geris, O'Donnell, Mayes, and O'Connell (2010). The advantage of the VARE method is that local knowledge of the gauging site, such as river cross-sectional area ( Figure 1) and water balance estimates, can be used to constrain this extrapolation beyond the gauged range by imposing a maximum velocity that can be achieved in the river channel. In VARE, a sigmoidal function (G, Equation 2) that varies between two limits (the maximum and minimum stream velocities, ν max and ν min ) is used to determine the average velocity in the stream for a given stage measurement: y−y min y max −y min ; (1) where y is the measured stage, y min is the minimum stage, y max is the maximum stage, α and β are parameters that control the sigmoidal function, and v is the VARE calculated velocity. The velocity can then be used with the cross-sectional area of the stream at stage y to determine the discharge (Ewen et al., 2010). In this case, we assume that the velocity is zero at the bottom of the channel (therefore giving us the values of y min and v min ), and thus need to calibrate the remaining four parameters of the VARE model (α, β, y max , and v max ). Furthermore, the VARE model can be calibrated over a long period (in this case, three hydrological years), such that a water balance is approximately satisfied, allowing for uncertainty in both rainfall and actual evapotranspiration estimates (see below). To explore the rating curve uncertainty, a Monte Carlo analysis was run using the VARE rating curve model; 2,000 parameter sets of the four VARE parameters were sampled using random uniform sampling and evaluated using an extended voting point method.

| The extended voting point method
In the voting point method of McMillan and Westerberg (2015), randomly generated rating curves are assessed using an informal likelihood measure based on how close each curve falls to each of the available discharge-water level pairs. A logistic function was used to account for error in the gauging measurements, although they suggest this can be replaced with a function of the modeller's choosing. In this study, we replace the logistic function with a triangular fuzzy weighting measure, which uses Equations 4 and 5 to calculate a normalized score (Score(g) in Equation [4]) and weight (W(g) in Equation [5]) at each of the 14 available gauging points (see Figure 2).
Here, Ŷ g is the rating curve estimated value of discharge; y g is the gauged discharge value; y min,g is the lower limit of error (see below); and y max,g is the upper limit of error for a given gauging point. This therefore gives a score of zero for a value at the best estimate of the observed value, −1 at the lower limit and +1 at the upper limit. If the normalized score lies between −1 and +1 for a given gauging, a triangular fuzzy weight (W(g), Figure 2) is calculated for the gauging point g in Equation 5. Here, L lwr and L upr are the lower and upper limits of the normalized scores required to consider the simulated values acceptable across all the gauged points (in this case −1 and 1); and N is a shaping factor (set to 1 in this case).
The method requires that the limits y min, g and y max. g can be specified for each gauging point. This information is not usually available for single gauging points but typical uncertainties of ±10% for in-bank flows have been determined, for example, by Schmidt and Yen (2008) : where W(g) is the weight at gauging g, and w vp is the voting point weighting based on the number of gaugings the candidate curve passes through. The voting point weighting is calculated as follows: where h and q are the gauged stage and flow values; hfit and qfit are the subsets that are intersected by the candidate curve. As discussed by McMillan and Westerberg (2015), w vp is a weighting based on the space (area) of gauging points that the candidate curve spans. This is to avoid situations where the distribution of gaugings is highly skewed (in this case, towards lower stages), which can lead to divergence from the gauging points, particularly at the top and bottom ends of the stage range.
In addition to this, a further constraint was imposed on each of the 2,000 candidate curves, in that the modelled mass balance was required to fall within a particular tolerance of the observed value as calculated from the rainfall and weather station data.
If a candidate rating curve was classed as behavioural (likelihoods > 0) using both the mass balance and the voting point criteria described above, an overall likelihood weighting for each behavioural candidate curve was calculated as follows: where L VARE is the overall weighting, L vp is the likelihood measure calculated for the voting point fit to gaugings, L mb is the likelihood measure for the mass balance criteria, and C is a scaling factor such that the sum of likelihoods scales to unity in each case. If either evaluation measure returned a likelihood of zero (L vp or L mb ), then according to Equation 9, the overall likelihood (L VARE ) is also zero and the candidate curve is classed as nonbehavioural and plays no further role in the analysis. Once a set of behavioural models has been obtained, prediction quantiles can be formulated at a given point on the curve (i) as follows: Here, P is the prediction quantile for Ẑ i (the simulated value of variable Z at point i using candidate curve M (Θ j )) being less than z; L is the likelihood weighting associated with candidate curve M (Θ j ); Θ j is the jth parameter set; and N is the number of candidates accepted as behavioural. We then define the 95 percent prediction uncertainty The unique combinations of the behavioural discharge and TP concentration time series from the voting point analysis were then used to determine TP loads using Equation 11: where Q i is the discharge at time i, C i is the concentration, and n is the number of measurement time steps in a day. Any day with missing data was excluded from the model evaluation. As with discharge, prediction quantiles were calculated at each time step, with the combined final likelihood weight of each behavioural model determined as follows: where L load is the overall likelihood of each TP load time series, L VARE is the likelihood weighting of each behavioural parameter set from the rating curve uncertainty analysis, and L conc is the likelihood weighting from the concentration uncertainty analysis. C is a scaling factor, such that the sum of likelihoods scale to unity in each case. As with discharge, the upper and lower limits of uncertainty on the resultant load time series were based around the 2.5 and 97.5 percentiles, respectively (95PPU limits). The 50th percentile (median) was defined as the best estimate of the observed in-stream load.
3 | RESULTS Figure 3 shows the uncertainty limits calculated for discharge ( Figure 3a) and TP concentrations (Figure 3b). Overall, the uncertainty interval (based on 95% prediction quantiles) on the discharge measurements was, on average, 70% throughout the duration of the The solid dark grey line shows a standard power law fitted with regression and the grey shading shows the 95% prediction intervals from the regression analysis. (b) Rating between total phosphorus (TP) concentration as measured using the bank-side analyser and corresponding samples analysed in the lab. The solid line shows the best fit to the lab analysed data, and the dashed lines show the 95% prediction bounds. The black dots show the pairs of TP concentrations from the analyser and the lab. The solid dark grey line shows a standard power law fitted with regression and the grey shading shows the 95% prediction intervals from the regression analysis calibration period, with a range of 21-140%. The higher relative uncertainty intervals were seen in the low flow periods (here defined as the lowest 5% of discharges, which equates to values <0.032 m 3 s −1 ), where they were on average 128%. However, this equated to a mean absolute uncertainty interval of 0.032 m 3 s −1 .
In contrast, the high flow periods (here defined as the highest 5% of discharges which equates to values greater than 1.22 m 3 s −1 ) had much smaller relative uncertainty intervals, on average 40%.
This range is much larger compared with those determined during a recent study on 500 UK catchments (Coxon et al., 2015), which showed that the majority of catchments had 20-40% relative uncertainty intervals, though the maximum uncertainty of 140% determined for Newby Beck here is much lower than the maximum value of 397% quoted by Coxon et al. (2015).
Figure 3a also shows a comparison with a rating curve generated for this catchment using the traditional power law (fitted using regression). The power law (solid grey line in Figure 3a) gives much higher values at the high end of the rating than when the water balance constraint is imposed for the VARE method (solid black line in Figure 3a).
Furthermore, outside the range of the available gaugings, the uncertainty (95% prediction intervals from the regression) in the rating curve (grey shading) is much larger than the curve generated from the voting  5-6, 2015) for the voting point method (a, c, and e) and the power law method (b, d, and f). The black line shows the median value, and the grey shading shows the 95% uncertainty limits derived for both methods. Note the difference in scale on the y axis for the power law method 5%). Overall, these intervals are larger than those reported by McMillan et al. (2012), who provided a summary of uncertainties in water quality data showing relative errors of up to 150% on TP loads and concentrations. However, recent work by Lloyd et al. (2016), employing the use of a bank-side analyser similar to that used at Newby Beck, resulted in uncertainties of up to 83% on the estimation of TP loads when compared with laboratory analysed data. Figure 3b shows the relationship between the bank-side analyser and laboratory TP concentrations as predicted using a power law fitted using standard regression (solid grey line shows fit, and grey shading shows 95% prediction intervals). As with the discharge, the uncertainty intervals at the higher concentrations (top 5%) were much greater using the regression (102%) than with the voting point method  (Figure 4b). Therefore, the use of the VARE and voting method allows the modeller to constrain the uncertainty using local knowledge of the catchment.
Furthermore, VARE allows the hydrologist to impose a (uncertain) mass balance constraint on the evaluation of candidate rating curves using available weather data over a long period (three hydrological years in this application to Newby Beck). This, therefore, ensures that the rating curve model is consistent with the catchment water balance (see Beven and Smith (2015), for example, where this is not the case in another catchment). However, it is acknowledged here that the uncertainty in the mass balance calculation is dependent on the accuracy in the available weather data and consequent precipitation and evapotranspiration estimates on which to perform the analysis.
The advantage of the VARE method in the voting point framework is that the weighting imposed on the overall likelihood of a candidate model can be stronger towards either the fit to the gaugings or the mass balance (e.g., a multiplier can be added to each likelihood in Equation 9 when calculating the overall likelihood for a candidate curve, L VARE ). The weighting towards either constraint can be split evenly or allowed to give preference to one of the measures depending on the model user, knowledge of the catchment, the available data to calculate mass balance, and the nature of the application the model user wants to use the model for.
As we have demonstrated in this work (Figures 3 and 4), the downside of using the power law method to fit rating curves is often the lack of available gaugings during high flow periods. Therefore, when extrapolating the curve beyond the gauged range, there is the potential for overestimation at the higher end of the curve There are other epistemic uncertainties that can lead to nonstationarities in rating curves that are not always obvious. For example, during a flood there can be changes in the physical cross section of the channel due to erosion or sediment build up (Lang, Pobanz, Renard, Renouf, & Sauquet, 2010). This can alter the stage-discharge relationship from any single calibrated curve. Using the voting point method in combination with the VARE approach aims to reduce this uncertainty by assuming that each of our 14 gaugings are representative of a given rating curve at the time of measurement. Therefore, our condition of any candidate curve only needing to hit one gauging to be classed as behavioural aims to account for any potential variation in the rating curve with time.
We also present an extension of the voting point method to account for uncertainties in our P observations and the translation of these errors through to the estimates of daily P loads. As most water quality models typically work on a mass balance basis, the focus is on uncertainties in the observed load data. As load data are calculated using the combination of discharge and concentration, the errors in both measurements must be accounted for.
Therefore, the error in the load measurement (for this particular dataset) will be a combination of rating curve uncertainty, procedural and instrument error in the measurement of nutrient concentrations (in this case P), and cross-sectional variation. Previous methods to estimate load uncertainty (Johnes, 2007;Lloyd et al., 2016) provide some estimation of this combination of errors. However, the discharge error is based on the aforementioned power law rating curve fitted using methods such as LOWESS. Therefore, these methods are susceptible to the issues of extrapolation beyond the range of the gauging data. Our application of the VARE method to estimate the discharge component of the load calculation accounts for this issue as discussed above.
For the concentration errors, we have employed similar methods to those used previously, whereby the bank-side analyser data are compared with those generated in a lab, to check for inconsistencies in the measurements. However, the previous methods tend to quantify the relationship between these data using a regression analysis or LOWESS that requires a fit to all data pairings. As with discharge data, epistemic errors in nutrient data can arise due to changes in the monitoring equipment, such as instrument drift in the bank-side analyser data over time. Therefore, the relationship between laboratory data (which is often generated infrequently, such as with gauging data) and the in situ data may shift. Therefore, to account for these epistemic errors, we utilized the voting point method to estimate the uncertainty in our bank-side analyser data, assuming the lab data were the best estimate of the true measurement.  (Lloyd et al., 2016;McMillan et al., 2012). However, we tend to show higher relative uncertainties towards the lower end of the range.
Again, when the stream went out of bank during Storm Desmond As the computational cost of running this procedure is relatively cheap, and as more gauging information or additional data regarding the characteristics of the catchment become available, the rating curve information or the empirical relationship between the lab and in situ P measurements can be updated easily. This will allow further constraints on the estimation of uncertainties in the discharge, nutrient concentrations, and estimated loads. These uncertainties can then be used as limits of acceptability in the evaluation of water quality models as demonstrated by Hollaway et al. (2018).