Characterizing Catchment‐Scale Nitrogen Legacies and Constraining Their Uncertainties

Improving nitrogen (N) status in European water bodies is a pressing issue. N levels depend not only on current but also past N inputs to the landscape, that have accumulated through time in legacy stores (e.g., soil, groundwater). Catchment‐scale N models, that are commonly used to investigate in‐stream N levels, rarely examine the magnitude and dynamics of legacy components. This study aims to gain a better understanding of the long‐term fate of the N inputs and its uncertainties, using a legacy‐driven N model (ELEMeNT) in Germany's largest national river basin (Weser; 38,450 km2) over the period 1960–2015. We estimate the nine model parameters based on a progressive constraining strategy, to assess the value of different observational data sets. We demonstrate that beyond in‐stream N loading, soil N content and in‐stream N concentration allow to reduce the equifinality in model parameterizations. We find that more than 50% of the N surplus denitrifies (1480–2210 kg ha−1) and the stream export amounts to around 18% (410–640 kg ha−1), leaving behind as much as around 230–780 kg ha−1 of N in the (soil) source zone and 10–105 kg ha−1 in the subsurface. A sensitivity analysis reveals the importance of different factors affecting the residual uncertainties in simulated N legacies, namely hydrologic travel time, denitrification rates, a coefficient characterizing the protection of organic N in source zone and N surplus input. Our study calls for proper consideration of uncertainties in N legacy characterization, and discusses possible avenues to further reduce the equifinality in water quality modeling.

example, in the US (Byrnes et al., 2020), in Europe (Erisman et al., 2011) and in China (X. Wang et al., 2014). Furthermore, wastewater can be released to the environment with or without treatment, which constitutes a point source of N (United Nations, 2017). In Europe, the efficiency of wastewater treatment plants has largely improved since 1990 (European Commission, 2020). Nevertheless water bodies still receive large point N loads, and for many countries, further enhancement of treatment facilities are prescribed (European Commission, 2020;Svirejeva-Hopkins et al., 2011).
This excess of N in the environment has substantial consequences for both the hydrosphere and the atmosphere (Robertson & Groffman, 2015). In fact, soluble N compounds, and more specifically nitrate, are polluting drinking water, which threatens human health (WHO, 2016). They are also causing eutrophication of receiving water bodies, such as European seas (EEA, 2019b) and coastal areas in the US (Scavia & Bricker, 2006). In addition, denitrification in the terrestrial system, which microbially reduces nitrate, can release large amount of nitrous oxide (N 2 O; Tian et al., 2018). The latter is an important greenhouse gas almost 300 times as potent as CO 2 (Myhre et al., 2013), and contributes to the depletion of stratospheric ozone (Portmann et al., 2012). Therefore, reducing nitrate levels through improved land management strategies is a pressing issue, in particular in Europe where countries such as France, Germany and Greece have been recently fined by the European Court of Justice for exceeding the regulatory limits for nitrate in receiving water bodies (Damania et al., 2019). Sporadic evidence shows that N inputs to the soil can not only reach surface water bodies or be lost to the atmosphere via denitrification, but they can also accumulate in the landscape as legacy N (Chen et al., 2018;Van Meter & Basu, 2015). A few studies including long-term measurements of soil N content in the Mississippi River Basin (Van Meter et al., 2016), fertilization experiments in the UK and in France (Jenkinson, 1991;Sebilo et al., 2013) and mass balance assessments (Smil, 1999;Worrall et al., 2015) suggest that part of the N surpluses can build up in the root zone of agricultural soils in organic forms (biogeochemical legacy). Furthermore, increasing nitrate concentration trends are detected in groundwater in some locations in the US and the UK (Puckett et al., 2011;Stuart et al., 2007). The pervasive issue of nitrate pollution in groundwater, in particular in Europe (EEA, 2018;Sundermann et al., 2020), and the results of modeling studies focusing on N accumulation in the vadose zone (Ascott et al., 2017;L. Wang et al., 2012) points to a widespread buildup of N in dissolved inorganic forms (hydrologic legacy). Biogeochemical and hydrologic legacies can impact the future water quality status of water bodies. They can induce a delay or "time lag" between changes in land management practices and the water quality response (Grimvall et al., 2000;Vero et al., 2018). The implications of the two legacy forms for future water quality could be very different because hydrologic legacy corresponds to the accumulation of reactive and mobile N, and biogeochemical legacy to the accumulation of much more stable (organic) N compounds. Therefore, a characterization of both types of N legacies at catchment scale is crucial to inform the design of effective water quality measures.
However, our understanding of the magnitude of these legacies and their associated timescale remains limited because of a lack of observational data and the uncertainties associated with N legacy modeling. Direct observational data of N content in soil and groundwater are often sparse in time and space. This makes it difficult to capture the temporal changes and spatial variability in N content and to determine the integrated legacy behavior at catchment scale. In particular, large-scale data sets of soil N content are often available for 1 year (e.g., Ballabio et al., 2019). Regarding groundwater N concentration measurements, in Europe, low-frequency (typically annual) measurements and the fact that they are mostly available for the recent years impedes the analysis of the long-term dynamics. Additionally, assessing the spatial distribution of groundwater concentration from point measurements involves large uncertainties (see e.g., Figure 6 in Knoll et al., 2020). Therefore, mechanistic models should be used to quantify N legacies, to complement the information provided by currently available observation data of N content in soil and groundwater that are insufficient.
In this respect, some past modeling studies investigate the time lag between the trajectory of N inputs to the terrestrial system and the in-stream N levels using empirical approaches, for example, the studies of Chen et al. (2014), Dupas et al. (2020), Ehrhardt et al. (2021) and  (further details are reported in the review of Chen et al., 2018). Yet, such modeling approaches are typically based on lumped transfer functions that relate the N inputs to the N stream export at catchment scale and they do not explicitly disentangle the role of biogeochemical and hydrologic legacies. Only few studies, all of them focusing on North America, explicitly consider and examine both types of N legacies at catchment scale using mechanistic models, namely a modified version of the Soil Water Assessment Tool model (SWAT-LAG, Ilampooranan et al., 2019), the NOAA's 10.1029/2021WR031587 3 of 32 Geophysical Fluid Dynamics Laboratory Land Model (LM3-TAN, Lee et al., 2016), and the more parsimonious Exploration of Long-tErM Nutrient Trajectories model (ELEMeNT, Chang et al., 2021;J. Liu et al., 2021;, 2018. Importantly, the investigation of N legacies through modeling approaches is fraught with large uncertainties. First, the N input data have large uncertainties because their construction involves numerous uncertain factors, as reported in Byrnes et al. (2020), Häußermann et al. (2020), Hong et al. (2012), and Poisvert et al. (2017) for diffuse N sources and in Grizzetti et al. (2008), Morée et al. (2013), and  for point sources (wastewater). Notably, although the application of mineral fertilizer is a key N input to agricultural soils, data may only be available at the national level for some countries such as Germany, and spatial disaggregation strategies have to be developed to estimate fertilizer application at finer spatial resolutions (Häußermann et al., 2020). Second, the modeling of N legacies suffers from a lack of process understanding, for example, regarding the immobilization and accumulation of N into organic matter in soils (see e.g., Bingham & Cotrufo, 2016;Yansheng et al., 2020). Third, the parameters of mechanistic N models are generally estimated through calibration (Moriasi, Zeckoski, et al., 2015), because they are often conceptual parameters that cannot be directly related to measurable quantities. Parameter values are typically constrained using in-stream measurements only (Moriasi, Zeckoski, et al., 2015), since observational data of internal model variables, such as N content in soil and groundwater, are generally lacking. It is well known that different parameter sets can fit the in-stream data equally well, as discussed in previous water quality studies, for example, Ford et al. (2017), Husic et al. (2019), Rankinen et al. (2006), and Wade et al. (2008). Because of this issue of equifinality (Beven, 2006), it may be possible to identify a range of plausible values for the model internal fluxes and states corresponding to the different plausible parameter sets. Hence, simulated N legacies (internal model states) might be poorly characterized if in-stream information only is used to constrain model parameterizations. All of the discussed uncertainties in N legacy modeling are exacerbated by the fact that the N accumulation in the landscape needs to be modeled over long timescales (decades to centuries) to understand the contemporary and future water quality status. In fact, long time series of input data are fraught with uncertainties and long records of observations of model outputs, such as in-stream N concentrations, are rarely available.
Our review of the literature suggests that (1) we lack understanding of the magnitude and timescale of N legacies at catchment scale in light of the associated uncertainties and (2) it remains unclear whether and to what extent the in-stream information only, as typically used to calibrate catchment-scale N water quality models, is able to constrain the simulated N legacies and which additional information would mostly help to reduce uncertainty. In this study, we address these gaps by investigating the long-term fate of the N inputs to the landscape and its uncertainties. In particular, we analyze the uncertainties in simulated N biogeochemical and hydrologic legacies due to the uncertainties in the model parameters and the input data, and we determine the value of different types of (observational) data to constrain the modeling results.
To this end, first, we introduce a multicriteria approach based on soft rules to constrain the model parameters, which allows assessment of parameter uncertainty and of the value of the different observational data available (in-stream N loading and concentration data, and soil N data). Second, we perform a sensitivity analysis to determine the factors responsible for the (residual) uncertainty in the simulated N legacies to prioritize future efforts for uncertainty reduction and model improvement. We apply the ELEMeNT model, which is a parsimonious N model that explicitly accounts for both biogeochemical and hydrologic legacies . While past modeling studies focus on North America (Chang et al., 2021;Ilampooranan et al., 2019;Lee et al., 2016;J. Liu et al., 2021;, 2018, here we extend the analyses of N legacies to the European context. We examine the N legacy behavior over the last six decades  in the Weser river basin (WRB), which is Germany's largest national river basin. Through the application of the soft rules and the sensitivity analysis, we infer the dominant factors affecting the uncertainty in simulated N legacies and discuss the implications for future modeling studies, data monitoring, and water and land management strategies.

Model Description
The ELEMeNT model  simulates the fate of the diffuse N sources (soil N surplus) and point sources at annual timescale and it computes the annual in-stream nitrate-N loading and concentration at the catchment outlet. The model assumes that dissolved N is present in the form of nitrate (NO 3 ). ELEMeNT conceptualizes the N processes in two terrestrial compartments, a shallower compartment referred to as "source zone", which represents the soil root zone (assumed to have a depth of 1 m), and a deeper compartment referred to as "subsurface zone", which includes both the unsaturated zone below the source zone and the groundwater. The land use categories considered are cropland, agricultural permanent grassland, and non-agricultural land (see Section 3.2.1). ELEMeNT requires as input the annual time series of the different land use fractions, the N surplus for each land use category, the stream discharge at the catchment outlet, and the N point sources. We revise the formulation of the subsurface submodel and add an in-stream submodel as detailed in Sections 2.2 and 2.3. The model version used in this study counts nine calibration parameters (Table 1).

Source Zone Submodel
Following , the model assesses the dynamics of the organic N pool, accounting for the biogeochemical legacy, and of the inorganic N pool. The organic N pool is divided into two different stores: (1) a protected N store with a slow turnover, that accounts for the physical stabilization of N, and (2) an active organic N store, that consists of organic N which is more labile and prone to mineralization. The organic N stores receive the total N surplus. Therefore mineralization in the model (the simulated N mass flux from the organic to the inorganic pool) represents the effective N mass flux to the inorganic N pool. This flux results from the combined effect of the transformation of organic N into mineral forms and the immobilization of mineral N inputs into the organic matter. Immobilization is understood to be an important process in both agricultural soils (Blankenau et al., 2001;Haag & Kaupenjohann, 2001;X. Liu et al., 2016;A. J. Macdonald et al., 1989;Sebilo et al., 2013;Yansheng et al., 2020) and forested soils (Castellano et al., 2012;Lewis & Kaye, 2012;Morier et al., 2008;Spoelstra et al., 2001). However, these process interactions are yet not well understood (Bingham & Cotrufo, 2016;Yansheng et al., 2020) and therefore difficult to represent explicitly in a mechanistic model. Since mineralization is parameterized as a first order process, part of the N surplus is quickly transferred to the inorganic N store and can therefore be readily available for leaching. A protection coefficient determines the fraction of the N surplus that is added to the protected store. Transfer of N mass from the protected to the active N stores occurs in case of land use conversion from uncultivated land to cropland. After mineralization occurs, the inorganic N in the source zone can either denitrify or leach to the subsurface.

Subsurface Submodel
In the subsurface, the transport of dissolved inorganic N is represented using a travel time distribution to account for hydrologic legacy, while N can denitrify following a first order process . In this study, we revise the formulation for the subsurface compartment. Compared to the previous model formulation , the travel time distribution function is an explicit function of time t Botter et al., 2010;Queloz et al., 2015). The N-NO 3 export from the subsurface to the stream stream sub ( ) (kg ha −1 yr −1 ) at time t is therefore written as: where T (yr) is the travel time, sub s (kg ha −1 yr −1 ) is the N-NO 3 mass leaching from the source zone to the subsurface, p(T, t) (−) is the travel time distribution function and λ sub (yr −1 ) is the rate constant of denitrification in the subsurface. As in Van Meter et al. (2017), we assume complete mixing (or random sampling) in the subsurface compartment and we adopt an exponential distribution for the travel time (Equation 41 in Botter et al., 2010). The mean value of the distribution μ′(t) (yr) is given as: where Q out (t) (mm yr −1 ) is the discharge at the catchment outlet, out (mm yr −1 ) is the arithmetic mean of the discharge, and μ sub (yr) is the harmonic mean of μ′(t). To account for the fact that the mean discharge may not be stationary, we compute out at each time step as the 30-year backward moving average of the discharge. We refer to Text S1-S3 in Supporting Information S1 for further details on the mathematical derivation and numerical integration of the new equations introduced in this study.

In-Stream Submodel
In-stream N removal, which comprises the processes of denitrification and biotic assimilation described for example, in Basu et al. (2011) and Dehaspe et al. (2021), was implicitly accounted for in . The removal of N from point sources was lumped with the removal due to wastewater treatment, and the removal of N from diffuse sources was lumped with the denitrification in the terrestrial compartments. In this study, we need to represent in-stream removal explicitly, since our point N sources input, described in Section 3.2.4, already includes the N removal through wastewater treatment. The in-stream NO 3 -N load J out (t) (kg ha −1 yr −1 ) is computed as: where R (−) is the annual fraction of in-stream N removal and J ps (t) (kg ha −1 yr −1 ) is the N loading from point sources. For our application at catchment scale and annual timescale, we thereby assume that in-stream N removal can be represented by a first order process, as documented for example, in Basu et al. (2011).

Description of the Study Catchments
We apply the ELEMeNT model to the Weser river basin (WRB) in Germany, focusing on the region located upstream of the Hemelingen station (see Figures 1a and 1b). This covers an area of around 38,450 km 2 , which corresponds to almost 11% of the total area of Germany. The Weser river largely contributes to the total N load discharging into the North Sea, where eutrophication is a major issue (Arle et al., 2017). In the WRB, a priority goal set by the WRB Commission (FGG Weser, 2020) is to reduce N inputs to the landscape to achieve in-stream N concentrations below the regulatory threshold of 2.8 mg L −1 (OGewV, 2016). For our analyses, we selected eight stations, including Hemelingen, that have a long record (between 26 and 37 years) of 14-day average in-stream nitrate measurements, that were constructed by mixing daily samples. These stations are situated on the Weser, Werra, Fulda and Aller rivers and their location is reported in Figure 1b.

of 32
The WRB is characterized by a humid temperate climate with an annual mean precipitation around 780 mm yr −1 (spatial range: 600-1100 mm yr −1 ) and a mean aridity index (ratio of potential evapotranspiration to precipitation) around 0.9 (S. Yang et al., 2019;Zink et al., 2017). The average annual discharge at Hemelingen over the period 1950-2015 amounts to 268 mm yr −1 . Agriculture is the dominant land use type and constitutes 45% of the catchment area in 2015 (35% cropland and 10% agricultural permanent grassland). Other important land uses are forested land and other vegetated land (including natural grassland and urban green areas), which cover 34% and 19% of the catchment area in 2015, respectively. Spatially and temporally averaged annual N surplus is estimated to be equal to 48.6 kg ha −1 yr −1 over the period 2011-2015, and N surplus takes a higher value over agricultural areas (70.9 kg ha −1 yr −1 ) compared to non-agricultural areas (30.3 kg ha −1 yr −1 ). On average in-stream N-NO 3 concentration (C out ) value at the catchment outlet is equal to 3.7 mg L −1 over the period 2011-2015, which is above the target value of 2.8 mg L −1 (OGewV, 2016).
The eight nested subcatchments analyzed in this study, present some moderate differences in their characteristics, as indicated in Figures 1c-1j. In particular, the percentage of agricultural areas ranges from 38% to 48%, with lower values in the southern (upstream) part, which is situated in a mountainous area, compared to the northern (downstream) part, which lies in the German Northern Plain. We notice a sharp decrease in N surplus in 1990 for the Letzter Heller subcatchment (Figure 1j). This can be explained by the fact that a large area of the subcatchment is situated within the state of Thuringia, which was part of the former German Democratic Republic, and where agricultural activities were profoundly disrupted following the German reunification in 1990. The in-stream concentration C out ranges from 3.0 mg L −1 at the Wahnhausen station (Figure 1i), which is located in the upstream part, to 4.1 mg L −1 at the downstream Drakenburg and Porta stations (Figures 1d and 1f).

Data Description and Processing
In this section, we present the data sets adopted to run ELEMeNT and the data sets of N content observations used to assess the performance of the model and estimate its parameters (as explained in Section 4).

Land Use Data
We construct the 1800-2015 trajectories of the catchment-scale fractions of the three land use categories required by ELEMeNT, namely cropland, agricultural permanent grassland and non-agricultural land. The latter land cover includes forest, other vegetated land (such as natural grassland, green urban areas), built-up areas and non-vegetated land. We combine data from the gridded Corine Land Cover data set (CLC; EEA, 2019a), the gridded History Database of the Global Environment data set (HYDE;Klein Goldewijk et al., 2011, 2017 and census data on agricultural areas available from the Federal Statistical Office (Statistisches Bundesamt, 2021) and from the yearly statistical books for Germany (Digizeitschriften, 2021). Text S4 in Supporting Information S1 provides details on the methodology used to construct the land use data. The trajectories of the land use fractions for the different subcatchments are presented in Figure S1 of Supporting Information S1.

N Surplus for Agricultural Areas
We adopt the N surplus data set for agricultural areas of Häußermann et al. (2020), which is available for the period 1995-2015 at the county level, and the data set of Behrendt et al. (2003), which is available for the period 1950-1998 at the state level, and which builds on the studies of Bach and Frede (1998) and Behrendt et al. (2000). The components included in the two data sets are the N content in the input of mineral fertilizer, manure, other organic fertilizer such as sewage sludge, seeds and planting material (for the county level data set only), atmospheric deposition, biological fixation by legumes, as well as the N content in harvested crops. We refer to Häußermann et al. (2020) and Behrendt et al. (2003) for further methodological details. We harmonize the two N surplus data sets similar to Ehrhardt et al. (2021), to construct the N surplus trajectories at county level for the period 1950-2015 (see Text S5 in Supporting Information S1 for further details). The resulting trajectories of the N surplus for agricultural areas are shown in Figure S2 of Supporting Information S1.

N Surplus for Non-Agricultural Areas
Following , for non-agricultural areas, we consider two components for the N surplus, namely atmospheric N deposition and biological N fixation, and we neglect any net accumulation of N in the vegetation. We quantify atmospheric N deposition using the data set produced from Community Atmosphere Model with Chemistry (CAM-chem, Lamarque et al., 2012) simulations, as part of the National Center for Atmospheric Research (NCAR) Chemistry-Climate Model Initiative (CCMI, Tilmes et al., 2016) N deposition data set. This product is part of the input data sets for the Model Intercomparison Projects (input4MIPS) and it is a forcing data set for the Coupled Model Intercomparison Project Phase 6 (CMIP6, Eyring et al., 2016). The data are provided over a 1.9° × 2.5° grid for the period 1850-2014. We estimate biological N fixation using the mean annual rate reported by Cleveland et al. (1999) for natural temperate forest (16 kg ha −1 yr −1 ) and for natural grassland (2.7 kg ha −1 yr −1 ). We use the latter for our land use category "other vegetated land" defined in Section 3.2.1. The trajectories of atmospheric N deposition and biological N fixation are shown in Figure S3 of Supporting Information S1.

N Point Sources
For N point sources, we use observations of N loading, available for 802 wastewater treatment plants (WWTPs) and 1 year in the period 2012-2016 depending on the plants (Büttner, 2020;S. Yang et al., 2019). Data for the larger WWTPs (with population equivalent over 2,000) come from the Environmental Agency database (EEA, 2015) and correspond to the year 2012, while data for the smaller WWTPs (with population equivalent under 2,000) come from the authorities of the federal German states and correspond to the year 2015 or 2016. For the past years, we estimate N point sources from wastewater from the methodology proposed by Morée et al. (2013). We utilize data on population count (HYDE data set), protein supply (FAO, 1951(FAO, , 2021a(FAO, , 2021b, and population connection to sewer and WWTPs (Eurostat, 2016(Eurostat, , 2021Seeger, 1999). We combine these data and create an ensemble of historical N loading from WWTPs over the period 1950-2015. The ensemble reflect the uncertainty in the characteristics of different parameters, such as the fraction of protein supply lost in the food supply chain, the ratio of industrial to domestic N emissions or the efficiency of wastewater treatment. Text S6 in Supporting Information S1 (as well as Figure S4 and Tables S1-S2 in Supporting Information S1) details the underlying procedure for the N point sources construction. A visual depiction of the N point sources with uncertainty is provided in Figure S5 of Supporting Information S1.

Stream Discharge
To run the ELEMeNT model, we require annual discharge at the outlet of the subcatchments for the period 1800-2015 and, in addition, we need daily discharge for the recent period to process the measurements of in-stream N-NO 3 concentration, as described in Section 3.2.6. Discharge data is constructed by combining (1) daily discharge measurements (at the catchment outlet or at a nearby measuring station) obtained from the GRDC (Global Runoff Data Centre, 2021) or the WRB Commission (FGG Weser, 2021) databases (see Table S3 in Supporting Information S1), and (2) bias-corrected simulations from the mesoscale Hydrologic Model (mHM, Kumar et al., 2013;Samaniego et al., 2010), to fill the missing values in the observation data set. Two sets of mHM simulations are used, medium-term daily simulations for the period 1950-2015 , and long-term annual simulated values for the period 1800-1949 (Hanel et al., 2018). The mHM simulations capture the variability of observed discharge reasonably well, with values of the Nash-Sutcliffe efficiency always higher than 0.64 and values of the coefficient of determination always higher than 0.66 (see Table S4 in Supporting Information S1). Figures S6-S13 in Supporting Information S1 represent the annual time series of the discharge measurements and the simulations before and after we apply the bias-correction. We refer to Zink et al. (2017) and Hanel et al. (2018) for details on the mHM setups.

N Content in the Source Zone
We derive N content in the topsoil (0-20 cm) from the Land Use and Cover Area frame statistical Survey (LUCAS; Ballabio et al., 2016Ballabio et al., , 2019. The LUCAS data set was created from the spatial interpolation of approximately 22,000 surveyed points across Europe. For most countries, including Germany, soil samples were collected in 2009. To estimate the catchment-scale soil N content (0-100 cm, source zone of ELEMeNT), we combine the N content and bulk density of the topsoil from the LUCAS data set. We also use the ratio of total soil N content (0-100 cm) to topsoil N content (0-20 cm), which we estimate to be between 2.5 and 4 from Batjes (1996), thus obtaining a plausible range for the total soil N content (0-100 cm). Our estimated ranges of the soil N content and further details are reported in Table S5 of Supporting Information S1.

In-Stream N Concentration
In-stream nitrate concentration is obtained from the WRB Commission (FGG Weser, 2021). For the Letzter Heller catchment, we combine the concentration measurements at the Letzter Heller station available for the period 1979-2002 and at the Witzenhausen station, which is located 8 km upstream, for the period 2003-2015. The data consists of 14-day average N-NO 3 concentration measurements, that were constructed by mixing daily samples, and that start for most stations in the early 1980s and end around 2015. We only fill the gaps that have a length of maximum 42 days, and therefore concentration data for years containing longer gaps are discarded. Annual total in-stream loading is calculated as the sum of the 14-day average loading values, while annual average concentration is estimated as the discharge weighted average of the 14-day average concentration values. The in-stream N-NO 3 concentration and loading data are reported in Figures S14-S15 of Supporting Information S1. We also examine the uncertainty in the observations and check for outlier values. We find that the concentration value at the Letzter Heller station in 1990, which is equal to 6.3 mg L −1 , is abnormally high. The difference to the average value (4.5 mg L −1 ) amounts to 2.8 times the standard deviation over the period 1985-1995 (see Figure  S14 in Supporting Information S1). This anomalous concentration value could be explained in the context of the German reunification in 1990, where unusual and undocumented N amounts could have been discharged into the stream. This is not reflected in our N input data sets. Tracking the cause of this anomaly is beyond the scope of this study, and therefore, we discard this value for our analyses.

Multicriteria Parameter Estimation Strategy
Our parameter estimation strategy considers the performance of the ELEMeNT model in simulating three different variables, namely the total source zone N content (M s , which includes the organic protected, organic active, and inorganic N stores), and the in-stream N-NO 3 loading (J out ) and concentration (C out ) at the catchment outlet. Regarding M s , the simulated values are constrained within the range derived from the observations, since the source zone N content is only provided for the year 2009. For J out and C out , we use three performance metrics that are the Pearson correlation coefficient denoted as ρ (−), the relative bias denoted as RBIAS (−) and the variability error denoted as STD err (−). RBIAS and STD err are defined as follows: where μ obs and μ sim are the average of the observations and simulations, respectively, and σ sim and σ obs are the standard deviation of the simulations and observations, respectively. There averages and standard deviations are calculated for each subcatchment over the years where observations are available. The three metrics ρ, RBIAS and STD err measure how well the dynamics (temporal pattern and timing), the mean and the variability of the observations respectively are captured by the simulations. They constitute the three components of the Kling-Gupta efficiency (KGE; Gupta et al., 2009) defined as: We consider these three metrics separately instead of the aggregated KGE measure to ensure a sufficient performance with regards to all important aspects that we aim to simulate, as discussed in Martinez and Gupta (2010). , we use "soft rules" to identify the set of well-performing ("behavioral") simulations. We define seven soft rules, all of which have to be satisfied in the behavioral simulation ensemble: 1. for J out : |RBIAS| ≤ 0.2 2. for J out : |STD err | ≤ 0.25 3. for J out : ρ ≥ 0.8 4. for C out : |RBIAS| ≤ 0.2 5. for C out : |STD err | ≤ 0.25 6. for C out : ρ ≥ 0.6 7. the simulated M s is within the range derived from the observations The order of the rules allows us to assess to what extent the use of in-stream N concentration and source zone N content data can help to reduce the simulation uncertainties, beyond the use of in-stream N loading data. Some studies only examine the in-stream N loading (e.g., Chang et al., 2021;J. Liu et al., 2021;, 2018 and not the in-stream N concentration that tends to be more difficult to simulate than the in-stream N loading (Husic et al., 2019). In addition, as discussed in Section 1, previous studies generally considered in-stream variables only for calibration. The threshold values for RBIAS and ρ introduced in rules 1, 3, 4 and 6 correspond to "satisfactory" or "good" model performance in reproducing nutrient dynamics according to Moriasi, Gitau, et al. (2015). We however note that Moriasi, Gitau, et al. (2015) examine values of ρ at monthly and not annual timescale, due to a lack of studies analyzing the model goodness-of-fit for annual simulations of nutrients. Here, we set a stricter threshold value on ρ for J out (rule 3) compared to C out (rule 6), since the dynamics of J out are driven by the stream discharge and are therefore easier to reproduce then the dynamics of C out , as further discussed in Sections 5.1.1-5.1.2. Due to a lack of analysis of STD err in previous studies, we consider that a threshold value equal to ±0.25 is reasonable (rules 2 and 5).
To estimate the nine model parameters, we generate a parameter sample of size 100,000 from the ranges reported in Table 1 utilizing latin hypercube sampling and uniform distributions. We discard the parameter sets that do not meet the condition h c < h nc (where h c and h nc are the protection coefficients for cultivated and non-cultivated land, respectively). We assume that the protection of organic matter is reduced by tillage practices. We perform Monte Carlo simulations for each of the eight subcatchments and we sequentially apply the seven soft rules, thus progressively reducing the number of behavioral parameter sets. Following previous studies (Dupas et al., 2020;Ehrhardt et al., 2021;, we use the entire time series of in-stream N observations to identify behavioral simulations. We note that the goal of our analysis here is not to predict future catchment N export but to analyze the uncertainty in the simulations and the value of different types of data to constrain the simulations. In addition, the amount of in-stream observational data available (between 26 and 37 years depending on the subcatchments) is rather limited to be divided between a calibration set and an independent verification set.
We simulate the ELEMeNT model from year 1800 (pre-industrial conditions) to year 2015, including a long warm-up period that is, we only analyze the simulations for the period 1960-2015 and we discard the results for period 1800-1959. This is because the legacy stores can have a slow turnover and can build up over long timescales . The setup of the initial states is described in detail in Text S7 of Supporting Information S1. Regarding the N point sources, we select the realization in our ensemble that shows the best match with the observations available for the period 2012-2016 (further information on the point sources are in Section 3.2.4). For the atmospheric N deposition, we use the value in 1850 for the period 1800-1849. Regarding the N surplus in agricultural areas, since no data are available before 1950, we assume that the value (at county level) of the N surplus in 1850 is half the value in 1950. We then use linear interpolation for the period 1850-1950 and we consider that the value is constant for the period 1800-1850. We also assume that the N surplus takes the same value over cropland and agricultural permanent grassland. In Section 4.2, we explain how we assess the impact of the uncertainty in the N point sources and N surplus.

Sensitivity Analysis of the Simulated N Legacies
We perform a sensitivity analysis to investigate the factors that are responsible for the residual uncertainty in the simulated N legacy stores, that is, the uncertainty that remains after constraining the model simulations using the soft rules described in Section 4.1. This analysis allows us to set priorities for future efforts for uncertainty reduction and model improvement. We examine the sensitivity of four model outputs related to the legacy stores (namely the average source zone and subsurface storage and their cumulative change, assessed over the period  to the nine parameters of ELEMeNT, the N point sources input and the N surplus input. We select 10 realizations of the N point sources across the ensemble of realizations to cover the uncertainty range (see Figure  S5 in Supporting Information S1). Regarding the N surplus, we introduce two additional parameters to generate alternative realizations for the agricultural N surplus for the period  and for the disaggregation between cropland and agricultural permanent grassland. First, we define parameter r warm (−), which represents the ratio of the value in 1850 to the value in 1950 of the agricultural N surplus Surplus agr (t) (kg ha −1 yr −1 ): The agricultural N surplus for the period 1800-1949 is derived from r warm , through linear interpolation over the period 1850-1950 and by setting a constant value over the period 1800-1850. Second, we define parameter r mgra−crop (−), which is the ratio of the N surplus for permanent agricultural grassland Surplus mgra (t) (kg ha −1 yr −1 ) to the N surplus for cropland Surplus crop (t) (kg ha −1 yr −1 ), assumed to be constant in time: In addition, we define a time-invariant multiplier denoted as f surplus (−), which is used to multiply the time series 1800-2015 of the N surplus for both agricultural and non-agricultural areas. This multiplier accounts for the uncertainty in the value of the total N surplus.
We select three values for f surplus (0.8, 1 and 1.2), r warm (0.25, 0.5 and 0.75) and r mgra−crop (0.5, 1 and 1.5), which results in 27 N surplus realizations. Since no information is available to further constrain the uncertainty, we argue that these 27 realizations cover a large plausible range of N-surplus estimates. The case f surplus = 1, r warm = 0.5 and r mgra−crop = 1 corresponds to our "baseline scenario" that is, the one used for analyses presented in Section 4.1. Text S8 in Supporting Information S1 details the derivation of the N surplus for cropland and permanent agricultural grassland from the parameter r mgra−crop and Figure S16 in Supporting Information S1 reports the time series of different realizations of the N surplus for agricultural areas.
We combine the 10 point sources and the 27 N surplus realizations to create 270 sets of N inputs. For each of them, we perform Monte Carlo simulations from the same parameter sample of size 100,000 described in Section 4.1. This produces a total input-output sample of size 27,000,000. We then discard simulations that do not satisfy the soft rules defined in Section 4.1 to obtain the sample for the sensitivity analysis.
We apply the distribution-based PAWN sensitivity analysis method  that evaluates the effect of the input factors on the entire output (here legacy stores) distribution. We estimate the PAWN sensitivity indices using the numerical approximation strategy introduced by Pianosi and Wagener (2018), which can be utilized for any generic input-output sample, and which is implemented in the Python version of the SAFE toolbox . With this numerical scheme, the range of variation of the ith input factor is partitioned into a number n i of equally probable "conditioning" intervals (denotes as I i,k , k = 1,…,n i ), each interval containing the same number of parameter sets. The PAWN method consists of the comparison between (1) the Cumulative Distribution Functions (CDFs) of the model output (denoted here as y) obtained by letting all input factors vary in their entire space of variability (i.e., unconditional CDF, denoted as F y (y)) and (2) the CDF obtained by allowing all input factors to vary freely, but the ith input x i whose value is constrained to a specific conditioning interval (i.e., conditional CDF, denoted as | ( | ) ). In PAWN, input sensitivity is quantified through the Kolmogorov-Smirnov statistic (KS, Kolmogorov, 1933;Smirnov, 1939), which is the maximum vertical distance, between unconditional and conditional CDFs. The PAWN sensitivity index for the ith input factor, denoted as PAWN (−), aggregates the KS values calculated across all n i conditioning intervals through a summary statistic, which is chosen as the median value in this study, to eliminate the impact of outlier values: where 11) PAWN takes values between 0 and 1, and the higher its value the larger the impact of that input on the model output. For the ELEMeNT parameters, we adopt a number of conditioning intervals n i equal to 10. For the N point sources, we calculate the conditional CDF for each of the 10 realizations, and for the three N surplus parameters we compute the conditional CDF for each of their three selected values. We estimate the 95% confidence intervals of the PAWN sensitivity indices using 1000 bootstrap resamples, and we verify the convergence of the results given the sample size, following Sarrazin et al. (2016).

Application of the Soft Rules
The performance of the simulated in-stream N loading and concentration is comparable in terms of the metrics RBIAS and STD err (Figure 2a). However, in terms of ρ, the performance is noticeably better for the simulated N loading compared to the concentration for five subcatchments. For the three other subcatchments (outlet Hemelingen, Drakenburg and Wahnhausen), performance are more similar although higher values of ρ can be reached for the loading. In contrast to the concentration dynamics, the temporal fluctuations of the loading are strongly influenced by the discharge dynamics, which is an input to the ELEMeNT model. Importantly, we identify simulations that comply with each soft rule individually, as shown in the gray shaded areas in Figure 2a for rules 1-6 and in Figure S17 of Supporting Information S1 for rule 7 (source zone N content). We also find that the threshold values on the performance metrics introduced in rules 1-6 result in values of the KGE higher than 0.62 for the loading and 0.49 for the concentration in the behavioral simulation ensemble (gray shaded areas in Figure 2. Application of the soft rules: (a) Cumulative Distribution Function (CDF) of the performance metrics for in-stream N loading (J out ) and concentration (C out ) in the initial simulation ensemble (100,000 realizations) and (b) percentage of realizations of the initial ensemble identified as behavioral (p behav ) by successive application of the soft rules based on the performance metrics for loading (J out , rules 1-3), the performance metrics for concentration (C out , rules 4-6), and the source zone N content (M s , rule 7). The name of the eight subcatchments refer to both the legend of the lines of panel (a) and the bar graphs of panel (b). Panel (a) reports the three performance metrics used in the definition of rules 1-6 (relative bias RBIAS, variability error STD err and Pearson correlation coefficient ρ) and the Kling-Gupta efficiency (KGE). The gray shaded areas and gray numbers on the x-axis indicate the behavioral ranges of the performance metrics used in the definition of rules 1-6. The ranges of the performance metrics shown (x-axis of plots in panel [a]) do not include the extreme values, which are shown in Figure S18 of Supporting Information S1. right panels of Figure 2a). We verify that these KGE values are higher than the mean benchmark value of −0.41 (Knoben et al., 2019).
From Figure 2b, we observe a reduction in the number of ELEMeNT realizations after application of the rules on the loading (rules 1-3), but also a further diminution after application of the rules on the concentration (rules 4-6) and on the source zone N content (rule 7). This means that not only the loading, but also the concentration and the source zone N content have a value in constraining the simulations. In addition, the data on the source zone N content in 2009 (rule 7) allows reduction in the uncertainty in the total source zone N storage that is not constrained by the other rules ( Figure S17 in Supporting Information S1). We obtain a number of behavioral simulations that varies between 676 for Letzter Heller and 2076 for Wahnhausen (Table S6 in Supporting Information S1). For further details on the reduction in the number of realizations obtained when applying each of the seven rules individually, we refer to Figure S19 in Supporting Information S1.

Constraining of the Simulated In-Stream Loading and Concentration
From Figures 3 and 4, we observe that the precision of the simulated in-stream N loading and concentration is larger in the behavioral simulation ensemble compared to the unconstrained ensemble, that is, the red shaded areas are much narrower than the gray shaded areas. We also see that, for some years, the width of the 95% confidence interval (CI) of the simulation ensemble is reduced when applying the rules on the concentration and source zone N content (rules 4-7, red shaded areas) in addition to the rules on the loading (rules 1-3, blue shaded areas). However, the information on the source zone N content in 2009 (rule 7) does not further narrow down the uncertainty bounds, that is, we do not observe green shaded areas. Importantly, after applying the seven rules, the behavioral simulation ensembles (red shaded areas in Figures 3 and 4) capture a large number of observations. Specifically for most subcatchments, the simulation ensemble encompasses more than 90% of the observations, except for Verden for which it includes 57% of the observations. Regarding loading (Figure 3), the temporal dynamics follow the discharge dynamics (discharge is reported in Figures S6-S13 of Supporting Information S1), and we see that the simulation ensembles match the observations very well. Regarding concentration (Figure 4), we observe that the simulations are overall in agreement with the measurements. However, the median ensemble has difficulties in reproducing the observed concentration trend around the last 10 years of the simulations (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015), in particular at Hemeln, Porta, Verden and Hemelingen. While the measurements indicate a slight decrease in concentration, the latter is relatively stable in the simulations, which is consistent with the N input dynamics over this time period (see Figure 1). Nonetheless, the simulation ensemble captures the observations for all locations apart from Verden. We also notice that for Letzter Heller (panel b), concentration shows little temporal variations, while the N surplus trajectory presents a sharp decrease in 1990 (as shown in Figure 1).

Constraining of the Parameter Distributions
From Figure 5, we find that the differences between the prior (gray lines) and posterior (colored lines) CDFs of the parameters are very small for four parameters, namely the two protection coefficients (h c and h nc ), the mean annual water content in the source zone (V s ) and the fraction of in-stream N removal (R). Nonetheless, the median of the distribution changes by at least 20% for V s for one subcatchment (increase of 25% for Letzter Heller) and for R for four subcatchments (decrease of 27% for Wahnhausen, 20% for Hemeln, 22% for Porta and 23% for Hemelingen). The other five parameters ( prist s org , k a , λ s , λ sub , and μ sub ) show appreciable (higher than 15%) reduction in their 95% CI. On average, these parameters exhibit a diminution in their CI of 59%, 10%, 37% 15% and 33%, respectively. The two denitrification rate constants (λ s and λ sub ) and the mean travel time (μ sub ) take values in the lower range of their prior distribution for most subcatchments, with median values in the range 0.28-0.34 yr −1 (prior: 0.55 yr −1 ), 0.06-0.17 yr −1 (prior: 0.15 yr −1 ) and 8-19 years (prior: 26 years), respectively. The 95% CI of the travel time in the posterior distribution is reduced for the some subcatchments such as Hemelingen (95% CI: 2-24 years), while it can still be rather large for some other subcatchments such as Drakenburg (95% CI: 3-44 years). The mineralization rate constant for organic active N (k a ) is unlikely to take values in the lower range of its prior distribution, as the lower bound of its 95% CI is in the range 0.09-0.18 yr −1 (prior: 0.07 yr −1 ). Details on the median and 95% CI of the parameter values are reported in Tables S7-S8 of Supporting Information S1. In addition, all three observational data used (loading, concentration and source zone N content) have value in constraining the distribution of at least one parameter, as shown in Figures S20-S27 of Supporting Information S1. In particular, the source zone N content observations in 2009 is the only data source that allows to constrain the source zone organic N stock under pristine conditions ( prist s org ). We also calculate the behavioral values of the mean transfer times for the source zone organic N stores as the inverse of the respective mineralization rate constants. The mineralization rate constant for the protected pool is computed from Equation S24 of Supporting Information S1 and its CDFs can be visualized in Figure S28 of Supporting Information S1. We find that the mean transfer times for the protected store are much higher compared to the active store for all subcatchments. The median values (95% CI) of the mean transfer times are in the range 2000-2700 years (1300-5400 years) and 2.0-3.0 years (1.4-11.3 years) for the protected and active stores, respectively (more details in Table S9 of Supporting Information S1).
Interestingly, the parameter distributions for the upstream Letzter Heller subcatchment stands out, with particularly low median values of k a , and high median values of μ sub compared to the other subcatchments, including the Wahnhausen subcatchment which is neighboring the Letzter Heller and which has similar land use and  (Figure 1; Section 3.1). The results suggest that, compared to the other subcatchments, Letzter Heller may have a particularly high potential to accumulate organic active N in the source zone, due to the low mineralization rate for the active N pool (median value: 0.33 yr −1 ), and dissolved mineral N in the subsurface, due to the large travel time (median value: 17 years).
Overall, our results demonstrate that, for each of the eight subcatchments, multiple combinations of model parameter values lead to acceptable model performance. This equifinality can be partly explained by interactions between the model parameters, in particular between the two denitrification rate constants in source zone and subsurface, between each of the denitrification rate constant and the mean travel time in the subsurface, between the denitrification rate constant in the subsurface and the fraction of in-stream removal and between the two protection coefficients (a detailed interaction analysis is presented in Table S10 of Supporting Information S1).

Fate of the N Surplus
In this section, we examine the fate of the N surplus over the period 1960-2015 from the behavioral simulation ensemble ( Figure 6; Table 2). Over the entire WRB (Hemelingen), the total denitrification in the source zone and subsurface ( den tot ) amounts to 1888 kg ha −1 (median value), which corresponds to about 63% of the N surplus (95% CI: 49%-74%). Around 50% (median value) of the total denitrification occurs in the source zone, but the uncertainty in the partitioning of denitrification between source zone and subsurface is large (95% CI of the source zone contribution: 18%-94%; Table S11 in Supporting Information S1).
The landscape export of N from catchment ( out sub ) is equal to 537 kg ha −1 (median value), which represents 18% of the N surplus. As the initial source zone N storage in 1960 is very large (median value: 17,487 kg ha −1 ; 95% CI: 13,678-21,564 kg ha −1 ), the change in the source zone N storage (ΔM s , biogeochemical legacy) is relatively small with respect to this initial storage (median value: 2.5%; 95% CI: 1.2%-5.1%; see details in Figure S30 and Table S12 of Supporting Information S1). Nevertheless, the change in the source zone storage amounts to 448 kg ha −1 , (median value), which constitutes a high percentage of the N surplus (15%) similar to the stream export part. The 95% CI is however large for this component (229-781 kg ha −1 ) and is largely overlapping with the 95% CI of the stream export. The two last components per order of magnitude are the in-stream removal ( rem sub ) and the change in the subsurface N storage (ΔM sub , hydrologic legacy), which have median values equal to 72 and 37 kg ha −1 respectively (which correspond to 2% and 1% of the N surplus, respectively), and which have overlapping 95% CI. Therefore, the denitrification in the source zone and subsurface is an order of magnitude greater than the in-stream removal. Moreover, the N accumulation in the source zone is an order of magnitude higher than the N accumulation in the subsurface. Total legacy buildup in the WRB amounts to 491 kg ha −1 (95% CI: 264-820 kg ha −1 ), which corresponds to around 16% of the N surplus.
We also observe that the simulated source zone N store is continuously building up in time over the period 1960-2015 (Figure 6a). In the subsurface, the dynamics of the N store is much more coupled to the dynamics of the N surplus (Figure 6b). We see that the N storage in the subsurface increases until the 1987 to reach a value of 61 kg ha −1 (median value), decreases by about as much as 40% between 1987 and 2010, and shows small fluctuations between 2010 and 2015.
For the different subcatchments, the relative importance of the different components of the N mass balance are similar (Table 2). In particular, denitrification den tot is always the largest outgoing N flux. The median change in source zone storage ΔM s generally varies between 473 kg ha −1 (Hemeln) and 584 kg ha −1 (Drakenburg). An exception is the Verden subcatchment, which is mostly located in the lowland areas, and for which the median ΔM s is smaller (326 kg ha −1 ). The median change in subsurface storage ΔM sub is smaller for Wahnhausen (17 kg ha −1 ) and is larger for Letzter Heller (62 kg ha −1 ), Hessisch Oldendorf (62 kg ha −1 ) and Drakenburg (78 kg ha −1 ). The relatively large value of ΔM sub for Letzter Heller compared to the other subcatchments is consistent with the parameter distribution results presented in Section 5.1.3. We also observe that the (temporal) dynamics of N buildup in the legacy stores of all subcatchments are similar to those of Hemelingen (see Figures S29-S31 in Supporting Information S1). In particular, N levels in the subsurface peak around the year 1990. Notably, the highest level of N accumulation in the subsurface across subcatchments and time is equal to 119 kg ha −1 (median value) and is reached for Drakenburg in 1993 and Letzter Heller in 1987. However, the differences found between catchments are not robust, since the 95% CI are largely overlapping between subcatchments.
We also examine the change in the different N stores of the source zone, that is, the organic protected, organic active, and inorganic N stores (details in Table S12 of Supporting Information S1). Most of the N accumulation occurs in the protected pool (e.g., 94% for Hemelingen; 95% CI: 79%-98%). For Hemelingen, N buildup amounts to 416 kg ha −1 (95% CI: 203-756 kg ha −1 ) in the protected store, to 21 kg ha −1 (95% CI: 12-74 kg ha −1 ) in the active store, and to 2 kg ha −1 (95% CI: -5-16 kg ha −1 ) in the inorganic store.

Contribution of the N Point Sources to the In-Stream N Loading
We investigate the contribution of the in-stream N loading originating from N point sources ( out ps ) to the total in-stream loading (J out ) over the period 1960-2015. For Hemelingen, we find that the N point sources are an important flux that amounts to 217 kg ha −1 , and that accounts for 28.7% of the total in-stream N loading (Table 2  and Table S13 in Supporting Information S1). For all subcatchments, the N points sources contribution to the total in-stream N loading is between 20% and 29%, apart from Verden for which it is as high as 34.4% (median values, as reported in Table S13 of Supporting Information S1). We note that the 95% CI on the point sources contribution is rather large, as for example, for Hemelingen it is 22.6%-35.6%. This can be partly explained Figure 6. Cumulative values of the components of the N mass balance (inputs and simulated variables) for the WRB at Hemelingen for the period 1960-2015. For the simulated variables, the figure reports the median values and the 95% CI in the behavioral simulation ensemble. Panels (a-b) report the simulated cumulative change in N storage for the source zone (M s ) and the subsurface zone (M sub ) since 1960 as a function of time t ( init s and init sub are the initial conditions in 1960 for the source zone and subsurface storage respectively). The shaded areas indicate the 95% CI, the solid lines the 25% and 75% quantiles and the dashed lines the median values. Panel (c) represents the in-stream compartment, where no accumulation of N occurs in ELEMeNT. Notations: J ps is the point source; J out is the simulated in-stream loading, which is the sum of the point source contribution ( out ps ) and the subsurface contribution ( out sub ); den tot is the total denitrification, which is the sum of the denitrification in the source zone ( den s ) and in the subsurface ( den sub ); rem tot is the in-stream removal, which is the sum of the removal of the point source contribution ( rem ps ) and the subsurface contribution ( rem sub ); ΔM s is the change in the source zone storage which includes three N stores (organic protected, organic active, and inorganic N stores); ΔM sub is the change in the subsurface storage.
To understand the relative role of point and diffuse (N surplus) sources on the resulting temporal trend of the total in-stream N concentration, we perform a piecewise linear trend analysis for each individual component of the concentration over different time periods for Hemelingen. The analysis is based on the median of the behavioral simulation ensemble (total concentration is represented by a dashed red line in Figure 4h). For the period 1970-1990, with respect to the total concentration, we find no statistically significant trend (significance level: 0.01) and a very small slope of the regression line (s lin = −0.01 mg L −1 yr −1 ), which is explained by the contrasting trends in the diffuse sources contribution ( Note. The table reports the fate for the N surplus: the stream export ( out sub ), the total denitrification in source zone and subsurface ( den tot ), the in-stream removal ( rem sub ), the change in the source zone storage (ΔM s ) which includes three N stores (organic protected, organic active, and inorganic N stores), and the change in the subsurface storage (ΔM sub ). It also reports the fate for the N point sources (J ps ): the stream export ( out ps ), and the in-stream removal ( rem ps ). For simulated variables, numbers indicate the median, and lower bound (lb) and upper bound (ub) of the 95% CI in the behavioral simulation ensemble: median ub lb .

Table 2
Components of the N Mass Balance in the Behavioral Simulation Ensemble for the Period 1960Period -2015 L −1 yr −1 ). The concentration time series used for this trend analysis and the regression lines are reported in Figure  S32 of Supporting Information S1.

Uncertainty and Sensitivity of the Simulated N Legacies
Section 5.2.1 mostly focuses on examining the median values of the simulated N legacies. However the uncertainty is large (Table 2), due to the limited information available to constrain these legacies. The soft rules hardly affect the distribution of the simulated change in source zone storage (ΔM s , top panel of Figure 7). The width of the 95% CI in the behavioral ensemble is about equal to the median value, apart from Verden for which it is 1.7 time higher (red boxplots). Regarding the subsurface, the soft rules can have a contrasting effect on the simulated change in storage (ΔM sub , bottom panel of Figure 7), as they can reduce but also exacerbate the uncertainty. Values that are outliers in the unconstrained ensemble, that is, beyond the 95% CI of the gray boxplots, can be identified as behavioral and be included in the constrained 95% CI (colored boxplots). The width of the 95% CI for ΔM sub is between two to three times higher than the median value.
Here, we investigate the factors that explain this residual uncertainty in the legacy N stores by assessing the sensitivity of the simulated N legacies to the ELEMeNT parameters, the N point sources and the N surplus in the constrained simulation ensemble obtained after application of the soft rules, for Hemelingen (see method in Section 4.2). For each of the 270 combinations of N surplus and N point sources, we identify between 531 and 2146 behavioral simulations (details in Table S14 and Figures S33-S35 of Supporting Information S1). This results in a total sample of the model input-output of size 362,985 to perform the PAWN sensitivity analysis. We observe that the bootstrap confidence intervals of the estimated sensitivity indices are narrow and exhibit little overlap among the different inputs ( Figure 8). Therefore, the sample size is sufficient to infer a robust ranking of the input factors according to their relative importance.
From Figure 8 we observe that the N point sources (PS) and two parameters used to generate the N surplus realizations, namely the ratio of the N surplus for agricultural permanent grassland to the N surplus for cropland (r mgra−crop ) and the ratio of the agricultural N surplus in 1850 to the value in 1950 (r warm ) have very small sensitivity indices and are the least sensitive inputs for all four output variables considered. This means that the uncertainties in the ELEMeNT parameters and in the value of the total N surplus (multiplier parameter f surplus ) have a much larger impact on the behavioral values of the legacy stores than the uncertainties in PS, r mgra−crop , and r warm .
For the source zone, the sensitivity analysis results with respect to the average N storage ( s ) and the change in N storage (ΔM s ) differ. The protection coefficient for cultivated land (h c ) is largely responsible for the uncertainty in ΔM s , followed by the N surplus multiplier (f surplus ) and the protection coefficient for non-cultivated land (h nc ) whose sensitivity indices have similar magnitude. In contrast, the source zone organic N stock under pristine conditions ( prist s org ) is by far the most influential parameter for s . For the subsurface zone, unlike the source zone, results are similar for the two statistics analyzed (the average N storage sub and the change in N storage ΔM sub ), and a larger number of input factors are influential. Specifically, the mean travel time in the subsurface (μ sub ) is the most influential input, followed in decreasing order of importance by the denitrification rate constants in the source zone (λ s ) and in the subsurface (λ sub ), and the N surplus multiplier (f surplus ). In addition the mean annual water content in the source zone (V s ) has a stronger impact on the change in the subsurface N storage than on the average storage. The value of the sensitivity index of V s with respect to the change in subsurface storage is similar to the sensitivity index of f surplus .
The importance of the three N surplus parameters and the N point sources may be higher than suggested by the PAWN analysis because the parameter estimation may result in different posterior parameter distributions when using different N input realizations. Therefore, the application of the soft rules may compensate for the uncertainties in the N inputs. This effect is particularly visible for the N surplus multiplier, which has some impact on the distributions of the mean travel time and the denitrification rate constants ( Figure S38 in Supporting Figure 8. PAWN sensitivity indices S PAWN of the nine ELEMeNT parameters, the three parameters introduced to generate alternative N surplus realizations (f surplus , r mgra−crop , and r warm ), and the N point sources realization (PS), for the WRB at Hemelingen. Sensitivity indices are reported with respect to four model outputs evaluated over the period 1960-2015, namely the average source zone N storage s , the average subsurface N storage sub , the cumulative change in source zone N storage ΔM s , and the cumulative change in subsurface N storage ΔM sub . The source zone N storage includes the three N storages (organic protected, organic active, and inorganic N stores). The horizontal black lines indicate the bootstrap mean value of the sensitivity indices, while the gray boxes represent the 95% bootstrap confidence intervals. The bootstrap confidence intervals are very small (the gray boxes are very narrow), since the size of the sample used to calculate the PAWN indices is very large. Information S1), but it is much less pronounced for the other N inputs parameters ( Figure S36-S37 in Supporting Information S1).
We further examine the robustness of the PAWN analysis by estimating the sensitivity indices using other summary statistics (mean and maximum values) than the median value, to aggregate the KS values across the conditioning intervals in Equation 10. As shown in Figures S39-S40 of Supporting Information S1, the results show a similar order of importance among all inputs (model parameters and N input realizations) as reported in Figure 8. We however note that, when using the maximum KS, the mineralization rate constant for the active store (k a ) has a sensitivity index of the same magnitude as f surplus with respect to the average subsurface storage. From Figures S41-S44 in Supporting Information S1 that report the conditional and unconditional CDFs used for the calculation of the PAWN indices, we see that k a has a higher impact in the lower 10% of its range (values lower than 0.25 yr −1 ) on the average subsurface storage.

Performance of the Simulated In-Stream N Loading and Concentration
The ELEMeNT model is able to produce simulations that are consistent with observations of in-stream N loading and concentration (i.e., that show satisfactory values for each of the three components of the KGE), and with the source zone N content in 2009 for the different study catchments (Figures 2-4). While previous studies using the ELEMeNT model (Chang et al., 2021;J. Liu et al., 2021; focused on simulating the in-stream N loading, here we also examine the concentration, which is more difficult to simulate than the loading (Husic et al., 2019). As seen in our results, the loading dynamics are predominantly determined by the discharge dynamics, while the concentration dynamics may be the results of more complex processes. It is important to assess both N loading and concentration to characterize the in-stream water quality status, as N loading affects the status in downstream receiving water, while concentration describes the local water quality status (Hirsch et al., 2010).
Although, our simulations are overall in good agreement with the observations, we observe a discrepancy between the median simulated N concentration and the measured values for the later years (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015). This is particularly visible at Verden, where the concentration observations are at the lower end or lower than the behavioral simulation ensemble for this time period. First, this mismatch between simulated and observed concentration could be due to changes in the characteristics of the landscape (such as e.g., the density of tile drains as observed in J. Liu et al., 2021), that could require temporally varying parameter values. However, no information is available to us to substantiate that such changes have occurred. Second, another possible cause for the discrepancy between observed and simulated concentration could be the uncertainty in the N input data (N surplus and N point sources). In particular, in Germany, data on application of mineral fertilizer exist at the national level only. For lower administrative levels, data refer to the sale of mineral fertilizer and are therefore strongly linked to the location of the fertilizer companies, rather than the actual amount of fertilizer application. Therefore, subjective choices have to be made to disaggregate the fertilizer application amounts to finer spatial units (Behrendt et al., 2003;Häußermann et al., 2020). Fertilizer application is an important component of the N surplus and changes have been implemented after 2007 due to a new ordinance that limits fertilizer use (DüV, 2007). This could contribute to the uncertainty in our in-stream simulations. In Section 6.3.2, we further elaborate on the need for a better estimation of the uncertainty in the N surplus for an improved characterization of not only the in-stream variables, but also the N legacy stores.

Denitrification in the Terrestrial Compartments
Our results ( Figure 6; Table 2) indicate that denitrification in the terrestrial system (source zone and subsurface) is the largest sink for the N surplus in the WRB for the period 1960-2015. This is consistent with previous modeling studies of N legacies performed in North America, namely Ilampooranan et al. (2019) using the SWAT-LAG model, and J. Liu et al. (2021) and  using the ELEMeNT model. However, we find that denitrification is much higher for the WRB, in that it is likely to be higher than 50% for the eight subcatchments, while it is found to be less than 50% in the three previous studies. Such a high amount of denitrification in the WRB could have adverse consequences on the atmosphere and the climate, because it can potentially release nitrous oxide (N 2 O). Yet, N 2 O can be further reduced into harmless dinitrogen N 2 in presence of favorable environmental conditions in the soil and subsurface (Bergsma et al., 2002;Betlach & Tiedje, 1981;Rivett et al., 2008;Robertson & Groffman, 2015). The results of ongoing efforts, such as the "Global N 2 O Model Intercomparison Project" (Tian et al., 2018), will be key to better understand and quantify the processes involved in N 2 O emissions and the impact of denitrification on the atmosphere.

Accumulation of N Legacies
In this study, we explicitly quantify the N legacies in the WRB over the period 1960-2015 ( Figure 6; Table 2). Previous N modeling studies in the WRB are either based on a simple regression model that does not account for N storages in the terrestrial system (GREEN model; Grizzetti et al., 2008), or on a modeling framework that represents the residence time of N in the groundwater, but that does not consider source zone N storage nor assesses the long-term accumulation of N in the subsurface (Heidecke et al., 2015;Hirt et al., 2012;Kreins et al., 2010). We find that the N accumulation in the source zone, which amounts to 15% of the N surplus (95% CI: 8%-26%), is an order of magnitude higher that the N accumulation in the subsurface, which is equal to less than 3% of the N surplus. The magnitude of the simulated N accumulation in the source zone and the subsurface is similar across subcatchments. Groundwater nitrate concentration is found to be higher than the regulatory threshold of 11.3 mg L −1 in some measuring points in the WRB, in particular in Lower Saxony (NLWKN, 2019), while background levels are typically very low. This supports our result that N has been building up in the subsurface. However, no information is available on the N accumulation in the source zone to corroborate our findings. In previous studies of N legacies, the relative value of biogeochemical and hydrologic N legacies is highly variable. Whereas the N buildup in the subsurface store is also found to be lower than the buildup in the source zone in the Grand river basin (J. Liu et al., 2021), a subcatchment of the Iowa-Cedar basin (Ilampooranan et al., 2019) and the Mississippi river basin , the opposite result is reported for the Susquehanna river basin . The magnitude of hydrologic N legacies in J. Liu et al. (2021) is similar to our study (around 4% of the N surplus), while it reaches 14% in Ilampooranan et al. (2019).
Although the N accumulation in the subsurface represents a small fraction of the N surplus in the WRB, this N store is composed of reactive and dissolved N forms, which can be easily accessed and mobilized. Thus, they are of immediate relevance for the water quality status. Since the mean travel time in the subsurface is found to be equal to 8 years (95% CI: 2-24 years), the subsurface N storage is likely to impact the stream N concentration over the coming years. In the source zone, the accumulated N could be a threat for future water N levels as well, depending on how fast it can mineralize. Findings of previous studies suggest that applied N fertilizer can slowly leach over decades following their application (Haag & Kaupenjohann, 2001;Sebilo et al., 2013). Our results indicate that most of the N accumulation occurs in the organic protected N pool (median value for Hemelingen: 94%, Table S12 in Supporting Information S1), whose transfer time is in the order of magnitude of a few millennia (Table S9 in Supporting Information S1). The N buildup is much smaller in the active pool that has a transfer time of a few years to a decade. The long transfer times of N stored in the source zone, which can be made possible through the protection of N into organic matter (Six et al., 2002), may partly explain why it is difficult to lower the N concentrations to acceptable values in the WRB and, more broadly, in regions with long history of high N inputs, as described for example, in Grimvall et al. (2000) and Vero et al. (2018). Source zone N storage could also be a potential resource for crop growth, allowing satisfaction of the crop N requirements with lower amounts of fertilizer application, as proposed by Dupas et al. (2020) and J. Liu et al. (2021). In particular, while it is widely accepted that crops can use mineral N compounds present in the soil, they may also take up organic N forms, but this process is still not well understood (Farzadfar et al., 2021;Näsholm et al., 2009). Therefore, the fate of the N stored in the source zone has large uncertainties and depends on the ability of the plants to access it and on its potential to mineralize to yield more available N forms.
Regarding the temporal dynamics, the permanent buildup in the source zone found for the WRB is consistent with most of the previous N legacy studies (Ilampooranan et al., 2019;J. Liu et al., 2021;Van Meter et al., 2016). The WRB shows a large decrease in the subsurface N store during the period 1990-2010, that can be explained by the concurrent reduction in the N surplus. Hydrologic N legacies permit to sustain higher in-stream concentration levels over this time period. This is particularly visible for the Letzter Heller subcatchment, which has undergone a large and sudden decrease in the N surplus after 1990, while in-stream N concentration remains relatively stable. This result for the subsurface N legacy differs from earlier legacy studies over the Mississippi and Susquehanna river basins and the subcatchment of the Iowa-Cedar river basins, where the subsurface N store is continuously building up in time (Ilampooranan et al., 2019;Van Meter et al., 2016). Given the importance of the N legacies for land and water management and in particular to achieve the target of 2.8 mg L −1 for in-stream concentration in the WRB (OGewV, 2016), better characterization and reduction of the uncertainties in the simulated N legacies is crucial, as further discussed in Section 6.3.

Importance of the N Point Sources
We find that N point sources from wastewater represent an important fraction of the in-stream nitrate loading in the WRB (Figure 6; Table 2 and Table S13 in Supporting Information S1). Point sources N loads comprise 28.7% (95% CI: 22.6%-35.6%) of the stream N load for the period 1960-2015. The contribution is smaller for the later period (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015), where point sources have a lower magnitude due widespread connection to wastewater treatment plants and high efficiency of treatment. Grizzetti et al. (2008) find that point sources account for 31% of the stream N-NO 3 loading over the period 1995-2002 for the WRB at Hemelingen. This is higher than our uncertainty estimates for this time period (95% CI: 13%-23%, Table S13 in Supporting Information S1), which could be explained by the differences in the model structure used in Grizzetti et al. (2008) (regression based GREEN model) and in the N point sources inputs. In our study, the N point sources are constrained by recent observations of N loading from wastewater treatment plants (Section 3.2.4), while Grizzetti et al. (2008) do not make use of observational data. Moreover, the temporal variations in the N point sources have a large effect on the trend of the total in-stream N concentration during the period 1970-2000 ( Figure S32 in Supporting Information S1). The decrease in the N point sources during the 1970s and 1980s counteracts the increase in the contribution of the N diffuse sources (N surplus) to the in-stream N concentration, resulting in no overall trend in the total concentration. The marked decrease in the total concentration in the 1990s is also largely dominated by the decrease in the point sources. While the N diffuse sources are the largest contributor in magnitude to the in-stream concentration, their temporal signal can be smoothed through biogeochemical transformations and transport in the source zone and subsurface ( Figure S32 in Supporting Information S1). In contrast, changes in the N point sources have an immediate impact on the in-stream concentration and can therefore strongly influence its trend.
Some past N modeling studies covering a large range of catchments across Germany and France have not accounted for N point sources (Dupas et al., 2020;Ehrhardt et al., 2021). Based on our simulation results, we recommend the consideration of N point sources and their temporal variations in future N modeling analyses over the WRB.

Value of the Soft Rules to Constrain the Model Uncertainties
In this study, we utilize three different sets of observational data (in-stream N loading and concentration and source zone N content) to estimate the model parameters, using a portfolio of soft rules to constrain the model results. We show that, beyond in-stream N loading, in-stream concentration and source zone N content have a value in reducing the number of behavioral simulations and in constraining the parameter distributions, thus reducing the equifinality (Figures 2b and 5). Specifically, the in-stream N loading and concentration data affect the simulated in-stream loading and concentration (Figures 3 and 4) and the simulated change in the subsurface storage (Figure 7). The source zone N content is the only data that allows to constrain the magnitude of the total simulated N storage ( Figure S17 in Supporting Information S1), but it has no appreciable impact on the different components of the mass balance (Figures 3, 4 and 7). Importantly, the soft rules do not constrain the change in the source zone N store (Figure 7).
Only few previous N modeling studies analyzed the equifinality by performing a detailed investigation of the parameter space, including Husic et al. (2019) and Rankinen et al. (2006). Due to the different model structures used in these studies, our parameter estimation results cannot be directly compared to these studies. Yet, we note that Rankinen et al. (2006) reveals a strong interplay between terrestrial and in-stream model processes in a subcatchments of the Simojoki river basin in Finland. This is consistent with our results, as we could barely constrain the in-stream N removal parameter (R).
Despite the equifinality we can constrain the range of a few parameters, including the mean travel time in the subsurface μ sub , which we determine to be equal to 8 years (95% CI: 2-24 years) at Hemelingen (Table S7 in Supporting Information S1). Koeniger et al. (2008) reports values of the mean groundwater travel time in the range 8-93 years (37 years for total flow) for Hemelingen, based on long-term tritium isotope data in combination with simulations from a hydrologic model. Hirt et al. (2012) found a value of the mean travel time in the groundwater of 25 years Ehrhardt et al. (2021) established that the overall mode travel time for total flow are in the range 0-20 years for different subcatchments of the WRB and in particular in the range 0-10 years for most subcatchments. Therefore previous studies cover a large spectrum of travel time values, which includes our estimates. The results of the study of Hirt et al. (2012) are also, as expected, on the higher end of our uncertainty estimates. This can be because Hirt et al. (2012) explicitly account for quicker flow paths (tile drainage) in their modeling framework, while in ELEMeNT all flow paths to the stream are lumped in the subsurface compartment.
The soft rules only allow to reduce part of the equifinality, as some parameter distributions could be hardly constrained ( Figure 5) and the uncertainty in the model internal components remains large (Figure 7). For example, it would be relevant to determine the amount of denitrification occurring in the source zone and the subsurface, because denitrification in the subsurface can involve the irreversible degradation of substances, such as pyrite, which is not sustainable (Wendland et al., 2009;Wriedt & Rode, 2006). Such quantification is however not possible due to equifinality ( Figure 5, Table S11 in Supporting Information S1) and therefore this topic deserves further investigations. In addition, establishing a robust ranking of importance of the N legacy buildup between subcatchments would be desirable to target management efforts to legacy hotspots (J. Liu et al., 2021). However, the residual uncertainty in the simulated N legacies is still large (Figure 7) and the confidence intervals of the distributions of the N buildup obtained for the different subcatchments are overlapping. This equifinality can be due to parameter interactions (Table S10 in Supporting Information S1) or to the fact that some model parameters are not influential with respect to the metrics used in the soft rules.

Strategies to Further Reduce the Uncertainty and Equifinality
To tackle this issue of uncertainty and equifinality, we perform a sensitivity analysis to investigate the factors that are responsible for the residual uncertainty in the simulated N legacy stores and that should be the focus of future efforts for uncertainty reduction (Figure 8). We apply the PAWN method, which does not rely on any assumption regarding the model input-output relationship, to the constrained input-output sample. The studies of  and J. Liu et al. (2021) assess the sensitivity of the median source zone N store simulated with the ELEMeNT model. Although the method adopted in these two studies is linear regression and the sensitivity analysis is carried out based on the unconstrained sample before calibration, their results are comparable to our study. We observe that the most influential factor is by far the source zone organic N stock under pristine conditions ( prist s org ) in both our study and in J. Liu et al. (2021), and the mineralization rate constant for the organic protected pool, which is related to prist s org (Equation S24 of Supporting Information S1) in . We further note that  and J. Liu et al. (2021) examine the sensitivity of the cumulative in-stream N loading and find, similar to our results for the average subsurface N store, that the mean travel time and the two denitrification rate constants in the source zone and the subsurface are the three most influential parameters. This similarity between the sensitivity of the in-stream N loading and the subsurface N store can be explained by the structure of the ELEMeNT model which lumps all flow paths to the stream into the subsurface store.
Our sensitivity analysis (Figure 8) reveals that the protection coefficient for cultivated land is mostly responsible for the residual uncertainty in the simulated accumulation in the source zone N store. This parameter could scarcely be constrained by the soft rules ( Figure 5). The protection coefficient is a conceptual parameter that partitions the N surplus between the organic active and protected N stores, and therefore it can hardly be inferred through field measurements. A possible solution could be to further refine and constrain the representation of the protection mechanism in the source zone using information gained from simulation experiments carried out with more complex models, that focus on the soil processes and that can include a large number of soil N pools, such as the DAISY (Hansen et al., 1991) or the CANDY model (Franko et al., 1995; a review of soil organic matter models is provided in Campbell & Paustian, 2015). Regarding the buildup of the subsurface N store, three interacting parameters mostly contribute to the uncertainty, namely the mean travel time in the subsurface, which is also the most influential factor, and the two denitrification rate constants. In this regard, tracer studies, and in particular the combination of tritium concentration and helium isotope measurements, can help to characterize the travel time (Sültenfuß et al., 2009), as well as the modeling of conservative solutes, such as chloride, in combination with nitrate (Kaandorp et al., 2021). In addition, Eschenbach et al. (2018) propose a method to characterize denitrification in the groundwater, based on the measurement of the N 2 /Ar ratio. Such techniques provide promising avenues for constraining denitrification fluxes and thereby possibly reducing the uncertainty in the simulated N legacies.
Regarding the magnitude of the total N surplus, we characterize the uncertainty of this input data by using a time-invariant multiplier coefficient and we explore a variation of ±20% with respect to the baseline N surplus data. N surplus data sets for Germany do not provide uncertainty intervals, as uncertainty estimation is currently not a common practice in N surplus construction (X. Zhang et al., 2021). An improved assessment of this uncertainty in future studies seems necessary, since, on the one hand, our results show that the uncertainty in the N surplus has an impact (1) on the simulated N legacies (Figure 8), although this impact is smaller than the ELEMeNT parameters that we discussed previously in this section, (2) on the posterior distribution of the model parameters ( Figure S38 in Supporting Information S1), and (3) possibly on the simulated in-stream concentration trend over the period 2005-2015 (as discussed in Section 6.1). On the other hand, X. Zhang et al. (2021) found large discrepancies in different components of the N surplus for agricultural areas between different global data sets, which suggests that the actual uncertainties in the N surplus can be large.
We recognize that, in this study, the uncertainties on the different components of the N mass balance, including the simulated N legacy buildup, could be underestimated. In fact, we examine the uncertainties in the N surplus using exploratory coefficients that may not reflect the actual uncertainties in the N surplus. We also do not investigate the model structural uncertainties. To further address the modeling uncertainties and equifinality, future studies need to reveal not only the uncertainties in the model parameter values, but also in the data used as input or to constrain the simulations, and in the model structures, as elaborated below.
First, regarding the parameters values, the definition of the parameter distributions and ranges can affect the parameter estimation results, because they can greatly impact the sensitivity of the metrics used for calibration (e.g., the bias, variability error or correlation used in this study) to the model parameters (a discussion on the impact of the parameter ranges on sensitivity analysis results is provided e.g., in Pianosi et al., 2016). To address this issue, in this study, we define the ranges through a careful literature review (Table 1).
Second, with respect to the data uncertainties, in this study we focus on the N inputs and the soil N data (used in the soft rules). Including uncertainty on soil N data is crucial especially, if the objective is to detect a change in storage when data are provided for different years. Given the large size of the total N storage (Table S12 in Supporting Information S1), the change in storage may be within the observation uncertainties. In our study, we consider that other data uncertainties are smaller. In fact, the annual stream discharge data is the combination of observations, and evaluated and bias-corrected model simulations (Section 3.2.5). In contrast, no observational data of N surplus exist to test the plausibility of the N surplus estimates. We also consider that the in-stream N concentration data has a high quality, since they come from 14-day average measurements (Section 3.2.6). However, uncertainty should be examined when data have a lower quality, in particular when using low-frequency in-stream concentration observations combined with weighted regressions on time, discharge and season (WRTDS, Hirsch et al., 2010). In this respect, a bootstrap approach could be envisaged (Hirsch et al., 2015).
Third, a modeling approach allowing for systematic exploration of different water quality modeling alternatives could be developed, similar to the Structure for Unifying Multiple Modeling Alternatives (SUMMA, Clark et al., 2015aClark et al., , 2015b) that allow testing of alternative model formulations for a range of different hydrological and thermodynamic processes. Specifically, in the source zone, worth of further investigation are the representation of the processes of immobilization of N into organic matter and of N saturation, which are poorly characterized (Bingham & Cotrufo, 2016;Yansheng et al., 2020). In the subsurface, further mixing schemes beyond complete mixing/random sampling could be examined using StorAge Selection (SAS) functions, as implemented for example, in Nguyen et al. (2021).
To help identification of plausible model structures and parameters values, our study call for the long-term monitoring of N content in the soil and along the subsurface (unsaturated zone and groundwater) profile. Current N data in the subsurface are typically provided at a unique depth below the water table at each measuring site, for example, in the European Waterbase data set (EEA, 2021) or in the data set provided by the German state of Lower Saxony (NLWKN, 2022). These data do not allow straightforward quantification of the subsurface N storage, since nitrate concentration can vary greatly with the depth to the water table (G. MacDonald et al., 2017;Rudolph et al., 1998) and large amounts of N could also be stored in the unsaturated zone (Ascott et al., 2017).
Regarding the soil part, future modeling studies could make use of the data on soil mineral N content that will likely become available, in particular in Germany where the 2017 fertilizer ordinance (DüV, 2017) prescribes the investigation of soil mineral N prior to fertilizer application.
Yet, due to the scale mismatch between point scale measurements of soil and subsurface N content and the modeling resolutions, the incorporation of these data into the modeling exercise requires the use of smart techniques and appropriate model structures that are commensurate with the measurements (Peters-Lidard et al., 2017). Moving toward the use of spatially distributed water quality models (Nguyen et al., 2021;X. Yang et al., 2018) may be a way forward for integrating local scale measurements into the modeling framework. Such models should be however combined with smart parameterization techniques, such as the Multiscale Parameter Regionalization (MPR, Kumar et al., 2013;Samaniego et al., 2010), which allows for seamless simulations at multiple scales and facilitates the incorporation of finer level information (Rakovec et al., 2016;Samaniego et al., 2017).

Conclusions
The objectives of this study were to (1) characterize the uncertainties in the long-term fate of the N inputs to the landscape, simulated with a parsimonious catchment-scale N model (ELEMeNT), (2) determine the value of different (observational) data to constrain the simulation results with emphasis on the simulated N legacies, and (3) gain further understanding of the magnitude and dynamics of the N legacies to determine their relevance for water and land management. To do so, we establish the ELEMeNT model in eight nested sub-catchments of the WRB, and simulate the fate of N and the dynamics of the legacy stores over the last six decades .
We introduce a multicriteria parameter estimation strategy based on soft rules, that imposes acceptability limits to the model performance in reproducing the in-stream N loading and concentration, and the source zone N content in 2009. We demonstrate that this procedure allows to reduce the equifinality. In particular, the in-stream data allow to constrain the simulated in-stream N loading and concentration and the change in the subsurface N storage, while the source zone N content data reduce the uncertainty in the simulated total source zone N storage. All sources of information also have value in constraining the parameter distributions. However, despite the parsimonious structure of the ELEMeNT model, the uncertainties in the mass balance components remain substantial after using all available information to constrain the simulations. This is due to equifinality, and more specifically to interactions between the model parameters, for example, between the travel time in the subsurface and the denitrification rate constants. Our sensitivity analysis reveals crucial information on model functioning by identifying key model parameters, such as the protection coefficient for cultivated land, the travel time in the subsurface and the dentrification rate constants in the source zone and the subsurface, that are largely responsible for the residual uncertainty in the simulated N legacies. The N surplus input could also be an important source of uncertainty. Its uncertainty estimates should be better assessed in future works to refine the exploratory multiplier coefficient approach used in this study.
Given our modeling assumptions and the data we used, our simulation results suggest a relative importance of the different constituents of the N mass balance in the WRB over the period 1960-2015. Denitrification is found to be the largest sink for the N surplus and is likely to be higher than 50%, followed by the in-stream N export and source zone N accumulation-both with similar magnitude (median value: 15%-18%), while subsurface N accumulation and in-stream N removal appear to be the smaller components (lower than 4%). Total accumulation in legacy stores in the WRB amounts to around 491 kg ha −1 (95% CI: 264-820 kg ha −1 ). Although the buildup of the subsurface N store represents a small proportion of the N surplus, it constitutes an immediate threat for the water quality status, since it includes mobile N forms. Furthermore, our analysis reveals N point sources as one of the important contributor to the in-stream N levels (median value: 28.7% over the period 1960-2015); and therefore we recommend that more attention should be given to this component to properly analyze N dynamics in future modeling studies.
Overall, we recognize that our simulation results have large uncertainties. Our study calls for a thorough consideration of equifinality in catchment water quality modeling, for a better characterization of the model internal components, such as the biogeochemical and hydrologic N legacies. Although knowledge about N legacies is crucial to reach the water quality goals and improve the ecological status of water bodies, this topic deserves more attention. In particular, modeling of N legacies is fraught with a myriad of uncertainties arising from different sources, including not only the model parameter values and input data that are examined in this study, but also the model structures and sparse measurements (e.g., low-frequency in-stream concentration observations). To this end, we believe that sensitivity analysis can be a promising tool for tackling the uncertainty and equifinality. In fact, it allows identification and pinpointing of the model input factors that are responsible for the uncertainty and that should be the focus of future efforts for uncertainty reduction. Importantly, spatially lumped or semi-distributed model structures may restrict the amount of observational data that can be incorporated into the modeling framework, because of the incommensurability between the data, and the model parameter and corresponding simulations. Therefore, future efforts toward reducing the equifinality should focus on both collecting further data, and improving the model representations (e.g., parameterization and structures).

Acknowledgments
Support to Fanny J. Sarrazin was provided by the Reduced Complexity Models project co-funded by the Helmholtz Association (grant no. ZT-I-0010). Fanny J. Sarrazin, Rohini Kumar, Michael Weber and Sabine Attinger acknowledge the Advanced Earth Modelling Capacity (ESM; grant no. ZT-0003) project funded by the Helmholtz Association. We thank Martin Bach for providing the N surplus data for agricultural areas. We thank Oldrich Rakovec for providing the long-term mHM simulations. We thank two anonymous referees for their useful suggestions that allowed us to improve the quality of this paper. Open access funding enabled and organized by Projekt DEAL.