Source Region Geochemistry From Unmixing Downstream Sedimentary Elemental Compositions

The geochemistry of river sediments is routinely used to obtain information about geologic and environmental processes occurring upstream. For example, downstream samples are used to constrain chemical weathering and physical erosion rates upstream, as well as the locations of mineral deposits or contaminant sources. Previous work has shown that, by assuming conservative mixing, the geochemistry of downstream samples can be reliably predicted given a known source region geochemistry. In this study, we tackle the inverse problem and “unmix” the composition of downstream river sediments to produce geochemical maps of drainage basins (i.e., source regions). The scheme is tested in a case study of rivers draining the Cairngorms, UK. The elemental geochemistry of the <150 μm fraction of 67 samples gathered from the beds of channels in this region is used to invert for concentrations of major and trace elements upstream. A smoothed inverse problem is solved using the Nelder‐Mead optimization algorithm. Predictions of source region geochemistry are assessed by comparing the spatial distribution of 22 elements of different affinities (e.g., Be, Li, Mg, Ca, Rb, U, V) using independent geochemical survey data. The inverse approach makes reliable predictions of the major and trace element concentration in first order river sediments. We suggest this scheme could be a novel means to generate geochemical baselines across drainage basins and within river channels.

2 of 25 methodology. This approach uses a small number of observations of downstream composition and the topology of drainage networks to make testable predictions of upstream chemistry.

Study Design
Consider a river catchment containing three geochemical endmembers that correspond to, for example, lithologic units. These endmembers are represented as orange, green, and purple areas in Figure 1. The sediment in rivers draining each of these regions inherits the geochemistry of these sediment sources, as indicated in Figure 1 by the pie charts representing the contributions from each endmember. Downstream geochemistry changes as tributaries draining different sources join or the river erodes a different source region. The "forward" problem, as we define it here, is to predict the composition of sediment at sample sites downstream given the spatial distribution of geochemistry in source regions and a known drainage network ( Figure 1b). An example of this forward problem was implemented and successfully validated in Lipp et al. (2020). The inverse problem addressed in this study attempts to predict source region composition by inverting the known sediment compositions at sample sites downstream ( Figure 1c).
In this manuscript, we consider "source region geochemistry" to be the elemental composition of river sediments in the uppermost portion of the drainage network, that is, first order streams. We recognize that other definitions of source region geochemistry might be preferred, most obviously the composition of the underlying bedrock. We use the geochemistry of stream sediments as our target, rather than bedrock, because stream sediments have likely already undergone chemical weathering on hillslopes prior to entering the drainage network. We note that stream sediment geochemistry is strongly influenced by the composition of underlying bedrock and moderated by weathering (see e.g., Kirkwood et al., 2016). Stream sediments hence incorporate geochemical information about both lithology and weathering whereas bedrock can only inform about lithology.
Given the ubiquity of mixing in the Earth sciences, a number of quantitative unmixing procedures have been developed. The most general case of unmixing is where both the endmembers (i.e., the compositions that are being mixed) and the mixing proportions are sought, so as to explain variability in a proposed mixture data set (e.g., Menke, 2012). Weltje (1997) developed a numerical solution to this general problem, which has been used to unmix, for example, fossil abundances, rock magnetism and grain size distributions (Dam & Weltje, 1999;Dekkers, 2012;Weltje & Prins 2007). In the instance where the endmembers are assumed to be known, calculating the mixing proportions is relatively straightforward, and frequently solved on an ad hoc basis or as part of a Bayesian framework (e.g., Stock et al., 2018). The unmixing problem we consider differs in that we explicitly seek the spatial structure of both the endmembers and the mixture, that is, geochemical maps of source regions and the composition of downstream river sediment samples, respectively. This approach is most similar to that developed by De Doncker et al. (2020). They sought the spatial pattern of erosion in a catchment through a Bayesian inversion of downstream sediment tracers. We, instead, seek the composition of source regions given mixing proportions calculated using drainage networks.

Outline
We first introduce the study area, the Cairngorms mountains, UK. Compositions of sediment sources in this part of Scotland are well constrained, which provides the means to assess model predictions. We then describe how observations of sedimentary composition downstream were acquired from 67 samples along trunk channels and tributaries. Next, we summarize a forward model that uses the structure of drainage networks to convert maps of source region geochemistry into predictions of downstream sediment geochemistry. The inverse problem is then posed, in which the unknown geochemistry of source regions is sought. A description of how this problem can be solved by inverting the composition of downstream samples is provided. We minimize an objective function that includes both data misfit and model smoothing. The fidelity of predicted source compositions generated from inverse models are first assessed using synthetic inputs. We then present the results of inverting real geochemical data from downstream sediment samples. These results are evaluated using independent geochemical survey data from the study region. Finally, suggested improvements and future work are discussed.
LIPP ET AL.

Study Area
This study is focused on five rivers draining the Cairngorms Mountains, Scotland, UK: Dee, Deveron, Tay, Don, and Spey ( Figure 2a). River sediments were extracted from these channels at 67 sample sites indicated in Figures 2b and 2c. Sediments within these rivers have been previously analyzed by Lipp et al. (2020), where it was demonstrated that forward modeling can be used to make accurate predictions of downstream river geochemistry (see e.g., Figure 1). These rivers are therefore good candidates to explore the use of inverse modeling. The region is also well covered by the British Geological Survey's G-BASE (Johnson et al., 2005;Figure 2d). Consequently, there is a pre-existing independent data set that can be used to test predictions from inverse modeling. We also chose to study this region for three further reasons. First, for the UK, it has relatively high topographic relief and a high natural sedimentary flux. Second, a significant portion of the region is in a protected national park limiting potential anthropogenic effects. Finally, this region contains a variety of lithologic units and substrate compositions including mafic and felsic igneous intrusions hosted within meta-sedimentary units (Figure 3a).

Topographic Data and Processing
The inverse scheme requires a drainage network to be defined. Drainage networks were extracted from the SRTM1s topographic data set downsampled to a square grid with resolution 200  E 200 m. Prior to down-sampling the data underwent a cylindrical equal-area projection centered on the study area using GMT 6.0.0 (Farr et al., 2007;Wessel et al., 2013). Depressions in the digital elevation model (DEM) were then filled using the "priority-flood" algorithm (Barnes et al., 2014). Subsequently, drainage networks were extracted from this DEM using the "D8" flow-routing algorithm, which allows drainage area to be defined at every point in the model grid (O'Callaghan & Mark, 1984). The locations of major channels, that is, cells with upstream area 25 2 km are shown in Figures 2a-2c. All landscape modeling calculations were performed using the LandLab 2.2.0 package for python 3.8.5 (Barnhart et al., 2020;Van Rossum & Drake, 2009). Extracted drainage networks are displayed throughout the manuscript (e.g., Figures 2a-2c). , and composition of sampled river sediments (white circles). In this simple scheme, composition of sediment in rivers (e.g., colored pie charts) is determined by the composition of upstream source regions. (b) Schematic shows the forward, "mixing," problem when the source region geochemistry is known and the downstream composition at sample sites is predicted (see Lipp et al., 2020). (c) The inverse, "unmixing," problem attempts to reconstruct the composition of source regions from the point observations of downstream sediment composition, which is the focus of this study. LIPP ET AL.

Upstream Source Region Geochemistry
Predicted source compositions can be tested using the independent G-BASE geochemical survey data. G-BASE sampled the fine-grained, 150 μm, fraction of bed-material of low-order stream sediments (i.e., those with very small upstream areas) with an average sampling density of 1 per 2 2 km E . These sediment samples were subsequently analyzed for a range of geochemical analytes. In the study region, this analysis was performed principally by Direct Reading Optical Emission Spectrometry, with the exception of uranium which was analyzed by Delayed Neutron Activation. The sampling and geochemical analytic procedure used by G-BASE, as well as quality control measures, are described by  and .   (Aitchison, 1983). The first three principal components of the data set are extracted and converted into a red-green-blue ternary space, which highlights the major geochemical domains in the study region. White lines indicate simplified lithological map to highlight key geochemical domains. See panel (f) for key. (f) Ternary plot showing relationship between color, principal components, and geochemistry. Reds and greens indicate compositions that are relatively enriched in elements such as U, Be, Rb indicating felsic association and metallic elements (e.g., Li, Pb, Cu, Co), respectively; blues indicate relative enrichment in alkaline earth elements (e.g., Ca, Sr, Ba). The displayed principal components explain 62.1% of the total variance.

of 25
The G-BASE stream sediment geochemical survey, like other high sample density surveys, primarily reflects geochemical variations in the underlying bedrock (Everett et al., 2019). The geological map of the study region is shown in Figure 3a. Figures 3b-3d show the concentration of magnesium, potassium, and titanium respectively in stream sediments from the G-BASE data set. These geochemical maps show the strong relationship between stream sediment geochemistry and the underlying lithology ( Figure 3a). For example, the felsic intrusions at the center of the studied region are low in Mg and Ti but enriched in K.

Principal Component Analysis
The geochemical variability of the region can be examined and simplified using principal component analysis (PCA; e.g., Kirkwood et al., 2016). PCA rotates multi-dimensional data onto a smaller number of principal components (PCs) along which variance is maximized. This rotation therefore simplifies a data set. As geochemical data sets are compositional in nature (i.e., strictly positive data that sum to a constant), a log-ratio transformation ought to be applied prior to application of PCA. In this instance, we use the centered log-ratio transformation (Aitchison, 1983). We apply PCA to both the G-BASE data sets and predicted chemistry. Singular value decomposition was used to define PCs (scikit-learn; Pedregosa et al., 2011).
To visualize three PCs simultaneously, we transform them using a red-green-blue (RGB) mixing ternary diagram (e.g., Figures 3e and 3f) as follows. The scores on the first three PCs are calculated, raised to an exponent and then normalized by the sum of these three exponentials. The resulting values (which sum to one) are then used to weight the red, green, and blue channels respectively for visualization. The first PC corresponds to relative enrichment in felsic associated elements (e.g., U, Be, Rb) and defines the felsic intrusions at the center of the study region. The second PC corresponds to an enrichment in certain metals (e.g., Pb, Cu, Li) and appears to demarcate the different sedimentary units. The third PC corresponds to relative enrichment in some alkaline earths (Sr, Ca, Ba) and identifies the mafic intrusions in the northeast of the study area. Figure 3e displays the first three PCs of 22 elements from the G-BASE data set in the RGB ternary space (Figure 3f), where the RGB channels correspond to the (normalized exponents of the) first, second and third PCs, respectively. This map shows the principal geochemical domains of the region. A goal of the inverse modeling is to reconstruct these principal geochemical domains using a small number of sediment samples gathered downstream.

Downstream Sediment Geochemistry
The 150 μm fraction of bed material was gathered from localities on the studied rivers and analyzed for their elemental geochemistry. This data set was first reported in Lipp et al. (2020). In total, 67 samples were gathered from 63 sample sites ( Figure 2b). The sample sites divide the study area into a series of nested sub-catchments, which are displayed in Figure 2c. Sampling density means that the majority of sub-catchments have areas 200-400 2 km E . In the southern portion of the Tay catchment, lower sampling density results in sub-catchments with greater areas (Figure 2c). While a larger suite of elements was gathered, we focus on the following 22 elements, which are present in the downstream samples and were measured consistently by G-BASE in the study area: Ba, Be, Ca, Co, Cr, Cu, Fe, K, La, Li, Mg, Mn, Ni, Pb, Rb, Sr, Ti, U, V, Y, Zn, and Zr. This subset was selected so that predictions from the inverse model can be evaluated using the independent G-BASE data set.
The sampling procedure we used replicated the standard G-BASE sampling protocol. This replication makes the data gathered directly comparable between the two data sets. Bed material was extracted from the river channel by shovel and deposited on a sieve-stack. First, a 2 cm grill was used to remove pebbles. The material was then rubbed through a 2 mm and then 150 μm nylon sieve into a fiberglass collecting pan. After letting suspended sediment settle out for  E 15 min, excess water was decanted, and the homogenized sediment slurry was poured into a reinforced paper bag. Each paper bag was placed within a sealed plastic bag to prevent contamination. The bagged sediment samples were air-dried until they had the consistency of modeling clay, before being freeze-dried for short-term storage prior to geochemical analysis.
The freeze-dried sediments were powderised in an agate ball mill and homogenized, and an analytical subsample taken by cone-quartering. For each geochemical analysis, 0.25 g of powder was accurately weighted into Savillex tubes. The powders were digested using HF, HNO 3 and HClO 4 on a hotplate. After digestion, 7 of 25 the sample was resolubilised using HNO 3 and H 2 O 2 and analyzed for a full suite of elements using an Agilent 8900 Inductively Coupled Plasma Mass Spectrometer at the British Geological Survey.

Uncertainties
We consider two principal sources of geochemical uncertainty in our data: analytical and sampling. First, the analytical uncertainty of measured elemental concentrations was assessed by processing of standards in the laboratory. Apart from elements Zr, Y and Ti, which had slightly lower concentrations than the standards, measured compositions were successfully reproduced (see Lipp et al., 2020, where these data were first presented). The second, larger, source of uncertainty is the variability of measured compositions within each sample site which reflects both local geochemical heterogeneity as well as error introduced by the sampling protocol. This can be assessed using duplicate samples. Statistical analysis of the duplicate samples, reported in Lipp et al. (2020), indicated that the vast majority ( 95 % for most elements) of the geochemical variability in these samples reflects variation between sample sites, not local heterogeneity or sampling error. Nonetheless, there are only eight duplicates available (one pair each from the Spey, Deveron, Don, and Dee rivers), so calculated statistics must be treated with some caution. With this caveat in mind, the duplicates can be used to generate an estimate of the uncertainty of our data. The log standard deviation,  j E , of each element calculated from the four duplicate pairs is 0 16 . and tends to be highest for metallic elements (e.g., Ti, Zr, Hf, Cu, Ag, Au, Sn, Pb). It varies systematically between 0.04 and 0.08 for rare earth elements (e.g., La to Lu). The log standard deviation of many other elements is considerably smaller, for example it is 0.014 for magnesium. An obvious way to improve the quality of these statistics in the future is to incorporate more duplicate samples. Note that for most elements  j E is much lower than the intersite variability, as reflected in the results of the ANOVA shown in Lipp et al. (2020).

Forward Model
Here we describe the procedure to predict downstream sediment geochemistry given a known distribution of geochemistry and topography in the source-region. The forward model as described here has been implemented and successfully tested for this region previously (Lipp et al., 2020). Let ( , ) E C x y be the concentration of some element in the sediment source regions of a drainage network, for example, magnesium. E C can be approximated by geochemical surveying, for example, Figures 3b-3d. We seek to predict E D which is the concentration of that same element in downstream sediments at a point in a river which has an upstream drainage area, E A . The concentration downstream, E D , is simply the sum of the contributions to this element from every upstream point in the basin, E A , normalized to the total sediment flux. If E A has a spatially varying erosion/surface-lowering rate, amount of the target element, that is, the total amount of sediment produced by that point, multiplied by the concentration of the element in question. The total sedimentary flux is the total amount of erosion occurring upstream, that is, Combining these relationships provides the following estimate of concentration in downstream samples Under this formulation, the concentration of an element in sediment downstream can be predicted if the erosion rate and concentration can be defined at all points in the upstream region, assuming instantaneous sediment transport and no in-transit chemical modification (e.g., De Doncker et al., 2020;Sharman et al., 2019). This approach assumes that all chemical weathering happens in-situ (e.g., on hillslopes) before sediments enter the fluvial system.
An unknown in this formulation is erosion rate,   is required to be defined continuously across the studied region, a reasonable approach is to use landscape evolution models. The widely used stream power model, for example, predicts erosion rates using empirical relationships between slope angle, upstream area and erosion rate (see e.g., Howard & Kerby, 1983;Tucker & Whipple, 2002). Alternatively spatial patterns of LIPP ET AL.

10.1029/2021GC009838
8 of 25 erosion rate have be constrained using detrital geochronology (e.g., Avdeev et al., 2011;Braun et al., 2018;Fox, Leith, et al., 2015;Stock et al., 2006;Vermeesch, 2007). In Lipp et al. (2020), the stream power model was used to predict erosion rates and hence composition of sediment downstream using Equation 1 for the same data set used here. Changing model parameters had a minor effect on the goodness-of-fit for downstream data. In fact, spatially homogenous incision (i.e., constant   z E t ) was found to provide, by a small margin, the best fit to the data downstream. These results, combined with the results of tests in which substrate was varied, indicated that downstream geochemistry was much more sensitive to the drainage network topology and source region geochemistry. Hence, in this study we proceed with this assumption of homogeneous incision. The validity of this assumption of spatially constant incision will be implicitly tested when predictions from inverse modeling are compared to independent data.
Under the assumption of homogenous incision (i.e., , where E k is a constant, e.g., one), Equation 1 can be simplified further to give This equation simply states that the composition of sediment downstream, E D , is an equal area weighted mixture of the composition of its upstream region. In summary, Equation 2 is the forward model, ( ) E F C , we use to transform a spatial map of upstream geochemistry, ( , ) E C x y , into a prediction of sediment geochemistry downstream,

Discretising Source Region Composition
The goal of the inverse procedure is to identify the upstream geochemistry, E C , that best fits point observations downstream. Here we describe a procedure for objectively determining E C x y can be represented as a vector, C of length equal to the number of grid-cells contained within or overlapping the studied drainage area. For example, a  5 5 E km resolution grid of the study area can be recast as a vector of 601 scalar values. This discrete vector can be upsampled to the resolution of the base DEM used to perform flow routing (e.g., 200  E 200 m in this study). The upsampled grid, E C , can then be used by the forward model to calculate the composition of the sediments downstream.
As geochemical data are relative (not absolute), we seek the natural logarithm of concentration, log( ) E C (Aitchison, 1986;Pawlowsky-Glahn & Egozcue, 2006). Consider, for example, the change in concentration of some trace-element from 0.01 wt% to 1 wt%. This relative change of 100 times the original value is generated by an absolute change of 0.99 wt%. If that same element changes in concentration again to 2 wt%, the relative change is only two times the intermediate value, much smaller than the initial change. However, the absolute change is in fact larger, that is, 1 wt%. Given that elemental concentrations frequently traverse many orders of magnitude, the objective function must be sensitive to relative, not absolute, changes in concentration. In logarithmic space, the first change in concentration is correctly identified as traversing a greater compositional distance than the second change, hence its application here.
We note that as compositional data are strictly bounded between 0 and some closure value (e.g., 100%, 6 10 E ppm), the sigmoidal logit function should be used instead of a log E function. However, given that the elements we analyze are generally 10 wt%, where the logit and log E functions are functionally identical, we use a log E function as its computational burden is lower.

Data Misfit and Uncertainty
We seek the upstream geochemistry vector, E C , that generates theoretical compositions downstream that best fit observed compositions. The observations can be represented as a vector, obs E D .
In the examples explored in this study obs E D contains 67 values (i.e., the number of downstream samples). Predicted concentration at each sample site ( ) E F C can be expressed in vector notation as ( ) E C F . The inverse model seeks to minimize the difference between A visualization of the misfit between observed and predicted concentrations is shown in Figure 4b.

Regularization
This inverse problem is likely to always be underdetermined (i.e., there are fewer observations than free parameters). In the example, we consider there are almost an order of magnitude more unknown compositions (upstream source) than known compositions (downstream samples). Underdetermined problems are often solved by imposing constraints on properties of the solution, for example, minimizing roughness (Parker, 1994). In this instance, we seek smooth geochemical maps that best fit the composition of the 67 downstream samples. We do so by penalizing the roughness of upstream geochemistry, E C . We define roughness here as the sum of the square of the Euclidean norms of the first derivative of log( ) E C in both the E x and E y directions, that is, To quantify the first-derivative we calculate the first discrete difference between adjacent values of the grid of log( ) E C values in both the E x and E y directions, assuming Von Neumann boundary conditions that are equal to zero (i.e.,       log( ) log( ) C x C y / / 0 ). 10 of 25

Minimizing the Objective Function
Considering both the data-misfit and roughness constraints, the best-fitting source-region chemistry is that which minimizes the following objective function, ( ) The hyper-parameter  E controls the extent to which roughness is penalized. High values of the "smoothing coefficient"  E result in solutions which are spatially very smooth but fit the data poorly (underfitting). Conversely, very low values of  E result in very good fits to the data but resultant maps of E C which are geologically implausible due to their spatial roughness (overfitting). We seek the smoothest model that best fits the data and systematically tested     2 2 10 10 E for each element. We choose the value of  E that lie at the point of maximum curvature (the "elbow") of data-misfit as a function of model roughness (Parker, 1994). We note that, while pragmatic and objective, this particular approach does not always identify the model that best matches independent observations (see e.g., Bodin & Sambridge, 2009). In this study, we can test the appropriateness of the chosen value by comparing model predictions to independent observations ( Figure S5 in Supporting Information S1).
As we seek log( ) E C , which must be raised to an exponent prior to being entered into the forward model, this is a nonlinear inversion and unlikely to be amenable to linearization. Assuming that there is no analytic solution for the minima of E X , we minimize Equation 5 numerically. We minimized E X , with respect to log( ) E C , using Gao and Han (2012)'s implementation of the Nelder-Mead (downhill simplex) algorithm using SciPy libraries (Press et al., 1992;Virtanen et al., 2020). The algorithm finishes when the change in the objective function and the maximum change of any parameter between subsequent iterations is less than 4 10 E . Both these criteria must be met for convergence. We chose to use a simple constant-value starting condition such that  log( ) C is the average composition of the five most downstream samples from the Spey, Deveron, Don, Dee, and Tay rivers weighted by upstream area.
The Nelder-Mead algorithm requires 5 10 E -6 10 E iterations for convergence (10-100 hr on a standard desktop computer with a 2.5 GHz Intel i7 processor). The number of iterations depends on the element being inverted and the  E value. Preliminary work instead indicates that a different optimization algorithm, Powell's conjugate gradient method, could generate equivalent results at significantly less (10-100 times) computational cost. A Jupyter notebook containing python implementations all of the calculations described above is provided (see Data Availability Statement).

Synthetic Examples
First we explore the extent to which this inversion scheme can recover a known, synthetic, input. Figure 5a displays a synthetic source-region geochemistry for an arbitrary geochemical element. This "chequerboard" pattern has a peak-to-trough distance of 40 km. From this synthetic input, we then calculate the composition of downstream samples, which become the "observations" used to invert for a source composition. We invert 67 "observations" at locations corresponding to the actual sample locations along the Spey, Deveron, Dee, Don, and Tay rivers. If the inverse scheme is working correctly, the optimal ( , ) E C x y should match the input map displayed in Figure 5a.
Comparison of Figures 5a and 5b shows that the inverse scheme successfully recovers locations and amplitudes of the geochemical signal in almost all of the source region. We emphasize that this input was recovered using just the 67 (synthetic) observations at the sample sites (red dots in Figure 5b). The "pixelation" in Figure 5b is a result of the discretization of E C , discussed above, but does not prevent the scheme from resolving the significant spatial signals. The optimal solution to the inverse problem and the synthetic input are compared on a cross-plot in Figure 5c. Note that the input grid is downsampled (block-mean) to the same resolution as the inverse model predictions prior to comparison. The data lie clustered close to the 1:1 line with an 2 R E of 0.71 and root-mean-square (RMS) misfit of 0.20. These results indicate that the inverse model is unbiased and explains the majority of "observed" variance.  Figure 2c). Inset shows histogram of misfits with binwidth = global RMS misfit; best-fitting normal distribution (black curve) is shown for comparison. Analogous figures for synthetic inputs with different input wavelengths are given in Figure S1 in Supporting Information S1.
12 of 25 Figure 5d shows that the residuals are not randomly distributed in space. The residuals have a Moran's E I of 0.08. While this value is low in magnitude, which indicates a generally small effect size, this value was greater than the expected E I under the null hypothesis with a p-value 0 05 . . These results suggest that the residuals have a statistically significant spatial structure. We attribute this structure to sampling density. For example, the south of the studied region has larger residuals. This region coincides with low sampling density (Figures 2c and 5b). Where the sample density is more consistently high, across the rest of the region, the optimal inverse model successfully recovers "observed" composition. Figure 6 shows how the best-fitting upstream geochemistry relates to the observations of geochemistry downstream. Figure 6b displays the predicted downstream geochemistry for the optimal solution shown in Figure 5b. Overlain on this panel are the synthetic "observations," which were inverted for upstream composition. Figure 6c is a cross-plot of synthetic "observed" downstream concentrations against the predicted concentration from the best-fitting inverse model. These points all lie clustered on the 1:1 line indicating that the model was able to fit the downstream data well. The variation in predicted geochemistry of the arbitrary element as a function of downstream distance, with the observations overlain, is shown in Figure 6e. The discrete "jumps" in concentration are caused by tributaries joining the main channel. In summary, the optimal inverse model fits "observed" downstream sediment geochemistry accurately (  2 1 E R , RMS→ 0 E ). Figure S1 in Supporting Information S1 shows the results of systematically varying the spatial distribution of synthetic source composition. When the spatial structure is small (  (10) E O km) the inverse scheme cannot resolve spatial variability in source composition. However, longer wavelength spatial structures are accurately predicted. These results indicate that, with the data available, the inverse scheme can resolve geochemical spatial structures with wavelengths  20 E km or longer.
The synthetic examples discussed above only consider smooth changes in the input signal. However, sharp changes in geochemistry may occur in reality caused by, for example, changes in lithology. Figure S2 in Supporting Information S1 shows the results from a test analogous to that shown in Figure 5 but with sharp changes in geochemistry. The results indicate that, as expected, the smooth inverse model does not capture the loci of sharp changes in composition but the wider structure is successfully recovered.
Real geochemical data will incorporate some amount of random noise, generated by a range of processes (see discussion of uncertainties above). We test the robustness of predicted upstream geochemistry to noisy data by performing an inversion in which a Gaussian distribution of noise was added to the synthetic downstream data. The magnitude of noise was equal to 5% of the total data variance. The results of this test (shown in Figure S3 in Supporting Information S1) show that despite the noise the target geochemical map is successfully recovered.

Real Data
Having successfully trialled the inversion scheme on synthetic examples we now minimize Equation 5 for the concentration of each studied element independently, for our study area. We seek to identify the smoothest distribution of source region chemistry ( , ) E C x y that minimizes misfit to 67 data constraints downstream. Predicted ( , ) E C x y is tested using the independent G-BASE geochemical survey data set. As an example, we focus first on the results for magnesium. The solutions displayed in Figures 7 and 8  . This value was chosen as it lies in the "elbow" of the data-misfit-roughness plot shown in Figure 9a. Each point on this graph corresponds to the roughness and data-misfit of a solution which minimizes Equation 5 for a specified  E . Choosing  E values greater than the optimum clearly over-smooth the solution relative to the independent G-BASE data set resulting in a poor-fit to the data (Figures 9e and 9f). Conversely, reducing  E allows the scheme to overfit the data with results which are implausible in reference to the independent data set (Figures 9b and 9c).  E must be calibrated in this way for each element. Comparing independent observations to predicted upstream concentrations indicates that the chosen value of  E is close to the optimal value (see Figure S5 in Supporting Information S1). Figure 7a shows the predicted concentration of Mg upstream that best-fits the composition of the 67 downstream samples. Figure 7b shows Mg from the G-BASE database downsampled to the resolution of the inversion grid. The two maps show the same spatial structure. The low Mg concentration of sediments LIPP ET AL.

10.1029/2021GC009838
13 of 25 derived from the felsic intrusions in the center-left of the region are correctly identified by the inverse solution. Similarly, the two lobes of high Mg concentrations in the upper-right of the region, corresponding to sediments derived from mafic intrusions, are also correctly identified in the best-fitting inverse model. A cross-plot of G-BASE data and predicted concentrations is shown in Figure 7c. This figure shows that the predictions correlate with the independent data set and clusters around the 1:1 line. We emphasize again here that the solution displayed in Figure 7a is completely independent of the G-BASE survey data and calculated using only the 67 samples collected downstream. Residuals are normally distributed around 0 and higher in regions where model coverage is low (Figure 7d). The predicted downstream geochemistry, that is, ( ) E F C , for this optimal solution is displayed in Figure 8. Comparing the predicted downstream chemistry indicates that the model captures the important geochemical variability within and between drainage

10.1029/2021GC009838
15 of 25 basins. A cross-plot of predicted and observed downstream geochemistry indicates that the model is unbiased with a regression close to the 1:1 line, and explains 76% of the total variability. We also compare our model predictions for Mg to the full resolution G-BASE data set in Figure S4 in Supporting Information S1. This comparison should be treated with some caution however as it is unreasonable to expect the inversion to resolve details on a scale less than the resolution of the model.
In Figure 10, we compare independent data to predictions from the inverse model for four other elements, chosen as they show a range of different chemical affinities. Calcium (Figures 10a and 10b) shows a broadly similar spatial structure to magnesium in both the inverse solution and the independent data set. The mafic, Ca-rich intrusions in the northeast are correctly identified by the inverse solution as well as the felsic, Ca-poor intrusions in the center of the region. Rubidium (Figures 10c and 10d) has a different chemical affinity to Mg and Ca, and is generally associated with felsic rocks. The best-fitting inverse solution for Rb correctly identifies the regions of elevated Rb concentration associated with the felsic-intrusions. Conversely there is a broad region of predicted low-Rb associated with the sedimentary units in the south-east. The distribution of vanadium is different again to the previous elements, and appears to be mostly set by northeast-southwest trending sedimentary units. These separate domains are correctly identified by the inverse solution. Beryllium has a similar spatial structure to rubidium, consistent with its association with felsic units. The best-fitting maps for all other studied elements are given in the Figures S7-S25 in Supporting Information S1. produce smooth solutions that are a poor fit to the data, for example, panels (e and f). Optimum solutions lie in the "elbow" of this tradeoff plot (Parker, 1994). The optimal solution used in this study is shown in panel (d).
LIPP ET AL. These results indicate that the inverse model is able to successfully recover the spatial distribution of geochemistry for elements with a range of different geochemical affinities. Figure 11a summarizes comparisons of predicted source region composition and the independent G-BASE data set ( 2 R E and RMS values). Figure 11b compares the predicted and observed concentrations for downstream sediments. Note that results for Cu and Pb are not presented because systematic sweeps of the smoothing coefficient,  E , did not yield optimal values in a reasonable amount of time. Mean  2 R 0.32 E and RMS = 0.23 for upstream predictions and the independent G-BASE data set. Fitting of downstream data yields a mean 2 R E and RMS of 0.78 and 0.09, respectively.

Multivariate Analysis
Studying the results from inverse modeling of elements individually neglects the important relationships that exist between elements. If our inverse scheme is successful it should also be able to recover geologically plausible relationships between different elements. By applying PCA to the upstream inverse solutions for all of our studied elements, we can determine whether our inversion has captured meaningful associations between the different elements. In addition, by plotting these associations spatially, we can examine whether it has recovered the different geochemical domains for the region identified in Figure 3.
The first three PCs of the suite of inverse results are shown in Figure 12a using a RGB ternary space. The relationships between the geochemical elements and these PCs are shown in Figure 12d. The first PC corresponds to relative enrichment in felsic lithophile elements (e.g., U, Be). The second PC is associated with metals (e.g., Ni, Co, Ti) and the third appears to be associated with mafic lithophile elements (Mg, Ca). These associations are very similar to the principal geochemical relationships of the G-BASE data set (Figures 3e and 3f). This result indicates that the inverse model correctly identifies major geochemical associations in this region. Moreover, the spatial distribution of these associations mimics that of the G-BASE PCA map, albeit at a lower resolution. The similarities of the G-BASE data structure to that of the predictions

Deterministic Sedimentary Geochemistry
We show that inverse modeling of a small inventory of fluvial sediments can constrain the geochemistry of upstream source regions. Significantly, comparison to independent observations indicates that model predictions tend to be unbiased as residuals are distributed around zero. Successful unmixing suggests that sedimentary geochemistry is determined principally by conservative mixing of source compositions downstream. In transit processes appear to play a moderating role (Menges et al., 2020). Such deterministic behavior is encouraging for quantitative provenance analysis (e.g., Weltje & Eynatten, 2004). The results from this study validate approaches that have previously attempted to describe sediment geochemistry assuming  Figure 11. The first three PCs were extracted from the best-fitting inverse models and passed into a red-greenblue (RGB) ternary space. See panel (d) for key. Note similarity to the principal component map generated from the Geochemical Baseline Survey of the Environment (G-BASE) data set displayed in panel (b) (also in Figure 3e) and the relationship to lithological boundaries (white lines). (b) PC map generated in same way as panel (a) but using the G-BASE data set as input. White lines indicate lithological boundaries. Note that key for this PC map is given in Figure 3e. Reds indicate enrichment in elements with felsic association (e.g., U, Be). Greens indicate enrichment in metallic association elements (e.g., Ni, Co, Ti). Blues indicate mafic association elements (e.g., Mg, Ca). conservative behavior (e.g., Ercolani et al., 2019;Garzanti et al., 2012). While we consider the inorganic sediment fraction only in this study, this approach could in principle be used to understand organic geochemistry which is also strongly controlled by mixing (Menges et al., 2020).
A relationship between climate and fluvial sedimentary geochemistry has been observed in some rivers (e.g., Canfield, 1997;Dinis et al., 2017;Gaillardet et al., 1999;Garzanti et al., 2013Garzanti et al., , 2014Riebe et al., 2003). Therefore, the success of our simple mixing model, which does not explicitly consider any climatic effects (e.g., chemical weathering controlled by climate), is perhaps surprising. However, the results of these studies are not necessarily inconsistent as we consider only relatively small catchments, which do not cross large climatic gradients. An analogous study in areas of strong climatic gradients might be a means to produce maps of source region geochemistry and better explore the role of climate (e.g., Angola; Dinis et al., 2017).

Nonconservative Behavior
An exception to the general rule of unbiased predictions (i.e., residuals distributed around zero) is calcium, which is overpredicted relative to G-BASE (Figures 10a, 10b and S9 in Supporting Information S1). One explanation for this result is the adsorption of dissolved calcium cations to the surface of clays in sediments, which is observed in rivers globally (Lupker et al., 2016;Sayles & Mangelsdorf, 1979;Tipper et al., 2021). In contrast, the predicted spatial structure of zirconium is similar to that of G-BASE, but it is systematically underpredicted ( Figure S25 in Supporting Information S1). We attribute underprediction to the measurement of zirconium in the laboratory. Comparison to standards indicates zirconium measurements are underestimates because they tend to be hosted in resistate minerals. As a consequence, while the optimal model yields a reasonable good fit to measured downstream compositions, predicted concentrations in source regions are too low.
Hydrodynamic sorting imposes strong geochemical variability locally on sediments Eynatten et al., 2012Eynatten et al., , 2016. We have attempted to avoid this effect by sampling a constant grainsize fraction ( 150 μm) of bed material at each sample site. However, this implicitly assumes that the distribution of grainsizes beneath our threshold of 150 E μm is constant across all sites. Instances where this assumption is not valid may contribute to model misfit, for example where different lithologies produce sediments with different grainsize distributions. In future studies, this effect could be minimized by gathering grainsize data at all sites and performing statistical corrections (e.g., Bloemsma et al., 2012). Alternatively, for large rivers, depth profiles can be gathered across the water column and the geochemistry integrated across all grainsizes (e.g., Baronas et al., 2020)

Resolution and Other Limitations
Predictions from the inversion scheme are similar to low-pass filtered maps of the true source region geochemistry. Figure 13 shows the optimal models for elements and G-BASE after application of a two-dimensional Gaussian filter of width 25 km, which is the estimated effective resolution of the scheme deduced from inversion of synthetic data (see Figure S1 in Supporting Information S1). The spatial structure of the two filtered maps is similar for these elements. The effective resolution of the inverse model depends upon the size of the nested sub-catchments (see e.g., Figure 2c). Spatial variability within a sub-catchment is averaged out and hence unresolvable. Inverse modeling of synthetic and real observations shows that the most significant source of error is low sampling density (e.g., Figures 5 and 7). Sampling campaigns could be designed with the specific goal of creating nested, equal-area, catchments.
RMS misfit between the G-BASE data set and model predictions is low for all elements we studied. However, some elements, for example Mn, have an 2 R E close to zero. This result indicates that predicted compositions have no relationship to the G-BASE data set. Most low 2 R E values are associated with elements that have almost no long-wavelength spatial structure in the G-BASE data set (e.g., Figure S16 in Supporting Information S1). Instead the distribution of these elements is dominated by short-wavelength variability. We suggest that the inverse model fails to replicate these observation because there is almost no long wavelength spatial structure to resolve in the first instance. This limitation could be remedied by greater sampling densities.

of 25
Comparing model predictions to independent observations shows that concentrations of some elements (e.g., Mg; Figure 7) are over-predicted in areas of high concentration and under-predicted in regions of low concentration. We attribute this result to the chosen model not being sufficiently smoothed (i.e., the chosen  E value, extracted from the loci of maximum curvature in Figure 7c). This result is consistent with Figure S5 in Supporting Information S1 which shows that the model that is a closest fit to the independent data for Mg is more smoothed than the model chosen by the method of maximum curvature. As a result, there appears to be a slight tendency for our approach to return an under-smoothed model (i.e., to overfit the downstream data). Although we note that the magnitude of the difference in fit to the independent data are small between the optimal and chosen models. This result indicates that exploring other methods to identify the "best" inverse model would be worthwhile. It is interesting that we tend to overfit the data given that the RMS misfit between the predicted and observed data are generally an order of magnitude greater than the estimated data uncertainties discussed above. This result indicates that processes other than a pure mixing model, and sampling error, contribute to the geochemistry downstream.

Further Work
The method presented here does not produce uncertainties for the predictions of source-region geochemistry. For practical applications of this method it would be highly desirable to generate uncertainties for the predictions. A number of ways to generate uncertainties exist. Cross-validation, where the data are repeatedly resampled with some random samples excluded before the inversion is performed, is one way to generate an ensemble of predictions. From this ensemble, both a central prediction and associated uncertainties can be extracted.
A further approach to generate an ensemble of models is reversible jump Markov Chain Monte Carlo (rjM-CMC) modeling, which has been successfully applied for seismic tomography, inverting river profiles for uplift histories, and for modeling thermochronological data (Bodin & Sambridge, 2009;Stephenson et al., 2006). In rjMCMC modeling, the spatial resolution of the solution adapts to the data itself. In this way, the scheme allows for rapid near-discontinuities where the data allows it, but does not solve for redundant nodes in areas of low coverage. It can also, alongside an optimum model solution, return an estimate of the uncertainty of the predictions and does not require a semi-subjective choosing of optimal inverse solutions.
A current limitation of the approach presented here is variable data coverage due to spatial variations in sample density. This issue could be partially ameliorated by careful sampling campaign design, or by using adaptive resolution methods such as rjMCMC discussed above.
The current approach treats each individual geochemical element separately. However, as shown in the PCA analysis (Figure 12), there is a large amount of redundant information between the different elements. This is a result of elements with similar chemical affinities behaving in the same way, and hence have strongly covarying concentrations. Therefore, inverting for the upstream concentrations for multiple elements simultaneously, potentially making use of dimension reducing techniques such as PCA, may be a more efficient approach than solving for each element separately. This approach would however have to respect the closure constraint imposed by compositional data (i.e., the summed concentration of all the elements must be strictly less than 100%).

Geochemical Mapping
Producing geochemical maps of Earth's surface remains an ongoing challenge for applied geochemistry. Geochemical maps are essential data-products for identifying regions of elevated elemental concentrations, which may indicate economic mineralization or contamination. Such maps may impact on the application of regulatory controls or agricultural land management (Ander et al., 2013). However, it is estimated that at present only 20 % of the Earth's surface has been mapped geochemically at any scale (Liu et al., 2021). Part of this challenge is due to the fact that mapping large, continental-scale, areas is logistically challenging and can be extremely expensive due to the large numbers of geochemical samples that must be processed. As a result, considerable attention has been focused on developing methods to produce geochemical maps with small numbers of samples (e.g., Birke  . Despite this much lower sample density an inverse procedure was shown to identify the dominant spatial geochemical structures of the region. While our scheme cannot resolve very fine spatial structures, in many instances, this limitation may be reasonable given the significant reduction in samples required.

Conclusions
A methodology to invert small inventories of river sediment compositions for the elemental composition of source regions is presented. This "unmixing" scheme was tested using real data gathered from five rivers in the Cairngorms, UK. Inversion of synthetic data indicates upstream spatial signals at scales 20 km can be recovered. The concentration of 22 elements across an area of 12,800 2 km E were predicted by inverting 67 downstream samples. Predictions are validated using independent observations from geochemical surveying. They tend to be unbiased and successfully recover the long-wavelength (>20 km) distribution of chemical concentrations in the region. Multivariate analysis indicates that optimal inverse models successfully identify meaningful geochemical associations between different elements. The success of this unmixing procedure indicates that in-transit modification of bulk sediment geochemistry is subordinate to the effect of mixing. Such inverse approaches are a novel way to map the geochemistry of large areas at a low sampling density. The results indicate that sedimentary elemental geochemistry is in part deterministic. Inverse schemes, such as the one we present here, are a step toward fully quantitative models of sediment provenance.