Lignin phenols have proven to be powerful biomarkers in environmental studies; however, the complexity of lignin analysis limits the number of samples and thus spatial and temporal resolution in any given study. In contrast, spectrophotometric characterization of dissolved organic matter (DOM) is rapid, noninvasive, relatively inexpensive, requires small sample volumes, and can even be measured in situ to capture fine-scale temporal and spatial detail of DOM cycling. Here we present a series of cross-validated Partial Least Squares models that use fluorescence properties of DOM to explain up to 91% of lignin compositional and concentration variability in samples collected seasonally over 2 years in the Sacramento River/San Joaquin River Delta in California, United States. These models were subsequently used to predict lignin composition and concentration from fluorescence measurements collected during a diurnal study in the San Joaquin River. While modeled lignin composition remained largely unchanged over the diurnal cycle, changes in modeled lignin concentrations were much greater than expected and indicate that the sensitivity of fluorescence-based proxies for lignin may prove invaluable as a tool for selecting the most informative samples for detailed lignin characterization. With adequate calibration, similar models could be used to significantly expand our ability to study sources and processing of DOM in complex surface water systems.
 The biogeochemistry of dissolved organic matter (DOM) has emerged as a critical component of the global carbon cycle, capable of transferring significant amounts of reduced carbon between global reservoirs while fueling food webs in aquatic environments. Processes that affect DOM cycling are intimately linked to DOM composition and the reactivity of individual molecular structures within the DOM pool. Biomarker analytical techniques such as CuO oxidation for lignin [Hedges and Ertel, 1982] are important tools in DOM research because they provide insight into the molecular world of DOM, and thus are crucial toward understanding DOM reactivity.
 Lignin provides important source information about the vascular plant or terrigenous component of DOM, but also has the capability for capturing diagenetic history. Measurements of dissolved lignin have been utilized to show that terrigenous OM is just a few percent of the oceanic DOM pool despite the fact that riverine DOM fluxes to the ocean are greater than average turnover of the oceanic DOM pool [Hernes and Benner, 2006; Opsahl and Benner, 1997]. These findings indicate rapid loss rates of terrigenous DOM in marine environments. One likely sink, photochemical oxidation, imparts a chemical signature to the lignin component of DOM, namely elevated acid:aldehyde ratios along with decreasing ratios of syringyl:vanillyl phenols in the high molecular weight (∼1 to 100 nm) fraction [Hernes and Benner, 2003; Opsahl and Benner, 1998]. In freshwater systems, concentrations and compositions of dissolved lignin demonstrate the strong influence of local landscape-scale features on DOM [Eckard et al., 2007] as well as the hydrology of the system [Dalzell et al., 2007; Hernes et al., 2008]. Furthermore, carbon-normalized lignin yields have been used to show that a significant fraction of riverine DOM may not be vascular-plant derived as commonly thought [Hernes et al., 2007; Spencer et al., 2008]. Together, these studies point toward vascular plant-derived DOM as a dynamic pool that undergoes large changes both spatially and temporally in aquatic environments, and demonstrate that lignin is a powerful tracer of significant environmental processes. However, for many studies that would benefit from the use of lignin analyses, the analytical difficulty and expense of CuO oxidation significantly impedes our abilities to achieve sufficient spatial and temporal resolution to capture the dynamics of natural systems.
 In stark contrast to time-consuming and complex molecular biomarker measurements, spectrophotometric techniques such as DOM fluorescence can be rapidly measured on benchtop machines or collected nearly continuously by in situ instruments. Although the information obtained through fluorescence measurements is not as definitive as molecular biomarker analyses, fluorescence data are still directly tied to the chemical composition of dissolved constituents, and as such can be used to determine DOM source and processing history. A number of studies have related fluorescence properties of natural waters to the concentration, source and composition of DOM [e.g., Bergamaschi et al., 2005; McKnight et al., 2001; Stedmon et al., 2003]. For example, certain features in fluorescence excitation emission matrices (EEMs) such as hypsochromic shifts in peaks attributed to humic and fulvic-like material have been attributed to a breakdown in aromaticity [Blough and Del Vecchio, 2002; Coble, 1996] and thus have been used to distinguish between allochthonous and autochthonous-derived DOM [Spencer et al., 2007a]. The ability to use spectrophotometric measurements to distinguish between DOM of different sources and observe DOM processing coupled with the emerging ability to conduct these measurements in situ leads to the potential for high-frequency, real-time monitoring of aquatic systems.
Hedges et al.  hypothesized that detailed surveys based on spectrophotometric measurements might allow DOM dynamics (e.g., photochemical breakdown) to be observed in the coastal ocean. Recent studies have indeed shown the potential of high-frequency spectrophotometric measurements to discriminate DOM of different sources and to monitor changes in DOM composition over short timescales in both marine [Coble, 2007] and riverine systems [Spencer et al., 2007b]. The focus of this study was to construct fluorescence-based models for lignin concentrations and compositions measured in a diverse estuarine environment as a potential tool for capturing the dynamics of lignin cycling with greater sensitivity and temporal/spatial coverage afforded by fluorescence measurements.
2.1. Sampling and Analyses
 Fifty-eight water samples (0.2 μm filtered) were collected from 11 stations within the Sacramento River/San Joaquin River Delta (hereafter referred to as the Delta) over the course of two annual hydrological cycles, December 1999 to May 2001 (Figure 1 [Stepanauskas et al., 2005]). The Delta is the tidal freshwater portion of the San Francisco Bay Estuary, located in central California, United States, and consists of many contrasting habitats and land uses. The 11 stations sampled included wetlands, rivers, channelized waterways, and open water sites. All Delta samples were analyzed within 24 h for fluorescence properties and DOM concentrated by XAD8 and XAD4 according to the methods of Aiken et al. , then freeze-dried [Kraus et al., 2008]. As these measurements were part of a much larger study, XAD was used in order to isolate enough material for several different analyses, as opposed to C18 which is used more routinely for just lignin measurements. However, XAD recovers lignin in freshwater with ∼90% efficiency with minimal compositional fractionation compared to rotary evaporated water or C18 (unpublished data), thus the lignin measurements on XAD are essentially whole-water measurements. Lignin analyses on XAD extracts were done by alkaline CuO oxidation at 155°C in Monel reaction vessels, extracted by ethyl acetate, and analyzed by GC-MS using a five-point calibration scheme with a precision of ∼10% for individual lignin phenols [Eckard et al., 2007; Hernes and Benner, 2002]. These lignin measurements together with the fluorescence spectra served as the basis for constructing our models. In addition to the Delta samples, 25 water samples (0.2 μm filtered) were collected from the San Joaquin River at Crow's Landing (∼20 miles upstream of the Delta) at 2-h time intervals starting from noon, 28 July 2005 through noon, 30 July 2005 as described by Spencer et al. [2007b]. The San Joaquin River drains the southern portion of the Central Valley in California, is disconnected from its source during summer months due to water diversions, and drains predominantly row crop agriculture, orchard and wetlands during this period [Kratzer et al., 2004]. Samples at Crow's Landing were analyzed for fluorescence properties within 24 h of collection, which were then used to model lignin compositions and concentrations.
 Fluorescence was measured on filtered water from both study sites using SPEX model Fluoromax 2 (Delta samples) and 3 (Crow's Landing samples) spectrophotometers (Horiba Jobin Yvon, Ltd., Japan) at room temperature in a 10 mm quartz cuvette. Excitation occurred over the wavelengths 250–440 nm at 10 nm intervals; for each excitation wavelength, emissions were detected within the 300–600 nm range at 1 nm intervals, but the lowest wavelength utilized in this study was always 50 nm higher than excitation, which avoids Rayleigh scattering features. All resulting excitation-emission matrices (EEMs) were water blank-corrected, Raman-normalized, and corrected for instrument biases using correction factors supplied by the manufacturer. Thus, fluorescence spectra from the Fluoromax 2 and Fluoromax 3 are directly comparable. Fluorescence spectra were also corrected for inner filter effects (IFEs) [Spencer et al., 2007a], Lignin phenols measured included three vanillyl (V) phenols (vanillin, acetovanillone, vanillic acid), three syringyl (S) phenols (syringaldehyde, acetosyringone, syringic acid), and two cinnamyl (C) phenols (p-coumaric acid, ferulic acid) (Figure 2).
2.2. Modeling and Application
 Partial Least Squares (PLS) models were constructed using The Unscrambler software (Camo) to predict lignin concentrations and compositions in the Delta from fluorescence EEMs (models generated with 10 nm emission steps performed identically to 1 nm steps, we chose the former to decrease run times, thus using ∼350 emissions per EEM). All measured excitation/emission pairs were used in the model with no prior parameterization in order to take full advantage of all information in the EEM. PLS is similar to principle component analysis (PCA) in that a large number of variables are reduced to orthogonal principle components (PCs). However, whereas PCA is used to find overall structure between all variables in a single matrix, PLS iteratively computes PCs for both input (fluorescence) and output (lignin) variable matrices while utilizing linear regression to maximize the covariance between the two matrices. Thus, PLS is a more appropriate choice when building predictive models. Further information about PLS modeling can be found in the tutorial by Geladi and Kowalski . All fluorescence and lignin data were normalized by one standard deviation for consistency; however, we observed that normalization of the lignin data had no effect on the resulting model. All variables were mean centered for the actual models, although this data pretreatment was eliminated for ease of interpretation when investigating the regions of the EEMs with the most predictive power. Mean centering was observed to improve model fits by ∼5%. All models underwent “leave-one-out” cross validation, in which each sample is sequentially left out and then predicted from a model calibrated with the remaining samples. A generated model consists of regression coefficients in a matrix identical to the EEMs along with a baseline value for a specific lignin parameter, the latter which is similar to an intercept in linear regression. When mean centering is applied, this baseline value is approximately equal to the average of all the sample lignin values. Predictions are made by multiplying the regression coefficients pairwise with any identically sized EEM, summing all the products, then combining that sum with the baseline value. Residuals between predicted and observed are used to evaluate model performance. Fifteen PCs were calculated for each model; however, only the first 3–7 were significant. We generated model relationships for lignin concentrations (μg L−1), carbon-normalized lignin yields (mg 100 mg OC−1), ratios of cinnamyl to vanillyl phenols, C:V, ratios of syringyl to vanillyl phenols, S:V, and ratios of vanillic acid to vanillin, (Ad:Al)v. Subsequently, these model relationships were used to predict lignin concentrations and compositions from a diurnal study in the San Joaquin River at Crow's Landing [Spencer et al., 2007b].
 Given the growing interest in parallel factor analysis (PARAFAC) as a data reduction technique for fluorescence EEMs, we also generated 3-, 5-, and 7-component PARAFAC models for Delta fluorescence samples to test whether any of the components would correlate with our lignin parameters. PARAFAC was conducted with MATLAB using the N way toolbox version 3.10 [Andersson and Bro, 2000] on 58 samples. All models were verified by comparing the measured and modeled EEMs for all samples and observing the intensity of the residual EEMs. Split half analysis was not chosen for these models because of the limited number of samples.
 All lignin data for this modeling exercise has been previously published in the work of Eckard et al. . To briefly summarize (Table 1), lignin concentrations in the modeled samples varied from 3.0 to 36.8 μg L−1, with the highest average concentrations (18.5 μg L−1) in the wetlands and the lowest average concentrations (7.7 μg L−1) in the rivers. Carbon-normalized yields, Λ8, ranged from 0.13 mg 100 mg OC−1 to 0.846 mg 100 mg OC−1, with the highest average values (0.48) again in the wetlands while the lowest average values (0.29) were measured in the open water sites. Ratios of syringyl to vanilly phenols, S:V, were highest on average (1.28) in the open water sites and lowest (0.94) in the river samples while ranging from 0.59 to 1.53. Ratios of cinnamyl to vanillyl phenols, C:V, varied from 0.13 to 0.85 with similar distributions to S:V–highest average values (0.79) in the open water sites and lowest average values (0.29) in the rivers. Finally, the vanillic acid:vanillin ratios, (Ad:Al)v, modeled from this data set varied nearly fourfold from 0.83 to 3.21, with highest average values in the open water (1.44) and river (1.42) sites and lowest average values in the wetland (1.11) and Delta channel sites (1.12).
Table 1. Summary of Sacramento River/San Joaquin River Delta Lignin Phenol Data From Eckard et al.  Used to Build Partial Least Squares Models and Ancillary Optical Data
Absorption coefficient at 254 nm is included as an indicator of the relative color of the water samples used in this study. Wetland samples dominated the high a254 values.
 Fluorescence EEMs for the four sample types were quite similar (see Figure 3 for an example of each). Primary distinguishing features include a hypsochromic shift in the humic-like peak A (see Stedmon et al.  for a summary of EEM peaks) in the river sample (centered at 250/450 in the river versus 250/470 in the remaining samples), and a generally overall higher fluorescence intensity in the wetland sample (max intensity of ∼1 Raman Unit, R.U., versus 0.8 for the open water and 0.6 for the river and channel).
 Our modeling of the five principal lignin parameters within the Delta data set exhibited varying degrees of success. Predicted lignin concentrations for the leave-one-out validation samples were well correlated (p < 0.0001, student t test) with observed measurements (r2 = 0.91, Figure 4a), as were carbon-normalized yields (r2 = 0.79; Figure 4b), S:V (r2 = 0.74, Figure 4c), and C:V (r2 = 0.50, Figure 4d). However, the model was not successful at predicting (Ad:Al)v (r2 = 0.09, plot not shown). In addition to correlation coefficients, the number of significant PCs gives an indication of the overall robustness of individual models, in general, fewer PCs indicate more robust models. The number of significant PCs for the four successful models ranged from three (C:V) to seven (Λ8) (Figure 4). There is currently no universally accepted approach for estimating prediction error with PLS modeling [Nadler and Coifman, 2005; Zhang and Garcia-Munoz, 2009], but the most intuitive relate the residuals (the difference between predicted and observed values) to either the total number of samples (standard error of prediction, SEP) or the average of the observed values (average relative error, ARE). SEP (square root of the sum of squares of the residuals divided by total number of samples) provides reasonable estimates of the absolute error near the average of the range of values, but overestimates the absolute error in the lower range and underestimates the upper range. ARE (average residual expressed as a percent of average observed value) likely underestimates the true error, but expressed as a percentage is a more consistent representation of prediction error throughout the entire range of values, thus we report AREs for this study, which ranged from 9% for the S:V model to 21% for Λ8 (Figure 4). Expressing SEP as a percentage of the average gives values ∼2% higher than ARE.
 PARAFAC analysis generally produced varied fluorescence components dominated by the humic Peak A or Peak C (Table 2). There was no significant correlation between any of the components and lignin concentrations or compositions, with r2 ranging from 0 to 0.14.
Table 2. Results From PARAFAC Modeling of Delta Fluorescence
 Predicted lignin concentrations at Crow's Landing over the 48 h sampling period ranged from 5.9 to 10.7 μg L−1 and exhibited a diurnal pattern with maximum concentrations midmorning and minimum concentrations midafternoon (Figure 5a). Predicted carbon-normalized lignin yields followed an identical pattern and ranged from 0.04 to 0.21 mg 100 mg OC−1 (Figure 5b). In contrast, predicted S:V and C:V ratios demonstrated little variability over the course of the 48 h, with S:V averaging 1.25 and C:V averaging 0.81 (Figures 5c and 5d).
 The efficacy of lignin as a tracer for DOM cycling hinges upon the diverse and specific information that can be obtained from a single analysis. The proportion of DOM that is vascular-plant derived or terrigenous in origin can be quantified using lignin concentrations and carbon-normalized yields [Hernes and Benner, 2002; Hernes et al., 2007; Opsahl and Benner, 1997]. Lignin ratios, S:V and C:V, can be used to distinguish between woody and nonwoody angiosperm and gymnosperm tissues [Hedges and Mann, 1979]. At the landscape scale, S:V and C:V signatures have been used to quantify landscape-scale sources of DOM in the Delta [Eckard et al., 2007]. In addition, S:V and (Ad:Al)v in DOM are both sensitive to photooxidation and thus can capture diagenetic history [Hernes and Benner, 2003; Opsahl and Benner, 1998; Spencer et al., 2009b]. Our results suggest great potential for using fluorescence EEMs as a proxy for lignin concentration, carbon-normalized yields, S:V ratios, and to a lesser extent C:V ratios in systems for which fluorescence EEMs have been well calibrated with lignin measurements. Although our generated models resulted in highly significant correlations between predicted and observed values for these four parameters, the diversity of samples and sources used in this modeling exercise (i.e., wetland dissolved lignin from monocots is different from riverine lignin derived from forests and agricultural lands) likely contributed to a lower overall model fit to the data. In other words, fluorescence proxies for lignin are likely to be even more successful when the overall source of lignin to a set of samples is more homogeneous, as has been demonstrated for modeling lignin concentrations with absorbance data in the Yukon River system [Spencer et al., 2009a].
 Our modeling results are consistent with theoretical relationships between fluorescence and chemical functionality. While fluorescence intensity at any given wavelength pairing for excitation-emission depends on the concentration of fluorescing functional groups, the pattern observed in EEMs relates to functional group diversity. Thus, concentrations of fluorescing compounds like lignin should be strongly related to fluorescence intensity, while modeling success for compositional ratios will be dependent on relative differences in chemical structure between the components of the ratios. In the case of carbon-normalized yields, the difference in functionality related to fluorescence between polyphenolic lignin and bulk DOM is significant, hence a strong predictive capability. In contrast, (Ad:Al)v is derived from similar propylphenol structures within the polyphenolic structure that likely differ by only a single oxygen on the propyl side chain (e.g., Figure 2). Syringyl phenols differ from vanillyl phenols by a single methoxy group on the aromatic ring, while differences between cinnamyl and vanillyl phenols occur on both the aromatic ring (number of methoxy groups) and the propyl side chain (presence of carbon-carbon double bonds in the cinnamyl phenols) (Figure 2). The strong predictive capability for C:V and S:V in comparison to (Ad:Al)v would suggest that methoxyl substitution patterns on the aromatic ring and double bonding in the propyl side chain lead to more unique fluorescence patterns than oxygenation of the side chain. One caveat to this interpretation is that lignin is not solely responsible for all fluorescence properties. However, it seems apparent from the strong modeled relationships for both concentration and composition that the lignin contribution to the overall fluorescence signature is unique, significant, and quantifiable.
 In order to explore excitation-emission (Ex/Em) pairs with the greatest predictive capacity, we generated 3D contour plots in the same 3D space generally used for plotting EEMs by multiplying model regression coefficient matrices by a sample EEM – in essence performing the first step in using a PLS model for generating a prediction. For this exercise, we chose the sample that demonstrated the tightest overall fits to all four successfully modeled lignin parameters, a San Joaquin River sample (Station RJ in Figure 1, sample EEM presented in Figure 3a) collected in mid July. We also reran the models without centering the data which simplifies data interpretation. Regression models generated from centered data include both positive and negative coefficients since the starting value for predicting a lignin parameter is approximately the average value for the whole sample set, then regression coefficients have the effect of adjusting that value up or down. Regression models generated from uncentered data primarily “buildup” from zero.
 Lignin is often considered a component of humic substances, yet the region of the EEMs with greatest predictive capability did not fall within the traditionally defined humic-like regions as summarized by Stedmon et al.  (Ex = 260 nm/Em = 400–460 nm, and Ex = 320–360 nm/Em = 420–460 nm), but rather within the region generally attributed to aromatic amino acids like tryptophan or tyrosine (Figure 6). Structurally, the propylphenol monomers that make up lignin are similar to tyrosine and tryptophan, and another class of polyphenols, tannin, have been shown to fluoresce in the same region of the EEM [Maie et al., 2007]. It is noteworthy that neither the protein-like regions that contribute the most toward predicting lignin parameters, nor the high Ex/Em region that contributes toward lignin concentration (Figure 6a) contain prominent features in the overall EEM (Figure 3). Thus, one important conclusion from this exercise is that the entire EEM contains valuable information and not just the prominent features that are typically identified and interpreted. This is reinforced by the PARAFAC results, which isolated the prominent A and C peaks, neither of which have significant predictive power for lignin in this system.
 The ability to predict lignin concentrations and compositions using EEMs has the potential to greatly extend research capabilities in terrestrial freshwater systems as well as coastal zones with large riverine input. In the Delta study, for instance, lignin compositions were used to demonstrate significant changes in DOM sources over the course of one hydrologic season as well as the strong influence of localized landscape features on DOM composition [Eckard et al., 2007]. However, the limitations of sampling and analytical throughput permitted only six snapshots. With the possibility of field autosamplers and benchtop analyses of fluorescence or even in situ monitoring of fluorescence, future molecular-level studies in the Delta could be greatly expanded with proper molecular-level calibration beyond six snapshots to provide a much more detailed picture of vascular plant-derived DOM cycling.
 In addition to much greater sampling throughput, spectrophotometric properties of DOM can typically be measured with much greater sensitivity and reproducibility than discrete molecular measurements. Thus, well-calibrated fluorescence models for lignin or other biomarkers have the potential to reveal fine-scale changes that would be challenging from discrete biomarker measurements alone. For example, we applied our Delta-derived models to a 48-h diurnal study that was conducted during the summer on the San Joaquin River (Crow's Landing) approximately 20 miles upriver from the San Joaquin River site (Vernalis) included in the modeled Delta data set (Figure 1). Predicted lignin parameters at Crow's Landing compared favorably to those measured at Vernalis, with the Crow's Landing lignin concentrations and C:V falling within the range of the six measurements taken at Vernalis throughout the seasons (Table 3), while carbon-normalized yields were slightly lower and S:V slightly higher. Given the significant influence that local landscape features can exert on lignin concentrations and compositions [Eckard et al., 2007], these predicted values are entirely reasonable. However, the accuracy of the predictions in this example is not as important as the relative trends that are revealed in Figure 5 since it is the trends that are indicative of short-term processing and not the absolute values of the lignin parameters.
Table 3. Comparison of Predicted Lignin Parameters on the San Joaquin River at Crow's Landing to Measured Values at Vernalisa
Σ8 (μg L−1)
Λ8 (mg 100 mg OC−1)
Vernalis data from Eckard et al. . Abbreviations: Σ8, sum of concentrations of eight lignin phenols; Λ8, carbon-normalized yield of eight lignin phenols; S:V, ratio of syringyl phenols to vanillyl phenols; C:V, ratio of cinnamyl phenols to vanillyl phenols.
9 Dec 1999
19 Mar 2000
16 Jul 2000
15 Oct 2000
4 Feb 2001
20 May 2001
28 Jul 2005 to 30 Jul 2005
 Apparent diurnal trends in lignin concentration and carbon-normalized yields represent an intriguing finding as to the nature of DOM cycling in riverine systems. Unchanging S:V and C:V values indicate a similar source for dissolved lignin throughout the 48 h sampling period. Total flow at Crow's Landing over the course of the 48 h changed by less than 1%, precluding any hydrologic pulses of lignin-rich waters. Other factors that could come into play include diurnal variability in amounts of lignin released from the riparian zone, photobleaching of lignin, or diurnal cycling of other compounds that fluoresce in the predictive regions highlighted in Figure 6. Clearly, the diurnal trends predicted for lignin at Crow's Landing must be reproduced with properly calibrated models or discrete lignin measurements before they can be accepted as real. Ultimately, however, the power of this example is that even without prior calibration of a system with actual lignin data, we can use our generated models as a screening tool in any system that has similar EEMs data to point us toward processes that merit further research with discrete biomarker analyses: in essence, a fluorescence-guidance system for our molecular toolkit.
 Continued progress in DOM cycling research will require both the specificity of biomarker measurements along with the high spatial and temporal resolution capability that spectrophotometric measurements offer. The vast richness of information contained at the molecular level in DOM remains a virtually untapped resource, largely because of the difficulty and expense of making enough measurements to adequately address issues of temporal and spatial scaling. Coupling fluorescence and biomarker analytical approaches with PLS or similar models provides a powerful new tool that significantly expands the scope of what is possible for future studies. Emerging in situ capabilities for fluorescence points toward real time monitoring of biomarkers such as lignin in the near future, while the more general development of biomarker proxies using optical measurements ultimately could lead to remote sensing capabilities: precisely the kind of significant advances needed to move the field of DOM research forward.
 We would like to gratefully acknowledge the California Bay Delta Authority Ecosystem Program and Drinking Water Program for their support (grant B-17). We thank Bryan Downing for assistance with EEM processing and PARAFAC analyses. We would also like to thank the Aqueous Organic Geochemistry group at UC Davis, the R. Benner research group at University of South Carolina, the AE, and an anonymous reviewer for helpful comments on this manuscript.