Discovering variation of secondary metabolite diversity and its relationship with disease resistance in Cornus florida L.

Abstract Understanding intraspecific relationships between genetic and functional diversity is a major goal in the field of evolutionary biology and is important for conserving biodiversity. Linking intraspecific molecular patterns of plants to ecological pressures and trait variation remains difficult due to environment‐driven plasticity. Next‐generation sequencing, untargeted liquid chromatography–mass spectrometry (LC‐MS) profiling, and interdisciplinary approaches integrating population genomics, metabolomics, and community ecology permit novel strategies to tackle this problem. We analyzed six natural populations of the disease‐threatened Cornus florida L. from distinct ecological regions using genotype‐by‐sequencing markers and LC‐MS‐based untargeted metabolite profiling. We tested the hypothesis that higher genetic diversity in C. florida yielded higher chemical diversity and less disease susceptibility (screening hypothesis), and we also determined whether genetically similar subpopulations were similar in chemical composition. Most importantly, we identified metabolites that were associated with candidate loci or were predictive biomarkers of healthy or diseased plants after controlling for environment. Subpopulation clustering patterns based on genetic or chemical distances were largely congruent. While differences in genetic diversity were small among subpopulations, we did observe notable similarities in patterns between subpopulation averages of rarefied‐allelic and chemical richness. More specifically, we found that the most abundant compound of a correlated group of putative terpenoid glycosides and derivatives was correlated with tree health when considering chemodiversity. Random forest biomarker and genomewide association tests suggested that this putative iridoid glucoside and other closely associated chemical features were correlated to SNPs under selection.


| INTRODUC TI ON
Plant secondary metabolites are closely tied to ecological functions and greatly affect community interactions (Dixon & Paiva, 1995;Moore, Andrew, Külheim, & Foley, 2014). Certain secondary metabolites may provide plants with specialized functions like deterrence to herbivory or infection (Harborne & Turner, 1984). Identifying the genetic basis of secondary compounds for such functions is of interest to the field of evolutionary ecology. Even small changes in genetic diversity may yield exceptionally large changes in secondary metabolism-producing novel molecules with often unknown biological activity (Firn & Jones, 2000). While specific compound classes such as iridoids, phenolics, and tannins are often the basis of study for ecological function in plants (Sardans, Peñuelas, & Rivas-Ubach, 2011), secondary metabolite diversity as a trait, or chemotype, represents a special dimension of biodiversity important to natural and managed ecosystems (Bustos-Segura et al., 2017). In contrast to the research presented in this article, few studies have evaluated broader relationships between chemical diversity and genetic diversity within species (focusing on the diversity of chemical compound composition among individuals in a population) while examining a select group of metabolites diagnostic of health versus disease (otherwise known as biomarkers) and associated with SNPs under selection. Recent innovations in next-generation sequencing coupled with untargeted chemical profiling provide unique opportunities to examine these relationships in plant systems (Eckert et al., 2012;Gomez-Casati, Zanor, & Busi, 2013;Raguso et al., 2015;Riedelsheimer et al., 2012). Integrating next-generation sequencing technologies with population genomics and community ecology permits identification of chemical compounds and associated SNPs related to disease resistance or other ecologically functional traits.
Secondary metabolite richness and relative abundance of chemicals within individuals-a chemotype referred to as chemodiversityare informative yet understudied metabolome properties helpful for understanding evolutionary and ecological processes (Hilker, 2014;Kellerman, Dittmar, Kothawala, & Tranvik, 2014). Promising work has investigated broader patterns of natural metabolome variation in the context of natural genetic variation, but most analyses of variation in chemical diversity focus on distance-based measures versus explicit measurement of chemical richness. For example, significant correlations between metabolic and genetic distances were detected in nine Arabidopsis thaliana accessions exposed to different environments (Houshyani et al., 2012). In a second example, multigenerational lines inbred from different Drosophila melanogaster populations were found to remain distinguishable in general lipid composition, and approximately one-fifth of the lipid compounds had clear concentration differences between male and female genotypes (Scheitz, Guo, Early, Harshman, & Clark, 2013). More recent studies of environmental and bud-leaf metabolome analyses of Pinus pinaster (ten European provenances in common garden) revealed two groups of individuals corresponding to spatially distinct regions (Meijón et al., 2016). All these studies used distance measures based on a reduced dimensionality of abundance differences for targeted compounds instead of explicitly calculating and comparing diversity indices (Appendix S1), which account for both chemical compounds' presence-absence and relative abundances within each sample.
Studies employing diversity indices of broad chemical profiling are rare (Hilker, 2014), possibly due to previous aversion to adapting such indices outside of community ecology (Hurlbert, 1971).
However, initial trepidations regarding usage of these indices are now being addressed with cautious interpretation of chemical diversity indices (Morris et al., 2014). Additional studies that integrate advancements in untargeted metabolomics (Alonso, Marsal, & Julià, 2015;Yi et al., 2016) and population-landscape genomics (Anderson, Willis, & Mitchell-Olds, 2011;Sork et al., 2013) with adoption of these chemical diversity indices would further demonstrate the power of this correlative approach to illustrate how genetics (i.e., locally adapted genes) and plant functional diversity (i.e., chemodiversity) influence plant health, after controlling for environment analytically.
We use a multidisciplinary approach to characterize and evaluate how properties of genetic diversity and chemodiversity contribute to susceptibility or resistance to disease in Cornus florida (L.), the flowering dogwood tree. In addition, we have applied exploratory analyses to winnow an untargeted list of metabolites down to a select group of potential antimicrobial compounds-closely resembling compounds previously observed in dogwoods (He, Peng, Hamann, & West, 2014;Stermitz & Krull, 1998;Yue et al., 2006). The species itself occurs naturally throughout much of eastern North America and is ecologically important partly because of calcium it delivers to food chains in deciduous forests (Baird, 1980;Blair, 1982;Borer, Sapp, & Hutchinson, 2013;Holzmueller, Jose, Jenkins, Camp, & Long, 2006;Linzey & Brecht, 2003;Lovenshimer & Frick-Ruppert, 2013). In addition, the plant is a cultural icon, serves as the emblem of three southern US states (Jordan, 2010), and is valued in the horticulture industry at 30 million dollars in annual sales (NASS USDA, 2007 Census of Agriculture). In the past three decades, C. florida populations have experienced major declines in health due to the introduction of a fungal pathogen (Discula destructiva) to North America (Miller, Masuya, Zhang, Walsh, & Zhang, 2016), the causal agent of dogwood anthracnose (Redlin, 1991). Northern and mountain populations have been hardest affected with up to 98% mortality occurring in monitored stands (Hiers & Evans, 1997;Jenkins & White, 2002;McEwan, Muller, Arthur, & Housman, 2000;Rossell, Rossell, Hining, & Anderson, 2001;Sherald, Stidham, Hadidian, & Hoeldtke, 1996;Williams & Moriarity, 1999). As dogwood anthracnose disease progresses southward along the Appalachian Mountains, populations of C. florida continue to decline (Jones, Smith, & Twardus, 2012). Whether or not the range of dogwood anthracnose ( Figure 1) will expand to the southeast overtime is uncertain. Understanding the adaptive mechanisms in C. florida that may possibly limit the spread of disease will be important in the conservation of the species. Iridoid glycosides in particular are highly abundant in Cornelian taxa and have been noted to play roles in plant defense and disease resistance (Stermitz & Krull, 1998;Yue et al., 2006) in addition to various phenolic and tannin compounds (Dudt & Shure, 1993). In this work, we integrated evidence from multivariate analyses of flowering dogwood tree metabolomes, reduced genome sequences from genotype by sequencing (Pais, Whetten, & Xiang, 2016;Peterson, Weber, Kay, Fisher, & Hoekstra, 2012), and environmental data (model controls) to address the following questions: (1) What is the relationship between genetic diversity and chemodiversity? (2) Is there evidence from candidate SNPs and metabolites for local adaptation, and are there particular chemical biomarkers such as iridoid glycosides associated with either diseased or healthy plants? (3) Likewise, do healthier plants exhibit greater chemodiversity?

| Plant material
Natural populations of C. florida were sampled from the mountain, Piedmont, and Coastal Plain regions of North Carolina during the summer of 2012 (Pais et al., 2016). Each region contained two populations, and each population consisted of one or two collection sites, or subpopulations, of 30-15 individual trees, respectively, for a total of 180 mature and unique trees. Leaves were dried and stored in silica gel under standardized conditions. All samples were given an extended period to dry before extracting thermally stable metabolites from leaves.

| Metabolite extraction
Metabolites were extracted following a modified protocol of Strauch, Svedin, Dilkes, Chapple, and Li (2015). For each tree, extraction material (20 mg) was selected from all sampled leaf tissue visibly free of mold or necrosis to minimize the chances of sampling metabolites that were unique to fungi, altered or degraded during collection.
During the drying period, eight samples developed mold growth, which prohibited their use for metabolic study-leaving 172 remaining samples. Leaf tissue was ground with liquid nitrogen in a Retsch MM 400 oscillating mill for one min at 25 Hz. After grinding, two F I G U R E 1 Subset of collections from broader study of C. florida (top) applied to metabolic study of chemical diversity in North Carolinian populations (bottom). Red counties have known incidence of dogwood anthracnose disease. For subset of populations sampled in this study, differences in mean monthly rainfall, length of growing period, soil type, and county occurrence of dogwood anthracnose are visualized to demonstrate the heterogeneity in environment that exists among the mountain, Piedmont, and Coastal Plain ecoregions of North Carolina ml of 50% methanol was immediately applied to samples. Samples were then incubated in a water bath for 30 min at 60°C and allowed to cool one hour at 4°C to minimize potential precipitate (soluble in warm solution) from being transferred to final vials. After centrifuging for one min, remaining supernatant was transferred to a new vial via a filtered syringe tube.

| Untargeted metabolite profiling
Untargeted metabolite profiling was performed on a G6530A Q-TOF LC-MS system (Agilent Technologies, Santa Clara, CA). Five microliters of leaf extract was injected onto an Agilent ZORBAX Eclipse Plus C18 column (3 × 100 mm, 1.8 μm). Metabolites were separated using a binary gradient of solvent A (0.1% formic acid in water) and solvent B (0.1% formic acid in acetonitrile) at a flow rate of 0.6/ml/min. The elution gradient started with a one-min hold at 2% B, followed by ramping up to 45% B over 16 min, and then was increased to 90% B in one min and held at 90% B for 2.5 min. The acquisition of mass spectra was performed in negative mode for a m/z range from 100 to 1,600, with the following parameters: drying gas temperature, 300°C; drying gas flow rate, 7.0 L/min; nebulizer pressure, 40 psi; sheath gas temperature, 350°C; sheath gas flow rate, 10.0 L/min; Vcap, 3,500 V; nozzle voltage, 500 V; fragmentor, 150 V; skimmer, 65 V; OctopoleRFPeak, 750 V.

| LC-MS data processing
Raw data files obtained from LC-MS experiments were converted to the mzData format using Agilent Masshunter software, grouped into directories by population, and then uploaded to the XCMS Online platform (Tautenhahn, Patti, Rinehart, & Siuzdak, 2012) for automatic metabolite detection and alignment. Metabolite features-peaks defined by mass-to-charge ratio (m/z), retention time (RT), and intensity-were extracted with optimized parameters: centWave method, minimum-maximum peak width = 8 and 30, signal-to-noise threshold = 30, mzdiff = 0.01, prefilter peaks = 3, prefilter intensity = 2,000, and noise filter = 0. As retention time variation between runs was minimal, the peaks were aligned across all the samples without RT correction, using the following parameters: bw = 5, mzwid = 0.025, and minfrac = 0.75. A list of 2,785 aligned peaks/features from 172 individuals was then exported from XCMS Online as a tab-separated file. A preliminary PCA using autoscaled distances of individual peak areas and a distance to model (DModX) test (implemented in XCMS) detected individual UM19 as a significant outlier sample (possibly due to extraction error) to be removed. As one metabolite may give rise to multiple peaks including isotope, adduct, or fragment peaks, the 2,785 peaks were further grouped by peak intensity correlation and RT similarity using built-in procedures from the XCMS pipeline (Tautenhahn et al., 2012). This resulted in 377 metabolites, which were represented by the largest peak within each group.
Metabolite annotation was performed by searching the exact mass of detected metabolite features against the Knapsack and Metcyc databases (Caspi et al., 2014;Shinbo et al., 2006) using a ten ppm threshold.

| Chemodiversity index calculation
Calculation of the richness diversity index (S; Whittaker, 1972) is primarily described in this study as its results have straightforward and biologically meaningful interpretations, which can be easily reconceptualized from the study of species diversity to the study of chemodiversity. For clarity, we first define richness in the context of a typical community ecology study before transferring the analogy to studying intraspecific chemodiversity. In a hypothetical field divided into multiple plots, species richness is the number of unique species in the field or each plot (Whittaker, 1972). Alpha (α) richness refers to the average richness of plots while the total number of unique species in the whole field is gamma (γ) richness. When appropriating these indices for studying chemodiversity, we treat each "species" as a metabolite and the "plot within a field" is represented by an individual plant's chromatogram, which represents the sum of all metabolite intensity peaks. More simply, richness is the metabolite count in an individual sample, and when richness is averaged among each tree in a subpopulation, subpopulations can be statistically compared by α-richness (Whittaker, 1972). In contrast, γ-richness represents total number of unique metabolites within a given subpopulation. As exploratory findings showed that γ-richness was equal among subpopulations, we hereinafter refer to α-richness when reporting chemical richness.

| Genetic marker data
We used genotype by sequencing (GBS; Peterson et al., 2012) of two Illumina Hiseq libraries, de novo assembly into 90-bp GBS tags with STACKS (Catchen, Amores, Hohenlohe, Cresko, & Postlethwait, 2011), latent factor mixed modeling [a genotype-environment association (GEA) method; Frichot, Schoville, Bouchard, & François, 2013], and two F ST outlier methods (Excoffier, Hofer, & Foll, 2009;Foll & Gaggiotti, 2008) to classify putatively neutral SNPS and SNPs exhibiting varying support for being under selection (Pais et al., 2016). Putatively neutral reference SNPs were used to calculate marker-based inbreeding coefficients (F; Keller, Visscher, & Goddard, 2011) and identity-by-state matrices using PLINK (Purcell et al., 2007). We added an inbreeding coefficient (F) variable into logistic models characterizing plant health and disease (see section 2.9) because we recognized the need to account for greater heterozygosity (fewer homozygous loci than expected) within an individual, which could affect plant health (Ouborg, Biere, & Mudde, 2000) by yielding more unique metabolites and raising plant potential to respond to novel pathogen effectors (screening hypothesis; Jones, Firn, & Malcolm, 1991). GBS samples were also reanalyzed with aid of a newly available C. florida draft genome (Dogwood Genome Project (NSF ID: 1444567), and the draft genome was used as an additional resource to predict candidate gene function by inspecting BLAST hit annotations surrounding SNPs of interest.

| Environmental-functional traits
We correlated climate-soil variables (obtained through collection site measurements and GIS extrapolation; Pais et al., 2016) and temperature-precipitation estimates at time of collection (dailymonthly; PRISM Climate Group; extracted 30 January 2015) with chemical and genetic data (Table 1). Similarly, plant health scores were plotted against chemodiversity levels. Visual health-diseased estimates were taken using the procedure of Mielke and Langdon (1986) based on the percent of tree canopy affected by leaf blotting, necrosis, or branch dieback. Additionally, five categorical scores obtained from this method were converted into a binary variable.
Plants with scores of four and five were considered healthy while plants with scores of three and below were considered diseased.
This recoded binary variable served as the response for mixed model logistic regressions. For further description how environmental variables were selected for multivariate modeling of chemodiversity levels, see Appendix S1 and additional justification of mixed logistic models as described further in methods.

| Characterizing general relationships of chemical structure and diversity to plant health and genetic diversity
We determined the general structure among sampled populations and the diversity of metabolites from multilocus genotype data. We first used discriminant analyses of principal components (DAPC) to identify collection sites that clustered together, according to SNP or metabolite abundance data. Using the R package adegenet (Jombart, 2008;Jombart, Devillard, & Balloux, 2010), we performed discriminant analysis (DA) on the optimal number of principal components (PC) to maximize among-population variation and minimize within-population variation. We estimated and analyzed PC scores separately from two different datasets (both scaled): reference SNPs aligning to the C. florida draft genome and abundance data (log-transformed) for 377 metabolites. We conducted DAPC both by defining groups by collection sites and by allowing the program clustering algorithm to find optimal cluster number (K) without priors. Discriminant analyses of principal components does not require assumptions on a population genetic model (e.g., linkage equilibrium of markers) in contrast to programs like STRUCTURE (Pritchard, Stephens, & Donnelly, 2000) so DAPC has been widely adopted in recent population genetic studies (Buchalski et al., 2016;Cahill & Levinton, 2016;Grünwald & Goss, 2011). Its use for metabolic study is recent, but its efficacy in discriminating different biologically meaningful chemotype classes has been demonstrated and favored over other discriminant analysis methods under certain circumstances (Gromski et al., 2015;Scheitz et al., 2013).
Next, we estimated average genetic diversity and chemodiversity of each subpopulation. Rarefied-allelic richness was calculated per site using the hierfstat R package (Goudet, 2005)   We next tested for correlations between environmental gradients, genetic differentiation, and chemical distance using Mantel tests and linear regression. Full and partial Mantel tests (Legendre & Fortin, 1989) were implemented in the R package ecodist (Goslee & Urban, 2007) with 9,999 permutations and 500 bootstraps to determine the strength and significance of association between population-level metabolic distance, genetic distance, and mean Euclidean distances of spatial and environmental variables [i.e., displacement of collection sites, precipitation of driest month (Bio14), and temperature at day of collection (Tcol); see Environmental-functional traits continued in Appendix S1 for justification of environmental variables tested]. For these correlation analyses, we employed Arlequin (Excoffier & Lischer, 2010)  The significance of simple linear regressions between individualspecific chemodiversity levels, inbreeding coefficients, and all available environmental predictors was also assessed. In addition, we determined the best multivariate models describing general chemodiversity as the response (Appendix S1). For these initial regression models, we used chemodiversity indices based on all 2,785 chemical features as this allowed us to more reliably detect general differences in chemical richness among samples. However, we caution that inclusion of correlated chemical features may bias the calculation of chemodiversity indices, and as such, we delegate reporting and discussion of such results in Appendix S1.

| Biomarker analyses
For identifying biomarkers associated with healthy versus diseased trees, we primarily used random forests (RF) tests and a logistic mixed model predicting disease states based on abundance data of each metabolite. Random forests tests were previously compared to partial least squares discriminant analysis (PLS-DA), principal component discriminant analysis (DAPC), and support vector machines (SVM) for their ability to correctly assign samples to biologically based classes using metabolic data (Gromski et al., 2015). Logistic mixed models predicting healthy-diseased states of plants were also considered given the ability to statistically evaluate a Bonferroni correction and analytically control for inbreeding coefficient, temperature at collection, and random effects of collection site. More details on parameters for RF tests, justification of variable selection for logistic mixed modeling, and other biomarker tests compared in exploratory analyses (PLS-DA, DAPC, and SVM) are available in Appendix S1 (Biomarker analyses continued).

| Predicting metabolite-SNP networks
To understand patterns between chemical data and individual loci while controlling for sample structure and environmental variability, we employed a linear mixed model implemented in EMMAX (Kang et al., 2010). This model has been used in genomewide association Chemical-genotype associations were calculated in EMMAX with: (1) no covariates present; (2) the Bio14 covariate present; (3) the Tcol covariate present; or (4) both covariates present (see Environmental-functional traits continued in Appendix S1 for justification of environmental controls specified). p-Value distributions from output files were plotted using R package Haplin (Wilcox, Weinberg, & Lie, 1998) to assess Q-Q plots for each metabolite.
Results were considered significant for genotype-metabolite associations passing a Bonferroni correction with an alpha value of 5%.
Only results from normally distributed Q-Q plots were considered.
To explore the relationship between metabolites, we applied Gaussian graphical modeling (GGM) to the 377 metabolite dataset.
Gaussian graphical modeling utilizes partial full-order correlation coefficients to test for correlation between two metabolites while removing other metabolite effects. Justifications and additional considerations of GGM are further discussed in Appendix S1 (Gaussian graphical modeling continued).

| Modeling health versus disease: Logistic mixed modeling with chemodiversity of a specific set of biomarkers
To assess whether chemodiversity of a putative terpenoid derivative network was related to the odds of being healthy versus diseased, we employed logistic mixed modeling (previously applied to single metabolites; see section 2.9) with chemical richness (or H′, D1, D2, E, and BP indices; see Chemodiversity indices continued in Appendix S1) specified as a fixed effect (recalculated from metabolites in the GGM-associated group of iridoid glucosides; see Candidate metabolite-SNP network). In other words, we recalculated chemodiversity indices from a set of related iridoid derivatives and substituted the predictor representing a given metabolite abundance in our aforementioned mixed logistic model (see section 2.9) for a given chemodiversity index. Our justification for recalculating chemodiversity from metabolites in presumably related biological pathways was to compare the evenness or dominance in the accumulation of the specific set of metabolites between diseased and healthy plants. We also recalculated and examined chemodiversity among ten metabolites with common results among GWA and biomarker tests in exploratory analyses (Appendix S1; Table S5). Lastly, we tested interaction terms between chemodiversity and the other effects found to influence plant health (i.e., inbreeding coefficient, temperature at collection, and random effects of collection sites).
Upon adding interaction terms for temperature at collection, we consistently found the interaction effects to be insignificant. The same applied when testing interactions to inbreeding coefficient.
Thus, we removed interaction terms from our models.

| General patterns of chemical-genetic structure and diversity
As shown in Pais et al. (2016) (Figure 2f,g; see Figure S2 for subpopulation means of other chemodiversity indices). Levels of rarefied-allelic richness and chemical richness were similar, and significant correlations between subpopulation means were observed ( Figure S2).
When grouping individual trees into healthy or diseased categories based on a one to five scoring system (Mielke & Langdon, 1986), the relationship of disease status to chemodiversity indices derived from 377 metabolites varied depending on the index (Appendix S1). In particular, chemical richness of the 377 metabolites highly overlapped among the five disease-state groups  (Figure 4d). These results confirmed high sensitivity of metabolic data to environment (i.e., temperature), which was also supported by Mantel test results (Table 2) and regression model results of metabolic and environmental data (Tables S1 and   S2, and Appendix S1). The remainder of reported results focus on metabolite associations with SNPs while controlling for the most important environmental factors influencing the general metabolic profile of samples as described in Appendix S1.

| Candidate metabolite-SNP associations
To understand the connections between genetic polymorphism and metabolite variation, we performed GWA analyses on all available SNPs and metabolites. As each SNP-metabolite association analysis was an independent test not biased by correlations among chemical features, each of the 2,785 chemical features of 377 metabolites was included.
With and without the most important environmental covariates controlled for in GWA models, we identified 975 unique chemical features significantly associated with 347 unique SNPs. Overlapping chemical feature and SNP results among the various GWA tests (different covariates specified) are presented in Figure 5 and summarized here ac- We performed GGM to formulate hypotheses of metabolitemetabolite associations based on partial correlation. This analysis, combined with the GWA results, revealed SNP-metabolite connections for a putative group of terpenoid glucosides (Figure 6a).
The majority of these biomarker metabolites accumulate more in healthy plants compared to diseased ones (Figures 8, S6, and S8).
It is notable that there were few overlaps between the biomarkers detected by different methods ( Figure S9a). Logistic mixed modeling of individual metabolites showed that six metabolites were significantly correlated with the log odds of being healthy versus diseased (Table 3) after controlling for temperature at collection, inbreeding coefficient, and collection site random effects. Of the 39 biomarker metabolites, five had significant associations with SNPs as revealed from GWA models controlling for environment ( Figure S9a). The hypothesized SNP and GGM associations for these biomarkers are reported in Figure S10 along with results of greater focus concerning the hypothesized group of iridoid glycosides ( Figure 6).  ued in Appendix S1) was significantly correlated to the log odds of a plant being healthy versus diseased (Figure 7). The positive effect of the BP index was driven primarily by increasing abundances of M435T576, which was the most abundant metabolite within most samples (among other metabolites composing GGM network; Figure 6). Moreover, it reflected a greater unevenness of chemical expression for this metabolite in healthy plants relative to diseased plants (Figure 7). In other words, M435T576 was considered the most predictive biomarker among the two other associated iridoid glycosides (M227T630 and M451T432) for distinguishing plant health and disease.
Ten other credible biomarkers not correlated exclusively with the hypothesized group of iridoid glycosides were shown to be strongly associated with plant disease status ( Figures S8 and S9). Accumulations of these ten metabolites (including M435T576) were highly predictive of plant health and disease status, according to multiple biomarker test results ( Figure S9). Chemodiversity patterns derived from these ten biomarkers also showed that the dominance of M435T576 (in relation to the relatively even expression of other biomarkers) was associated with log odds of a plant being healthy versus diseased (Table S5; Appendix S1; Other candidate biomarkers continued).

| Relationships of genetic diversity and chemodiversity
Our analysis showed largely concordant chemical-genetic distance clustering patterns (Figure 4). For instance, mountain subpopulations TA B L E 2 Mantel tests of correlations between subpopulationlevel metabolic, genetic, and environmental distances. Chemical distance matrix obtained through Analysis of Similarities (ANOSIM) using 377 metabolites. Genetic distance matrix consists of linearized F ST values calculated from 1,171 reference (putatively neutral) SNPs. Site-level means of temperature at collection (Tcol) and precipitation of driest month (Bio14) used for matrices of Euclidean distances, and geographic distance among sites calculated from a X, Y coordinate system. Vertical bar denotes partial Mantel's test controlling for third matrix right of "|"    (Glassmire et al., 2016;Richards et al., 2015). In the context of this study, it is important to note the similarity between clustering patterns derived from genetic and metabolite data in addition to noting evidence that some diseased mountain subpopulations (Figure 1) have lower genetic and chemical diversity.
Furthermore, relatively high chemodiversity estimates among 377 metabolites are observed in healthier plants ( Figure 3). As healthy plants occur in all environments from the Coastal Plains to mountains, this finding suggests that dogwood anthracnose may not necessarily be constrained by only abiotic factors (e.g., cooler and moister habitats) but instead may also be affected by both genetic and metabolite diversities. Genetic and metabolite diversity may be important to disease variation. High levels of metabolic and genetic diversity intrinsic to the host may benefit individual trees in staving off disease infection (Jones et al., 1991). Alternatively, low genetic diversity in mountain populations can also be a consequence of dogwood anthracnose disease effects (Hadziabdic et al., 2012). Although the co-occurrence of these patterns in C. florida presents challenges to distinguish relative roles of abiotic, genetic, and chemical factors, available evidence supports an influence of genetics on disease as elaborated below.
Previous ecological genomic analysis using GBS data (Pais et al., 2016) has identified SNPs under selection for local adaption in the species, and a few of these SNPs are associated with biomarker metabolites (predictors of plant health) after accounting for environmental covariates ( Figure 6). In other words, our sampled sub-  (Caplan, Padmanabhan, & Dinesh-Kumar, 2008). For instance, M227T630 was identified as Cornolactone C, an iridoid isolated previously from C. florida (He et al., 2014), which belongs to a compound class known to accumulate in response to infection and has documented antimicrobial properties (Marak, Biere, & Van Damme, 2002).
These predicted chemical compounds serve as an important guide for prioritizing which SNPs and biomarkers to further test in future research. The abundance and inducibility of certain secondary metabolites such as flavonoids, other phenolics, and glycoside derivatives have been found to be heritable and mediated by herbivore-pathogen pressures (Johnson, Agrawal, Maron, & Salminen, 2009;Li et al., 2014). Determining the specific genetic factors regulating such adaptive metabolites remains an important goal, and this study adds to emerging efforts to integrate large secondary metabolite concentration data with information from genome scans (Eckert et al., 2012;Jensen, Foll, & Bernatchez, 2016;Talbot et al., 2016).

| Herbivore/pathogen interactions influencing chemodiversity in plants
Explanations for natural variation of secondary metabolism have long been debated (Fraenkel, 1959). Intraspecific variation of plant chemodiversity can be attributed to differences in the environment (e.g., variation in surrounding plant, herbivore, fungal, or microbial communities as well as abiotic heterogeneity; Tahvanainen & Root, 1972;Root, 1973;Barbosa et al., 2009;Sardans et al., 2011;Rivas-Ubach et al., 2014) or to genetic variation within the species. Proponents of the latter explanation cite observations used to argue for the role of chemodiversity in ecological function and heritability; namely, these authors note that congruent patterns of genetic diversity and chemodiversity are often inversely related to herbivory or infection levels (reviewed in Moore et al., 2014;Raguso et al., 2015). As an example, one study of Smallanthus macroscyphus (Asteraceae) reported less sesquiterpene lactone diversity (herbivory deterrent) in populations further away from the equator, which was explained as a result of selection for lower toxicity due to fewer herbivore-plant interactions in areas far from the equator (Aráoz, Mercado, Grau, & Catalán, 2016;Salazar & Marquis, 2012). This variation of sesquiterpene lactone diversity in the species may well have a genetic basis, which has not been investigated. Higher genetic diversity in certain individuals (e.g., less homozygous genotypes; Figure S7) or subpopulations (e.g., higher rarefied-allelic richness; Figure 2d) can confer a greater range of gene products (e.g., secondary compound precursors) and increase host ability to respond more readily to any general infection (Firn & Jones, 2000).
On the other hand, herbivory and infection on plants can also induce greater secondary compound diversity or induce dominance F I G U R E 8 Top random forest (RF) results of biomarkers indicative of healthy or diseased plants. Metabolites arranged top to bottom based on ranked importance, and labeled metabolites right of solid black line on plot were considered for biomarker selection. Panel right of support index panel for biomarker tests indicates relative expression of each compound in healthy versus diseased plants with black shades representing higher expression and white shades representing lower expression

Random Forest
Selected frequency (Support index) of certain compounds (Mithöfer & Boland, 2012;Thoss & Byers, 2006). While our study could not discriminate between constitutive and induced chemical diversity, our analysis of chemodiversity derived from a specific set of biomarkers did show that healthy plants tended to have greater unevenness of chemical expression than diseased plants (Figure 7). The unevenness seemed largely attributed to variation in expression of certain biomarkers (i.e., M435T576).
In C. florida, candidate SNPs like B982_75 and B440_76 as well as biomarker M435T576 in the iridoid glucoside network ( Figure 6) may represent examples of candidate genes governing variation in accumulation and degree of inducible expression for certain defense compounds.

| CON CLUS IONS
Our study demonstrates untargeted metabolite profiling is a useful approach for understanding biodiversity in a new dimension.
Secondary metabolites preserved in dried leaves of C. florida from natural populations provided data for evaluating chemodiversity and identifying potential disease biomarkers. We found congruent patterns of chemical and genetic variation and identified several biomarkers indicative of disease and health after accounting for the effects of environment. From those results, a select group of candidate SNPs and metabolites (i.e., iridoid glucosides) of clear ecological importance was identified to guide future study. Additional investigation of chemical diversity with increased sampling across the species range may provide more details on the relationship among genetics, metabolites, and dogwood anthracnose in C. florida, which in turn may shed light on forest diseases in general.

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
With the support of Qiuyun Xiang and Xu Li (3) Appendix S1 containing summary of all significant GWA results with a Bonferroni correction of 0.05 (Table S3) and list of compounds found in Knapsack and Metcyc databases that are similar in mass (delta ppm < 10) to notable chemical features reported in this manuscript (Table S4).