Genome‐wide association and genomic prediction for biomass yield in a genetically diverse Miscanthus sinensis germplasm panel phenotyped at five locations in Asia and North America

To improve the efficiency of breeding of Miscanthus for biomass yield, there is a need to develop genomics‐assisted selection for this long‐lived perennial crop by relating genotype to phenotype and breeding value across a broad range of environments. We present the first genome‐wide association (GWA) and genomic prediction study of Miscanthus that utilizes multilocation phenotypic data. A panel of 568 Miscanthus sinensis accessions was genotyped with 46,177 single nucleotide polymorphisms (SNPs) and evaluated at one subtropical and five temperate locations over 3 years for biomass yield and 14 yield‐component traits. GWA and genomic prediction were performed separately for different years of data in order to assess reproducibility. The analyses were also performed for individual field trial locations, as well as combined phenotypic data across groups of locations. GWA analyses identified 27 significant SNPs for yield, and a total of 504 associations across 298 unique SNPs across all traits, sites, and years. For yield, the greatest number of significant SNPs was identified by combining phenotypic data across all six locations. For some of the other yield‐component traits, greater numbers of significant SNPs were obtained from single site data, although the number of significant SNPs varied greatly from site to site. Candidate genes were identified. Accounting for population structure, genomic prediction accuracies for biomass yield ranged from 0.31 to 0.35 across five northern sites and from 0.13 to 0.18 for the subtropical location, depending on the estimation method. Genomic prediction accuracies of all traits were similar for single‐location and multilocation data, suggesting that genomic selection will be useful for breeding broadly adapted M. sinensis as well as M. sinensis optimized for specific climates. All of our data, including DNA sequences flanking each SNP, are publicly available. By facilitating genomic selection in M. sinensis and Miscanthus × giganteus, our results will accelerate the breeding of these species for biomass in diverse environments.


| INTRODUCTION
Miscanthus is a promising crop for lignocellulosic bioenergy and bioproducts, but it is in the very early stages of domestication and breeding (Clifton-Brown, Chiang, & Hodkinson, 2008;Clifton-Brown et al., 2019;Dwiyanti, Stewart, & Yamada, 2013;Sacks, Juvik, Lin, Stewart, & Yamada, 2013). A single sterile triploid clone of Miscanthus × giganteus, a hybrid between Miscanthus sacchariflorus and Miscanthus sinensis, has been adopted for commercial biomass production in Europe and North America (Głowacka et al., 2015;Heaton, Clifton-Brown, Voigt, Jones, & Long, 2004). However, new biomass cultivars of Miscanthus are needed to limit the risk of disease and pest outbreaks associated with growing a monoculture of a single clone. Additionally, the standard clone of M. × giganteus is insufficiently winter-hardy in parts of northern Europe and northern parts of the US Midwest (Clifton-Brown & Lewandowski, 2000;Dong et al., 2019), and it flowers too early to yield optimally at lower latitudes (~30°) such as the coastal plain of the southern United States (E. J. Sacks, unpublished data). Both M. sinensis and M. sacchariflorus are native to a broad range of environments across East Asia and should provide a wealth of breeding material for developing new biomass cultivars (Clifton-Brown et al., 2008). Such breeding can be accelerated by genomic selection, especially because Miscanthus is a long-lived perennial that requires 3 years of field testing to obtain high-quality yield data. Additionally, candidate genes and associated single nucleotide polymorphisms (SNPs) identified by genome-wide association (GWA) can be used to enhance genomic prediction (Bian & Holland, 2017;Rice & Lipka, 2019;Spindel et al., 2016), and identify targets for genome editing (Scheben & Edwards, 2018). By obtaining knowledge about associations between genotype and biomass yield and 14 yield-component traits, for a diverse germplasm panel of M. sinensis evaluated at one subtropical and five temperate locations, we aim to build a strong foundation for genomic-enabled breeding of this new crop.
Previous GWA and genomic prediction studies of Miscanthus have used phenotypic data collected from single locations (Nie et al., 2016;Slavov et al., 2014;Zhao et al., 2013). Slavov et al. (2014) conducted a GWA study and genomic prediction on 138 genotypes of M. sinensis from the Korean Peninsula and Southern Japan (S Japan), using more than 100,000 SNP markers. Slavov et al. (2014) phenotyped the M. sinensis accessions for yield traits near Aberystwyth, United Kingdom, and found significant SNP-trait associations for stem and leaf length, stem angle, senescence, and lignin content, but none for overall dry matter yield. Dry matter yield in the Slavov et al. (2014) study and a followup study by Davey et al. (2017) had moderate broad-sense heritability but low genomic predictive ability (0.04-0.06), which they hypothesized could be improved considerably if the study included a larger number of individuals. It was also demonstrated that genomic selection could be more successful if a selection index were used in place of direct selection for yield (Davey et al., 2017;Slavov et al., 2019). Zhao et al. (2013) studied 300 M. sinensis genotypes collected from a broad geographic range across China in a field trial at Wuhan, China, using 115 alleles from 23 microsatellite markers. They identified nine significant associations with heading date, plant height, and yield, although a relatedness matrix was not included in the GWA model, increasing the chance of false positives. Nie et al. (2016) used a unified mixed linear model (MLM; Yu et al., 2006) with 1,059 markers from several polymerase chain reaction (PCR)-based genotyping methods on a collection of 138 M. sinensis genotypes from southwest China phenotyped in a field trial in Sichuan and identified one significant association with biomass yield and 11 significant associations with other traits. Genomic prediction accuracies estimated by Nie et al. (2016) were generally lower than those estimated by Slavov et al. (2014), although the estimate for dry biomass yield was higher (0.23).
Given the high genetic diversity, population structure (Clark et al., 2014) and known genotype-by-environment effects (Arnoult & Brancourt-Hulmel, 2015;Clifton-Brown et al., 2001;Kaiser, Clark, Juvik, Voigt, & Sacks, 2015;Yan et al., 2012) in M. sinensis, inferences from GWA and genomic prediction would benefit from multilocation phenotypic data on large, genetically diverse germplasm panels. GWA and genomic prediction studies that used phenotypic data spanning multiple continents are rare, with notable exceptions being a few studies in wheat (Crossa et al., 2014(Crossa et al., , 2007. However, such broad sampling of environments can enable the identification of SNPs that are broadly useful for improving breeding values across environments (Wei et al., 2010) and improve genomic prediction accuracy via the use of phenotypes from correlated environments (Crossa et al., 2014;Spindel et al., 2016).
In this study, we present the first multilocation GWA and genomic prediction study in M. sinensis. In total, 568 genotypes previously characterized for population structure (Clark et al., 2014) were phenotyped at six field trial locations in Asia and North America (Clark et al., 2019), and were genotyped with 46,177 SNPs using RAD-seq. Our goals were to (a) identify candidate genes for yield and yield-component traits; (b) assess reproducibility of GWA K E Y W O R D S biomass yield, field trials, genome-wide association studies, genomic selection, Miscanthus sinensis, Miscanthus × giganteus, RAD-seq results across years, locations, and correlated traits; and (c) estimate genomic prediction accuracy in order to determine the feasibility of genomic selection.

| Plant material and phenotypic data collection
Dry biomass yield and 14 yield-component traits (Table 1) were recorded for 568 M. sinensis clonal genotypes at six field trial locations in the second and third year after planting as previously described (Clark et al., 2019). In short, field trials were established early in the 2012 growing season at five temperate locations: Sapporo, Japan by Hokkaido University (HU); Leamington, ON by New Energy Farms (NEF); Fort Collins, CO by Colorado State University (CSU); Urbana, IL by the University of Illinois (UI); and Chuncheon, Korea by Kangwon National University (KNU); and at one southern (subtropical) location, Zhuji, China by Zhejiang University (ZJU). Field trials were randomized complete block designs with three to four replicate at each site; plots were single plants equally spaced within and between rows on 1.5 m centers. Harvesting was conducted in late autumn or early winter, after dormancy or the first killing freeze led to dry down, with stems being cut 15-20 cm above the ground. M. sinensis genotypes included wild accessions representing six genetic groups in East Asia, plus ornamental and US naturalized accessions that clustered with an S Japan based on marker data (Clark  et al., 2014). Pairwise Jost's D (Jost, 2008) among the eight genetic groups ranged from 0.01 to 0.08, indicating low genetic differentiation consistent with M. sinensis being a highly outcrossing species (Clark et al., 2014).

| Quantitative genetics
Best linear unbiased predictor (BLUP) values were calculated for subsequent use in GWA and genomic prediction. Phenotypic data were transformed using the Box-Cox method (Box & Cox, 1964) prior to BLUP calculation, as previously described (Clark et al., 2019), in order to prevent false positive associations, inflated prediction accuracy estimates, and other issues that can arise from violation of model assumptions in GWA and genomic prediction (Owens et al., 2014). Random effects models were fitted using the R package lme4 (Bates, Mächler, Bolker, & Walker, 2015). Models were fit for each of 188 location × trait × year combinations (Equation 1), with replication (R) and genotype (G) as random effects. In Equation 1, Y is the vector of Box-Cox-transformed phenotypes, b is the intercept, β is a vector of coefficients (β 2 being the BLUPs used in downstream analysis), and ε is random error.
Additionally, 132 multilocation models (Equation 2, Table 2) were fit using the site combinations HU + NEF, ZJU + KNU, HU + NEF + UI + CSU, HU + NEF + UI + CSU + KNU (all five northern trial sites), and HU + NEF + UI + CSU + KNU + ZJU (all six trial sites), which were chosen based on environmental similarities, genetic correlations, and ANOVA of phenotypic values among locations (Clark et al., 2019), in order to generate multilocation BLUPs for GWA and genomic prediction. Multilocation models were only fitted when data for a given trait × year combination were available for all locations in a given combination of locations. Location (L), replication within location, genotype, and genotype × location were included as random effects in the multilocation models.
However, if Equations 1 and 2 are applied to all entries in the germplasm panel, they do not account for population structure. Prior studies have found that population structure can bias estimates of genomic prediction accuracies upwards (Fiedler et al., 2018;Guo et al., 2014;Riedelsheimer et al., 2012). Because we previously found that the genetic groups within M. sinensis differed from each other with respect to yield and yield-component traits, and because it was possible to predict from marker data which genetic group a genotype was in, we expected genomic prediction accuracy across the entire germplasm collection to be higher than genomic prediction accuracy within groups. Estimation of prediction accuracy within genetic groups would therefore serve as an indicator of how well genomic selection would work if one were to breed within groups, or within a population where individuals from different genetic groups were intermated a sufficient number of generations to greatly reduce or eliminate population structure. Genetic groups in the M. sinensis diversity panel included S Japan (which included ornamental and US naturalized accessions), Northern Japan (N Japan), Korea/Northern China (N China), Yangtze-Qinling, Sichuan, Southeast China (SE China), as determined by the discriminant analysis of principal components (DAPC; Clark et al., 2014).
We evaluated three methods for estimating genomic prediction accuracy within genetic groups: (a) performing genomic prediction separately within each genetic group using BLUPs from Equations 1 and 2, (b) using genotypewithin-genetic group BLUPs (Equations 3 and 4, below) in genomic prediction, and (c) fitting the BLUPs from Equations 1 and 2 to the genetic group and using the residuals in genomic prediction (Equation 5, below). Method (a) closely resembles the genomic prediction that might be performed by plant breeders wishing to select within individual genetic groups, but in this study, estimates from this strategy were limited by small sample sizes within genetic groups. In contrast, methods (b) and (c) allowed us to more accurately estimate the prediction accuracy without bias from population structure from the representative panel that we studied. Method (b) was designed to integrate coherently with our method for estimating BLUPs, and to control for interaction effects between genetic group and location. Method (c) was based on a method that Lipka et al. (2014) developed previously to control for population structure in genomic prediction. All three methods were performed using traits measured in year 3 only.
For method (a) of controlling for population structure, genomic prediction was run within genetic group for groups where at least 50 genotypes had phenotypic data at a given site or site combination, using BLUPs from Equations 1 and 2. The US naturalized and ornamental accessions were included as part of S Japan for this purpose, given their known S Japan ancestry.
For method (b) of controlling for population structure, genotypic BLUPs were calculated where between-group variance was removed, leaving only within-group variance. BLUPs for genotype-within-genetic-group within each individual location were calculated with Equation 3, where D is the genetic group.
Multilocation BLUPs for genotype-within-genetic-group were calculated with Equation 4 for all five northern trial sites (HU + NEF + UI + CSU + KNU), and for all six trial sites (HU + NEF + UI + CSU + KNU + ZJU). (1) T A B L E 2 Number of significant SNP-trait associations detected in genome-wide association analyses of Miscanthus sinensis. Phenotypic values were Box-Cox transformed, then genotype BLUPs were calculated for each trait. In total, 46,177 SNP markers and 568 accessions were included in genome-wide association analyses. Q + K mixed model analyses were performed using the software GAPIT. Associations were considered significant if p < 0.05 after FDR correction   The BLUP values used in downstream genomic prediction analysis from Equation 3 was represented by β 3 , and for Equation 4, β 5 was the BLUP used.
For method (c) of controlling for population structure, models were fitted with genetic group as a fixed effect and BLUPs from Equations 1 and 2 as the response variable. Residuals from these models were then used as the response variable in genomic prediction (see below). These residuals were calculated for traits measured in year 3 only. Residuals were calculated with Equation 5, where D is the genetic group.
To improve genomic estimated breeding values by utilizing genetic correlations among traits, selection indices were estimated for each individual location, the five northern locations, and all six locations, using an approach similar to that of Davey et al. (2017). Year 3 measurements of Yld, CC, BC, CmL, CmNdN, DBI, and TCmN were included in the selection index. CmDW, CmDW/V, and TCmN/RCmN were excluded as these were not measured at all sites. To avoid error from the inclusion of highly correlated variables (Baker, 1986), CC/BC, IntL, CmV, and TCmN/A were excluded as these were calculated from other traits in the model, and DTI was excluded for its strong correlation with DBI. Plots that did not have year 3 measurements for all seven traits were excluded. For each location and location combination, phenotypic (P) and genetic (G) variance-covariance matrices were estimated using Box-Cox transformed data. The genetic variance was estimated from Equation 1 (individual locations) or Equation 2 (location combinations). The genetic covariance was estimated as (σ 2 (trait2) are the genetic variances of individual traits, and σ 2 G(trait1 + trait2) is the genetic variance of the sum of the two traits. The economic value vector a was set to 1 for Yld and 0 for all other traits. The weighting factor w was then estimated as: The selection index S was then estimated as.
where X is a matrix of phenotypic observations by plot. Equations 1-4 were then used to estimate BLUPs of the selection index, and Equation 5 was used to estimate residuals of regressing BLUPs on DAPC groups.

| Genotyping
RAD-seq was performed according to a previously described protocol (Clark et al., 2014). In summary, genomic DNA was digested with MspI and either PstI-HF or NsiI-HF (New England BioLabs). Digested DNA was then ligated to a barcoded adapter with a PstI/NsiI overhang and a universal adapter with an MspI overhang. Ninety-five barcoded samples were then pooled into one library, and a 200-500 bp size selection was performed on 2% agarose. Libraries were amplified with a Kapa Hi-Fi library amplification kit (Kapa Biosystems, Wilmington, Massachusetts, USA), quantified, and then sequenced on a HiSeq 2500 (Illumina) at the Roy J. Carver Biotechnology Center at the University of Illinois. Nine PstI libraries from a previous study (Clark et al., 2014) as well as eight additional PstI libraries and thirteen NsiI libraries (data available at NCBI Sequence Read Archive, accession SRP026347), were included in the analysis. Every individual in the study was represented on at least two PstI libraries and two NsiI libraries.
SNPs were mined from RAD-seq data on 594 accessions (Data S1), including 568 M. sinensis accessions with phenot ypic data, 3 doubled haploid M. sinensis genotypes that were not included in the field trials, 3 M. sinensis accessions that did not survive at any field trial location, 9 diploid and 1 triploid M. × giganteus accession, and 7 diploid and 3 tetraploid M. sacchariflorus accessions. SNPs were called with the UNEAK pipeline (Lu et al., 2013) using a minimum call rate of 0.04 and a minimum minor allele frequency of 0.002 for initial filtering. Additionally, 402 GoldenGate SNPs from a previous study (Clark et al., 2014) were included. SNPs were then imported into R, and any SNPs that appeared hetero zygous in any of the doubled haploid lines were removed from the dataset because these were evidence of allelic differences at paralogous loci in the recently duplicated diploid M. sinensis genome. The dataset was then filtered to only include SNPs that had missing data in 30% or fewer individuals and a minor allele frequency of at least 0.05 in at least one of the nine genetic groups (eight M. sinensis groups and one M. sacchariflorus group) previously identified by DAPC (Clark et al., 2014;Jombart, Devillard, & Balloux, 2010). A total of 46,177 SNP markers (including GoldenGate SNPs) were retained, with an overall missing data rate of 26% averaged across all accessions and SNPs. After removing 20 individuals that were M. sacchariflorus or F 1 M. × giganteus hybrids, imputation of missing SNPs was performed for the remaining 574 M. sinensis individuals (including the three doubled haploid lines), using the estimation-maximization method based on relatedness (Poland et al., 2012) as implemented in the R package rrBLUP (Endelman, 2011).

| Genome-wide association
Unified mixed linear model genome-wide association (MLM GWA) analyses were performed on 568 M. sinensis individuals (all M. sinensis with phenotype and genotype data; Data S1) using 46,177 imputed SNP markers. MLM GWA was implemented with the software GAPIT (Lipka et al., 2012) version 2016.03.01 using kinship matrix compression and P3D (Zhang et al., 2010) under default parameters. The kinship matrix included in the model was estimated by GAPIT using the method of VanRaden (2008). Included as covariates were seven columns of Q values estimated by Structure (Falush, Stephens, & Pritchard, 2003) from our previous study of M. sinensis population structure ; Data S1). The eighth column of Q values was omitted since it could be predicted from the other seven. The eight columns of Q values corresponded to ancestry from M. sinensis populations in S Japan, central Japan, N Japan, Korea/N China, Sichuan, Yangtze-Qinling, and SE China/tropical, as well as M. sacchariflorus. The S Japan population from Clark et al. (2014) corresponded to the S Japan and central Japan populations from Clark et al. (2015), due to larger sample size in the latter study resulting in finer resolution of population structure in Japan. All single-site BLUPs and multisite BLUPs were individually subjected to GWA. Upon examination of the number of significant SNPs identified, GWA results were only retained for further analysis if the trait BLUP included values for more than 200 genotypes and if its skewness was no less than −2 and no greater than +2, so that associations would not be driven by accessions with extreme phenotypes. All single-site BLUPs for CSU, UI, and KNU were excluded because each had fewer than 200 genotypes with data. Data from different years were treated separately in order to help assess the reproducibility of associations. In total, 83 single-site trait BLUPs and 127 multisite trait BLUPs were analyzed. For any given BLUP, SNPs were excluded from GWA if there were fewer than three genotypes with phenotypic data that also had the minor allele. Associations were considered significant if false discovery rate (FDR)-corrected p-values (Benjamini & Hochberg, 1995) were below 0.05.

| Genomic prediction
Genomic prediction for each location and for combinations of locations was performed using ridge regression-best linear unbiased prediction (RR-BLUP; Meuwissen, Hayes, & Goddard, 2001). Assuming equal phenotypic variance explained at each marker across the genome, an additive relationship matrix calculated from imputed SNP markers was used as the independent variable in the genomic prediction model, rather than the matrix of marker values (Endelman, 2011). The relationship matrix was calculated using the A.mat function, and the genomic prediction model was then calculated using the kin.blup function in the rrBLUP package (Endelman, 2011). Response variables for the genomic prediction models were the BLUP values obtained from phenotype data or the residuals of those BLUPs fitted to a model with genetic group as a fixed effect (Equations 1-5). Tenfold cross-validation was used as described by Resende et al. (2012). Briefly, for each trait, all genotypes having phenotype data were divided into 10 subgroups of equal size. Nine subgroups were used as a training set, while the remaining subgroup was used as a prediction set. The process was repeated 10 times in order to predict the genomic estimated breeding value (GEBV) once for each genotype. Prediction accuracy was estimated as the Pearson's correlation coefficient between the observed BLUP values and the predicted values. The stability of the model was evaluated by performing 100 iterations of cross-validation and estimating the average value of the Pearson's correlation coefficient from each iteration.

| Genome-wide association
Across the 210 trait-site-year combinations examined, a total of 504 significant SNP-trait associations were identified at 5% FDR (Table 2). Within most traits, some SNPs were significant for more than one site, site combination, and/or year, yielding 326 significant associations when these duplicates were removed (Table 2). Because some SNPs were significant for more than one trait, 298 unique significant SNPs were identified (Data S2). Of all the significant associations, 307 were from traits measured in year 2 and 197 were from traits measured in year 3 (Table 2), and 16 SNPs had significant associations with at least one trait in both years 2 and 3 (Table S1). Most sets of traits that shared significant associations with a SNP were related to each other, such as culm diameter and culms per area (Table S1). An average of 2.5 significant SNPs were found for each trait × site × year estimated from single-site genotypic BLUPs, and 2.3 were found per trait × site combination × year based on multisite genotype BLUPs. The greater the number of sites used to calculate the multisite BLUPs, the fewer significant SNPs that were typically identified, with 3.1 and 2.4 for HU + NEF and KNU + ZJU respectively, 2.3 for HU + NEF + CSU + UI, 2.2 for HU + NEF + CSU + UI + KNU, and 1.6 for HU + NEF + CSU + UI + KNU + ZJU (Table 2). However, for yield, only 3 significant year 3 yield SNPs were identified for ZJU + KNU, whereas 26 were found for HU + NEF + UI + KNU + ZJU, (Figure 1; Table 2). In total, 27 unique SNPs for yield were identified, with 26 in | 995 CLARK et AL.
year 3, 1 in year 2, and 2 in both years. The minor allele was associated with higher yield for 23 out of the 27 significant SNPs (Data S2). Eighteen of the significant SNPs for yield could be aligned to S. bicolor chromosomes, and the remaining nine were unaligned (Data S2). Of the yield SNPs aligned to the S. bicolor genome, five were on chromosome 1, three on chromosome 2, three on chromosome 3, two on chromosome 4, one on chromosome 6, one on chromosome 8, two on chromosome 9, and one on chromosome 10, with the closest pair of yield SNPs being over 1 Mb apart on chromosome 9 (Data S2, Figure 1 and Figure S1).

| Genomic prediction
Using genotypic BLUPs (Equations 1 and 2), genomic prediction accuracies for year 3 were mostly moderate for dry biomass yield and compressed circumference, the component trait that best predicted yield (Table 3). At each site, genomic prediction accuracies for culm length were higher than they were for yield or compressed circumference. KNU had a relatively small number of surviving genotypes and low replication (Clark et al., 2019), which likely contributed to its low prediction accuracies (Table 3; Table S2). Prediction accuracies for multisite genotype BLUPs (Equation 2) were comparable to those for single-site genotype BLUPs (Equation 1), suggesting that genomic selection can be used to identify genotypes that are high yielding across a broad range of environments (Table 3). Predicted breeding values are provided in Data S1. For all traits, prediction accuracies ranged from −0.19 to 0.79 for single-site BLUPs and 0.17-0.65 for multisite BLUPs (Table S2). Genomic prediction accuracies for diameter of basal internode, culm length, and culm volume were high (≥0.5) at all locations except at KNU, which had moderate values (Table 3; Table S2).
Genomic prediction accuracies for genotypes within genetic groups (using BLUPs calculated according to Equations 3 and 4 and residuals calculated according to Equation 5) were typically lower than those that did not control for population structure, but remained moderate in magnitude for most traits (Table 3), suggesting potential for genomic selection within genetic groups. Some notable exceptions were observed in which prediction accuracy was reduced drastically by accounting for population structure, including basal circumference at NEF and ZJU; compressed circumference/basal circumference at NEF and UI; diameter of topmost internode at HU; total number of culms at KNU; culms per footprint at NEF and ZJU; and yield, culm length, total number of culms, and culms per footprint at ZJU. On average, prediction accuracies were similar using the BLUP method (Equations 3 and 4) and the residuals method (Equation 5), although there were notable differences. For example, with respect to the method that did not control for population structure, prediction accuracies for basal circumference across multiple sites was reduced by ~50% when using residuals, but was reduced to zero when using genotype-within-genetic group BLUPs (Table 3). When yield BLUPs across temperate locations were used without controlling for the effect of genetic group (Equation 2), genomic prediction ranked the genotypes somewhat according to their genetic group (5.6% of variance in yield GEBVs explained by genetic group vs. 2.2% of variance in yield BLUPs explained by genetic group), suggesting that population structure played a role in prediction accuracy ( Figure 2a). However, when the effect of genetic group was removed by any of our three methods (prediction within groups, within group BLUPs, or residuals), genomic prediction no longer ranked genotypes according to genetic group but was still moderately accurate (Figure 2b-d).

Chr01
Chr02 Chr03 Chr04 Chr05 Chr06 Chr07 Chr08 Chr09 Chr10UA T A B L E 3 Genomic prediction accuracies for dry biomass yield and yield-component traits measured in year 3 on Miscanthus sinensis genotypes grown at five northern field trial locations (HU, NEF, CSU, UI, and KNU; 37.9-43.1°N) or one southern location (ZJU; 29.8°N). Phenotypic values were Box-Cox transformed, then genotype BLUPs (G; Equations 1 and 2), genotype-within-genetic group BLUPs (G(D); Equations 3 and 4), or residuals of genotype BLUPs fitted to genetic group (R; Equation 5) were calculated for each trait and used as the response variable in genomic prediction. G(D) BLUPs and R residuals eliminated phenotypic variance attributable to differences among genetic groups, in order to estimate the efficacy of genomic selection within genetic groups. Among the genetic groups, S Japan (including ornamental and US naturalized accessions, known to be derived from S Japan germplasm) had the highest prediction accuracies overall, which were consistently moderate to high across locations and traits (Tables 4 and 5). The Yangtze-Qinling group had consistently moderate prediction accuracies among locations for compressed circumference/basal circumference, culm length, culm node number, and internode length, whereas other traits had moderate prediction accuracies at some locations and low prediction accuracies at others for this group (Table 4). N Japan had mostly moderate prediction accuracies at NEF, but moderate to low prediction accuracies at HU (Table 4). Low to moderate accuracies were observed in the Korea/N China group, but no traits were consistently moderate across locations in this group, despite this group having more genotypes than any other (Table 4). The SE China/tropical group only had enough genotypes for genomic prediction at ZJU, where prediction accuracies were mostly moderate (Table 4).
The use of a selection index for yield, based on biomass yield plus six yield-component traits, generally gave higher prediction accuracies than those observed for yield alone (Tables 3-5). The improvement in prediction accuracy was particularly consistent for multilocation data, with or without controlling for population structure. For example, prediction accuracy across all field trial locations without controlling for population structure was 0.64 for the selection index versus 0.49 for yield, and using genotype-within-genetic-group BLUPs was 0.49 for the selection index versus 0.17 for yield (Table 3). Due to environmental differences and different subsets of genotypes surviving at different locations, weighting factors for the selection index varied from location to location, although the weighting factor for total number of culms was always negative (Table S3), consistent with highly tillering plants tending to have short, thin culms that resulted in low yields (Clark et al., 2019).

| SNP-trait associations and genomic predictions
Our results indicate that GWAs and genomic selection are likely to accelerate the breeding of M. sinensis, even with relatively low coverage of the genome by RAD-seq SNPs. Marker-assisted breeding via genomic selection can be expected to reduce the duration of the breeding cycle (seed to seed) of M. sinensis from approximately 4 years for conventional breeding based solely on phenotypic selection, to one to one and a half years for purely marker-assisted breeding (Clifton-Brown et al., 2008;Głowacka, 2011). Combinations of marker-assisted breeding and phenotypic selection are likely to be needed to confirm and revise genomic predictions, and such strategies might have breeding cycles of intermediate duration, although it could also be possible to maintain a minimum breeding cycle duration while concurrently phenotyping a subset of progeny of each generation for long-term model improvement. In addition to their potential use as covariates for genomic selection, the SNPs that we identified via GWA can be used to identify candidate genes to help elucidate the genetic architecture of complex traits in the grass family (Liu & Yan, 2019). In particular, given that Miscanthus is undomesticated and has a broad geographic distribution, we expect that the degree and pattern of genetic variation across its genome is very different from that of cereal crops and other domesticated grasses, affecting which biologically relevant genes are detectable in GWA. Thus, GWA of Miscanthus can help complement existing knowledge of phenotypic pathways in the grasses.
Using only 46,177 RAD-seq SNPs, we identified hundreds of significant associations with yield-related traits, including 34 for biomass yield per se across 27 unique SNPs. Most SNP-trait associations that we identified were not significant in more than 1 year, but those that were  (Table S1) represent high priority targets for additional study. Although for most traits SNP-trait associations based on single-location BLUPs were more frequent than those based on multilocation BLUPs, for biomass yield the greatest number of associations were detected using BLUPs that were calculated across all sites, despite a large genotype × environment effect on yield (Table 2; Clark et al., 2019). Due to differential survival of genotypes across field trial locations, the BLUPs calculated across all locations included more genotypes than for subsets of locations, which likely contributed greater statistical power to detect SNPs associated with yield. These associations based on multilocation BLUPs represent allelic effects that were consistent for trial locations, which we would expect to be advantageous for breeding cultivars with broad adaptation. Similarly, genomic selection analyses of multilocation BLUPs resulted in moderate to high prediction accuracies for year 3 biomass yield, ranging from 0.17 when all sites were included and population structure was controlled for (Table 3) to 0.57 for KNU + ZJU when we did not control for population structure (Table S2). Using data from temperate sites only and controlling for population structure, prediction accuracy for year 3 yield was 0.35 using BLUPs for genotype-within-geneticgroup from Equation 4 and 0.31 using residuals of genotype BLUPs fitted to the genetic group from Equation 5 (Table 3; Figure 2), indicating that genomic selection across similar environments on individual genetic groups will be effective. Moreover, these prediction accuracies increased to 0.50 and 0.49, respectively, when a selection index for yield was used in place of direct selection for yield (Table 3), demonstrating the benefit of measuring yield component traits and including them in prediction models. In practice, genomic selection is likely to focus on breeding within genetic groups, and thus the prediction accuracies using genotype-within-genetic-group BLUPs and residuals of genotype BLUPs fitted on genetic groups provide useful estimates of the efficacy of genomic selection in real-world breeding. When genomic prediction was performed within genetic groups, prediction accuracy was variable due to small sample size (Tables 4 and 5). Nevertheless, we found that S Japan, the most consistently high-yielding genetic group across field trial locations, also had particularly high genomic prediction accuracies (Tables 4 and 5), indicating strong potential for genomic selection within this group. If the moderate to high genomic prediction accuracies from this study are confirmed in subsequent progeny populations, then we will be highly confident in the efficacy of genomic selection in M. sinensis and its advantage relative to phenotypic selection for efficiently breeding improved biomass cultivars.

F I G U R E 2 Genomic estimated breeding values (GEBVs) for year 3 dry biomass yield in
In some cases, low or negative genomic prediction accuracies were observed for residuals after genotype BLUPs were fitted to genetic groups (Equation 5) or for genotype-withingenetic group BLUPs (Equations 3 and 4), despite moderate or high prediction accuracies using genotype BLUPs (Table 3), indicating that in those cases most of the prediction accuracy using genotype BLUPs was dependent on phenotypic differences between genetic groups, rather than variance within groups. For example, at ZJU, some phenotypic differences among genetic groups were especially pronounced, with the Sichuan and SE China groups being high yielding and having long, thick culms, whereas surviving genotypes in the ornamental group were shorter and had many more culms than other groups (Clark et al., 2019). For basal circumference, outliers were present in the NEF and ZJU datasets after Box-Cox transformation and genotype-within-genetic-group BLUP calculation, and inaccurate GEBVs for these outliers accounted for the low prediction accuracy at those sites. Both cases suggest that inspection of the relationship between phenotypic values and GEBVs, beyond a single correlation statistic, is important for predicting the efficacy of genomic selection. Several methods have been used by others to control for or assess the effect of population structure on genomic prediction. Guo et al. (2014) decomposed the relationship matrix into eigenvectors before using it for prediction, then estimated variance attributable to the first n eigenvectors, which was assumed to represent population structure. Azevedo et al. (2017) used the first several eigenvectors of the relationship matrix as covariates in the prediction model. Fiedler et al. (2018) assigned entire genetic groups to the training or validation set. Arruda et al. (2015) used k-means clustering to identify closely related individuals and to make sure they were not in both the training and validation sets. Lipka et al. (2014) fitted phenotypic values to the first several principal components of the marker data and used the residuals as the response variable in genomic prediction. Because we have previously identified genetic groups within M. sinensis (Clark et al., 2014 and evaluated phenotypic differences among these groups in this set of field trials (Clark et al., 2019), we corrected for population structure by removing variance attributable to genetic groups by (a) performing genomic prediction within individual groups (Tables 4 and 5), (b) estimating BLUPs for genotype-within-genetic-group from Equations 3 and 4 (G(D); Table 3), or (c) analyzing residuals of genotype BLUPs fitted to genetic group (R; Equation 5). Prediction accuracies were T A B L E 5 Genomic prediction accuracies within genetic groups for traits measured in Miscanthus sinensis in year 3 at five northern field trial locations (HU,NEF,CSU,UI,and KNU;) and one southern location (ZJU; 29.8°N). Phenotypic values were Box-Cox transformed, then genotypic BLUPs were calculated across locations (Equation 2) and used as the response variable in genomic prediction. The analyses were based on 46,177 SNP markers and 568 accessions. Genetic groups were excluded from analysis when fewer than 50 genotypes had phenotypic data typically lower for the methods that accounted for population structure relative to analyses that did not account for structure, which was consistent with expectations because failure to account for population structure has been shown to bias estimates upwards (Fiedler et al., 2018;Guo et al., 2014;Riedelsheimer et al., 2012). For biomass yield, genomic prediction accuracies were mostly moderate for each of the three methods we used to control for population structure (Tables 3-5).
Including multiple locations in the analyses, and thus having greater phenotypic sampling, was advantageous. The genotype-within-genetic-group method and the residuals method typically gave similar results to each other but the former often had slightly higher prediction accuracies than the latter (Table 3). These small differences may be due to Equation 4 controlling for genetic-group-by-location effects, whereas Equation 5 did not. It will be interesting to see if tests of future generations of Miscanthus confirm the small difference in prediction accuracies observed for these two methods in this study. Our high genomic prediction accuracies and number of detected SNP-trait associations are notable given the relatively modest number of SNPs evaluated and the low linkage disequilibrium expected for M. sinensis, due to its self-incompatibility, undomesticated status, wind dispersal of pollen and seed, and large population size. Indeed, Slavov et al. (2014) found that linkage disequilibrium in M. sinensis decayed after several hundred base pairs for most SNP pairs (although they note that their ability to estimate linkage disequilibrium was limited by the low proportion of the genome covered by RAD tags, similar to our study) and yet they too were also able to identify significant associations for many traits. Given this low level of linkage disequilibrium, we might have expected that most blocks of linkage disequilibrium in the M. sinensis genome would not contain any SNPs from our set of 46,177 if the SNPs had been dispersed completely at random. However, our SNP markers were highly concentrated in the protein-coding portions of the genome. Out of 34,211 genes annotated in the S. bicolor genome v. 3.1, 14,062 of them were within 1 kb of a SNP in our dataset (20,611 SNPs near at least one gene; 11,013 genes actually contained a SNP from our dataset and 16,825 SNPs were within genes; see expanded version of Data S2 available at https ://doi.org/10.13012/ B2IDB-07908 15_V3). Given the whole genome duplication of Miscanthus with respect to its common ancestor with sorghum, we estimate that at least one fifth of M. sinensis genes (~14K out of ~68K) were in linkage disequilibrium with one or more SNPs in our dataset. Moreover, hundreds of significant SNP-trait associations were identified and moderate to high genomic prediction accuracies were obtained with the current dataset. However, as only a minority of all M. sinensis genes are likely in linkage disequilibrium with the current marker set, substantially increasing the numbers of SNPs would be expected to greatly increase the effectiveness of the GWA and genomic prediction analyses.
We found that when the same SNP was significantly associated with multiple traits, those traits tended to have strong genetic correlation, suggesting that in some cases one trait can serve as a proxy for another in GWA. For example, UIMiscanthus105065 was significantly associated with yield and compressed circumference, the best predictor of yield at ZJU in year 2, as well as basal circumference at ZJU and KNU in year 3, where the genetic correlation with yield was 0.76 and 0.61, respectively (Clark et al., 2019). Thirteen SNPs had associations with multiple traits relating to culm dimensions and number of culms, including diameter of basal internode, diameter of topmost internode, culm length, culm node number, internode length, culm dry weight, culm volume, total number of culms, and culms per footprint. Diameter of basal internode, diameter of topmost internode, culm dry weight, and culm volume nearly always had moderate to strong positive genetic correlations with each other (>0.6) and moderate negative genetic correlations with culms per footprint (Clark et al., 2019), which made sense given that fewer thick culms can fit into a given area than thin culms. Of the 13 SNPs, 6 had significant associations at multiple nonoverlapping field sites and/or site combinations, consistent with the high multilocation heritabilities of these traits (UIMiscanthus020125, 022671, 092590, 097427, 105978, and 117199;Clark et al., 2019). Thicker, larger stems were typically associated with higher yielding plants (Clark et al., 2019), and thus, we expect these SNPs to be useful for selection and breeding. Lastly, we identified a SNP associated with the ratio of compressed to basal circumference and culms per footprint at ZJU in year 3 (UIMiscanthus098471), where both of these traits relate to how much of the area occupied by the plant is filled in with culms rather than empty space. Overall, the identification of SNPs significantly associated with multiple yield-component traits suggests that the associations are biologically meaningful and may be used for identification of candidate genes and for breeding.
Throughout the Miscanthus genome, 80 quantitative trait loci (QTL) identified by five previous studies (Clark et al., 2016;Dong et al., 2018;Gifford, Chae, Swaminathan, Moose, & Juvik, 2015;Slavov et al., 2014;Zhao et al., 2013) included or were near to one or more significant F I G U R E 3 Locations of significantly associated single nucleotide polymorphisms (SNPs) and quantitative trait locus (QTL) from this study and others for biomass yield, compressed circumference, and culm length in Miscanthus with respect to the Sorghum bicolor reference genome. QTL peaks and 95% confidence intervals are indicated for Gifford et al. (2015) and Dong et al. (2018), and locations of single associated markers are indicated for all other studies SNPs from the current study (Data S2). Out of the 36 QTL identified by Gifford et al. (2015) for traits that were also measured in the current study, 10 contained a SNP from our study significantly associated with the same trait and 33 contained at least one SNP from our study associated with any trait. Out of the 14 QTL identified by Gifford et al. (2015) for compressed circumference, basal circumference, or ratio of compressed to basal circumference, 9 contained SNPs from our study significantly associated with the total number of culms, which had moderate genetic correlations with these traits at most sites (Clark et al., 2019) and logically should influence them. UIMiscanthus093867, which was significantly associated with yield in our study, was only 3.4 kb from the peak of Gifford et al.'s (2015) yield QTL Y4 and number of culms QTL NT4 on sorghum chromosome 1 (Figure 3a; Data S2). Similarly, UIMiscanthus030627, significantly associated with compressed circumference in our study, was 77.5 kb from the peak of compressed circumference QTL CC2 identified by Gifford et al. (2015) on sorghum chromosome 6 ( Figure 3b; Data S2). Out of the 47 joint-population meta-QTL identified by Dong et al. (2018) for traits included in this study, 7 spanned regions that included a SNP from our study for the same trait, and 34 included a SNP from our study significant for any trait. Additionally, out of nine QTL identified by Dong et al. (2018) for diameter of topmost internode, two spanned SNPs from our study for diameter of basal internode (strong positive genetic correlation; Clark et al., 2019), four spanned SNPs from our study for total number of culms (weak to moderate negative genetic correlation), and four spanned SNPs from our study for number of culms per area (moderate to strong negative genetic correlation). Concurrence between QTL identified in our study and others suggests confidence in their identification and indicates regions of the genome to prioritize further for identifying candidate genes.
Significant SNPs from our study that were near significant SNPs or QTL peaks from other studies were often associated with multiple yield-related traits. On S. bicolor chromosome 10, Slavov et al. (2014) identified a SNP associated with height of the tallest stem at 59.9 Mb (Sb10g029835), which is close to our SNP UIMiscanthus018832 (60.1 Mb) that was significantly associated with both culm node number and culm dry weight at NEF in year 3 (Table S1). Among the significant SNPs from our study that were closest to the QTL peaks identified by Gifford et al. (2015), UIMiscanthus117119, 030627, and 007880 were all associated with multiple traits (Data S2; Table S1).

| Candidate genes identified
Many of the significant SNP-trait associations identified in this study, especially those within previously known QTL, were near or in genes with a previously determined function in Arabidopsis and/or rice that suggested they may be the causative gene for the traits we observed in M. sinensis (Table S1; Data S2). For example, UIMiscanthus020125, which was associated with culm diameter and total number of culms in our study, as well as being within a previously identified QTL on S. bicolor chromosome 6 for number of culms (Gifford et al., 2015), codes for a synonymous mutation in the second exon of Sobic.006G034100, a methyl-CpG-binding protein expressed in juvenile and mature stems, lower leaves at grain maturity, and panicles, according to GeneAtlas data available at Phytozome (DOE-JGI, https ://phyto zome.jgi.doe. gov). Given that methyl-CpG-binding proteins have been observed to interpret DNA methylation in plants (Zemach & Grafi, 2007), regulating lateral branching in at least one case (Peng, Cui, Bi, & Rothstein, 2006), we hypothesize that this gene may be involved in epigenetic regulation of the observed trade-off in M. sinensis plants between the production of many thin stems or fewer, thicker stems. Similarly, on chromosome 3, UIMiscanthus112990, which was associated with total number of culms in the current study and within a QTL region identified by Gifford et al. (2015) for the same trait, is within the intron of the WOX transcription factor Sobic.003G336600 that is expressed in the panicle, stem, and leaf. WOX transcription factors have been found to regulate cell division and differentiation, including lateral organ formation in maize, Arabidopsis, and petunia (Van Der Graaff, Laux, & Rensing, 2009), which is consistent with the observed association for number of culms in M. sinensis. On chromosome 5, UIMiscanthus019070, which is 800 bp downstream of Sobic.005G132000, a transcription factor that regulates the expression of genes that respond to auxins (Guilfoyle & Hagen, 2007), was associated with the number of culms at HU in year 3, and is within overlapping QTL for basal circumference identified by Dong et al. (2018) and Gifford et al. (2015). Given the important role of auxin in tillering (Hussien et al., 2014), this is another promising candidate gene. For each significant SNP, Data S2 lists any sorghum genes containing the SNP as well as any sorghum genes within 1 kb of the SNP, along with any known gene functions from Arabidopsis or Oryza, and can be used as a resource for mining genes for further study.

| CONCLUSIONS
Genomic prediction has the potential to accelerate the breeding of M. sinensis several-fold in the immediate future. Large populations of seedlings can be screened with RAD-seq and subjected to genomic selection. Given the genomic prediction accuracies that we obtained, we recommend using genomic selection to identify the top percentage of genotypes for predicted yield breeding values, which can then be used for rapid cycling of generations to improve genetic gain per year relative to phenotypic selection. We expect that rapid and substantial genetic gains for biomass yield in M. sinensis will be obtained by implementing these new methods.