Genetic diversity and relationship between domesticated rye and its wild relatives as revealed through genotyping‐by‐sequencing

Abstract Rye (Secale cereale L.) is a cereal grass that is an important food crop in Central and Eastern Europe. In contrast to its close relatives wheat and barley, it was not a founder crop of Neolithic agriculture, but is considered a secondary domesticate that may have become a crop plant only after a transitory phase as a weed. As a minor crop of only local importance, genomic resources in rye are underdeveloped, and few population genetic studies using genomewide markers have been published to date. We collected genotyping‐by‐sequencing data for 603 individuals from 101 genebank accessions of domesticated rye and its wild progenitor S. cereale subsp. vavilovii and related species in the genus Secale. Variant detection in the context of a recently published draft sequence assembly of cultivated rye yielded 55,744 single nucleotide polymorphisms with present genotype calls in 90% of samples. Analysis of population structure recapitulated the taxonomy of the genus Secale. We found only weak genetic differentiation between wild and domesticated rye with likely gene flow between the two groups. Moreover, incomplete lineage sorting was frequent between Secale species because of either ongoing gene flow or recent speciation. Our study highlights the necessity of gauging the representativeness of ex situ germplasm collections for domestication studies and motivates a more in‐depth analysis of the interplay between sequence divergence and reproductive isolation in the genus Secale.

barley, rye is not considered a founder crop of Neolithic agriculture (Abbo, Lev-Yadun, Heun, & Gopher, 2013). Although archaeobotanical remains of rye were found in early Neolithic sites in the Northern Levant, their domestication status is uncertain (Hillman, 1978;Nesbitt, 2002). Substantial numbers of rye grains appear in the archaeological record only in the European Bronze Age (Hartyanyi & Novaki, 1975). By the Iron Age, rye has likely become a widely used crop (Behre, 1992). A commonly evoked explanation for this delayed rise to agricultural relevance is that rye spread as a "hitchhiking" weed in wheat and barley fields (Zohary et al., 2012) after adaptation to an agricultural habitat. According to this hypothesis of Vavilovian mimicry (McElroy, 2014), rye domestication may have proceeded in two stages: First, wild rye became a weed that may have already had acquired key domestication traits such a tough rachis and larger seeds to facilitate co-harvesting with wheat and barley. Later, when its superior performance under inclement Northern climate became evident, farmers started to sow and harvest rye in its own right, turning the preadapted weed into a "fully domesticated" crop. Because of this complex history, rye is considered a secondary domesticate (Preece et al., 2017).
Rye belongs to the small genus Secale with only three taxa, which contains the putative wild progenitor, S. cereale subsp. vavilovii together with domesticated rye S. cereale subsp. cereale as well as several other described subspecies (possibly weedy), and two other wild species S. strictum and S. sylvestre (Frederiksen & Petersen, 1998). Among these, only rye is used as a crop in current times, although S. strictum may have been used as a forage crop (Hammer, Skolimowska, & Knüpffer, 1987). The species in the genus Secale differ in life cycle and breeding system (Hammer, 1990). Rye is an annual outcrosser, but self-compatibility has been reported both in its wild progenitor S. cereale subsp. vavilovii (Hammer et al., 1987) and in breeding lines of cultivated rye (Voylokov, Fuong, & Smirnov, 1993). S. strictum and S. sylvestre are perennial outcrossers and annual selfers, respectively (Frederiksen & Petersen, 1998). The literature abounds with proposed infraspecific taxa for S. cereale and S. strictum that reflect differences in geographic range, growth habit (e.g., weediness), and morphological characters such as hairiness of leaf sheath of spike brittleness (Frederiksen & Petersen, 1998).
The agronomical importance of rye as a cereal crop is eclipsed by the prominent role of its close relatives wheat and barley in modern agriculture. Reasons for this include easier line breeding in the selffertilizing crops wheat and barley and a preference for wheat for baking and barley for malting in most regions of the world. Thus, comparatively few resources have been allocated to set up a genomic infrastructure comprising reference genome, representative diversity panels, and high-throughput marker technologies. Recently, Bauer et al. (2017) published an annotated draft reference genome sequence assembly constructed for the inbred line Lo7. This assembly represents only 2.8 Gb of the 8-Gb genome and is fragmented into 1.3 million sequence scaffolds of which only 158 Mb were anchored to chromosomal positions. Despite these shortcomings, this draft assembly will likely serve as an important reference anchor for population genomic studies in the same way as the genetically anchored sequence assemblies of barley and wheat (International Barley Genome Sequencing Consortium, 2012;International Wheat Genome Sequencing Consortium, 2014) underpinned genomewide surveys of sequence diversity (Jordan et al., 2015;Russell et al., 2016). The progress in genomics technology has enabled the development of high-throughput genotyping platforms also for minor crops. Mining transcriptome sequence data from five winter rye breeding lines, Haseneyer et al. (2011) designed a SNP array with 5,234 features. A subset of these were used to genotype a diversity panel of rye and wild relatives by Hagenblad, Oliveira, Forsberg, and Leino (2016). Compared to SNP arrays, reduced-representation sequencing methods based on digestion with restriction enzymes have the advantage of joint discovery and genotyping of sequence variant without the requirement for ascertaining polymorphic markers in a discovery panel, which may lead to underestimation of genetic diversity of diverse genetic material. A flavor of reduced-representation sequencing, DArTseq (Li et al., 2015), has been previously employed for linkage mapping in rye (Milczarski, Hanek, Tyrka, & Stojałowski, 2016;Rakoczy-Trojanowska et al., 2017).
Here, we report the analysis of single nucleotide polymorphism (SNP) data of rye and its wild relatives obtained through genotypingby-sequencing (GBS; Elshire et al., 2011). We performed explorative population genetic analysis of this dataset. Our results support the Secale taxonomy of Frederiksen and Petersen (1998) with little infraspecific substructure in S. cereale and recently shared ancestry between taxa.

| Plant material
One hundred and one genebank accessions from the Secale taxa (domesticated rye [S. cereale subsp. cereale], S. cereale subsp. vavilovii, further S. cereale subspecies, S. strictum, and S. sylvestre) were selected based on passport information (taxonomic status, country of origin, collection site) available in the genebank information system of the German Federal ex situ Genebank at IPK Gatersleben (GBIS; https://gbis.ipk-gatersleben.de/GBIS_I, Oppermann, Weise, Dittmann, & Knupffer, 2015). Seeds from the selected accessions were obtained from the IPK genebank. Passport information from IPK's genebank information system is given in Table S1, and collection sites are shown in Figure S1. DNA was extracted from leaf tissue of six plants at the seedling stage using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany). Three plants of each accession were grown to full maturity to observe seed shattering.

| Genotyping-by-sequencing
Genotyping-by-sequencing libraries were prepared for the six individually bar-coded plants of each accession using the PstI-MspI two-enzyme approach (Poland, Brown, Sorrells, & Jannink, 2012) as described previously (Wendler et al., 2014). Libraries were sequenced (single read, 100 cycles) on the Illumina HiSeq 2500 device at IPK Gatersleben (Wendler et al., 2014). Raw data have been deposited in the European Nucleotide Archive under accession number PRJEB22681. Accession numbers for individual samples are provided in Table S2.

| Read mapping and variant calling
Primary data analysis followed the procedures of Mascher, Wu, Amand, Stein, and Poland (2013). Briefly, after adapter trimming with Cutadapt (Martin, 2011), reads were mapped to the wholegenome shotgun sequence assembly of rye cultivar Lo7 (Bauer et al., 2017) using BWA-MEM version 0.7.13 (Li, 2013). The alignments were converted to BAM format with SAMtools (Li et al., 2009) and sorted with NovoSort (http://www.novocraft.com/products/novosort/). Variant calling was performed with SAMtools and BCFtools version 1.3 (Li, 2011) using a mapping quality threshold of 30 and a base quality threshold of 20. The resultant VCF file was filtered with a previously published AWK script (Mascher et al., 2013). We considered only biallelic sites. Sites with a quality score below 40 were discarded. Genotype calls (inferred allelic states for individuals at segregating sites) as reported by BCFtools were set to missing if either the read depth (DP) or the genotype quality (GQ) was below 5. Subsequently, sites with more than 90% missing data or more than 90% heterozygous calls were discarded. The resulting markerby-individual matrix was imported into the R statistical environment (R Core Team, 2015) and further analyzed using functionalities of R packages data.table (https://cran.r-project.org/web/packages/data.  (Zheng et al., 2012), and SeqArray (Zheng et al., 2017). We further filtered the SNP set to include only sites with up to 10% missing across all samples. Three samples (belonging to the accessions R1112, R193, and R918) with less than 80% present data in the final call set were excluded from further analysis. Read depth analyses were performed using the command "samtools depth" (Li et al., 2009

| Population genetic analyses
Principal component analysis was performed with EIGENSOFT version 6.0.1 (Patterson, Price, & Reich, 2006) using least-square projection and disabling outlier removal. Fixation indices (F ST ) were calculated on previously defined groups using the modified Hudson estimator (Hudson, Slatkin, & Maddison, 1992) using the formulas of Bhatia, Patterson, Sankararaman, and Price (2013). Neighbor-joining trees were constructed with the R package "ape" (Paradis, Claude, & Strimmer, 2004). Pairwise identities based on identity-by-state were computed using SNPRelate (Zheng et al., 2012). Model-based ancestry estimation was performed with ADMIXTURE (Alexander, Novembre, & Lange, 2009). For each analysis, twenty replicate runs per cluster size K were performed and combined with CLUMPP (Jakobsson & Rosenberg, 2007). D-statistics were calculated with AdmixTools (Patterson et al., 2012) using wheat as the outgroup. To determine ancestral states, SNP positions were lifted to wheat. Toward this purpose, 300-bp regions flanking each SNP were extracted from the genome sequence assembly of rye (Bauer et al., 2017) and aligned to the genome sequence assembly of bread wheat (International Wheat Genome Sequencing Consortium, 2014) with BWA-MEM (Li, 2013).
Prior to mapping, the wheat assembly was divided into subgenomes based on the POPSEQ genetic map (Chapman et al., 2015) and exact SNP coordinates in wheat were determined by parsing the BAM CIGAR string (Li et al., 2009) using BEDTools (Quinlan & Hall, 2010) and R. If a flanking region was aligned to more than one wheat subgenome, only one hit was considered in the order (A, B, D). Predicted protein and transcript sequences of rye genes were aligned to their wheat counterparts (separated into subgenomes) using BLAST+ (Camacho et al., 2009). We considered only SNP positions that (i) overlapped an annotated rye gene and for which (ii) also their lifted positions in wheat overlapped the corresponding best hit in the wheat gene set. Nucleotides at these positions were extracted from the wheat assembly and used as ancestral states. To test the relationship between genetic and geographic distances, we performed a Mantel test using the R package adegenet (Jombart & Ahmed, 2011).

| Genotyping-by-sequencing in Secale species
We obtained genotyping-by-sequencing data for 101 genebank accessions of domesticated rye and wild Secale forms. Our panel includes 81 rye accessions, five accessions of S. cereale subsp. vavilovii, eleven accessions of S. strictum, and four accessions of S. sylvestre (Table S1, Figure S1). Six individuals were sampled from each accession for multiplexed genotyping-by-sequencing following the protocol of Wendler et al. (2014) using bar coding of individual DNA samples. Read mapping and variant calling were performed against the draft genome sequence assembly of cultivated rye (Bauer et al., 2017) using a reference-based SNP calling pipeline previously applied in wheat, barley, and wild relatives (Chapman et al., 2015;Mascher et al., 2013;Wendler et al., 2015). Stringent filtering for a low missing rate (10%) across all samples resulted in a set of biallelic 55,744 SNPs distributed across a target region of 2.4 Mb, that is, genomic intervals that are covered with at least five reads in 90% of the samples (Table 1). A total of 603 samples had more than 80% present data and were used for further analysis. Only 22% of SNPs were assigned to approximate chromosomal locations in the partially ordered sequence assembly (Table 1) of our SNPs were located in genic regions, although only 12% of the total assembly were annotated as genic sequence.

| Clear differentiation between species despite incomplete lineage sorting
We performed several explorative analyses, which all supported  Speciation is followed by lineage sorting, that is, the fixation of different alleles of ancestrally segregating variants in the descendant species by stochastic processes. Recent speciation events or ongoing gene flow due to incomplete fertility barriers leads to an elevated proportion of shared segregating sites between closely related species. We found a substantial number of biallelic SNPs variants for which both alleles were present in more than one species (Table 2). For example, 2,760 sites were polymorphic in both S. cereale and S. sylvestre. We note that S. strictum had relatively more polymorphic sites in common with S. cereale (65.8%) than had S. sylvestre (51.1%), indicating that S. sylvestre split first from the lineage leading to S. strictum and S. cereale in agreement with AFLP phylogeny of Chikmawati, Skovmand, and Gustafson (2005) and the chloroplast phylogeny of Petersen and Doebley (1993).

| Evidence for crop-wild gene flow in S. cereale
Studies in wild and domesticated barley have shown (i) a clear separation of the crop and its wild progenitor and (ii) a good correlation between geographic and genetic distance in the crop and, independently, the wild relative (Russell et al., 2016). To investigate the re- Apart from a clustering of individuals from the same accession, also taxonomic classifications were not reflected in the PCA (Figure 4b).
The first four components together explained only 6.7% of the variance, indicating low genetic differentiation between populations.
A Mantel test on genetic distances based on pairwise identity-bystate analysis and the geographic origin given in the passport data revealed a weak (r = .09), albeit significant (p = .01), correlation.
Next, we included the S. cereale subsp. vavilovii samples in the PCA (Figure 5a). Most conspicuously, samples of one accession

| Shared ancestry components between rye and related species
We performed model-based ancestry estimation across all our samples with the assumptions of either four or seven ancestral populations using ADMIXTURE (Alexander et al., 2009).  Figure S2). In both scenarios, ancestry components were shared between the three taxa ( Figure 6). The major ancestry components of S. cereale subsp. vavilovii were also present in domesticated rye ( Figure 6). When seven ancestral populations were used, two components (blue and green color in Figure 6B) were private to S. strictum. Interestingly, the "green" ancestry component was present in unadmixed state in a group of six individuals belonging S. strictum subsp. kuprijanovii, a subspecies from the Caucasus (Hammer et al., 1987). Additionally, this component occurred in two other S. strictum accessions also sharing ancestry with S. cereale (R2859 from Armenia and R2431 from Bulgaria). This shared ancestry could be caused by recent interspecific gene flow or shared ancestral variation. An improved, physically ordered reference genome sequence would enable the inspection of haplotype lengths to date hybridization events (Harris & Nielsen, 2013).

| D ISCUSS I ON
We have used genotyping-by-sequencing to discover genomewide SNP markers in a diversity panel of rye and its wild relatives.
Compared to previous studies based on microsatellite markers (Parat et al., 2016), the number of genetic markers in our study is orders of ing genetic diversity by focusing on conserved restriction fragments (Gautier et al., 2013). However, we believe that our conclusions from explorative analyses of population structure and genetic differentiation are robust to these potential biases.
In the past, several taxa included in our study such as S. cereale subsp. vavilovii or S. strictum subsp. kuprijanovii were considered as distinct species (Roshevitz, 1947). This overclassification may trace The composition of the "dense," "intermediate," Iberia, and Armenia groups of S. cereale is indicated in Figure 5b.
F I G U R E 6 ADMIXTURE results assuming four (a) and seven (b) ancestral populations. Colors represent ancestry components. Stacked bars represent samples. Samples are arranged according to taxonomy as indicated in the x-axis labels back to a preference of crop geneticists for obvious differences in conspicuous characters that nevertheless do not warrant elevation to the species level (Harlan & de Wet, 1971). The taxonomic revision of Frederiksen and Petersen (1998)  Interspecific gene flow in the genus Secale is not unexpected: Previous studies have examined the crossability of different Secale species and have identified chromosomal rearrangements as likely causes for frequent hybrid sterility (Hrishi & Müntzing, 1960;Singh, 1977;Stutz, 1957). However, these cytogenetic studies examined offspring of artificial interspecific crosses in a laboratory setting. Our results regarding the close affinity of  (Perrino et al., 1984). In principle, a wider wild genepool is accessible to prebreeding efforts in rye than in its close relative barley, also a diploid Triticeae. Fertile progeny is not readily obtained for interspecific crosses in Hordeum: Even crosses between barley and its closest relatives H. bulbosum frequently result in uniparental genome elimination (Houben, Sanei, & Pickering, 2011;Kasha & Kao, 1970). In the future, exotic traits from an agronomical perspective such as a perennial growth habit or self-fertility could be stably introduced into rye from the wild genepool. If incomplete lineage sorting also extends to structural variants such as chromosomal inversions, we can expect to find chromosomal rearrangements to underlie both apparently interspecies fertility barriers and occasional intraspecific hybrid sterility as has been observed in rye (Stutz, 1976). The construction of interspecific linkage maps and the application of novel methods for physical genome mapping such as chromosome conformation capture sequencing (Harewood et al., 2017) or optical mapping (Lam et al., 2012) are promising avenues for unraveling the causes for partial reproductive isolation between rye taxa, which may also shed light on the genetic basis of differences in the key characters self-compatibility and annual life cycle.
In contrast to the situation in barley (Russell et al., 2016), we did not find a good correspondence between geography and ge- It is not unlikely that gene flow has been bidirectional, resulting in the genetic assimilation of wild-growing and cultivated populations, which obfuscates the difference between "truly" wild, weedy, feral, and "fully" domesticated plants. Our small panel with only five accessions may not capture the full extent of population structure in S. cereale subsp. vavilovii. For instance, there may be many more "outlier" accessions such as R1003, which may actually represent rare unadmixed wild populations that would be highly informative on the domestication origin(s) of rye.
To summarize, our data are consistent with the notion that the domestication history of rye is more complex than that of the Neolithic founder crops wheat and barley. We believe that a more comprehensive assessment of the rye genetic diversity maintained in ex situ collections and, subsequently, a targeted effort toward broader sampling of wild and weedy populations are necessary to better understand the population structure in domesticated rye and its wild relatives.

ACK N OWLED G EM ENTS
We thank Stefanie Thumm and Susanne König for skillful technical assistance and Anne Fiebig for sequence data submission. This research was supported by IPK core funding.

CO N FLI C T O F I NTE R E S T
None declared.