Genomic analyses reveal selection footprints in rice landraces grown under on‐farm conservation conditions during a short‐term period of domestication

Abstract Traditional rice landraces grown under on‐farm conservation conditions by indigenous farmers are extremely important for future crop improvement. However, little is known about how the natural selection and agriculture practices of indigenous farmers interact to shape and change the population genetics of rice landraces grown under on‐farm conservation conditions during the domestication. In this study, we sequenced DNA from 108 core on‐farm conserved rice landraces collected from the ethnic minority regions of Yunnan, China, including 56 accessions collected in 1980 and 52 accessions collected in 2007 and obtained 2,771,245 of credible SNPs. Our findings show that most genetic diversity was retained during the 27 years of domestication by on‐farm conservation. However, SNPs with marked allele frequency differences were found in some genome regions, particularly enriched in genic regions, indicating changes in genic regions may have played a much more prominent role in the short‐term domestication of 27 years. We identified 186 and 183 potential selective‐sweep regions in the indica and japonica genomes, respectively. We propose that on‐farm conserved rice landraces during the short‐term domestication had a highly polygenic basis with many loci responding to selection rather than a few loci with critical changes in response to selection. Moreover, loci affecting important agronomic traits and biotic or abiotic stress responses have been particularly targeted in selection. A genome‐wide association study identified 90 significant signals for six traits, 13 of which were in regions of selective sweeps. Moreover, we observed a number of significant and interesting associations between loci and environmental factors, which implies adaptation to local environment. Our results provide insights into short‐term evolutionary processes and shed light on the underlying mechanisms of on‐farm conservation.


| INTRODUC TI ON
Rice (Oryza sativa L.) was domesticated between 8,000 and 10,000 years ago from its wild ancestor, Oryza rufipogon, a broadly distributed native species of Asia (Oka, 1988). During domestication, its genetic diversity was reduced by up to 80% from that of the wild ancestor (Londo, Chiang, Hung, Chiang, & Schaal, 2006). The most extreme loss of diversity occurred in modern, high-yield rice cultivars, leading to a lack of evolutionary potential for adaptation to changing environments in these lineages. Compared with mod- to 29°15′8″N and 97°31′39″E to 106°11′47″E), is characterized by a broad distribution of habitat types, including mountains, plateaus, and basins, with elevations ranging from 76 m to 6,740 m (Zeng et al., 2001). Yunnan has a long history of human settlement and agricultural activities, with 26 nationalities represented in the area. Owing to the wide geographical variation, diverse growing conditions, and cultural and ethnic diversity, Yunnan is acknowledged as one of the largest genetic diversity centers of rice in China and even worldwide (Zeng et al., 2001(Zeng et al., , 2007Zhang et al., 2006). A remarkably diverse set of rice landraces are found in Yunnan, including all varieties of Oryza sativa L. ssp. indica and ssp. japonica found in China. The diversity varies in many ways from morphological traits to aromas, such as having glumes with or without hairs and being nonglutinous or glutinous, upland or lowland, of the nude rice type, and rice of various hulled grain colors (white, red, and purple) and flavors (ordinary or fragrant; Zeng, Xu, Shen, & Deng, 2000).
To date, many traditional rice landraces are still planted by indigenous farmers and grown under on-farm conservation conditions in some ethnic minority regions of Yunnan. They are passed down from generation to generation despite the availability of modern improved varieties for reasons associated with the diversity of the local agroecology and to fulfill cultural requirements (Xu et al., 2014).
Indigenous locals practice the conservation of diverse traditional rice landraces on their farms, henceforth referred to as "on-farm conservation," not only for the conservation of highly productive landraces but also for ones more resistant to diseases and pests and tolerant of extreme environmental conditions, as well as cultural demands (e.g., ethnic dietary customs, medical uses, festival, and religious ceremony; Gao, 2003). These landraces selected by on-farm conservation likely cannot be easily replaced by modern improved varieties because they undoubtedly have their own outstanding features and some of these landraces have been planted for more than 50 years. The diverse climatic ecotypes, environmental heterogeneity, and unique ethnic minority cultures and customs play a crucial role in cultivation of landraces by indigenous farmers in the ethnic minority regions of Yunnan and thus provide a model of traditional on-farm conservation.
On-farm conservation is a subset of in situ conservation, which is increasingly recognized as a key component of any comprehensive strategy to conserve crop genetic resources (Pandey, Bisht, Bhat, & Mehta, 2011;Pusadee, Jamjod, Chiang, Rerkasem, & Schaal, 2009). Bellon, Pham, and Jackson (1997) defined on-farm conservation of crop genetic resources as "the continued cultivation and management of a diverse set of traditional landraces by farmers in the agroecosystem where they were developed," which prevents the traditional landraces from being replaced by modern varieties or disappearing. Thus, on-farm conservation provides opportunities for continuous differentiation and variation in traditional landraces (Bellon et al., 1997).
Genetic variation and differentiation are influenced by natural processes, such as selection and drift, and can also be influenced by the agriculture practices of indigenous farmers. However, little is known about how these processes interact to shape and change the population genetics of rice landraces grown under on-farm conservation conditions during a period of short-term domestication, which we consider as less than 30 years. Similarly, it is not clear whether genetic diversity has been successfully maintained by on-farm conservation practices. Thus, we aim to use whole-genome sequences to address how population dynamics have changed or remained the same over a short period of domestication in rice landraces cultivated following on-farm conservation practices. Characterization Germplasm of China, Grant/Award Number: 2017NWB036-01, 2017NWB036-12-2 a number of significant and interesting associations between loci and environmental factors, which implies adaptation to local environment. Our results provide insights into short-term evolutionary processes and shed light on the underlying mechanisms of on-farm conservation.

K E Y W O R D S
genetic diversity, on-farm conservation, rice landraces, selection sweep, short-term domestication of genome-wide selection footprints and genetic diversity may help reveal the short-term evolutionary processes and contribute to our understanding of the mechanisms of on-farm conservation.

| Sample collection
We performed a large scale study using 600 on-farm conserved rice landraces, including 332 accessions collected in 1980 and 268 accessions collected in 2007 from the ethnic minority regions of Yunnan Province, China, in our previous study (Cui et al., 2016). These accessions were from five different ecological zones, covering a wide geographic distribution and diverse growing conditions, and represent most of the diversity of on-farm conserved landraces in Yunnan.
In this study, a core subset of the collection of 600 landraces (108 accessions, including 56 accessions collected in 1980, and 52 accessions collected in 2007; Table S1 and Figure S1) was sampled for whole-genome sequencing and construction of neighbor-joining (NJ) trees using 48 SSRs (Table S2) that represented more than 97% of the genetic diversity of a total of 600 accessions in the DNA information obtained ( Figure S2).

| DNA sequencing and mapping
To construct the DNA libraries, genomic DNA from the fresh leaves of each of the 108 landraces was extracted using the CTAB method (Murray & Thompson, 1980). All the libraries were sequenced by the high-throughput Illumina sequencing platform (HiSeq 2500) using standard procedures of Novogene Bioinformatics Institute (Beijing, China). The 500-bp paired-end libraries were constructed according to the manufacturer's introductions (Illumina). Using a whole-genome shotgun strategy, we generated a total of 2.28 Gb of paired-end reads of 125-bp length (284.77 Gb of sequences). Prior to mapping, all reads were preprocessed for quality control and filtered out using our in-house script in PERL according to the following criteria: (a) any reads with ≥10% unidentified nucleotides (N); (b) any reads with >10 nt aligned to the adapter sequence, allowing ≤10% mismatches; (c) any reads with >50% bases having phred quality <5; and (d) putative PCR duplicates generated by PCR amplification in the library construction process (i.e., two paired-end reads were the same). A total of 280.66 Gb high-quality sequences was kept and mapped to the rice reference genome (ftp://ftp.ensem blgen omes. org/pub/plant s/relea se-23/fasta/ oryza_sativ a/dna/Oryza_sativa. IRGSP-1.0.23.dna_sm.tople vel.fa.gz) using BWA-MEM with default parameters except for the "-t -k 32 -M -R" option ). Alignment bam files were sorted using SAMtools , and duplicated reads were removed. Sequencing coverage and depth for each sample were calculated, and the average sequencing coverage was 88.65% (Table S3). All genome sequence data have been deposited in the NCBI Sequence Read Archive under project accession number: PRJNA342109.

| Detection of variation
We performed variation calling for the 108 accessions using a Bayesian approach implemented in the package SAMtools (Li et al., 2009b). The "mpileup" command was used to identify SNPs and Indels with the parameters "-m 2 -F 0.002 -d 1,000." In the downstream analysis, the raw population variations were filtered by requiring a respective minimum and maximum coverage depth of 4 and 1,000, a minimum RMS (root mean square) mapping quality score of 20, and missing genotype >50% of the 108 rice accessions. Consequently, a total of 2,771,245 of SNPs and 432,174 indels (204,215 insertions and 227,959 deletions, ranging from 1 to 5 bp in length) were retained for downstream analyses (Tables   S4-S5). Finally, all of the genomic variations were annotated using ANNOVAR software (Wang, Li, & Hakonarson, 2005).

| Population structure
To estimate individual admixture assuming different numbers of clusters, the population structure was investigated using ADMIXTURE (Alexander, Novembre, & Lange, 2009) with a maximum likelihood method. We increased the coancestry clusters spanning from 2 to 8 and ran the analysis with 10,000 iterations ( Figure S3). The optimal k-value was determined based on crossvalidation error ( Figure S4).

| Principal components analysis
The software package GCTA (Yang, Lee, Goddard, & Visscher, 2011) was used for principal component analysis with biallelic SNPs of the 108 individuals. We plotted the first two significant components.
To some extent, the discrete points reflect the real structure of population.

| Phylogenetic tree
We constructed a neighbor-joining tree using a matrix of pairwise genetic distances of all individuals, calculated by TreeBest (http:// trees oft.sourc eforge.net/treeb est.shtml ), which first runs a number of independent phylogenetic methods and then creates a combined tree using a stochastic context -free grammar approach. The bootstrap test (Efron 1982;Felsenstein, 1985) was used for evaluating the reliability of a neighbor-joining tree. For this method, the same number of sites was randomly sampled with replacement from the original sequences, and a phylogenetic tree was constructed from the resampled data. This process was repeated, and the reliability of a sequence cluster was evaluated by its relative frequency of the appearance in bootstrap replications (Kumar, Tamura, & Nei, 1994).
The bootstrap was set to 1,000 times to assess the branch reliability using TreeBest.

| Selection analyses
A sliding-window approach (10-kb windows sliding in 1-kb steps) was applied to quantify polymorphism levels (θ π , pairwise nucleotide variation as a measure of variability) and genetic differentiation (F ST ) between rice landraces from 1980 to 2007 using VCFtools (Danecek et al., 2011).
To detect regions with significant signatures of selective sweep, we Z-transformed the distribution of F ST and calculated the log value of θ π ratios (θ π 1980 /θ π 2007 ). We used an empirical procedure and selected windows with significantly high log 2 (θ π 1980 /θ π 2007 ) and ZF ST values from their respective empirical distributions and considered these as regions with selective-sweep signals along the genome. The values were at the right 5% tails, where the log 2 (θ π 1980 /θ π 2007 ) thresholds were 1.02 and 1.41 for indica and japonica, respectively, and ZF ST thresholds were 1.92 and 2.00, respectively. Then, adjacent selective-sweep regions were merged together to form a whole putative selected region. To further confirm the selection signals identified, we performed a genome scan using a cross-population composite likelihood approach XP-CLR (Chen, Patterso, & Reich, 2010) to calculate the candidate selective-sweep regions. A 10-kb sliding window with 1-kb steps across the whole genome was used for scanning. The highest XP-CLR values, accounting for 5% of the genome (p < .05), were considered as selected regions.

| Annotation analysis of selected regions
We annotated genes in selected genomic regions using the rice genome and a total of 623 and 537 genes in indica and japonica, respectively. They were identified to undergo strong selection sweep.
These genes were submitted to Gene Ontology (GO) (Ashburner et al., 2000) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa & Goto, 2000) databases for enrichment analyses. A false discovery rate (FDR)-corrected binomial distribution probability approach was used to test significant enriched gene function at a level of p < .05 (Benjamini & Hochberg, 1995).

| Genome-wide association study of agronomic traits
We performed a genome-wide association study (GWAS) for ten agronomic traits: days to heading, plant height, panicle length, effective panicles, grains per panicle, grain length, grain width, grain length-to-width ratio, spikelet fertility, and 1,000-grain weight. All Association analyses were conducted using MLMs (Yu et al., 2006) with TASSEL v.5.0 (http://www.maize genet ics.net/tassel). A kinship matrix (K-matrix), the pairwise relationship matrix calculated by TASSEL v.5.0, and the Q-matrix as a correction for population structure were used in the MLM association models to calculate Pvalues to associate each SNP marker with the trait of interest and to avoid spurious associations by TASSEL v.5.0.

| Genomic variation
We generated 280.66 Gb of high-quality sequences from 108 onfarm conserved rice landraces, which consisted of 56 accessions collected in 1980 and 52 accessions collected in 2007 (Table S1 and Figure S1). After alignment, 97.17% of the reads covered 88.65% of the reference genome, with an average of 6.94-fold depth (Table S3).
We subsequently identified 2,771,245 credible SNPs from the 108 accessions. Among all SNPs, 1,935,246 (69.83%) were located in intergenic regions and only 171,050 (6.17%) were located in coding regions (Table S4). The ratio of nonsynonymous to synonymous substitutions of the SNPs was calculated to be 1.20, which is consistent with previous reports . We observed low heterozygosity in all accessions, reflecting the lack of cross-pollination owing to high levels of inbreeding ( Figure S5).
In addition, a total of 204,215 insertions and 227,959 deletions were identified, ranging from 1 to 5 bp in length ( Figure S6). We annotated 9,000 (2.08%) indels located in coding regions of the rice genome.
Of these, 4,044 indels could have large effects on the protein-coding sequences, including 3,930 indels that cause frameshift changes, 78 indels that lead to the immediate creation of a stop codon, and 36 indels that lead to the immediate elimination of a stop codon (Table S5).

| Population structure and genetic diversity
We explored phylogenetic relationships among the 108 rice landraces through whole-genome SNP analysis ( Figure 1a). As expected, the phylogenetic tree showed that all the rice landraces were clearly  (Tables S6-S7).

| Genomic imprints of selection in rice landraces
To they could be subspecies-specific selective-sweep regions that were independently selected in either subspecies  Moreover, results of both methods indicate that only two common regions underwent selective sweeps in both indica and japonica. To further confirm these selection signals identified above, we performed a genome scan using a cross-population composite likelihood approach XP-CLR (Chen et al., 2010) to calculate the candidate selective-sweep regions. We found that more than 90% of the selective-sweep regions overlapped with genomic regions identified as showing selective sweeps by the XP-CLR approach (Figure 4a, b), indicating that most of the selection regions can be supported by the XP-CLR approach and is thus quite reliable. These regions exhibited significant differences (p < 10 -15 , Mann-Whitney U test) in log 2 (θ π 1980 /θ π 2007 ) and ZF ST values compared with the genomic background in both indica and japonica ( Figure S9). These regions also had lower levels of nucleotide diversity and extremely negative Tajima's D-values (Table S10). This result quantitatively reflects the importance of selection in shaping genomic variation, resulting in phenotypic and/or environmental adaptations in rice landraces. In the selective-sweep regions, we identified 623 and 536 protein-coding genes in the respective indica and japonica genomes, which are expected to represent targets of selection. To assess the functions of these candidate genes, we used GO (Ashburner et al., 2000) and KEGG (Kanehisa & Goto, 2000) functional categories based on orthologs to annotate them. We found that functional categories related to morphology, growth and development, transcriptional regulation, and metabolic processes were enriched (Tables S11-S14).

| Selective-sweep regions included important agronomic genes
As expected, we found a number of candidate genes included in the potential selective-sweep regions were important agronomic genes/quantitative trait loci (QTLs) released by Q-TARO (Yonemaru, Yonemaru, Yamamoto, & Yano, 2012) (Tables S15-S17). For example, a strong signature of selection (log 2 [θ π 1980 /θ π 2007 ] = 2.29, ZF ST = 3.55, XP-CLR = 144.28) was observed for a region on chromosome 6 of the indica genome, which includes the SSIIa (Similar to Starch synthase IIA) gene (Figure 3c). This gene is responsible for the eating quality of rice and plays an important role in grain starch synthesis (Kawakatsu, Yamamoto, Touno, Yasuda, & Takaiwa, 2009;Zhang et al., 2011). Patterns at SSIIa revealed reduced diversity in the rice landraces collected in 2007 when compared with diversity of that collected in 1980 (Figure 3d). In each clade of the SSIIa tree, we found most landraces collected in the same year clustered together, thus indicating a tendency of differentiation between landraces collected in 1980 and 2007 (Figure 3e). This result may also provide evidence for a potential selective-sweep region. Two of the other selective-sweep candidates contained major genes related to biotic stress in rice. One candidate, located at 13.045-13.060 Mb on chromosome 6 (log 2 [θ π 1980 /θ π 2007 ] = 3.69, ZF ST = 2.01, XP-CLR = 39.35), encompassed the entire coding region of Pid3 (Pyricularia oryzae resistance-d3), which confers blast resistance to Magnaporthe oryzae (Chen et al., 2011;Shang et al., 2009;Xu et al., 2014); the other candidate, located at 8.075-8.114 Mb on chromosome 11, includes NLS1, which encodes a typical CC-NB-LRR-type protein and is related to resistance to bacterial pathogens . Meanwhile, the region on chromosome 1 containing OsPIN3t showed a selective-sweep signal (log 2 [θ π 1980 /θ π 2007 ] = 1.77, ZF ST = 5.50, XP-CLR = 28.51) in the japonica genome and a SSIIalike gene tree (Figure 3c and 3e). This gene is involved in drought stress response and drought tolerance (Miyashita, Takasugi, & Ito, 2010;Wang et al., 2009;Zhang et al., 2012). Additionally, a selection signal (log 2 [θ π 1980 /θ π 2007 ] = 2.29, ZF ST = 3.55, XP-CLR = 19.28) was detected in a region on chromosome 6 of the japonica genome, which includes the OsLIC1 gene (Figure 3c). This gene is involved in the regulation of rice plant architecture and grain yield (Wang et al., 2008;Zhang et al., 2012). In total, we found 72 cloned agronomic genes in selective-sweep regions of the indica and japonica genomes.
Notably, these selective-sweep regions were extremely enriched in grain yield and abiotic stress resistance gene categories, followed by the plant-type category ( Figure 5). Furthermore, nearly all of our selective-sweep regions were overlapped with previously reported agronomic QTLs (Tables S16-17). However, the sizes of the selective-sweep regions were smaller than sizes of the reported QTLs (Tables S16-S17), which indicates that these defined regions will be helpful in identifying genes that govern important agronomic traits.
To further annotate the selective-sweep regions, we performed GWAS for ten agronomic traits (Table S18 and Figures S10-S12). In total, we identified 90 significant signals for six traits, some of which were also mapped for the same trait in previous studies (Table S18).
GWAS signals associated with 1,000-grain weight and effective panicles were detected at the previously reported qyd1 locus on Chr. 1 and the qHbd3 locus on Chr. 11 (Table S18; Yonemaru et al., 2012), respectively. We also detected a GWAS signal responsible for plant height at the qtl12.1 locus (Figure 4a, c) and a GWAS signal corresponding to 1,000-grain weight at the OsLIC1 locus (Figure 4b, c) (Yonemaru et al., 2012). We found that these two signals underwent selection during the short-term domestication of indica and japonica, respectively ( Figure 4). Notably, the strongest GWAS signal responsible for effective panicles overlapped with a selective sweep originating on Chr. 4 during the short-term domestication of indica (Figure 4a, c). In total, we identified 13 GWAS signals overlapped with the selective sweeps during short-term domestication (Table S18).

| Genomic signatures of local adaptation
To screen genomes for signatures of local adaptation, we tested for associations between genetic variation and environmental gradients using latent factor mixed models (LFMM) (Frichot et al., 2012). A total of 986 SNPs that obtained |z|-scores greater than 5 (p < 10 -10 ) (Table S19) were associated with elevation. Among these SNPs, we found the SNP (|z| = 5.25) located at 29.62 Mb on chromosome 7 was significantly associated with elevation.

| D ISCUSS I ON
To date, few studies have focused on the short-term domestication process and the population dynamics underlying short-term domestication are still poorly understood (Cui et al., 2016;Sun, Cao, Ma, Chen, & Han, 2012;Xu et al.., 2011;Yan et al.., 2012). This study used genome sequencing data (including millions of polymorphisms) to provide an unprecedented opportunity to comprehensively identify and characterize population dynamics in on-farm conserved samples of rice genetic resources and to reveal the selection footprints in rice landraces during the short-term domestication process.
Rice (Oryza sativa L.) was domesticated from wild rice (Oryza rufipogon) thousands of years ago (Oka, 1988). During this long-term period of domestication, its genetic diversity was reduced by up to 80% from the wild ancestor due to strong selection and genetic -log 10 P -log 10 P -log 10 P -log 10 P Grain weight Grain weight (a) (b) F I G U R E 5 Functional category of cloned genes in selectivesweep regions bottlenecks (Londo et al., 2006 (Cui et al., 2016;Li et al., 2015). Similar results were found in a study of sorghum (Deu et al., 2010) in which crop management by farmers globally preserved sorghum genetic diversity in Niger over a 26-year period (1976-2003). We further cal-  (Tables S8, S9). To assess whether these genomic islands contained known loci that control important agronomic traits, we compared their locations with previously mapped QTLs and cloned genes. We found a number of selectivesweep regions overlapped with previously reported agronomic QTLs or contained cloned genes, such as OsSDR (Kim et al., 2009), SSIIa (Kawakatsu et al., 2009;Zhang et al., 2011), OsPIN3t (Miyashita et al., 2010;Wang et al., 2009;Zhang et al., 2012), and OsLIC1 (Wang et al., 2008;Zhang et al., 2012), related to grain yield, eating quality, abiotic stress resistance, and plant type (Table S15). Furthermore, we found that these genes located in selective-sweep regions were enriched in the grain yield category, followed by the abiotic stress resistance and plant-type categories, while enrichment was quite low in the eating quality category ( Figure 5). The results indicate yield and abiotic stress resistance traits have been frequently selected in rice landraces during the short-term domestication. The reason is that the local farmers preferred not only for highly productive rice landraces but also for ones more resistant to tolerant of extreme environmental conditions. Interestingly, a number of genes, such as FRRP1 (Flowering-Related RING Protein 1) (Du et al., 2016), were also under selection during rice domestication from its wild ancestor as reported in a previous study . This suggests that a small number of genes with extremely large phenotypic effects have been targeted repeatedly by selection during domestication. Unexpectedly, we found only two common regions underwent selective sweeps in both indica and japonica, perhaps due to the divergence between the indica and japonica genomes. Additionally, we found multiple genes controlling the same traits, such as UbL401  and CYP703A3  for sterility; NLS1  and OsEP3A (Singh, Giri, Singh, Siddiqui, & Nandi, 2013) for bacterial blight resistance; AM1 (Sheng et al., 2014) and OsPIN3t (Miyashita et al., 2010;Wang et al., 2009;Zhang et al., 2012) for drought tolerance; and OsHAK1 (Chen et al., 2015) and OsHKT4 (Wang et al., 2015) for salt tolerance, in indica or japonica selective-sweep regions likely affected by shortterm domestication. This phenomenon showed signatures of parallel selection in the genomes of the two subspecies.
In summary, we provide a large dataset of genomic variation observed in on-farm conserved rice landraces in this study. We identified millions of SNPs in representative rice landraces collected in 1980 and 2007, providing an unprecedented opportunity to finely resolve genome-wide genetic diversity and selection footprints in landraces grown under on-farm conservation conditions during a short-term period of domestication. We demonstrated that farmer selection in combination with natural selection played an important role and strong selection can leave its footprint on genomewide polymorphism patterns. We propose that on-farm conserved rice landraces during short-term domestication had a highly polygenic basis with many loci responding to selection rather than a few loci with critical changes in response to selection. Moreover, the loci affecting important agronomic traits and biotic or abiotic stress response were particularly targeted. Our integrative analyses demonstrate that the rice landraces grown under on-farm conservation conditions have the potential to be a dynamic, evolving genetic system that can undergo continuous differentiation and variation in response to evolutionary pressures, both natural and those imposed by farmers. On-farm, in situ conservation is a recommended strategy to conserve crop genetic resources.

DATA AVA I L A B I L I T Y S TAT E M E N T
All genome sequence data have been deposited in the NCBI Sequence Read Archive (SRA) under project Accession Number: PRJNA342109.