Genetic diversity and population structure of Quercus fabri Hance in China revealed by genotyping‐by‐sequencing

Abstract Analysis of genetic diversity and population structure among Quercus fabri populations is essential for the conservation and utilization of Q. fabri resources. Here, the genetic diversity and structure of 158 individuals from 13 natural populations of Quercus fabri in China were analyzed using genotyping‐by‐sequencing (GBS). A total of 459,564 high‐quality single nucleotide polymorphisms (SNPs) were obtained after filtration for subsequent analysis. Genetic structure analysis revealed that these individuals can be clustered into two groups and the structure can be explained mainly by the geographic barrier, showed gene introgression from coastal to inland areas and high mountains could significantly hinder the mutual introgression of genes. Genetic diversity analysis indicated that the individual differences within groups are greater than the differences between the two groups. These results will help us better understand the genetic backgrounds of Q. fabri.

status of Q. fabri biodiversity and its formation and evolution history, which is of great significance for the protection of Q. fabri germplasm resources in China. Berg and Hamriek (1993) and Mattila, Pakkanen, Vakkari, and Raisio (1994) have studied the genetic diversity of Q. laevis and Q. robur populations, respectively. Kim, Lee, and Hyun (1993) studied 28 populations of six oak trees including Q. acutissima, Q. aliena, Q. dentata, Q. mongolica, Q. serrate, and Q. variabilis. Gomory, Yakovlev, Zhelev, Jarmila, and Paule (2001) analyzed 25 Q. petraea and Q. acutissima populations. Chung et al. (2002) analyzed the population of Q. acutissima. Cavender- Bares and Pahlich (2009) analyzed the populations of Q. virginiana and Q. geminata. Craft and Ashley (2006) conducted genetic differentiation studies on three white oak populations in northeastern Illinois. The results of these studies indicate that for the same species of oak, the genetic diversity between populations is lower, which within the population is higher, and there is a certain correlation between the genetic diversity and distribution range, population size, interspecific hybridization ability, and habitat conditions; and the genetic diversity of different oak is very different, which is caused by the difference in plant life history and evolution history (Berg & Hamriek, 1993;Cavender-Bares & Pahlich, 2009;Chung et al., 2002;Craft & Ashley, 2006;Gomory et al., 2001;Kim et al., 1993;Mattila et al., 1994).

Related researchers in China have also conducted genetic di-
versity studies on some native oak varieties. Li, Chen, and Li (1997), Li, Chen, and Li (1998) studied the genetic structure of Q. aquifolioides and Q. tungmaiensis populations. ,  and Li, Gu, and Zhou (2003) analyzed the populations of Q. liaotungensis and Q. mongolica. Zhou, Guo, Yang, and Zhang (2003), Zhou, Lu, Zhang, and Zhang (2008) studied the population of Q. variabilis. The results also indicate that the genetic variation of native oak populations mainly exists within the population, the level of genetic variation among the populations is lower, and the gene exchange between the populations is frequent; compared with foreign oaks, the genetic diversity of some native oaks in China is lower (Greef, Triest, Cuyper, & Slycken, 1998;Hamrick, Godt, & Sherman-Broles, 1992;Kremer & Petit, 1993;Li et al., 2003;Wang, Hu, & Zhou, 2005). For Q. fabri, the latest related research is to develop 12 polymorphic and 2 monomorphic microsatellite loci, and the genetic diversity of Q. fabri has not been studied in depth (Xiao, Chen, Bao, Wang, & Li, 2016).
In recent years, with the application and promotion of next-generation sequencing (NGS), the simplified genome sequencing technology genotyping-by-sequencing (GBS) has gradually come to be used to analyze the genetic evolution of multiple species (Wu & Blair, 2017). GBS technology can quickly identify a considerable number of high-density SNP markers, and these SNP markers can be used for genetic map construction, genome-wide association analysis, phylogenetic analysis, and diversity analysis (Lee et al., 2019;Siadjeu, Mayland-Quellhorst, & Albach, 2018). In addition, GBS technology can also be used for species with and without reference genomes. At present, GBS technology is becoming increasingly prevalent in the analysis of genetic diversity of plants and animals such as oil palm (Elaeis guineensis) (Xia et al., 2019), pepper (Capsicum spp.) (Pereira-Dias, Vilanova, Fita, Prohens, & Rodríguez-Burruezoet, 2019), Dendrobium (Ryu et al., 2019), and Misgurnus anguillicaudatus (Yi, Wang, & Zhou, 2019).
Considering that the genetics of Q. fabri are poorly understood despite its important uses, a diversity study of a collection of 158 Q. fabri individuals from 13 natural populations in China was presented using GBS. Our goals are to assess the population structure, genetic diversity, and relationships of Q. fabri.

| Plant material
In total, 158 individuals of Q. fabri were used in this study. All of these individuals were collected from 13 natural distribution areas of Q. fabri in China (Figure 1, Tables 1 and 2).

| DNA extraction, library preparation, and sequencing
Young leaves from 158 individuals of Q. fabri were collected, and the genomic DNA was extracted using a plant genomic DNA extraction kit (TIANGEN BIOTECH, Beijing, China). The purity of the extracted DNA was detected using a NanoDrop ® spectrophotometer (ND-1000, Thermo Fisher Scientific, Wilmington, NC, USA), and DNA electrophoresis was simultaneously performed in a 1% agarose gel to ensure DNA integrity. A Qubit ™ 2.0 Fluorometer (Invitrogen, Carlsbad, CA, USA) was used to accurately measure the DNA concentration. High-quality DNA was used for GBS library construction and sequencing.
A total of 1.5 µg of DNA from each sample were used for library construction. The enzyme digestion evaluation was based on the Quercus robur reference genome, and the capture of 10w tags was used as the evaluation standard. Three different enzyme digestion protocols were selected for the evaluation, namely EcoRI and MseI, HaeII and MseI, and MspI and MseI. The results of enzyme digestion showed that the number of tags captured by HaeII and MseI was the largest and the coverage of the reference genome was higher (Table S1). Thus, the restriction enzymes MseI and HaeII (New England Biolabs, NEB) were used to digest the genomic DNA. The Illumina sequencing adapters and sample-specific barcodes were ligated to both ends of the digested DNA fragment using T4 DNA ligase (New England Biolabs, NEB), and each sample was subjected to DNA fragment amplification and pooled. The 350-400 bp DNA fragments were recycled by agarose gel electrophoresis and purified. The purified DNA fragments were subjected to paired-end 150 bp sequencing on the Illumina HiSeq sequencing platform.

| GBS data analysis and SNP calling
The sequencing data were divided into corresponding samples according to their barcode, and all the raw reads were subjected to strict quality control screening to remove adapters and low-quality short reads, including those where the "N" residues accounted for more than 10% of the read, or the percentage of low-quality (Q ≤ 5) was greater than 50% in the read in single-end sequencing. The selected high-quality clean reads were aligned against the Quercus robur reference genome (https://www.ncbi.nlm.nih.gov/biopr oject/ PRJEB 19898) using the Burrows-Wheeler Aligner (BWA version 0.7.8-r455) set to the default settings. The data mapping rate ranged from 84.78% to 96.19%, and the coverage 1X ranged from 9% to 15.62%. The alignment result was good, and the reference sequence was available (Table S2). SAMtools (version 1.8) was used to calculate the number of reads that passed quality control criteria and were successfully aligned to the reference. The percentage of sequence tags that overlapped with gene regions was assessed by applying BEDtools (version 2.25.0). Then, the aligned sequence tags were converted into BAM format for storage in a TOPM (TagsOnPhysicallMap) file. SNP calling was also performed with SAMtools. The high-quality SNPs were obtained by filtering out low-quality SNPs on the basis of the minimum minor allele frequency (mnMAF < 0.01) and the missing data per site (MDpS > 10%). The transitions/transversions (ts/tv) ratio was calculated using BCFtools (version 1.8).

| Principal component analysis
The preliminary analysis of the population structure was performed

| Population structure analysis
Use ADMIXTURE 1.23 to assign each sample to the most likely cluster using different numbers of clusters (K values) (Alexander, Novembre, & Lange, 2009). A population structure analysis of 1-10 clusters was set up (K = 1, 2, 3,…, 10), and the cross-validation error (CV error) calculated by ADMIXTURE 1.23 with the sum of the values of 10 permutations was used to evaluate the most likely number of clusters.

| Phylogenetic tree construction
A phylogenetic tree was constructed using the neighbor-joining method in the MEGA program (version 7.0) based on a pairwise genetic distance matrix of 158 individuals, which was obtained using TreeBest (version 1.9.2) with 1,000 bootstrap replicates (Kumar, Stecher, & Tamura, 2016).
AMOVA is used to calculate the proportion of the variation between populations or within populations within the total variation. The number of groups determined with ADMIXTURE 1.23 was used for AMOVA in GenAlEx (version 6.503) (Peakall & Smouse, 2012).

| GBS data analysis summary
A collection of 158 Q. fabri individuals were successfully sequenced using the Illumina HiSeq sequencing platform, and an average of 5.07 million clean reads per Q. fabri individuals was generated with an average depth per individual of 9.53 (Table S2). The percentage of GC content was approximately 40%, which was within the normal range. The sequencing quality was high (Q20 ≥ 90%, Q30 ≥ 85%), so subsequent analyses could be performed (Table S3). The sequencing result for each individual was aligned with the reference genome, and the similarity between them reached the requirement for resequencing analysis. The average degree of sequence coverage was 11.89% for at least single-base coverage, and the average degree of coverage with at least 4-base coverage was 7.78% (Table S2) SNPs, the number of SNPs belonging to base transitions was 2.808 times the number of SNPs belonging to base transversions, indicating that most of the SNPs were of the base transition type (Table S4).

| Genetic structure and phylogenetic analysis
We used ADMIXTURE to study genetic relationships and population structure among the 158 Q. fabri individuals that originating from 13 populations of Q. fabri in China. The most likely number of clusters was evaluated by the cross-validation error as K = 2 (CV error = 0.238), and the error was lower than K = 1 and K = 3 (CV error = 0.240 and 0.240, respectively; Figure 2a). We used the criterion to group each individual: When at least 60% of its inferred pedigree came from a certain group, we assigned it to the group.
The grouping results showed that the group 1 contained the popu-

| Population genetic diversity analysis
A genetic divergence was observed in the two groups and expected heterozygosity (He) among individuals in each group (Table 3). The higher value of expected heterozygosity was found in group 2 with a value of 0.170, and the lower value was found in group 1 with a value of 0.163. However, the higher value of observed heterozygosity was found in group 1 with a value of 0.093, and the lower value was found in group 2 with a value of 0.090 (Table 3). The AMOVA revealed that 12.36% of the total variation occurred between the

| D ISCUSS I ON
Quercus fabri Hance is specifically distributed in various subtropical provinces in China and is often grown in the woods of hills and mountainous areas at an altitude of 50-1900 meters. Although Q. fabri is extremely important in wood production and forest by-product production, the Q. fabri populations in China are declining. Therefore, it is necessary to carry out research work on genetic diversity analysis, germplasm collection, evaluation, and conservation. In this study, we obtained 459,564 high-quality SNPs for subsequent analysis using GBS technology and found that the number of transition SNPs was significantly more than transversions in Q. fabri, suggesting that during natural selection, individuals with transition mutations were more adaptive than individuals with transversion mutations (Luo, Iaffaldano, Zhuang, Fresnedo-Ramirez, & Cornish, 2017). This result may also be due to the large number of base transitions that have caused only synonymous mutations in the sequence encoding the protein (Guo et al., 2017).

| Population genetic structure of Quercus fabri Hance
The ADMIXTURE analyses allowed the assignment of populations and Yuechengling Mountains in the middle and southeast, respectively, isolate the group 2 distribution area. The climate in group 1 distribution is warm, with small temperature changes, warm in winter and cool in summer; abundant precipitation and high humidity, belonging to a subtropical humid monsoon climate. The suitable climate makes the vegetation grow lush, and the forest coverage in the group 1 distribution area is extremely high, which may also be one of the main reasons for separating the two ancestors and preventing gene exchange. The group 2 distribution is mostly in the middle and lower reaches of the Yangtze River.
The terrain is low and flat, and the altitude is mostly below 50 m.
The temperature is also significantly higher than that of group 1 distribution areas. The differences in the genetic structure of Q. fabri in different regions may also be the changes in genetic information that they have to occur in order to adapt to different environments.
The differences in environment also lead to differences in population and consumption behavior. For example, the population of the eastern plains of China is significantly larger than that of the southwestern mountains, and because of the large number of metropolises in the eastern region, the economy is significantly more developed than the southwestern region, and the frequency of people's activities is also significantly higher, which may lead to the genetic information exchange between Q. fabri resources is more frequent, and the genetic structure among populations in group 2 is very similar. The genetic structure of several populations in the southwestern mountainous area of group 1, such as GXLZ, GZZY, and GZLP, is quite different, which may also be caused by mountain barriers and people's lower activity frequency. During this period, the main purpose of Q. fabri is to burn charcoal. In order to withstand the cold, people in the southwestern mountainous areas have cut down Q. fabri trees to burn, causing a sharp decline in the Q. fabri resources of this area. This result also suggests that we urgently need to strengthen the protection of Q. fabri resources.
The genetic structure analysis also showed that although the were closely clustered into one group, indicating that the genetic relationships between them are close, which is also related to their geographical locations (Figure 3).
In view of the large difference in genetic structure between the two groups, it is necessary for Q. fabri breeding programs to incorporate the other genetic basis by utilizing germplasm from the other group particularly from the population with the farthest geographical distance between them as breeding parents. Due to the potential interspecific hybridization between oak trees, it is necessary to introduce other oaks germplasm by artificial hybridization in the future in order to broaden the genetic basis of Q. fabri germplasm.

| Genetic diversity of Quercus fabri Hance
Observed heterozygosity (Ho) is the probability that the alleles of two randomly selected samples are not the same and are the ratio of the number of heterozygous individuals observed to the total number of individuals in the sample. Expected heterozygosity (He) describes the expected proportion of heterozygous genotypes, which can be calculated according to the method provided by Harris and DeGiorgio (2017). Formally, He is the probability that a pair of randomly selected alleles from a population is different, which indicates the evolutionary pressure of the allele of an individual and the probability of a certain locus to mutate over a period of time (Shete, Tiwari, & Elston, 2000). In this study, we found that although both groups had similar He, group 2 was slightly higher than group 1, meaning that group 2 was more diverse than group 1 since He depends on both the number of alleles and the abundance of the alleles in a group.
It is generally believed that populations with a heterozygosity higher than 0.5 are not subject to high-intensity manual selection or natural selection and have high genetic diversity. This study detected that the observed heterozygosity and expected heterozygosity of these Q. fabri populations are both less than 0.5. The reason may be mainly because Q. fabri is a self-pollinating plant, and its male flowers have a sufficient amount of pollen and can often produce progeny through selfing, so the heterozygosity of Q. fabri is relatively low.
The results of AMOVA indicated high level of genetic diversity within groups and meant that the individual differences within groups are greater than the differences between the two groups.
The potential interspecific hybridization between Q. fabri and other oaks in the same location was considered the main reason for the variation within groups. Through the field survey accompanying the sampling process, we found that there were many other oak resources located around these populations, such as Quercus acutissima Carruth., Q. variabilis Bl., Q. aliena Bl., and Q. serrata Thunb.
Interspecific hybridization is likely to occur between Quercus species, resulting in genetic communication.

| CON CLUS IONS
The application of GBS analysis for the assessment of population genetic diversity and genetic structure has become increasingly common and is usually necessary before decisions about the preservation and utilization of species resources are made. In this study, we analyzed the genetic diversity and genetic structure of 13 Q. fabri populations and found that these populations can be clustered into two groups and the genetic structure was explained mainly by their geographic location. The genetic relationship of adjacent populations was more similar, and infiltration of genetic background from coastal areas to inland areas was observed, where the hindrance of genetic infiltration by mountainous areas was obvious. In addition, the individual differences within groups are greater than the differences between the two groups, which benefits genetic improvement and breeding of Q. fabri cultivars.

DATA A R C H I V I N G S TAT E M E N T
The datasets supporting the conclusions of this article are available in the NCBI Sequence Read Archive (SRA) database under accession number PRJNA564867. http://www.ncbi.nlm.nih.gov/biopr oject/ 564867

ACK N OWLED G M ENTS
This work was financially supported by "the Fundamental Research Funds for the Central Non-profit Research Institution of Chinese Academy of Forestry (CAFYBB2018ZB001)."

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.