Development of a cost‐effective single nucleotide polymorphism genotyping array for management of greater yam germplasm collections

Abstract Using genome‐wide single nucleotide polymorphism (SNP) discovery in greater yam (Discorea alata L.), 4,593 good quality SNPs were identified in 40 accessions. One hundred ninety six of these SNPs were selected to represent the overall dataset and used to design a competitive allele specific PCR array (KASPar). This array was validated on 141 accessions from the Tropical Plants Biological Resources Centre (CRB‐PT) and CIRAD collections that encompass worldwide D. alata diversity. Overall, 129 SNPs were successfully converted as cost‐effective genotyping tools. The results showed that the ploidy levels of accessions could be accurately estimated using this array. The rate of redundant accessions within the collections was high in agreement with the low genetic diversity of D. alata and its diversification by somatic clone selection. The overall diversity resulting from these 129 polymorphic SNPs was consistent with the findings of previously published studies. This KASPar array will be useful in collection management, ploidy level inference, while complementing accurate agro‐morphological descriptions.


| INTRODUC TI ON
Greater yam (Discorea alata L.) is one of the major cultivated yam species (Discorea spp.) and the most widely spread among tropical and subtropical regions. The high importance of D. alata for food security has prompted the establishment of several international and national ex situ collections. Due to the limited shelf-life of stored tuber, yam genetic resources are conserved in vitro or/and in the field. All of these repeated manipulations are time-consuming and may affect long-term conservation. Quality control of genotype purity and general collection management is mainly based on morphological descriptors (IPGRI/IITA, 1997; Mahalakshmi et al., 2007). However, these descriptors are not reliable enough to rationalize ex situ D. alata collection. Indeed, several studies have revealed that morphological variations are not necessarily linked to geographic origin or genetic lineage (Arnau et al., 2017;Lebot, Trilles, Noyer, & Modesto, 1998;Vandenbroucke et al., 2016). Complementary characterization tools are thus required for the conservation and dynamic management of ex situ collections related to germplasm exchange, the development of core collection or identification of future parents for breeding programs. D. alata is also a polyploid species with ploidy levels of 2n = 2x, 3x, or 4x and a basic chromosome number of x = 20 (Arnau, Némorin, Maledon, & Abraham, 2009). Ploidy levels detection is consequently a prerequisite for the identification of possible parents as crosses between the different ploidy levels can fail (Nemorin et al., 2013).
These studies generated essential information on the diversity and representativity of the germplasm collections. However, these tools were not tailored for routine collection management. They were found to be either poorly discriminating within D. alata species or they were complex and not cost-effective to use. Besides the development of high-throughput methods for genome-wide variant detection, such as genotyping-by-sequencing (Davey et al., 2011) paired with cost-effective SNP assay (Broccanello et al., 2018) as KASPar can lead to the development of appropriate markers for collection management. This approach has been successfully implemented in maize (Semagn et al., 2012), chickpea (Hiremath et al., 2012), Citrus (Garcia-Lor, Ancillo, Navarro, & Ollitrault, 2013), pigeon pea (Saxena et al., 2014), and Brassica rapa (Su et al., 2018). Regarding the recent release of yam (Dioscorea spp.) genomic resources (Saski, Bhattacharjee, Scheffler, & Asiedu, 2015;Tamiru et al., 2017), the design of such markers for D. alata collection management would be worthwhile. Indeed, once developed they do not require any specific bioinformatics or wet chemistry skills. The results contain few erroneous and missing data and can be easily analyzed and interpreted.
The main objectives of this study were (a) to identify genomewide polymorphic SNP markers, (b) to develop a cost-effective SNP genotyping array using KASPar technology and (c) to test its use as a tool in managing yam ex situ collections.

| Materials
Based on a previous microsatellite markers study (Arnau et al., 2017), a set of 48 accessions representing worldwide D. alata diversity was selected and genotyped to identify polymorphic SNPs and design KASPar markers. Then, for the purpose of validating these markers, 141 landraces from the Tropical Plants Biological Resources Centre (CRB-PT) and CIRAD ex situ collections maintained in the West French Indies (Guadeloupe) were used.
First, DNA extractions were performed with dried leaves from the 48 accessions as described by Risterucci et al. (2009). The genomic DNA quality was checked using agarose gel electrophoresis, and the quantity was estimated using a Nanodrop ND-1000 spectrophotometer (Thermo Scientific, Wilmington, USA). For GBS, a genomic library was prepared using the PstI-MseI restriction enzymes (New England Biolabs, Hitchin, UK) with a DNA normalized quantity of 200 ng per sample. The procedures published by Elshire et al. (2011) were adapted as described in Cormier et al. (2019).

| KASPar genotyping and allele calling
Polymorphic SNP flanking sequences (60 bp upstream and 60 bp downstream around the variant position) were selected using SNiPlay3 (Dereeper et al., 2011). In order to assess their putative physical positions, these sequences were then blasted to the D. rotundata reference genome (TDr96_F1 Pseudo_Chromosome: BDMI01000001-BDMI01000021; Tamiru et al., 2017). The physical position of each SNP was defined using their flanking sequences best hit using a BLAST E-value threshold of 1e−30 (Basic Local Alignment Search Tool). Finally, 192 SNPs were selected by forming 192 k-means cluster based on their relative physical distance (Euclidean distance) and selecting the SNP nearest to the centroid of each cluster using R 3.4.0 (R core team, 2017).
The 192 SNPs were converted into a KASPar assay at LGC genomics where the primer design and wet chemistry was conducted (Middlesex, UK) on a validation panel of 141 landraces from the CRB-PT and CIRAD ex situ collections. From raw fluorescence data, allele calling was performed using LGC Kluster Caller software by defining fluorescence clusters. Some accessions with known ploidy level were used as reference to identify fluorescence clusters and assess allelic dosage.

| Diversity analysis
To identify duplicate accessions and compare accessions with different ploidy levels, a matrix of dissimilarity between each accession pair was computed as the percentage of shared alleles based on the allele presence/absence. Then, to refine the kinship assessment, similarities between accessions with the same ploidy level were computed in the same way but using the allelic dosage. For diploid accessions, genotypes were coded as 0, 1, and 2 where the number represents the number of nonreference allele. Heterozygous genotypes assessed as polyploid during allele calling were converted to 1. Moreover, for triploid accessions, genotypes were coded as 0, 1, 2, and 3 with allelic dosage score as 1:1 during allele call converted to 1.5. For tetraploid accessions, genotypes were thus coded as 0, 1, 2, 3, or 4 and no correction was needed.
Diversity analysis was conducted in two steps. During the first step, groups of duplicate accessions (redundancy groups) were defined by grouping accessions having up to one allele mismatch. Then, in the second step, the diversity analysis focused on the similarity between those groups. Clustering based on allele frequencies within redundancy groups followed by a bootstrap approach (pvclust R package, ward.D2, 10,000 boots, AU threshold = 0.95; Suzuki & Shimodaira, 2006) was used to identify gene pools. A diversity network between redundancy groups was also drawn using significant kinship detected through genotype permutations (1,000), with a significance threshold of 0.05.

| KASPar assay development and validation
Genotyping-by-sequencing (GBS) produced more than 344 million reads resulting in 521,918 sequence tags out of which 207,810 (39.82%) aligned exactly once on D. alata contigs. The remaining reads aligned at multiple locations (25.18%) or did not align to any contig (35%). From these sequence tags, SNP calling produced a raw vcf file of 158,695 SNPs. This raw vcf file was then filtered resulting in a dataset of 40 accessions (Appendix A), and 4,593 good quality SNPs out of which 3,879 (84%) SNPs were mapped by BLAST on the D. rotundata reference genome. The KASPar assay was then developed by selecting 192 SNPs representative of SNPs mapped along the D. rotundata reference sequence, and they were tested on 141 accessions.
Among the 192 SNPs, 26 (13%) SNPs failed as they did not produce any amplification signal. From the remaining 166 SNPs (87%), 129 SNPs (Appendix C) with less than 20% missing data and a minor allele frequency of over 5% were retained as high-quality SNPs. This final dataset (129 SNPs × 141 accessions) contained an overall missing data rate of only 0.5% with a maximum of 3% missing data per accession.
The 129 validated KASPar SNPs were distributed on all linkage groups used to construct the D. rotundata reference genome ( Figure 1). Their distribution was not homogeneous along chromosomes as their position was planned to be representative of that of the initial set of 3,879 mapped SNPs and not equally spaced.

| Assessment of ploidy levels
In our D. alata validation panel, three ploidy levels (2x, 3x and 4x) coexisted (Appendix B). Thus, the KASPar assay could theoretically produce a maximum of seven types of fluorescence signal (Table 1) corresponding to two types of fluorescence signal in homozygous states (2:0 = 3:0 = 4:0; 0:2 = 0:3 = 0:4), the fluorescence signal of mixed and balanced allelic dosages (1:1 for diploids or 2:2 for tetraploids) and the four types of fluorescence signal corresponding to the different possible unbalanced allelic dosages at heterozygotic loci ("polyploid-like" in Table 1) of triploids and tetraploids (1:3; 1:2; 2:1; F I G U R E 1 Location of KASPar SNPs on the D. rotundata reference genome (Tamiru et al., 2017). The 21 linkage group are aligned from left to right. Black dots, failed or bad quality SNPs; red dots, the 129 validated SNPs 3:1). In our case, due to insufficient fluorescence resolution, it was not possible to distinguish fluorescence signals of the 1:3 tetraploid allelic dosage from the 1:2 triploid allelic dosage, or the 2:1 triploid allelic dosage from the 3:1 tetraploid allelic dosage. Consequently, a maximum of five types of fluorescence signals were identified.
However, the overall allele call and allelic dosage assessment quality were good. Indeed, the ratio of genotypes scored as "polyploid-like" on overall heterozygous genotypes by accession was low (0.09 ± 0.05) for diploids and high for triploids (0.83 ± 0.05). In addition, the three distributions of this ratio corresponding to the three ploidy levels did almost not overlap ( Figure 2).
We were thus not able to differentiate all allelic dosage from each other when looking at one SNP. However, ploidy level could be deduced when taking all the KASPar array into account and considering the proportion of genotypes scored as "polyploid-like" per accession. This KASPar assay thus differentiated the accession ploidy level and allowed us to assign it for 12 accessions originally of unknown ploidy. Nine were set as diploid and three as triploid.

| Diversity analysis
Overall, 141 accessions from CRB-PT and CIRAD ex situ collections in Guadeloupe were used to validate the KASPar assay (96 diploids, 36 triploids, and nine tetraploids including accessions with known and deduced ploidy level).
The allele presence and/or absence was used to assess the similarity between accessions and thus to identify duplicate accessions More generally, redundancy groups only consisted of accessions with the same ploidy level ( Figure 4). Moreover, similarities within triploids or within tetraploids were higher than within diploids.
The diversity analysis was based on these 43 redundancy groups to avoid bias. After clustering, the bootstrap procedure detected five significant gene pools, named "cluster" here, represented in the kinship network ( Figure 5). Only one (cluster C,

| Assessment of allelic dosage and detection of ploidy levels
KASPar technology is based on competitive allele-specific amplification followed by allele-specific fluorescence assessment (Semagn, Babu, Hearne, & Olsen, 2014). Detection of allelic dosage in polyploid species is thus possible (Cuenca, Aleza, Navarro, & Ollitrault, 2013). However, several parameters may influence the fluorescence, such as the DNA quality or primer specificity, and consequently the ability to discriminate fluorescence signals and the allelic dosage. In our case, we were able to discriminate five types of fluorescence signal. At heterozygous loci, fluorescence signals were a mixture of two types of allelic-specific fluorescence.
Fluorescence signals should also be balanced for diploids which have a balanced allelic dosage (1:1) at heterozygous loci. Diploids should therefore theoretically have no genotypes assessed as "polyploid-like." Conversely, triploids should theoretically have only genotypes assessed as "polyploid-like" at heterozygous loci.
A balanced allelic dosage is impossible for triploids. Our results showed that 91 ± 5% and 83 ± 5% of heterozygous genotypes were correctly called for diploids and triploids, respectively.
Regarding the recent explosion of genotyping related to next-generation sequencing, bioinformatics tools have been developed to accurately determine dosages (e.g., GBS2ploidy; Gompert & Mock, 2017). However, this requires deep sequencing and usually an assumption of ploidy levels present in the dataset (Bourke, Voorrips, Visser, & Maliepaard, 2018).
Application in collection management may nevertheless not require allelic dosage assessment at each locus. Our aim was thus to develop a tool for estimating ploidy levels and not variations in copy number. Moreover, the results showed that ploidy levels for each accession can be accurately deduced from the percentage of "polypoid-like" genotypes on overall heterozygous genotypes. Regarding the overlapping distributions of this ratio (Figure 2), the only risk is to confuse triploids and tetraploids estimated at 3%. Consequently, ploidy level assessment is possible and fairly accurate for D. alata using the KASPar assay developed in this study.
F I G U R E 5 Network of kinship for the 43 D. alata redundancy groups based on significant similarity (p < 0.05, edge-weighted springembedded layout). Nodes shape and letter, cluster of diversity identified by a bootstrap procedure; red nodes, diploids; green nodes, triploids; blue nodes, tetraploids; edge colors, similarity from gray (0.64) to black (1) with previous genetic diversity studies that already pooled these accessions together and highlighted this intragroup variability in tubers (Arnau et al., 2017;Malapa et al., 2005).

| Diversity and collection management
The CRB-PT collection has been shown to be representative of worldwide D. alata diversity (Arnau et al., 2017). A subset of this ex situ collection has been genotyped in this study. However, all diversity groups identified by Arnau et al. (2017)

| CON CLUS ION
This is the first SNP array designed for D. alata and validated on a subset of accessions representative of worldwide D. alata diversity. This tool will allow users to estimate accession ploidy levels and genetic lineages. The results showed a good correlation between the diversity assessed by this KASPar array and the findings of previous studies. This KASPar array is a robust and cost-effective tool for diversity assessment and collections management. Regarding the importance of vegetative reproduction and somaclonal selection in D. alata, it is a good tool to complement agro-morphological description in collections.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no conflict of interest.   Pos. Pos. Pos. Pos. Pos.