Identification of diagnostic KASP‐SNP markers for routine breeding activities in yam (Dioscorea spp.)

Maintaining genetic purity and true‐to‐type clone identification are important action steps in breeding programs. This study aimed to develop a universal set of kompetitive allele‐specific polymerase chain reaction (KASP)‐based single nucleotide polymorphism (SNP) markers for routine breeding activities. Ultra‐low‐density SNP markers were created using an initial set of 173,675 SNPs that were obtained from whole‐genome resequencing of 333 diverse white Guinea yam (Dioscorea rotundata Poir) genotypes. From whole‐genome resequencing data, 99 putative SNP markers were found and successfully converted to high‐throughput KASP genotyping assays. The markers set was validated on 374 genotypes representing six yam species. Out of the 99 markers, 50 were highly polymorphic across the species and could distinguish different yam species and pedigree origins. The selected SNP markers classified the validation population based on the different yam species and identified potential duplicates within yam species. Through penalized analysis, the male parent of progenies involved in polycrosses was successfully predicted and validated. Our research was a trailblazer in validating KASP‐based SNP assays for species identification, parental fingerprinting, and quality control (QC) and quality assurance (QA) in yam breeding programs.


INTRODUCTION
Yam (Dioscorea spp.) is a multi-species monocotyledonous tuber crop predominantly cultivated in the tropical regions of Africa, Southeast Asia, and Latin America (Azeteh et al., 2019;Darkwa, Olasanmi, et al., 2020;Oben et al., 2016).Dioscorea encompasses about 600 species (Burkill, 1960), of which only 11 are widely cultivated species (Darkwa, Olasanmi, et al., 2020).The major cultivated species have basic chromosome number X = 20 (Scarcelli et al., 2019) but have variable ploidy (2n = 40, 60, 80) between and within the species (Sugihara et al., 2020).Among the cultivated species, water/purple yam (Dioscorea alata L.) is the most extensively distributed species globally from the forest to grassland in the tropical and temperate regions.In contrast, white Guinea yam (Dioscorea rotundata Poir) is the most widely planted and produced yam globally (Asfaw et al., 2021), with its predominant cultivation coming from West Africa, a region where Nigeria, Ghana, and Cote d'Ivoire together produce nearly 86% of the world's yam supply (FAOSTAT, 2020).In West Africa, yam is a valuable food security crop, with about 400 million people depending on it as a staple food.In addition to serving as an important source of dietary carbohydrates, it also serves as a significant source of income for producers and a sociocultural asset for the society that depends on it (Aighewi et al., 2020;Nweke, 2016).
The informal nature of yam trade, sharing of germplasm among countries, and the exchange of planting materials among farmers have resulted in the same cultivar carrying different names or different cultivars carrying the same name in various yam farming systems (Agre et al., 2021;Otoo et al., 2009).Such phenomenon has also been reported in cassava (Manihot esculenta L.) studies (Agre et al., 2018;Rabbi et al., 2015).Furthermore, human errors such as mislabeling and material mix-up during planting, harvesting, transportation, or storage are the leading causes of misidentification of accessions in genebank collections and breeding programs (Girma et al., 2012).Therefore, control and understanding of germplasm identity and purity, as well as hybrid authentication, are essential action steps for breeding efficiency management.
Single nucleotide polymorphisms (SNPs) are vital molecular markers for genetic analysis and breeding.The nextgeneration sequencing technique, which generates millions of SNP markers, has provided a faster and more affordable opportunity to employ genotyping as a low cost and robust component of routine breeding activities (Chen et al., 2016;Semagn et al., 2014).However, in yam, the majority of molecular studies have been focused on de novo sequencing (Saski et al., 2015), the development of reference genome (Sugihara et al., 2020;Tamiru et al., 2017), genetic diversity analysis (Agre et al., 2019(Agre et al., , 2021;;Akakpo et al., 2017;Darkwa, Agre,

Core Ideas
• This study is a pioneer that validated kompetitive allele-specific (KASP)-based single nucleotide polymorphism (SNP) assay for quality control (QC) and assurance in yam breeding.et al., 2020;Girma et al., 2014Girma et al., , 2015;;Loko et al., 2017;Siadjeu et al., 2018), evolution of yam domestication (Sugihara et al., 2020), quantitative trait locus identification (Agre et al., 2022;Bhattacharjee et al., 2018;Cormier, Lawac, et al., 2019), and genomic prediction (Norman et al., 2020).Routine breeding activities (assessment of genetic diversity or identification of mislabeled genotypes, pedigree verification, variety tracking, so on), also known as quality control (QC) in crop breeding leveraging molecular breeding tools, have been employed as a routine practice in many breeding and seed programs to harness genetic gain and delivery of quality products to the end users.These practices have been developed and deployed in several crops, including maize (Zea mays L.) (Chen et al., 2016;Gowda et al., 2017;Semagn et al., 2012), rice (Oryza sativa) (Ndjiondjop et al., 2018), sweetpotato (Ipomoea batatas (L.) Lam) (Gemenet et al., 2020), cowpea (Vigna unguiculata (L.) Walp.; Ongom et al., 2021), and soybean (Glycine max; Chander et al., 2021).Except for a recent study by Cormier, Mournet, et al. (2019) that presented kompetitive allele-specific (KASP)-based SNP markers for ploidy evaluation in water/purple yam, efforts to establish QC markers for yam are scanty.In order to examine the genetic purity of breeding lines at testing stages and cultivars in seed systems, as well as to authenticate the hybridity of offspring from crosses, the goal of this project was to develop a universally applicable, highly informative quality control/assurance (QC/QA) markers set.

Plant materials used for generating raw SNP markers
In this study, three sets of materials were used.Set 1 consisted of 333 yam clones encompassing breeding lines, landrace cultivars, and gene bank accessions using whole-genome resequencing to identify high polymorphic SNP markers.
Set 2 materials composed of 374 yam clones representing six different species (D. alata, D. bulbifera, D. cayenensis, D. praehensilis, D. esculenta, and D. rotundata).The materials were used for high-throughput KASP assay design and marker validation.Dataset 3 consisted of 27 yam clones, including four putative parents (one female and three males), and their 23 progenies originated from unsupervised polycross involving these four parents at the International Institute of Tropical Agriculture yam breeding.

Sequencing of dataset 1
Total genomic deoxyribonucleic acid (DNA) from dataset 1 was extracted following the protocol described by Tamiru et al. (2017).DNA library construction and whole-genome resequencing were conducted using the protocol described by Sugahira et al. (2020).Generated reads were aligned with yam reference genome v2 (Sugahira et al., 2020) using Genome Analysis Toolkit v3.8-0 to generate a variant call format file (VCF).The VCF was later filtered for minor allele frequency (MAF), missing data (both at genotypes and SNP markers level), polymorphism information content, heterozygosity, and read depth.

Selection of highly polymorphic SNP markers
SNP markers were then further filtered using a custom R script to retain only those that had a ratio of observed to expected heterozygosity (He) of less than 1.5, a guaninecytosine content of 40%-60% in the 101 bp window extending 50 basepairs (Bps) on either side of the SNP, and two or fewer flanking SNPs in that same window.Flanking SNPs were ignored if their quality score was 900 or lower but were not otherwise subjected to filtering in PLINK or R. Two approaches were then used for selecting SNPs.In the first approach, the Purity tool in the Excellence in Breeding Galaxy instance (http://cropgalaxy.excellenceinbreeding.org/root?tool_id = purity_beta) was used to identify a set of 50 markers that best distinguished all individuals in dataset 1.In the second approach, the dataset was subdivided into nine populations using discriminant analysis of principal components (DAPC) (Jombart et al., 2010) and run against a simulated annealing algorithm developed in R to find a set of 50 markers that maximized diversity within and between these nine populations.Using these two methods, 99 markers were selected in total.The process of the SNP selection, including all R codes, is documented and available online via the fol-lowing link: https://github.com/HPCBio/eib-marker-design/blob/main/Pedigree_verification.md,archived on Zenodo at https://doi.org/10.5281/zenodo.6093835.

Design, verification, and validation of KASP markers from selected SNP sequences
Flanking sequences surrounding each selected SNP (50 bp at both upstream and downstream around the variant position) were generated using a custom R script, with any flanking SNPs coded using International Union of Pure and Applied Chemistry ambiguity codes.The flanking sequences for the 99 selected SNPs were then submitted for primer synthesis, verified, and validated on a panel of 374 yam accessions composed of dataset 2 and 3 (https://yambase.org/pages/markers).

Data analysis
Various population genetic analyses were conducted to explore the genetic properties of the markers.DAPC were generated first using 136,429 SNPs and then with the selected 99 discriminative SNP markers through Adegenet library package in R (Jombart et al., 2010), and the membership probabilities were compared.For dataset 2, Hapmap and VCF files were generated for the raw data.Summary statistics, such as missing values, MAF, heterozygous frequency (observed and expected), polymorphism information content (PIC), and discriminant markers, were accessed using PLINK (Purcell et al., 2007) and VCFTOOLS.SNP markers were then classified based on the level of the polymorphism information content, their ability to discriminate yam species, and their contribution to identifying true-to-type yam clones.The most polymorphic SNP markers were selected and used to assess genetic diversity through principal component analysis (PCA) and hierarchical cluster analysis.To understand the frequency of the different recombinations in dataset 2, admixture analysis was conducted, whereby clones with ancestry posterior probability (APP) >0.5 were assigned to a group while clones with (APP) <0.5 were considered as admixed.
Intentional duplicate clones were used to evaluate the accuracy of some selected SNP markers in detecting potential duplicates.Twenty-three progenies with four putative parents (one female and three males) used in previous pedigree estimation by Norman et al. (2020) were incorporated in the present study for pedigree reconstruction.The Penalized and doMC libraries were used to estimate the contribution of the male and female parents to the offspring (McIlhagga, 2016).The value of each parent was estimated using the multinomial model of the penalized function formula as previously described (McIlhagga, 2016;Norman et al., 2020) and the final pedigree was visualized using the Helium visualization tool (Shaw et al., 2014).

Summary statistics of the selected SNP markers
Using dataset 1, 99 SNPs were identified as informative and discriminative.The selected SNP markers were successfully converted into KASP markers on 374 yam accessions (dataset 2).Considering the total SNP markers and across all the species, an average value of 0.29 was obtained for the MAF and the lowest average value was recorded in D. esculenta species (Table 1).For the polymorphism information content, the average value recorded was 0.29, with the lowest value recorded in D. esculenta (Table 1).For the observed heterozygosity (Ho), it varied from 0.17 in D. esculenta to 0.37 in D. rotundata, with the remaining subspecies presenting in-between values.
Based on the different genetic parameters (PIC, MAF, Ho, He, and the proportion of the missing information), 50 SNP markers of the 99 tested were identified with good genetic parameters information (Table 2).Additional information related to the SNP makers, such as the flanking sequencing and the alleles variant can be freely downloaded from the following link: https://figshare.com/articles/dataset/List_of_50_SNP_markers/24072339.

Population structure, species and variety identification, and germplasm management
The 50 highly polymorphic SNP markers selected were used to genetically characterize 374 yam accessions consisting of different yam species (D. rotundata, D. praehensilis, D. cayenensis, D. alata, D. esculenta and D. bulbifera) utilizing PCA and hierarchical clustering.The first two components of the PCA accounted for 82.6% of the total variation among the accessions (Figure 1).The PCA revealed clear separation among the different yam species involved in this study and identified D. alata, D. esculenta, and D. bulbifera to be closer than the rest of the species (Figure 1).Based on the identity-by-state matrix (dissimilarity matrix), the genetic distance varied from 0.01 to 0.59, with the lowest value obtained among accessions of the same species.In contrast, the highest genetic distance was observed between D. rotundata and D. alata.Hierarchical clustering revealed the presence of four F I G U R E 1 Principal component analysis of 374 yam accessions (dataset 2) using 50 single nucleotide polymorphism (SNP) markers.The different colors represent the different species used in the study.
major clusters (Figure 2).Cluster 1 had only D. rotundata (cyan color), cluster 2 only D. praehensilis (yellow color), cluster 3 (green color) had only D. cayenensis, while cluster 4 (red, black, and blue) is a combination of three different species (D. alata, D. esculenta, and D. bulbifera).Cluster 1 consisted of only D. rotundata (138 in total) with several intentional duplicated clones (Figure 2).The local varieties, such as Hemba and Pampar, previously identified as duplicates using high-density SNP markers, were herein confirmed as duplicates using this QC/QA marker set.In cluster 2, made of only D. praehensilis, the genetic distance varied from 0.01 to 0.22, with many clones identified as closely related (Figure 2).In cluster 3, only D. cayenensis was grouped in that cluster and displayed a very low genetic distance among the studied accessions.Cluster 4 had the largest membership, with three different species, namely D. alata, D. esculenta, and D. bulbifera.
Admixture was used to assess the population (dataset 2).The K-value was used to estimate the accession cluster number based on the 50 SNP markers.To determine the optimal K-value, the cluster number (K) was plotted against ΔK, which showed a sharp elbow at K = 4 (Figure S1).The optimal Kvalue indicates that four subpopulations (Pop1, Pop2, Pop3, and Pop4) showed the highest probability of population clustering (Figure 3).In addition, and across the four Pops, few accessions were identified with ancestry probability greater than 50%, and such accessions were identified as no admixt.Based on the ancestor probability of assignment, the highest admixed clones' rate was recorded in the D. rotundata breeding lines.

Pedigree reconstruction of D. rotundata using 50 SNP markers
The paternity of 23 progenies originated from a polycross between a single female (TDr9700205) and three different males (TDr9501932, TDr9902789, and TDr9902607), which was reconstructed.Through the penalized analysis and the pedigree reconstruction, it was ascertained that male TDr95011932 contributed to the highest (61%) pollination for most of the progenies (Figure 4).

DISCUSSION
We developed a subset of a few and highly informative SNP markers for routine use in breeding and seed programs for assessing the genetic purity of breeding lines and released cultivars and hybridity of progenies from crosses.The recent advances in molecular technology have emphasized on the use of SNP markers because they are cost effective, adequately abundant in the genome, locus-specific, codominant, and have the potential for high throughput compared to other markers (Hamblin et al., 2007;Josia et al., 2021).The use of many markers is recommended in the initial steps of understanding the genetic material that a breeding program possesses, yet in a later stage, it is highly recommended to use a smaller and more effective number of markers for cost-effectiveness (Cormier, Mournet, et al., 2019;Islam et al., 2015;Ramakrishnan et al., 2015;Vos et al., 2015).The utilization of very few informative SNP markers for routine breeding activities without compromising the original results has been established in many crops such as chickpea (Cicer arietinum L.) (Hiremath et al., 2012), potato (Solanum tuberosum) (Vos et al., 2015), cotton (Gossypium herbaceum;lslam et al., 2015), maize (Chen et al., 2016), rice (Ndjiondjop et al., 2018), as well as soybean (Chander et al., 2021).In a similar study, using 40 water/purple yam accessions,  2016) have reported that 50-100 singleplex assay SNPs are sufficient for molecular-based QC genotyping.The selected 50 SNP markers herein described can be applied in diverse yam breeding programs at the lowest cost without compromising the quality of routine activities, such as pedigree verification, variety tracking, and germplasm maintenance.

CONCLUSIONS
We conducted a pilot study on the development of SNP markers for Dioscorea spp.and employed them for yam varietal and species identification using KASP polymerase chain reaction technology.In order to create a universal DNA fingerprint for yam, 50 highly polymorphic KASP SNP markers were chosen.The SNP panel described here offers a valuable KASP resource to aid in the genetic identification of yam germplasm and correct errors made during normal activities in breeding and seed multiplication phases.The yam community will greatly benefit from the practical applications of this study's findings.

A C K N O W L E D G M E N T S
We appreciate the technical support from Ibadan and Abuja yam breeding staff at IITA.We also, appreciate help from the JIRCAS for technical support.The funding support from the Bill and Melinda Gates Foundation (BMGF) through the AfricaYam project of the International Institute of Tropical Agriculture (IITA) (INV-003446) is acknowledged.

C O N F L I C T O F I N T E R E S T S T A T E M E N T
The authors declare no conflicts of interest.

D A T A AVA I L A B I L I T Y S T A T E M E N T
The variant call format (VCF) file used for analyses of datasets 1-3 are all available under the genotypic project at: www.yambase.org.The list of the QC/QA markers can be downloaded through the following link: https://yambase.org/pages/markers.

F
Hierarchical clustering using a set of 50 single nucleotide polymorphism (SNP) markers on 374 yam accessions.Each color represents a cluster containing accessions of different species.F I G U R E 3 Admixture results using the 374 yam accessions (dataset 2) and the 50 single nucleotide polymorphism (SNP) markers.Each different color represents a group of clones while each bar plot is an accession.

F
Visual display of the pedigree reconstruction with the selected highly informative 50 single nucleotide polymorphism (SNP) markers on the 23 D.rotundata breeding lines derived from the four parents in unsupervised polycross design.
Genetic statistic summary of the 99 single nucleotide polymorphism (SNP) markers based on dataset 2. List of the 50 highly informative single nucleotide polymorphism (SNP) markers selected for routine breeding activities.
In our case, out of 136,429 SNPs assayed, 99 were successfully developed and validated on different yam species accessions and allowed the selection of highly informative 50 polymorphic SNPs characterized by having a high PIC, MAF, and high species discrimination rate.Hierarchical clustering and the PCA highlighted the efficiency of the subset of the selected 50 SNP markers to identify yam species and the recognition of identical clones within the different species studied.In line with our finding, Semagn et al. (2012) and Zhou et al. ( Cormier, Mournet, et al. (2019)identified 4593 genome-wide SNP markers, out of which 129 representative SNPs came out as KASP markers for ploidy level assessment in water/purple yam.
writing-review and editing.Lindsay V. Clark: Formal analysis; methodology; writing-original draft; writing-review and editing.Ana Luisa Garcia-Oliveira: Methodology; writing-original draft; writing-review and editing.Rajaguru Bohar: Writing-review and editing.Patrick Adebola: Project administration; writing-review and editing.Robert Asiedu: