Developing reduced SNP assays from whole‐genome sequence data to estimate introgression in an organism with complex genetic patterns, the Iberian honeybee (Apis mellifera iberiensis)

Abstract The most important managed pollinator, the honeybee (Apis mellifera L.), has been subject to a growing number of threats. In western Europe, one such threat is large‐scale introductions of commercial strains (C‐lineage ancestry), which is leading to introgressive hybridization and even the local extinction of native honeybee populations (M‐lineage ancestry). Here, we developed reduced assays of highly informative SNPs from 176 whole genomes to estimate C‐lineage introgression in the most diverse and evolutionarily complex subspecies in Europe, the Iberian honeybee (Apis mellifera iberiensis). We started by evaluating the effects of sample size and sampling a geographically restricted area on the number of highly informative SNPs. We demonstrated that a bias in the number of fixed SNPs (FST = 1) is introduced when the sample size is small (N ≤ 10) and when sampling only captures a small fraction of a population's genetic diversity. These results underscore the importance of having a representative sample when developing reliable reduced SNP assays for organisms with complex genetic patterns. We used a training data set to design four independent SNP assays selected from pairwise FST between the Iberian and C‐lineage honeybees. The designed assays, which were validated in holdout and simulated hybrid data sets, proved to be highly accurate and can be readily used for monitoring populations not only in the native range of A. m. iberiensis in Iberia but also in the introduced range in the Balearic islands, Macaronesia and South America, in a time‐ and cost‐effective manner. While our approach used the Iberian honeybee as model system, it has a high value in a wide range of scenarios for the monitoring and conservation of potentially hybridized domestic and wildlife populations.


| INTRODUC TI ON
Biodiversity, including the genetic diversity within and between populations, is a unique heritage whose conservation is imperative for the benefit of future generations (Frankham, Ballou, & Briscoe, 2002). This is particularly important for organisms like the honeybee (Apis mellifera L.), which, through the pollination service it provides, plays a critical role in ecosystem functioning and in food production for humanity. The honeybee is under pressure worldwide due to multiple factors, ranging from emergent parasites and pathogens, and the overuse of agrochemicals, to the less publicized introgressive hybridization mediated by human management (reviewed by Potts et al., 2010;van Engelsdorp & Meixner, 2010). In a global world, where the circulation of commercial queens and package honeybees occurs at a rapid pace, and at large scale, reliable tools for monitoring genetic diversity are becoming indispensable.
The honeybee exhibits high diversity, with 31 currently recognized subspecies (Chen et al., 2016;Engel, 1999;Meixner, Leta, Koeniger, & Fuchs, 2011;Sheppard & Meixner, 2003) belonging to four main evolutionary lineages (western and northern Europe, M; south-eastern Europe, C; Africa, A; Middle East and Central Asia, O). Of the 31 subspecies, the Iberian honeybee A. m. iberiensis (M-lineage) has received the most attention with numerous genetic surveys and references therein). These have consistently shown the existence of a highly diverse and structured subspecies defined by two major clusters forming a sharp cline that bisects Iberia along a north-eastern-south-western axis (Arias, Rinderer, & Sheppard, 2006;Chávez-Galarza et al., 2017;Smith et al., 1991). Such complexity has been shaped by recurrent cycles of interacting selective and demographic processes, typical of long-term glacial refugia organisms (Chávez-Galarza et al., 2013, 2015.
However, this genetic legacy might be at risk if Iberian beekeepers adopt a strategy of importing commercial strains belonging to the highly divergent lineage C, as is occurring at large-scale throughout western and northern Europe north of the Pyrenees.
Since the early 20th century, beekeeping activity in this part of Europe has been characterized by colony importations and queen breeding with mostly C-lineage honeybees De la Rúa, Jaffé, Dall'Olio, Muñoz, & Serrano, 2009); which are known for their docile nature and high productivity (Ruttner, 1988). This human-mediated gene flow has threatened A. m. mellifera, the other M-lineage subspecies besides A. m. iberiensis in Europe.
Indeed, the genetic integrity of A. m. mellifera has been compromised by introgressive hybridization and, in some areas, it has even been replaced by subspecies of C-lineage ancestry (Jensen, Palmer, Boomsma, & Pedersen, 2005;Pinto et al., 2014;Soland-Reckeweg, Heckel, Neumann, Fluri, & Excoffier, 2009). Yet, maintaining locally adapted subspecies is crucial for the longterm sustainability of A. mellifera van Engelsdorp & Meixner, 2010). Reciprocal translocation experiments have recently shown that local honeybees have longer survivorship (Büchler et al., 2014) and lower pathogen loads (Francis et al., 2014) than introduced ones, reinforcing the importance of preserving the genetic diversity of locally adapted subspecies. Furthermore, it has been advocated that apiculture and commercial breeding could compromise honeybee health by interfering with natural selection Neumann & Blacquière, 2017).
The idea that long-term sustainability of honeybee populations can only be achieved by preserving natural genetic diversity and coevolved gene complexes has led to the establishment of conservation programmes and protected areas throughout Europe (De la Rúa et al., 2009). To foster and monitor such conservation efforts, reliable, cost-and time-effective tools are needed to accurately assess admixture levels between introduced and native honeybees. For the endangered A. m. mellifera, reduced assays of highly informative SNPs have already been developed to estimate C-lineage introgression (Muñoz et al., 2015;Parejo et al., 2016). However, equivalent tools for application in conservation and breeding efforts are still required for its sister subspecies, A. m. iberiensis.
Following the last glacial maximum, honeybees dispersed from the Iberian refugium to colonize a broad territory, extending from the Pyrenees to the Urals (Franck, Garnery, Solignac, & Cornuet, 1998;Ruttner, 1988). This important Iberian reservoir of genetic diversity has not yet been seriously threatened by C-lineage introgression Miguel, Iriondo, Garnery, Sheppard, & Estonba, 2007), although this scenario might change as many young beekeepers are attracted by the advertised benefits of commercial strains-being more prolific and docile. In many islands of the Baleares and Macaronesia, for example where the Iberian honeybee was presumably introduced in historical times, the contemporaneous large-scale importation of commercial C-lineage queens has resulted in high levels of introgression into the local populations (De la Rúa, Galián, Serrano, & Moritz, 2001Miguel et al., 2015;Muñoz, Pinto, & De la Rúa, 2014). The conservation of A. m. iberiensis diversity is therefore a priority, especially in the light of climate change as this subspecies is well adapted to a broad range of environments, including hot and dry summer months with limited nectar flows. These adaptations could be a basis for selection of new development cycles suited to new environmental conditions (Le Conte & Navajas, 2008).
A diverse array of molecular tools has been employed to monitor C-lineage introgression including PCR-RFLP of the intergenic tRNA leu -cox2 mtDNA region (Bertrand et al., 2015), microsatellites (Jensen et al., 2005;Soland-Reckeweg et al., 2009) and, more recently, SNPs (Parejo et al., 2016;Pinto et al., 2014). Among these, SNPs are becoming the tool of choice for many applications because they are easily transferred between laboratories, have low genotyping error, provide high-quality data, are suitable for automation in high-throughput technologies (Vignal, Milan, SanCristobal, & Eggen, 2002), and are more powerful for estimating introgression in honeybees (Muñoz et al., 2017).
High-throughput sequencing of whole genomes generates millions of SNPs. Yet, this volume of data is inappropriate for routine conservation purposes, such as breeding and population monitoring. Therefore, the mining of highly informative SNPs from such high genomic resolution data sets is a common approach for developing reduced SNP assays capable of reliable ancestry estimation (Amirisetty, Khurana Hershey, & Baye, 2012;Judge, Kelleher, Kearney, Sleator, & Berry, 2017). While different metrics and approaches (e.g., Delta, I n , PCA, outlier tests) can be used for ranking SNPs by information content, the fixation index (F ST ) has been the metric of choice perhaps due to its power (Ding et al., 2011;Karlsson, Moen, Lien, Glover, & Hindar, 2011;Wilkinson et al., 2011), especially when comparing only two highly divergent populations (Hulsegge et al., 2013).
Furthermore, some metrics are correlated regarding information content, in particular those based on allele frequencies (Ding et al., 2011;Wilkinson et al., 2011).
In this study, we developed cost-effective reduced SNP assays from 176 whole-genome sequences. When developing such tools, to assure that they are accurate and reliable, the diversity and population complexity needs to be considered. Therefore, taking advantage of the large and comprehensive whole-genome data set for A. m. iberiensis (N = 117), we first tested the effect of sample size and sampling a geographically restricted area on detecting fixed SNPs. Next, we designed the reduced SNP assays using a training data set to identify highly informative SNPs (F ST = 1), which were then validated in holdout and simulated data sets. The constructed SNP assays were revealed to be very powerful for accurately estimating C-lineage introgression and can thus be applied to support conservation efforts in the Iberian honeybee.

| Samples
The whole-genome sequences used in this study were obtained from 176 pure haploid males, representing 117 A. m. iberiensis, 28 A. m. carnica and 31 A. m. ligustica (DH and MAP, unpublished data;Parejo et al., 2016) sampled across a wide geographical range CT MT (Weir & Cockerham, 1984) between A. m. iberiensis, A. m. carnica and A. m. ligustica and between A. m. iberiensis and combined A. m. carnica with A. m. ligustica (C-lineage).

| Effect of sampling bias on the number of fixed SNPs
Starting with a large sample size, which covers a species' entire geographical range and therefore encompasses its variation, is an important first step for developing SNP assays with high statistical power (Ding et al., 2011;Mariette, Le Corre, Austerlitz, & Kremer, 2002).

| Assay design
After assessing the effects of sampling bias on the number fixed SNPs, we proceeded with designing the reduced SNP assays for es-   To downsize the number of fixed SNPs, the first filter eliminated SNPs <5,000 bp apart, which carry redundant information ( Figure 2).
This distance threshold correlates with the high linkage disequilibrium (LD) decay in honeybees (Wallberg, Glémin, & Webster, 2015) and has been used by others (Chapman et al., 2015;Harpur et al., 2014). In this filtering step, SNPs located in 3′UTR, 5′UTR, missense, splice donor and splice regions were preferentially retained to assure that the reduced assays included SNPs of putative functional relevance and thereby represent real phenotypic differences between lineages.
The subsequent filtering step was linked to the Agena Bioscience MassARRAY ® MALDI-TOF genotyping system ( Figure 2). To increase the probability of amplification success, we removed the SNPs which had >5 variable nucleotides on either side of the 250 bp flanking sequences, which will be used for primer design (Table S1).
Additionally, SNPs located in ambiguous regions of the reference genome were excluded using the following criteria: (i) >5 sequential formation. Three criteria were followed to construct each multiplex (hereafter termed reduced SNP assay) aiming at a maximum of 40 SNPs per multiplex, as allowed by the MassARRAY ® technology: (i) every chromosome represented, (ii) at least four putative functional SNPs and (iii) no overlapping SNPs between multiplexes. For comparison purposes, we also constructed four assays of randomly chosen SNPs (hereafter termed random SNP assays) from the wholegenome data set with the same size of the four multiplexes.

| Assay validation
For validating the reduced SNP assays, we simulated hybrid haplotypes using the software admix-simu (https://github.com/ williamslab/admix-simu) and a window-based 100-kbp resolution recombination map from Wallberg et al. (2015). To avoid related haplotypes in the simulated F1 and backcross haplotypes, we used the parental individuals only once in the simulation of recombination.

| Effect of sampling bias on the number of fixed SNPs
The  (Table 3).
Sampling a geographically restricted area also influences the number of fixed SNPs, although the extent of bias depends on sample origin (Table 4) (Table 4).

| Selection and genomic information of highly informative SNPs
Having assessed the potential effects of sampling bias, we were able to follow Anderson's simple training and holdout method without incorporating a significant bias when selecting highly informative SNPs ( Figure 2). Accordingly, highly informative SNPs for estimating  Table S3).

C-lineage introgression into
Chromosome 11 contained the highest proportion of fixed SNPs (3.1%, 4,729 SNPs), whereas chromosome 7 had the least (0.3%, 400 SNPs; Table S4). The physical distance between the fixed SNPs ranged from 1 to 2,587,074 bp with a mean of 11,261 bp.
Most fixed SNPs are located in introns (7,666) and intergenic regions (4,257); however, a number are located in regions of putative functional relevance, including 47 SNPs (distributed along 37 genes) that are nonsynonymous or missense variants (Table S5).
Of the 1,347 genic regions containing SNPs, 12 harbour more than 100 SNPs (Table S6). Gene ontology (GO) analysis revealed 13 significantly enriched functional terms (modified Fisher exact pvalue <.05; Table S7). The biological processes term "regulation of transcription, DNA-templated" shared 12 genes with the molecular function term, "transcription factor activity, sequence-specific DNA binding." Two other molecular function terms are associated with more than 26 genes related to DNA binding ("sequencespecific DNA binding," "DNA binding"). The KEGG pathways were represented by four terms "aminoacyl-tRNA biosynthesis," "Wnt signalling pathway," "mRNA surveillance pathway" and "insulin resistance."

| Assay design
Several filters were applied to the initial 18,272 fixed SNPs identified in the training data set, resulting in a final data set of 708 SNPs, which were used to design four multiplexes (or reduced assays) with the assay design tool of Agena ( Figure 2). The resulting assays contained 37 (M1), 38 (M2), 40 (M3) and 38 (M4) SNPs (Table S8) Table S4).

| Assay validation
The with the reduced assays performing consistently better than the random ones. For example, all random assays misclassified between one to three pure individuals as hybrids, which never occurred with the reduced assays (Tables 5, S9). LG1 L G2 LG3 L G4 LG5 L G6 LG7 L G8 LG9 L G10 L G11 L G12 LG13 LG14 LG15 LG16 The overall performance increases when the reduced assays are combined (Tables 5, S9; Figures 4, S4). The best result is obtained for the combination of M1, M3 and M4, which represents a total of 115 highly informative SNPs distributed across the 16 chromosomes.

M1 (37 SNPs
However, the combination of M3 and M4, with only 78 SNPs, was nearly as good (Table 5). In summary, while there is an increment in the overall performance when combining M1, M3 and M4, their individual use still provides robust estimates of C-lineage introgression into A. m. iberiensis.

| D ISCUSS I ON
Developing cost-effective molecular tools for accurate estimation of introgression in A. mellifera is increasingly important as commercial strains (mostly of C-lineage ancestry) are threatening native genetic diversity in many regions throughout Europe (Bertrand et al., 2015;De la Rúa et al., 2009;Jensen et al., 2005;Parejo et al., 2016;Pinto et al., 2014;Soland-Reckeweg et al., 2009). In the postgenomics era, rapid innovations in high-throughput sequencing technologies make it possible to construct extensive whole-genome data sets, especially in model organisms with small genomes like the honeybee (Weinstock et al., 2006). However, while whole-genome sequencing is increasingly inexpensive (~200 €/honeybee), it is still not affordable for conservation management applications. Furthermore, the processing of the large amounts of data generated by wholegenome sequencing requires bioinformatics expertise and powerful computational resources typically not available to state entities or conservation centres. Whole-genome sequences, however, can be used to generate baseline data for developing robust molecular

| Effect of sampling bias on the number of fixed SNPs
Considering the long-standing problem of ascertainment bias during discovery and selection of informative SNPs (Albrechtsen, Nielsen, & Nielsen, 2010, and   and demography (Morin, Luikart, Wayne, & Grp, 2004;Wakeley, Nielsen, Liu-Cordero, & Ardlie, 2001). Accordingly, we assured a sufficiently large and representative sample of the A. m. iberiensis diversity, which covers the Iberian cline, for developing accurate reduced assays while at the same time leaving independent holdout samples for validation.

| Genomic information of the highly informative SNPs
A large number of SNPs (18,272) were fixed between A. m. iberiensis and C-lineage subspecies. This was an expected result because M and C are the most divergent of the four lineages (Wallberg et al., 2014). The top enriched GO terms of the genes marked by those SNPs were associated with numerous genes related to regulation of expression, which is essential for the versatility and adaptability of a species for short-and long-term environmental changes (López-Maury, Marguerat, & Bahler, 2008). This is consistent with the complex evolutionary history of A. mellifera and its numerous subspecies, which have adapted to the diversity of habitats and climates in its large distributional range (Harpur et al., 2014;Wallberg et al., 2014).

| Assay design and validation
Having a large number of fixed SNPs is an enormous advantage when designing reduced SNP assays, as they represent ideal ancestry informative markers (Rosenberg, Li, Ward, & Pritchard, 2003 Macaronesia, which are threatened by human-mediated gene flow (De la Rúa et al., 2001Jensen et al., 2005;Miguel et al., 2015;Muñoz et al., 2014;Pinto et al., 2014), there is very limited introgression in A. m. iberiensis populations of Iberia . Therefore, it is crucial to monitor Iberian populations, before gene complexes shaped by natural selection over evolutionary time are irretrievably lost. Here, we took advantage of whole-genome sequence data, which provided millions of SNPs, to design highly powerful assays containing a low number of SNPs capable of estimating C-lineage introgression into A. m. iberiensis with a high level of accuracy. We recommend the combination of the best two (78 SNPs) or three (115 SNPs) reduced SNP assays, although one assay can also be used when there are budget constraints. These assays can be used to estimate C-lineage introgression not only in the native range of A. m. iberiensis in Iberia but also in the introduced range in the archipelagos of Baleares and Macaronesia, and in South America.
This study provides a powerful set of tools to safeguard a unique legacy of honeybee diversity for future generations. While these tools can only be applied to honeybees, the approach demonstrated herein (from testing the effect of sampling bias to the intricate steps involved in the design of the reduced SNP assays) is of high general value in a wide range of scenarios for the conservation of potentially hybridized domestic and wildlife populations.

ACK N OWLED G EM ENTS
We thank numerous researchers, beekeepers and beekeeping as-