Complex population structure and haplotype patterns in the Western European honey bee from sequencing a large panel of haploid drones

Abstract Honey bee subspecies originate from specific geographical areas in Africa, Europe and the Middle East, and beekeepers interested in specific phenotypes have imported genetic material to regions outside of the bees' original range for use either in pure lines or controlled crosses. Moreover, imported drones are present in the environment and mate naturally with queens from the local subspecies. The resulting admixture complicates population genetics analyses, and population stratification can be a major problem for association studies. To better understand Western European honey bee populations, we produced a whole genome sequence and single nucleotide polymorphism (SNP) genotype data set from 870 haploid drones and demonstrate its utility for the identification of nine genetic backgrounds and various degrees of admixture in a subset of 629 samples. Five backgrounds identified correspond to subspecies, two to isolated populations on islands and two to managed populations. We also highlight several large haplotype blocks, some of which coincide with the position of centromeres. The largest is 3.6 Mb long and represents 21% of chromosome 11, with two major haplotypes corresponding to the two dominant genetic backgrounds identified. This large naturally phased data set is available as a single vcf file that can now serve as a reference for subsequent populations genomics studies in the honey bee, such as (i) selecting individuals of verified homogeneous genetic backgrounds as references, (ii) imputing genotypes from a lower‐density data set generated by an SNP‐chip or by low‐pass sequencing, or (iii) selecting SNPs compatible with the requirements of genotyping chips.

Following the extensive imports of queens from "exotic" subspecies, the genetic makeup of honey bee populations in France became complex, and genetic pollution of local populations had clear phenotypic consequences such as changes in the colour of the cuticle (Cornuet et al., 1986). The increasing admixture of divergent honey bee subspecies has fostered conservationists to protect the native genetic diversity of regional ecotypes, such as A. m. iberiensis in Spain and Portugal, A. m. ligustica and A. m. siciliana in Italy (Fontana et al., 2018), and A. m. mellifera in France, Scotland and Switzerland amongst other places (De la Rúa et al., 2009;Fontana et al., 2018;Hassett et al., 2018;Parejo et al., 2016;Pinto et al., 2014). As a result of the different breeding practices, the necessity arose for a study targeted towards A. m. mellifera conservatories and French bee breeders specialized in rearing and selling queens. In this context, the genomic diversity project "SeqApiPop" emerged. Within this project, samples from French conservatories, from individual French breeders and from breeders' organizations were analysed, including Buckfast samples.
Traditionally, wide diversity studies have been performed using a small number of molecular markers such as microsatellites (Techer et al., 2015) or limited sets of single-nucleotide polymorphisms (SNPs) (Henriques et al., 2018a;Parejo et al., 2018;Whitfield et al., 2006;Zayed & Whitfield, 2008), enabling population stratification, introgression and admixture levels to be characterized. However, to understand complex population admixture events, as has occurred for the managed honey bee populations in France and elsewhere, or to identify signatures of natural (Harpur et al., 2014;Henriques et al., 2018b;Parejo et al., 2020;Zayed & Whitfield, 2008) or artificial (Parejo et al., 2017;Wragg et al., 2016) selection in the genome, a much higher density of markers is required. As no high-density SNP chip was available for the honey bee at the onset of the project, and as the honey bee genome is very small compared to most animal genomes, being only 226.5 Mb long , we used a with two major haplotypes corresponding to the two dominant genetic backgrounds identified. This large naturally phased data set is available as a single vcf file that can now serve as a reference for subsequent populations genomics studies in the honey bee, such as (i) selecting individuals of verified homogeneous genetic backgrounds as references, (ii) imputing genotypes from a lower-density data set generated by an SNP-chip or by low-pass sequencing, or (iii) selecting SNPs compatible with the requirements of genotyping chips.

K E Y W O R D S
genome, haplotype, honey bee, population genetics, SNP whole-genome sequencing approach (Harpur et al., 2014;Wallberg et al., 2014). Although the sequencing of honey bee workers has proven successful for detecting signatures of selection or admixture events (Christmas et al., 2018;Dogantzis et al., 2021;Harpur et al., 2014;Wallberg et al., 2014;Wragg et al., 2018), analysing haploid drones allows sequencing at a lower depth and with greater accuracy in variant detection, as demonstrated by studies on a limited number of samples (Henriques, et al., 2018b;Parejo et al., 2016;Wragg et al., 2016). An additional advantage of sequencing haploids is that the alleles are phased, which is invaluable for studies investigating genome dynamics such as recombination hotspots and haplotype structure. Although some insights into recombination patterns in the honey bee have been made through the analysis of drones from individual colonies (Kawakami et al., 2019;Liu et al., 2015) and linkage disequilibrium (LD)-based approaches (Jones et al., 2019;Wallberg et al., 2015), a deep understanding of the recombination landscape, essential for fine-scale genetic analyses, requires hundreds of phased genomes. Such "HapMap" projects have been conducted in humans and cattle, initially using SNP arrays (Bansal et al., 2007;Bovine HapMap Consortium et al., 2009) and more recently by wholegenome sequencing as in the "1000 genome" projects (Chaisson et al., 2015;Sudmant et al., 2015).
Therefore, as the first step towards a deeper understanding of French and Western European managed honey bee populations and of their genome dynamics, and also to provide a large data set of phased genotypes from sequences aligned to the latest Amel_ HAv3.1 genome assembly , we undertook the extensive sequencing of haploid drones. These data comprised samples from French conservatories and commercial breeders in addition to samples from several European countries each representing potentially pure A. m. ligustica, A. m. carnica, A. m. mellifera and A. m. caucasia populations typically imported by French breeders. Finally, A. m. iberiensis, the Iberian subspecies only separated from the native French A. m. mellifera by the natural barrier of the Pyrenees was also studied. In total, 870 samples were sequenced for SNP detection and 629 were used for a detailed genetic analysis of present-day honey bee populations in France. The results are publicly available as a phased vcf file for high-quality SNP markers carefully filtered against sequencing and mapping artefacts, which can be used for imputation or for selecting already genotyped reference individuals to be used in future studies.

| Sampling and sequencing
For the population genomics analyses, one individual drone per colony was sampled before emergence, from colonies throughout France, Spain, Germany, Switzerland, Italy, the UK, Slovenia, Poland, Denmark and China, and from a French beekeeper having imported queens from Georgia, amounting to a total of 642 samples (Figure S1, Table S1; Table 1). The robustness of the primary SNP detection by genotypegvcfs (see below) and of the filtering steps on mapping and genotype quality metrics estimated across samples were improved by increasing the size of the data set for these technical steps: a further 30 colony replicate samples, which had been collected from colonies already sampled for this study, in addition to 198 samples of similar genetic backgrounds from two other ongoing projects. Thus, although 642 colonies were included for population genomics analyses, in total 870 samples were used for SNP detection (Table S1).
DNA was extracted from the thorax of adult bees or from pupae as described in Wragg et al. (2016). Briefly, drones were sampled at either the pupae/nymph or larval stage and stored in absolute ethanol at −20°C. DNA was extracted from the thorax or from diced whole larvae. Tissue fragments were first incubated for 3 h at 56°C in 1 ml of a solution containing 4 M urea, 10 mM Tris-HCl pH 8, 300 mM NaCl, 1% SDS, 10 mM EDTA and 0.25 mg proteinase K, after which 0.25 mg proteinase K was added and incubated overnight at 37°C. Four hundred microlitres of a saturated NaCl solution was added to the incubation, which was then gently mixed and centrifuged for 30 min at 15,000 g. The supernatant was treated for 5 min at room temperature with RNAse (Qiagen) and then centrifuged again, after which the DNA in the supernatant was precipitated with absolute ethanol and resuspended in 100 μl TE 10/0.

| Mapping and genotype calling
Sequencing reads were mapped to the reference genome Amel_ HAv3.1  using bwa-mem (version 0.7.15) (Li, 2013), and duplicates marked with picard (version 2.18.2; http:// broad insti tute.github.io/picar d/). Libraries that were sequenced in multiple runs were merged with samtools (version 1.8) (Li et al., 2009) prior to marking duplicates. Local realignment and base quality score recalibration (BQSR) were performed using gatk (version 4.1.2.) (McKenna et al., 2010), using SNPs called with gatk haplotypecaller as covariates for BQSR. Each drone was independently processed with the pipeline and genotyped independently with haplotypecaller. Although the drones sequenced are haploid, variant calling was performed using a diploid model to allow the detection and removal of SNPs for which heterozygous genotypes are called in >1% of samples, and that might have arisen for example as a result of short-tandem repeats (STRs) or could highlight copy number variants (CNVs) in the genome. Individual gVCF files were combined with combinegvcfs and then jointly genotyped with genotypegvcfs, resulting in a single VCF file for the 870 samples containing 14,990,574 raw variants. After removing Indels with gatk selectvariants, 10,601,454 SNPs remained. Sequencing depth was estimated using mosdepth (Pedersen & Quinlan, 2018). Further details are given in https:// github.com/avign al5/SeqAp iPop/blob/v1.5/SeqAp iPop_1_Mappi ngCal ling.md.

| Quality filters on SNPs
The first run of filters concerns technical issues related to the sequencing and alignment steps and was therefore used for the total data set of 870 samples, to benefit from its larger size for SNP detection and validation ( Figure S2). These filters included a Regional geographical origins in France are indicated by their administrative "département" (see Figure S1 for locations on the map and further information on the populations). The royal jelly samples come from several French beekeepers in Moselle, Alpes maritimes, Bouches-du-Rhône and Isère exchanging genetic material within the GPGR breeders' organization. Haplotype blocks were detected with plink (version 1.9) (Chang et al., 2015) using the blocks function, "--blocks no-pheno-req no-small-max-span," with the parameter "--blocks-max-kb 5000." LD pruning was performed with plink using the indep-pairwise md. Analyses of population migration was performed with treemix (Pickrell & Pritchard, 2012), with the option for grouping SNPs set to −k = 500, testing between 0 and 9 migrations and performing 100 runs per migration with a unique random seed. The optimum number of migrations was estimated with the r package optm (Fitak, R. R.: https://CRAN.R-proje ct.org/packa ge=OptM) using the Evanno method provided (Evanno et al., 2005). Tree summaries for the 100 runs per migration tested were performed with dendropy (Sukumaran & Holder, 2010) (Maples et al., 2013). Three main genetic backgrounds were considered for this analysis, corresponding to the three major groups highlighted in the PCA. Reference samples were selected as having >95% pure background. Although most diploid data were removed and the data were already phased, shapeit Version 2.904 (Delaneau et al., 2012) was run to format the vcf files for rfmix. rfmix was run using genetic maps generated from the data of Liu et al. (2015).

| Sequencing and genotyping
Sequencing of the honey bee drones for the SeqApiPop diversity project began in 2014 on Illumina HiSeq instruments and some of the first samples had such low coverage that a second run (or even three in the case of OUE8) of sequencing was performed. For these samples, the resulting BAM files were merged prior to variant calling. Only four samples of the diversity project were sequenced on Novaseq instruments, for which higher sequencing depths were achieved. Therefore, to improve the robustness of the SNP detection pipeline, we included drone genome sequences from other ongoing projects using the same genetic types, which had the advantage of all being produced with a Novaseq instruments and at a higher depth. Samples sequenced with the HiSeq and NovaSeq instruments had mean sequencing depths of 12.5 ± 6.1 and 33.5 ± 10.2 respectively (Table S1, Figures S3 and S4).
Genotyping the whole data set of 870 drones with the gatk pipeline allowed the detection of 10,601,454 raw SNPs ( Figure S2). Although a filter on genotyping rate ≥95% was applied in the primary filtering steps, the final filter on heterozygote calls was set to keep SNPs with up to 1% of heterozygote samples, and these remaining heterozygous genotypes were set to missing ( Figure S2). After this, a final filter on missing data in samples was applied and 15 samples were removed due to the fraction of missing genotypes exceeding 10%. The final diversity data set comprised 629 drones (Table S1) and 7,012,891 SNPs, and was used for all subsequent analyses unless stated otherwise.  Figure S8).

| Contribution of SNPs to the variance in PCAs: detection of large haplotype blocks
PC3 represents 1.2% of the variance and the remaining 626 principal components each represent 0.7% or less. When looking at the individual contributions of SNPs to the variance, we can see that only a very small proportion of the ~7 million markers contribute significantly to PC1 (red lines on Figure S9) and that this proportion is even much smaller for PCs 2 and 3. Two reasons for such a limited contribution to the variance of the majority of markers is the low informativity of markers of low minor allele frequency (MAF) and the redundancy of markers that are in strong LD. Therefore, to thin the data set, we tested the effect of several MAF filters and chose the most pertinent one for subsequent testing of various LD pruning values. The effects of these filters were estimated by inspecting the contributions of the SNPs to the principal components.
The MAF filters tested showed clearly that data sets containing only SNPs with MAF >0.01 or MAF >0.05 are sufficient to allow a higher proportion of markers contributing to the PCs, with a notable increase of SNPs contributing to PC2 and PC3 ( Figure S9). To avoid losing too many potential population-specific markers present at low frequency in the data, we chose to use the lowest MAF threshold tested, leaving a data set of 3,285,296 SNPs having MAF and S11). Such observations suggest the existence of large haplotype blocks driving differentiation along principal components, in particular PC1. To explore this further we compared these genomic regions to the haplotype blocks detected with plink (Table S2) revealing significant overlap by visual inspection ( Figure S12). The largest of these blocks spans 3.6 Mb, representing 21% of chromosome 11 and close to 1.6% of the honey bee genome size ( Figure 1c). Four other blocks on chromosomes 4, 7 and 9 are larger than 1 Mb (Figures S10-S12, Table S2).

| LD filtering
Population structure and admixture analyses rely largely on the assumption that markers along the genome are independent. Indeed, markers in strong LD such as those in haplotype blocks can influence genetic structure. Therefore, we sought to investigate the impact of LD pruning on inferences of population structure. The number of SNPs used in a window for LD pruning was determined such that most windows would correspond to a physical size of 100 kb. To achieve this, we used the mode of the distribution of the number of SNPs in 100-kb bins, which is 1749 for the data set of 3,285,296 SNPs with MAF >0.01 (Figures S13 and S14). LD pruning was thus performed with a window size of 1749 SNPs and 175-bp (10%) overlap and various values were tested, spanning between LD .1 < r 2 ≤ .9.
These various thresholds show that with LD r 2 ≤ .3 the global structure of the data set is altered, with the A. m. iberiensis population as the major contributor to PC2 and A. m. caucasia being a separate population only in PC3 ( Figures S15A,B), whereas with LD r 2 > .3, the contributions to the variance in PC1 is not as widely distributed amongst subspecies ( Figure S16). The effect of LD pruning on the haplotype blocks is drastic, with the few SNPs retained having a distribution of their contributions to the variance in PC1 and PC2 similar to that of the rest of the genome ( Figure S17). After pruning for LD r 2 < .3, 601,945 SNPs were left in the data set, which were subsequently used in the analysis of population structure.

| Analysis of population structure
The PCA revealed a distinct population structure within the data. For instance, some populations from French breeding organizations, such as the royal jelly breeders' organization (GPGR:

| Migrations between populations
Due to the commercial interest expressed by beekeepers for the Buckfast bees and the peculiar genetic composition observed in the Corsican population, we performed a population migration analysis with treemix (Pickrell & Pritchard, 2012). All samples having more than 80% ancestry from one of the nine backgrounds detected in F I G U R E 3 Admixture analysis. (a) Estimation of cross-validation (CV) error for 50 runs of admixture for 3 ≤ K ≤ 16. (b) Major modes and modes with the lowest mean CV error for admixture runs. For each value of K ranging between 2 and 12, Q matrices from admixture runs were grouped by similarity in modes by using the pong software (Behr et al., 2016). Blue: number of runs in the major mode; orange: number of runs in the major or minor mode having the lowest mean CV value. Amongst the values of K having the lowest CV values from admixture runs, K = 9 stands out as having a major mode containing 33 runs out of 50 ( Figure S19), which is also the mode having the lowest mean CV value from the admixture runs. For other values of K, such as 4, 6, 7 and 8, the major modes do not have the lowest mean CV values. (c) Admixture plots for all 629 samples for K = 3 (major mode containing 49 out of 50 runs) and K = 9 (major mode containing 33 out of 50 runs). Reference populations on the left have a colour code under the admixture plot that recapitulates their colour on the PCA plots of Figure 2; other populations are indicated with alternating grey and white colours   the admixture analysis were selected from one of the K = 9 major mode Q-matrices (Table S4), and the list supplemented with the 43 Corsican samples, making our data set composed of 10 representative groups for the European populations.
Estimations on the number of migrations (m) between the populations in the data set, based on the Evanno method (Evanno et al., 2005), return a mode of m = 1, strongly suggesting a single migration, and a relatively high ∆m value for m = 2 supports the existence of a second migration. The ∆m values for three or more migrations are close to zero, suggesting that more than two migrations between populations in the data set are unlikely (Figure 5a).

| Haplotype conservation in the admixed populations
To investigate further the haplotype blocks detected, we performed a local ancestry inference on our data set with rfmix. Reference samples were selected as bees having >95% ancestry for a given background following the admixture analysis at K = 3 (Figure 3c), resulting in 131 samples for group 1, 148 for group 2 and 17 for group 3, while the remaining 333 samples formed the query data set. To perform the local ancestry inference, we constructed a genetic map from crossovers identified in the sequence data of 43 males from three colonies (Liu et al., 2015) aligned In other countries, reference samples are all grouped together, unless two genetic types were sampled (e.g., Switzerland). Colours in the pie charts correspond to the backgrounds found in the admixture analysis for K = 9, as presented in Figure 3. Reference populations for the five subspecies are indicated in italics. Two Buckfast populations in France and Switzerland are indicated, as the four breeders from the Royal Jelly breeders' organization (GPGR: Groupement des Producteurs de Gelée Royale) having provided samples is 596,047 bp long. However, on investigating a possible relationship between haplotype blocks and gene sizes in the honey bee genome no obvious association could be found (data not shown).

| SNP detection in a haploid data set
Our complete data set of haploid drones is composed of 870 sam-   (Table S5). To investigate the effects of the high SNP density found on the honey bee genome on the potential quality of the genotyping results using the chip, we looked for additional SNPs close to each target SNP that could interfere with the genotyping process and cause allele dropout.
Indeed, additional variation in the probe sequence within 50 bp flanking the target SNP will interfere with the hybridization-based genotyping (Gershoni et al., 2022)  Conversely, for lower density chips, the spacing of markers can be optimized by taking the haplotype structure into account, thus avoiding redundancy while maintaining the highest possible level of genetic information. Another advantage of sequencing haploid samples is that the whole data set represents phased chromosomes. Notably, the present data set will be invaluable for genotype imputation in future studies using lower density genotyping, such as SNP chips or low-pass sequencing (Gilly et al., 2019;Li et al., 2021;Wasik et al., 2021).
Although other studies have used whole genome sequencing for honey bee population genomics, these either concerned workers, thus complicating phasing (Dogantzis et al., 2021;Wallberg et al., 2014), and/or were more limited in terms of the number of samples studied (Henriques, Wallberg, et al., 2018b;Parejo et al., 2016Parejo et al., , 2020. Moreover, to make practical use of these published results, the raw reads need to be downloaded from the short-read archive (SRA) and aligned to the genome reference prior to detection of variants. This is a highly demanding task, in terms of both labour and computing time. In contrast, our data set is much larger, is naturally phased, and genotypes for the samples and markers can be directly selected from the vcf file provided.
Moreover, our data are based on the latest version amel_hav3.1 of the genome assembly , whereas most other data sets are not, including one of the latest published, which is based on the older and less complete Amel4.5 assembly (Dogantzis et al., 2021). We believe our careful filtering steps on sequencing and alignment metrics provide a reliable set of markers and that the selection of reference samples can be done on the basis of the uniformity of genetic backgrounds, for instance by filtering on the admixture Q matrixes provided in Tables S6 and S7 on a userdefined basis.

| Population structure in managed honey bees
The deep understanding of European honey bee populations and of their recent admixture via imports of genetic stocks by breeders is not a simple task. Analyses of admixture events in complex population structures can be sensitive to a number of parameters and sometimes yield misleading results, especially if one or several populations have gone through a recent bottleneck (Lawson et al., 2018). PCA on all ~7 million markers indicate that our data set is structured into three main genetic types ( Figure S8). The first principal component, representing 10.8% of the variance, separates two major groups corresponding respectively to subspecies from northwestern (M lineage) and southeastern Europe (C lineage). These two groups are represented by several populations, including the Savoy and Porquerolles conservatories from South-East France on one side, and bees that are not geographically far from Italy or Slovenia on the other. This large genetic distance, despite relatively close geographical proximity of the populations, supports the hypothesis of the colonization of Europe by honey bees via distinct western and eastern routes (Estoup et al., 1995;Han et al., 2012;Ruttner, 1988;Whitfield et al., 2006), and the separation between subspecies due to the Alps forming a natural barrier preventing genetic exchange (Rinderer, 2013). Along the second principal component, representing 3.1% of the variance, the population originating from Apis mellifera caucasia separates from the southeastern European populations ( Figure S8). Prior to investigating admixture, we pruned SNPs in LD taking care to maximize the removal of redundancy while maintaining the general structure of the data (Figure 2; Figures S15-S17).
We explored a range of K numbers of genetic backgrounds, running multiple iterations of each, to determine the most likely admixture pattern (Figure 3). Our results indicate that this approach is necessary to ensure the results from each K model are stable prior to interpretation. We observe from our admixture analyses that CV outliers within a K model are common. For instance, at K = 8, the mode with the lowest CV is only represented by eight out of 50 admixture runs, whereas the major mode has 12 runs. On examining the admixture patterns from these two modes, the major mode suggests the A. m. mellifera bees from conservatories on mainland France to be hybrids between bees from Ouessant and Spain, and moreover with roughly 50% of each genetic background on the same mode, the A. m. iberiensis background represents also 50% of the M lineage background in the bees from Corsica ( Figure S19). This is unlikely given the geography of Western Europe and our knowledge of the history of the bees of Ouessant. Indeed, Ouessant is a very small island (15.6 km 2 ) off the coast of western Brittany, isolated from the rest of the French honey bee population since its installation in 1987 and the prohibition of imports since 1991 mostly for sanitary reasons. In contrast, the mode with the eight runs and lowest CV presents a better separation of A. m. mellifera and A. m. iberiensis, which is also found in the major mode at K = 9 backgrounds. A smaller level of admixture can still be found between A. m. mellifera and A. m. iberiensis, which is quite likely to be due to the shared ancestry between these two subspecies.
The major mode at K = 9, represented by 33 out of 50 runs, returned the lowest mean CV value. This mode identifies mainland France A. m. mellifera samples as having a distinct genetic background and suggests that honey bees from Ouessant may have been re-introduced in the mainland conservatories. This mode also identifies a distinct genetic background in French and Swiss Buckfast bees. Buckfast bees were developed by Brother Adam, and are described in page 14 of "Beekeeping at the Buckfast Abbey" as a cross performed around 1915 between "the leather-coloured Italian bee and the old native English variety" (Brother Adam, 1986). Brother Adam also notes that the Italian bees that were imported in later years were distinct from those used in the development of the Buckfast strain. Our analysis of migrations between populations with treemix suggests that the Buckfasts in our data set were subject to introgression with genetic material from A. m. caucasia (Figure 5b), although the timing of this potential admixture event could not be determined. When the two migrations of A. m. ligustica into Corsica and A. m. caucasia into the Buckfast are considered, which is a likely scenario suggested by the Evanno analysis, the latter is close to A. m. carnica, as seen in (Figure 5b,c). Interestingly, a whole genome sequence study of Italian honey bees also suggest that the Buckfast bees are closer to A. m. carnica than to A. m. ligustica (Minozzi et al., 2021) (Figures 2 and 3). The introgression of Italian bees is confirmed by the treemix migration analysis, and when this is accounted for, the Corsican samples group with A. m. mellifera bees from mainland France instead of being situated between the two main genetic subgroups of western and eastern European bees (Figure 2b,c). This result probably reflects the fact that Italian bees may have been imported on the Island before the ban on imports and that the population has been homogenized since then, at least within the breeders' organization. As beekeepers generally prefer the A. m. ligustica Italian bees over A. m. mellifera, it is very likely that the latter is the original population, as also suggested by Ruttner (1988). Although the hypothesis of the separation of the two subspecies on the mainland by the Alps seems appropriate (Rinderer, 2013), the situation of the  (Ruttner, 1988).
Apart from the subspecies references and the royal jelly populations, the honey bees provided by breeders are largely admixed, exhibiting high variability in background proportions-even for samples sourced from the same region. A typical example is that of the A large proportion of A. m. mellifera background is present and the population is far less homogenous (Figures 2-4). This exemplifies the heterogeneity of the managed populations that can be found in France. A question that needs further investigation is the influence of the mating strategies used by the breeders, such as artificial insemination, mating stations, with drone-producing hives to saturate the environment with the desired genetic strains, or open mating (Cao et al., 2016). Interestingly, in our data set, only three A. m. ligustica and all of the bees from China have some royal jelly genetic background.
Previous analyses on worldwide data sets were published, either by whole genome sequencing of workers (Chen et al., 2022;Dogantzis et al., 2021;Wallberg et al., 2014) or by sequencing the mitochondrial DNA (Tihelka et al., 2020). These were intended to understand worldwide populations, the geographical origins and migration routes of Apis mellifera, a topic still under debate. Our intentions here are different, being targeted towards managed populations, for which detailed knowledge of genetic makeup is essential for further work concerning traits of interest to queen breeders and beekeepers and the interaction between different subspecies present within a territory. Typically, a refined description of admixture and of its distribution along chromosomes is essential to avoid confounding effects in genome-wide association studies (GWAS).

| Large haplotype blocks in the honey bee genome, specific to the M and C lineages
When investigating the contribution of SNPs to variance in the PCA, we noted that several large genomic regions, up to 3.5 Mb long, in which almost all markers contributed very strongly to the first principal component, separate bees from northwestern (M lineage) and southeastern Europe (C lineage). These regions were noted to coincide with haplotype blocks detected with plink. To investigate the matter further, we performed local ancestry inference in the admixed samples with rfmix, using samples exhibiting 95% ancestry for the three main genetic backgrounds as references. A low recombination rate is confirmed by the observation of very few switches between the three main genetic backgrounds within these haplotype blocks.
Interestingly, some of our regions, including the largest one detected on chromosome 11, coincide with regions of low recombination rate detected in other studies. These include an LD map produced with 30 diploid sequences from African worker bees (Wallberg et al., 2015), ancestry inference in an admixed population (Wragg et al., 2018), low-resolution genetic maps produced by RAD or ddRAD sequencing, with microsatellite or SNP markers, ddRAD sequencing (DeLory et al., 2020;Ross et al., 2015), or higher resolution genetic maps produced by whole genome sequencing of European (Liu et al., 2015) and African subspecies (Kawakami et al., 2019).
Most of these regions coincide with the position of the centromeres such as described in the reference genome assembly, which is based primarily on the combination of the location of AvaI repeats, which were previously assigned to centromeres by cytogenetic analysis, and of a low GC content (Beye & Moritz, 1995;Wallberg et al., 2019). However, the AvaI repeats only represent a very small fraction of the centromeric regions described, with the largest one only covering 14 kb , whereas the estimate of the extent of the centromeres, based on a GC content lower than the genome average, is much larger although imprecise and supposes a similar organization as for the AT-rich alpha-satellite repeats in vertebrates, such as human . Although in some cases the boundaries of our regions of low recombination rate coincide with the actual positioning of the centromere on the genome assembly , such as in chromosomes 5 or 8, in other instances, such as in chromosome 12, the region we define is much narrower. Due to the difficulties in interpreting banding patterns in honey bee chromosomes, the position of the centromeres is not well defined. Some evidence based on G-and C-banding suggests there are four metacentric and 12 submetacentric or subtelocentric chromosomes (Hoshiba, 1984), whereas other evidence based on fluorescence in situ hybridization of a centromere probe suggests there are two metacentric, four submetacentric, two subtelocentric and eight telocentric chromosomes (Beye & Moritz, 1994). Our evidence suggests at least six chromosomes that could be telocentric or acrocentric: chromosomes 3, 5, 6, 9, 14 and 15.
Some of the haplotype blocks/regions of low recombination are large, such as representing up to 21% in the case of chromosome 11 ( Figure 6). This may seem a lot, but recent findings in a complete sequencing of the human genome give a similar proportion for chromosomes 9, in which 40 Mb of satellite arrays represent 20% of the chromosome (Nurk et al., 2022). One important difference, however, is that the block on honey bee chromosome 11 contains some genes, except in the central region, whereas the satellite array described on human chromosome 9 does not. This reaffirms that our understanding of the centromere positions in the honey bee chromosomes requires refinement. The specific case of the acrocentric chromosomes in terms of gene content ( Figure S20) seems to compare better to the situation described in humans, as the sequencing of the p-arm of the five human acrocentric chromosomes has allowed the discovery of novel genes within the satellite repeat-containing regions . Centromeric DNA evolves rapidly, suggesting it goes through a genetic conflict known as the centromere drive hypothesis. This is due to the fact that in female meiosis, only one of the resulting chromosomes is included in the oocyte, leading to a competition between centromeres (Rosin & Mellone, 2017). The differentiation of centromeric regions between subspecies we observe here could be in line with this hypothesis.
However, although in most species the rapidly evolving DNA sequence of centromeres is typically composed of highly repeated elements, this is not the case of the haplotype blocks found here, which seem to be associated with their respective centromeres due to a lack of recombination.
Some haplotype blocks may have another origin than centromeric DNA. For instance, genetic divergence could have been maintained by limiting recombination via the presence of structural variants such as inversions. Indeed, two of the blocks described here, between positions 4.0-5.1 and 5.8-6.9 Mb on chromosome 7, seem to coincide at least partially with two regions of haplotype divergence possibly due to inversions, detected between positions 3.9-4.3 and 6.3-7.3 Mb on the same chromosome, in a highland vs. lowland study of East African bees (Christmas et al., 2018). The slight differences in coordinates found between the two studies could be due to the fact that different version of the Amel_HAv3 assembly were used. However, if confirmed, this finding suggests that haplotype blocks differing between M lineage and C lineage bees such as found here might coincide with blocks found in other subspecies in Africa. Another study identifying the thelytoky locus (Th) in the South African Cape honey bee Apis mellifera capensis showed it was in a nonrecombining region over 100 kb long on chromosome 1, although long-read mapping failed to detect any inversion (Aumer et al., 2019).
Given the current hypotheses on the colonization of Europe by honey bees via distinct western and eastern routes (Estoup et al., 1995;Han et al., 2012;Ruttner, 1988;Whitfield et al., 2006), it is not surprising that the haplotype blocks described here, whether or not representing centromeric regions, tend to separate mainly the M and C lineage bees. Further analyses will be necessary to define the centromeric regions more precisely and study their implication, together with the other haplotype blocks, in the subspecies structure of honey bee populations.

| CON CLUS IONS
The sequencing of 870 haploid honey bee drones was shown here to be an invaluable approach for variant detection and for understanding the fine genetic makeup of a complex population having gone through multiple events of admixture. In addition, the extent of regions of extremely low recombination rate could be defined with higher precision than previously. The data set generated, based on the latest genome assembly, is a solid base for future research involving other honey bee populations and for any analyses requiring a reference set for simulations (Eynard et al., 2021), phasing or imputation.

AUTH O R CO NTR I B UTI O N S
YLC, J-PB, BB and AV designed the experiment. BB, YLC and AV coordinated colony selection, and sampling and samples were provided by KB, MB, CC, AG, PK, MP and AP. KC-T, EL and OB performed DNA extraction, library preparation and sequencing. DW, AV, SE and BS performed the bioinformatic analyses and cowrote the manuscript.
All authors read and commented on the final manuscript.

ACK N OWLED G EM ENTS
This work was performed in collaboration with the GeT platform,

CO N FLI C T O F I NTE R E S T S
The authors declare no conflicts of interest.