Automated improvement of stickleback reference genome assemblies with Lep‐Anchor software

We describe an integrative approach to improve contiguity and haploidy of a reference genome assembly and demonstrate its impact with practical examples. With two novel features of Lep‐Anchor software and a combination of dense linkage maps, overlap detection and bridging long reads, we generated an improved assembly of the nine‐spined stickleback (Pungitius pungitius) reference genome. We were able to remove a significant number of haplotypic contigs, detect more genetic variation and improve the contiguity of the genome, especially that of X chromosome. However, improved scaffolding cannot correct for mosaicism of erroneously assembled contigs, demonstrated by a de novo assembly of a 1.6‐Mbp inversion. Qualitatively similar gains were obtained with the genome of three‐spined stickleback (Gasterosteus aculeatus). Since the utility of genome‐wide sequencing data in biological research depends heavily on the quality of the reference genome, the improved and fully automated approach described here should be helpful in refining reference genome assemblies.


| 2167
KIVIKOSKI et al. Sherman & Salzberg, 2020) has shown that creating a linear haploid reference genome for a diploid species is a nontrivial task. Reaching this ideal can be especially challenging in organisms where the genetic variation cannot be reduced in controlled inbreeding designs, and most reference genomes are likely based on reference individuals carrying long alternative haplotypes (Chin et al., 2016;Howe et al., 2013;Stemple, 2013). The presence of homologous haplotypes, that is differing copies of the same genomic region inherited from the two parents, is against the assumptions of the linear reference genome and affects for instance the read mapping. If reads from distinct haplotypes map to different copies of the same region, single nucleotide variants (SNPs) separating the haplotypes cannot be detected and variation is underestimated. This affects various statistics in population genomics and may lead to wrong conclusions in many different contexts, including estimation of substitution rate (Kong et al., 2012), inbreeding (Ceballos et al., 2018) or population history (Roux et al., 2016).
Lep-Anchor software (Rastas, 2020) can improve assembly and scaffolding of even high-quality reference genomes with joint use of linkage map-based genome anchoring, pairwise contig alignment and long-read sequencing data. Performance and utility of Lep-Anchor were demonstrated in its original publication (Rastas, 2020) with empirical and simulated data sets, and gains in assembly quality were reported even with relatively small data sets.
Here, we have a closer look on the actual changes and assess their impact on typical genome analyses. Starting from an existing highquality contig assembly, original PacBio reads and ultra-dense linkage maps for the nine-spined stickleback (Pungitius pungitius), we were able to generate a significantly improved reference genome (ver. 7) using largely automated methods. When evaluating the differences to the published version of the reference genome (ver.6; Varadharajan et al., 2019), we detected haplotypes in three contexts. First, some haplotypes were originally assembled as separate contigs leading to false duplication of a region in the assembly. Second, haplotypes were assembled to the ends of subsequent contigs and occurred as duplicates on both sides of a contig gap. Third, haplotypic regions, exemplified by an inversion in LG19, were assembled as mosaics of the two haplotypes. Using the novel features of Lep-Anchor, we could automatically remove a large proportion of the first two types of haplotypes, while the correction of haplotypes of the last category was possible but demanded manual effort. Recognition and removal of haplotypes shortens the nine-spined stickleback reference genome and increases heterozygosity of the reference individual, while the contig rescaffolding enabled the identification of the centromere in all linkage groups. To demonstrate that this approach works for contig assemblies in general, we reassembled the latest published reference genome of the three-spined stickleback (Gasterosteus aculeatus; Peichel et al., 2017) using one new linkage map and publicly available 10x Genomics linked-read sequencing data (Berner et al., 2019).

| Nine-spined stickleback reference genome refinement
The starting point for this reference was the contig assembly and the genomic DNA sequence data from Varadharajan et al. (2019).
In short, the ver. 6 genome by Varadharajan et al. (2019) was based on de novo assembly of long PacBio reads, polishing with short reads and anchoring with linkage maps. The contig assembly was refined in two places: (1) the mitochondrial genome was reassembled from the short-read Illumina data of the reference individual using the program megAhit (ver. 1.2.9; Li et al., 2015), and (2) a large inversion in LG19 was characterized and the region was reassembled using the combination of programs FALcon Unzip (ver. 0.4.0; Chin et al., 2016), trio Binning (prerelease version; Koren et al., 2018), cAnU (ver. 1.6; Koren et al., 2017) and piLon (ver. 1.22; Walker et al., 2014), all run with their default parameters. The details of these steps are provided in the Methods S1 (see also Figure S1 for a workflow).
A new ultra-high-density linkage map was reconstructed based on crosses of wild-caught marine nine-spined sticklebacks from Helsinki, Finland (60°13′N, 25°11′E). 99 F 1 -generation families were generated at the University of Helsinki fish facility through artificial fertilizations (Rastas et al., 2016). Half-sib families were formed by mating one female to two different males, thinning the families to 25 offspring per family. The larvae were mass-reared in two large aquaria, and their family identity was later identified from the genotype data. The parental fish were whole-genome sequenced (WGS; Illumina Hiseq platforms, BGI Hong Kong) at 5-10X sequencing coverage, and the offspring were genotyped using the DarTseq technology (Diversity Arrays Technology, Pty Ltd). The fastq files were mapped to the contig assembly using BwA-mem (ver. 0.7.15;Li, 2013) and SAmtooLS (ver Li et al., 2009). The genotype likelihoods were called, and the linkage mapping and the pedigree construction were conducted using   (Rastas, 2017). The details of the linkage map reconstruction are provided in the Methods S1 (see also Figure S2).
The resulting contig assembly was anchored using Lep-Anchor (Rastas, 2020) following the standard pipeline (https://sourc eforge. net/p/lep-ancho r/wiki/Home) with default parameters (exception: minQuality=1 for Map2Bed to assign more contigs into chromosomes). For the anchoring, we (1) utilized three original linkage maps  and the newly reconstructed ultra-high-density linkage map concordant with the existing maps; (2) generated contig-contig alignments by running the two first steps of hApLomerger2 (Huang et al., 2017); and (3) incorporated the raw PacBio reads by aligning them to the contig assembly with minimAp2 (ver. 2.17; Li, 2018). Full computer codes for reproducing these analyses and instructions for automated improvement of any reference genome assemblies are available at https://github.com/mikko kivik oski/NSP_V7.

| Contig classification and centromere annotation
In ver. 7, 1644 of the total 2487 contigs were not assigned to any of the 21 linkage groups. We classified the contigs by analysing their sequencing depth (coverage) and repeat content. Illumina and PacBio data (subreads) for the reference individual and for a pool of four female individuals from the same Pyöreälampi pond (Illumina only, see Methods S1 for the details) were mapped and analysed using BwA-mem and minimAp2, respectively, and SAmtooLS. The coverage analysis was carried out using Lep-Anchor's novel modules CoverageAnalyser and CoverageHMM. Using CoverageAnalyser and a simple mixture model, sequencing depth histogram was classified to (about) zero, half, normal or high: half and normal depths were modelled using two normal distributions and the zero and high depth as a zeta distribution (coverage +1 ∼ Zeta, the same distribution was used for both, zero and high). Then, CoverageHMM and a four-state hidden Markov model (HMM) were used to classify each genomic position to four states: zero, half, normal or high. The emission probabilities of the HMM were taken from the mixture model (CoverageAnalyser), and maximum-likelihood transition probabilities along the physical (contig) coordinates were learned using the Baum-Welch expectation-maximization algorithm (Baum, 1972).
All contigs with at least one hit with e-value <10 − 5 were classified as putative centromeric contigs.
Alignments of centromere-associated repeat sequence were used to determine the centromere positions (Table S1, Figures S3 and S4). Within each linkage group, Blast alignments with e-value <10 − 5 were assigned to three groups with k-mean clustering according to their positions. Clusters with less than 10% of the total number of hits were discarded as outliers, and the centromeric region

| Content of LG12 sex chromosome and LG19 inversion
Based on the female and male sequencing coverage, the sex chromosome part (1-25 Mbp) of the ver. 6 LG12 appeared to contain contigs derived from X and Y chromosomes. We aimed to make LG12 haploid and purely X, and to identify differentiated Y-origin haplotypes. To investigate the new assembly of LG12, we joint-

| Quality assessment with variant and synteny analyses
To compare the nine-spined ver. 6 and ver. 7 references, we called autosomal SNPs of the reference individual (FIN-PYO-0). Reads were mapped to both references using BwA-mem, and variants were called with BcFtooLS mpiLeUp. SNPs were pruned with stringent criteria: SNPs within repetitive or unmappable regions, within 20 bp of an indel, of low quality (< 20) or with low (< 30) or high (> 70) depth were discarded. Unmappable regions were determined using the approach of Li (http://lh3lh3.users.sourc eforge.net/snpab le.shtml) and converted to bed format using a script by Schiffels (https://github. com/stsch iff/msmc-tools). SNPs found using ver. 6 were grouped into three categories: (1) found in autosomal linkage groups of ver. 7, (2) locus removed from autosomal linkage groups of ver. 7 or (3) not called with ver. 7. SNPs called using ver. 7 were grouped similarly, but there were two additional groups for SNPs in regions where haplotype copy was removed.
The quality of ver. 6 and ver. 7 was also assessed by comparing their synteny with the three-spined stickleback genome (Peichel et al., 2017). Based on the previously reported large-scale synteny to the three-spined stickleback genome , see also Guo et al. (2013), Rastas et al. (2016)), the homologous linkage groups of nine-and three-spined sticklebacks were aligned with min-imAp2 software. Previous studies (Rastas et al., 2016;Shikano et al., 2013) have shown that LG12 is a fusion chromosome, and it was aligned against the three-spined stickleback linkage groups 7 (1-14 Mbp) and 12. Alignment fragments with less than 5000 matching base pairs were discarded, and the syntenies of the two assemblies with the three-spined stickleback genome were compared by counting the number of changes in the orientation of consecutive fragments. BUSCO completeness of ver. 6 was reported to be very high, containing 97.1% of tested genes as complete BUSCOs (see Table 1 in Varadharajan et al., 2019). Here, we carried out the same analy-

| Three-spined stickleback reference genome refinement
We also tested the performance of Lep-Anchor with the threespined stickleback genome assembly (Peichel et al., 2017). First, a linkage map was constructed with Lep-mAp3 based on the data set of 517 F 1 -offspring from 60 families (30 males, each crossed with two different females) described by Pritchard et al. (2017). The parents were wild caught from the Baltic Sea and artificially crossed (see Leder et al. (2015) and Pritchard et al. (2017) for more details three-spined stickleback contigs were aligned with minimAp2 and included as two copies to Lep-Anchor to increase its weight in the optimization score.
Lacking the short-read data for the reference individual, we called SNPs for a male three-spined stickleback from Paxton Lake benthic population, Canada (Samuk et al., 2017

| RE SULTS
We used Lep-Anchor software and information from linkage map anchoring, pairwise contig alignment and long-read bridging to reassemble the nine-spined stickleback genome. Linkage map anchoring allowed assigning 274 previously unassigned contigs to the linkage groups (LGs), and pairwise contig alignments revealed 10% of the previous assembly as haplotypes (Figure 1a). Of the 843 contigs in linkage groups, Lep-Anchor could assess 763 to be scaffolded in correct orientation. Removal of haplotypes and linking of adjacent contigs reduced the number of contig gaps and more than doubled the N50 contig length as well as increased the number single-copy BUSCO genes (Table 1). With a more accurate representation of the haploid genome, the total length of the reference decreased by 55 Mbp (Table 1). It is noticeable that, with the exception of one linkage map produced here, these improvements were gained with a more efficient use of data generated for the original assembly. In addition to automated improvements with Lep-Anchor, we assembled and incorporated the native mitochondrial genome and used additional data from related individuals to characterize and reassemble a large inversion in LG19.
The improved assembly brings noticeable gains, and we could now successfully map the centromere-associated repeat and unambiguously identify the centromere positions in all linkage groups [previously missing from LG1 and LG16, and incoherent in LG10 and LG14, Figure S4; see Varadharajan et al. (2019)]. Removal of haplotypes and other changes in the genome assembly affects read mapping and SNP calling. More even read depth and the anticipated mean depth indicate that the reference has become more haploid and contains fewer haplotype copies ( Figure S7). In comparison with the ver. 6, the heterozygosity of the reference individual increased by 14% (Table 2), illustrating how the variation is concentrated in few regions and how these variable regions then get assembled as  Table 2). New SNPs in other regions were a minority, and their allelic depth deviated from the expected ( Figure S8).
Content of the LG12 changed considerably from ver. 6 to ver. 7 as one of the homologous copies in X and Y chromosomes were removed ( Figure 1a). As a result, the sex chromosome part of LG12 is close to a haploid representation and few regions show zero read depth (Figure 2a). This also increases the heterozygosity of the male reference individual (Figure 2a), the newly identified SNPs arising from differences between X and Y chromosomes, while no increase is observed in females (Figure 2a). Although homologous sequences are represented only once, the sex chromosome is still a mosaic of X and Y chromosomes and females show both homozygous variant and reference alleles (Figure 2a). An HMM analysis confirmed the mosaicism and indicated the sex chromosome assembly to be 57% of X chromosome ( Figure S9). Despite the mosaicism, the reassembly improved the synteny of LG12 with the three-spined stickleback counterparts (Figure 2b).
Scaffolding with Lep-Anchor had a minor impact on the variant allele frequencies in the LG19 inversion region (Figure 3a). The reason for this is that the original contigs were mosaics of the two alleles and that an improved ordering of contigs does not correct for their internal errors. The newly assembled contigs and the scaffolded alleles for the LG19 block revealed that there, indeed, are two segregating inversion haplotypes in the study population, and that the reference individual (see Methods) is heterozygous ( Figure 3a). As expected, the variant allele frequencies across the newly assembled haplotypes are either zero or twice as high as with the original mosaic assembly for individuals homozygous for the two alleles ( Figure 3a). Although the mosaicism had a large impact on variant allele frequencies, its effect on SNP frequency was small. There are more SNPs according to ver. 7, but most of them are found due to haplotype removal and few of them are in the inversion region. With the HMM and data from the four females, we found four observable large regions that indicate fine-scale mosaic of two diverged haplotypes (Figure 3b, see also Figure S10 and Table S2). All these four regions were identified in both genome versions which suggests that the corresponding contigs are erroneously assembled as in LG19. Comparable data were not available for three-spined stickleback.

F I G U R E 2
We constructed a contig assembly by partitioning the full-length sequence (Peichel et al., 2017) at long runs of N's, and constructed a linkage map for a distantly related population (Pritchard et al., 2017).
In the absence of long-read data, we bridged the contigs of the original assembly using scaffolds of a 10x Genomics assembly ( Table S2 and Figure S10 for the genomic coordinates and posterior likelihoods) in a sample from Paxton Lake, Canada, than were found using the original assembly (Table 2). Although the background heterozygosity of this individual was orders of magnitude higher than in our ninespined stickleback reference individual, most of the newly identified SNPs (52%) were in regions where haplotype variant was removed from the reference genome. Whereas the median sequencing depth of the sample was 15X for both genome versions, the depth for the identified haplotype regions was 9X and 15X in the published and new assembly, respectively, indicating successful haplotype removal.

| DISCUSS ION
Reconstructing a linear reference genome is a challenging, yet an instrumental task. Interpretation of genomic data is often made with the assumption that the reference genome is a complete haploid representation of the actual genome. The errors in the genome directly affect the conclusions drawn, and for instance, missing SNPs influence the site frequency spectrum that is essential in demographic analyses (Han et al., 2014). More directly, the presence of haplotype copies in a reference genome can make a highly diverged region seem exceptionally conserved and can thus seriously mislead variation-based functional analyses. Given the severe consequences of the errors, efforts to improve reference genomes are needed, and here, we have described an approach to make reference genomes more haploid and more contiguous using the Lep-Anchor software (Rastas, 2020).
Faced with the dilemma of correctly separating duplicated genome regions while simultaneously collapsing and merging haplotypic differences into a haploid sequence, all assembly programmes are poised to make errors. The magnitude of these errors depends on the heterozygosity of the reference individual and on the type of input data, long reads spanning more distant sites and thus capable of creating longer haplotype blocks, while the direction of the bias to either too long or too short genome depends on the algorithm. While the three-spined stickleback genome is based on relatively old data and is established over years of refinement, the nine-spined stickleback genome is an example of a modern reference genome built using the best practices. We demonstrated our method's potential by showing how the latter, an already very high-quality reference genome, could be greatly improved by more efficient use of the original sequencing and mapping data (Figure 1, Table 1). Improvements were based on linking, reassembly and improved scaffolding of the contigs with joint use of linkage map anchoring and long-read sequencing data, as well as characterization and removal of alternative haplotypes.
The improvements on the three-spined stickleback genome were more modest, but we could still both add new contigs into the linkage groups and remove haplotype copies (Table S3, Figure   S11), resulting in an 0.62% increase in number of segregating sites in a sample from the Paxton Lake benthic population (Table 2).
We anticipate that the more modest changes in comparison with the nine-spined were due to the absence of long reads and lower number of linkage map markers per contig in the three-spined data: 4.2 and 1.7 markers on average per contig in the nine and three-spined stickleback, respectively. While the three-spined stickleback analyses demonstrate that Lep-Anchor can improve even highly polished assemblies, they also illustrate how various data types, for example contigs from the 10x Genomics platform, can be incorporated in genome refinement.
In the nine-spined stickleback, most of the removed haplotypes were among the unassigned contigs and only one contig was moved between two linkage groups (Figure 1a), underlining the high quality of the original scaffold. Although we were not able to place all contigs in the linkage groups, we were able to divide them in putative classes based on the read depth and their repeat content, those with high repeat content (either centromere or other) forming the largest groups of unassigned contigs. Although repetitive regions are difficult to assemble and scaffold using the type of data available, we were able to improve the centromeric regions ( Figure S4) and our approach can be useful for repetitive regions more generally. Some unassigned contigs had low or even zero read depth, but as we did not detect any obvious contamination when aligning them to the NCBI database, those were retained in the reference genome.
Removal of haplotypes leads to the identification of ca. 14% more autosomal SNPs in the nine-spined stickleback reference individual (Table 2). Finding more SNPs per se is not evidence for better assembly, and removal of true paralogous regions could lead to incorrect increase in SNP numbers. However, together with more uniform sequencing depth ( Figure S7), strong evidence of successfully identified haplotypes (Figure 1b-c) and higher number of single-copy BUSCOs (and lower number of duplicated BUSCOs, Table 1, see also Table S4), our results show that genetic variability can be underestimated if the reference genome contains haplotypes. One should note, though, that our reference comes from a very small population and has extremely low background heterozygosity. Haplotypes, by definition, require variation between the copies, and in our reference individual, an exceptionally large proportion of the variation is concentrated within a small number of regions. The three-spined stickleback individual studied here had two orders of magnitude higher heterozygosity and, although the absolute numbers were larger, the relative impact of the reassembly on the SNP numbers was much smaller ( Table 2). The minority of newly identified SNPs that were not within haplotype regions (22% of the novel SNPs in the nine-spined stickleback) may have emerged because of short similarities between contigs that were not classified as haplotypes. They may also be related to changes in mapping of the read pairs in regions where haplotype copies have been removed or contig orientation or order has changed.
Nonetheless, the evidence for some of those SNPs is questionable as their allelic depth deviates from the expected (i.e. 0.5; Figure S8) and one may want to filter them from downstream analyses.
Nine-spined stickleback LG12 is formed by fusion of chromosomal segments that correspond to chromosomes 7 and 12 of the three-spined stickleback (Figure 2b; Shikano et al., 2013). This rearrangement has occurred after the split of the three-spined and the nine-spined sticklebacks 17 million years ago , but the exact timing is unclear (Shikano et al., 2013). While 15 Mbp in one end of LG12 behaves like an autosomal chromosome, the 17 Mbp (25 Mbp in ver. 6) in the other end contains the sex-determination region and behaves like a sex chromosome (Figure 2a). While parts of the sex chromosome region seem very similar, other parts have differentiated significantly, and assembling complete X and Y chromosomes based on a single male reference individual is extremely challenging. Although our HMM analysis indicated that the LG12 assembled here is only 57% of X ( Figure S9), we are confident that the sequence content of the current version is close to haploid presentation of X and the error is mainly in the SNP polarization. This is supported by the improved synteny with the three-spined stickleback genome ( Figure 2b) but especially by the more uniform read depth and more constant nucleotide diversity across the whole LG12 in females ( Figure 2a). The original sequencing data for the nine-spined stickleback reference are slightly outdated by modern standards, and we did not attempt to scaffold both X and Y copies of LG12. On the other hand, the characterization of the inversion haplotypes provides an example of how trioBinning (Koren et al., 2018) can be utilized without a trio and long-read sequencing data combined with population-level, whole-genome sequencing data allow assembling the segregating haplotypes. We acknowledge that our HMM for identifying regions of diverged haplotypes provides only indicative results (Figure 3b), but it does suggest that haplotypes can be fairly common in the nine-spined stickleback, which is in line with findings regarding other fish (Stemple, 2013) and humans (Sudmant et al., 2015). We also anticipate that highly concentrated alternation between two homozygous genotypes is a usable statistic for exploration and more sophisticated detection methods based on that could be devised. The identification of such regions requires the studied individual to be heterozygous, and therefore, all regions were not supported by all individuals. Having a single continuous haplotype, such as the inversion in LG19, in the reference genome correctly phases the alternative alleles ( Figure 3a) and allows studying the differences between the haplotypes. However, representation of potential structural differences is difficult and it is evident that methodological work to incorporate multiple haplotypes in a reference genome, for example using variation graph data structures, is urgently needed (Paten et al., 2017).
Haploid reference genomes based on a single individual, such as the one here, represent only one version of the species' genome, which may cause reference bias and thus affect various downstream analyses and the conclusions drawn from them (Ballouz et al., 2019;Paten et al., 2017). Although a linear reference does not represent the full species diversity, they are widely used and provide a starting point for analysis of genomic variation between individuals and populations. In the future, pan-genome representations and graphbased algorithms will likely change the way reference genomes are represented and analysed (Paten et al., 2017;Sherman & Salzberg, 2020). Since linear genomes are still widely used, their improvements are relevant and our work demonstrates that significant enhancements can be obtained with efficient use of the existing data.
Moreover, characterization of haplotypes is instrumental in more inclusive genome representations, increasing the relevance of our approach.

CO N FLI C T O F I NTE R E S T
The authors declare no competing interests.

AUTH O R CO NTR I B UTI O N S
J.M. directed and organized the data collection. P.R. generated linkage maps and assembled reference genomes. M.K., A.L. and P.R.
carried out the data analysis and interpretation of the results. M.K.
was responsible for structuring and compiling the manuscript. The manuscript was written and edited by all authors.

DATA AVA I L A B I L I T Y S TAT E M E N T
Sequence data that support the findings of this study have been de-