Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets

Abstract The advent of high‐throughput sequencing (HTS) has made genomic‐level analyses feasible for nonmodel organisms. A critical step of many HTS pipelines involves aligning reads to a reference genome to identify variants. Despite recent initiatives, only a fraction of species has publically available reference genomes. Therefore, a common practice is to align reads to the genome of an organism related to the target species; however, this could affect read alignment and bias genotyping. In this study, I conducted an experiment using empirical RADseq datasets generated for two species of salmonids (Actinopterygii; Teleostei; Salmonidae) to address these questions. There are currently reference genomes for six salmonids of varying phylogenetic distance. I aligned the RADseq data to all six genomes and identified variants with several different genotypers, which were then fed into population genetic analyses. Increasing phylogenetic distance between target species and reference genome reduced the proportion of reads that successfully aligned and mapping quality. Reference genome also influenced the number of SNPs that were generated and depth at those SNPs, although the affect varied by genotyper. Inferences of population structure were mixed: increasing reference genome divergence reduced estimates of differentiation but similar patterns of population relationships were found across scenarios. These findings reveal how the choice of reference genome can influence the output of bioinformatic pipelines. It also emphasizes the need to identify best practices and guidelines for the burgeoning field of biodiversity genomics.

the bioinformatic approaches developed for HTS data require the use of established genomic resources as part of the pipeline. By "genomic resources," I am referring to highly vetted consensus DNA sequences, whether at the whole genome, transcriptome, individual chromosome, or organelle-level, that are meant to serve as a "reference" sequence for a given taxa (see Fuentes-Pardo & Ruzzante, 2017 for a review). A reference genome provides a backbone on which to align HTS data and identify variants based on discrepancies between observed nucleotide base calls and the reference (Davey et al., 2011;Fuentes-Pardo & Ruzzante, 2017;Nielsen, Paul, Albrechtsen, & Song, 2011).
It has reached a point where generating HTS data is not a significant barrier for geneticists in terms of cost and equipment availability (Fuentes-Pardo & Ruzzante, 2017;Puckett, 2017). This has led to a torrent of HTS data for species across the Tree of Life. However, creating a quality reference genome is still a substantial obstacle that requires intense bioinformatic skill and financial investment (Ekblom & Wolf, 2014). Thus, HTS data are being generated for species that do not have a reference genome. A common practice is to use the reference genome of a species related to the target species of interest to align HTS data (Fuentes-Pardo & Ruzzante, 2017). Published "whole-genome comparisons" between multiple species often use the reference genome for one or a few of the species and then align whole-genome resequencing data to them, opposed to actually generating a reference genome for each species (e.g., Árnason, Lammers, Kumar, Nilsson, & Janke, 2018;Cho et al., 2013;VonHoldt et al., 2016).
Aligning HTS reads from one species to the reference genome of another is not without peril. There should be greater sequence variability between species than within species, which could produce mismatches between the reference sequence and HTS reads and contribute to lower mapping quality in cross-species alignments. Reads are likely to be mapped mainly to conserved regions between the species, inhibiting the detection of unique genomic variation within the target species (Fuentes-Pardo & Ruzzante, 2017). Structural variants, repetitive elements, chromosome number, and copy number variation can also differ between species, affecting mapping and identification of paralogous genomic regions. Increasing divergence between the target species and reference genome can increase the number of missing genotype calls and produce biased estimates of population genetic parameters (Nevado, Ramos-Onsins, & Perez-Enciso, 2014;Shafer et al., 2017).
Although common practice, it is important to assess the efficacy of aligning HTS reads across species on both data quality and downstream interpretation. There have been several studies that have investigated its consequences. Nevado et al. (2014) simulated HTS datasets and reference genomes to assess the impact of genome divergence and sequencing coverage on estimation of population genetic parameters, observing a noticeable impact caused by reference genome divergence. Shafer et al. (2017) tested an empirical RADseq dataset generated for a species of pinniped (Mammalia; Carnivora; Pinnipedia) against a draft genome of the target species and reference genomes for three related species of increasing phylogenetic divergence. They also found reference genome divergence impacted population genetic parameters, but different metrics were more sensitive to divergence level. Gopalakrishnan et al. (2017) published a reference genome for the gray wolf (Canis lupus) and compared population genetic inferences for wild canids (Mammalia; Carnivora; Canidae) using their wolf genome and the commonly used domestic dog (C. l. familiaris) genome. Despite wolves and dogs having diverged within the last 30,000 years, the authors observed noticeable differences in some of the inferences for wild canids when the wolf genome was used.
These studies shed light on how divergence between reference genome and target species can impact the analysis of a HTS dataset. However, further research is still needed to assess the impact of such practices under different scenarios with empirical datasets, especially when making biological inferences, and the threshold at which genome divergence begins to have an effect. In this study, I assessed the impact of aligning HTS reads generated from two empirical RADseq datasets to real reference genomes of multiple species within the same taxonomic family. The RADseq datasets were for two species of salmonid (Actinopterygii; Teleostei; Salmonidae): Coho salmon (Oncorhynchus kisutch) and bull trout (Salvelinus confluentus). I aligned these data to publicly available reference genomes for six members of Salmonidae have been recently published, which is more than have been tested by other studies (Gopalakrishnan et al., 2017;Nevado et al., 2014;Shafer et al., 2017). They also provide a greater range of genomic divergence between target species and reference genome than has been previously tested. A reference genome is available for Coho salmon (Rondeau et al., 2017), allowing for a direct comparison of processing the RADseq data with the reference genome of the target species and species of increasing phylogenetic distance. No reference genome is available for bull trout but there is a genome for a closely related species within the genus, so performing a similar comparison provides an informative complement.
My objective was to evaluate the impact of reference genome divergence on mapping quality, number of variants, and inference of population structure. I also performed a de novo assembly of both RADseq datasets to assess whether coverage and inference were improved using contigs derived from the libraries themselves compared to aligning to a reference genome (Paris, Stevens, & Catchen, 2017). My hypothesis was that increasing phylogenetic distance between the target species and reference genome would result in fewer reads aligning and lower mapping quality, which would impact the discovery of variants. However, I anticipated that even if fewer variants were produced when using a more divergent reference genome, there would still be sufficient statistical power in the resulting data to reach the same biological interpretation (e.g., genetic differentiation) compared to using the genome of the target species. By utilizing recently developed genomic resources, this study provides insights into the practice of analyzing HTS data with the genome of another species. | 7587 BOHLING 2 | MATERIAL S AND ME THODS

| The genomes
I used published whole genomes from six species within the family Salmonidae that represent a gradient of phylogenetic distance (Table 1). Among these available genomes, the two most closely related species are Coho (Rondeau et al., 2017) and Chinook salmon (O. tshawytscha) , which are sister species that diverged within the last ten million years ( Figure 1) (Crête-Lafrenière, Weir, & Bernatchez, 2012;Shedko, Miroshnichenko, & Nemkova, 2012). Within the same genus, there is a reference genome for the rainbow trout (O. mykiss) , which is part of a clade sister to the Pacific salmons (the group that includes Coho and Chinook salmon). A reference genome is available for the Arctic char (Salvelinus alpinus) (Christensen, Rondeau, et al., 2018): the genus Salvelinus, which also contains bull trout, is sister to Oncorhynchus, having split around 20-25 million years ago (MYA) (Crête-Lafrenière et al., 2012;Lecaudey et al., 2018). Further diverged is the genome of the Atlantic salmon (Salmo salar) (Davidson et al., 2010;Lien et al., 2016): the genus Salmo is part of a clade that shares a most recent common ancestor with the Oncorhynchus/Salvelinus complex around 30-35 MYA (Crête-Lafrenière et al., 2012;Lecaudey et al., 2018). All five of these species (Coho salmon, Chinook salmon, rainbow trout, Arctic char, and Atlantic salmon) are members of the subfamily Salmoninae. Basal to this subfamily is the group containing the European grayling (Thymallus thymasllus), which diverged around 60 MYA and also has a publically available reference genome (Varadharajan et al., 2017).
These six genomes represent varying levels of divergence from the two target species I included in this study. To quantity this divergence, I analyzed the annotated protein sequences associated with the reference genome assemblies for the six species using the OrthoFinder pipeline (Emms & Kelly, 2019). OrthoFinder identifies protein sequences that form orthogroups between species and then generates multiple gene alignments based on these groupings. I used the MAFFT (Katoh & Standley, 2013) option to construct the alignments and FastME 2.0 (Lefort, Desper, & Gascuel, 2015) to generate a neighboring-joining tree based on genetic distance.
Two of these were from the Hood Canal region of Washington State, USA: one composed of hatchery broodstock propagated at Quilcene National Fish Hatchery (NFH) and another composed of spawning adults returning to Tarboo Creek, a nearby natural tributary. The third population consisted of wild-origin Coho salmon from Warm Springs River in Oregon, USA, which is part of the Columbia River Basin. The Quilcene NFH broodstock was founded from local Hood Canal populations and Coho salmon from this region are highly diverged from Columbia River populations (Campbell & Narum, 2011;Van Doornik, Teel, Kuligowski, Morgan, & Casillas, 2007), meaning the samples used for this study should display a clear pattern of genetic structure. These three groups (Tarboo Creek, Quilcene, and Warm Springs) had similar sample sizes (n = 11-18) and produced comparable numbers of reads per individual (means of 1.96, 2.16, and 1.56 million reads, respectively; all individuals except one produced >500,000 reads).

| The sequence alignment and genotyping
I aligned the RADseq reads to each of the six reference genomes using Bowtie2 (Langmead & Salzberg, 2012) with the "-very-sensitive" preset option for alignment sensitivity (the "-very-fast" option was initially tested as well and produced similar alignment patterns in relation to reference genome choice). Other parameters were set to their defaults. To assess mapping quality associated with reference genome choice, I recorded mapping statistics generated from the flagstats command with Samtools (Li et al., 2009). I calculated mean MapQ after removing reads that did not map or had a MapQ < 1. Nevado et al. (2014) found that certain genotyping software was more sensitive to genome divergence and sequence cover- age. Therefore, I tested several different genotypers that have different analytical frameworks. Each program used the BAM files generated from Bowtie2 as input data. The first was Stacks 2.0 (Catchen et al., 2013), which uses a maximum-likelihood approach to genotype variants based on with-individual read distributions.
This program is commonly used for RADseq data, and I used the SNP model for genotyping, which is the legacy model that has been part of earlier versions of the software. I also generated genotypes with FreeBayes (Garrison & Marth, 2012), which detects variants using a Bayesian approach that incorporates read distributions across all individuals. The final program was ANGSD (Korneliussen, Albrechtsen, & Nielsen, 2014), which can generate genotype probabilities based on overall read distributions and actual genotype calls. Within ANGSD, I used the GATK model for genotyping.
In comparing genotypes, I focused solely on biallelic SNPs as those are the only variants identified by ANGSD and Stacks SNP model does not produce indels, providing a fair comparison between the genotypers. For all SNPs identified by FreeBayes and Stacks, I calculated mean individual read depth at those sites using VCFtools (Danecek et al., 2011). Due to differences in output, read depth at SNPs identified by ANGSD could not be calculated. Default parameters were used in all applications.

| The de novo assembly
For the Coho salmon and bull trout datasets, I performed a de novo contig assembly and subsequent variant identification. I processed the data through two popular pipelines: Stacks 2.2 (Catchen et al., 2013) and dDocent 2.6 (Puritz et al., 2014). Each pipeline used the reads remaining after the clone_filter procedure described above; I performed no filtering prior to running each pipeline. I used the default parameters for both pipelines and extended contigs using the reverse reads. Note that dDocent uses FreeBayes to call SNPs as part of its pipeline, making it comparable to use of FreeBayes with F I G U R E 1 Photograph of two Coho salmon (top and middle fish) and Chinook salmon (bottom fish) harvested from the Columbia River. These are sister species within the genus Oncorhynchus. Photo courtesy of Steve Money the reference-aligned data. As with the reference-based alignments, I focused only on biallelic SNPs and calculated mean individual read depth.

| The population genetics
Because of the infinite possible ways to filter a HTS dataset, I tested a single potential scenario to focus primarily on the effect of reference genome on population genetic inference. In their simulations, Nevado et al. (2014) observed less variability in estimates of neutrality across genotypers with a read depth of 8X to retain genotypes. They also used 8X as their genotype threshold in their empirical case study, so I used 8X for the reference-aligned and de novo datasets. I set a minor allele frequency threshold of 0.05 and only retained SNPs that were genotyped in more than 50% of individuals. To provide a fair comparison among the genotypers, I used called genotypes instead of genotype probabilities from ANGSD using the above filtering thresholds. Processing genotype probabilities requires specialized software with different algorithms and assumptions, which would prevent a direct comparison with the other two methods.
I performed two analyses to assess the impact of reference genome divergence on population genetic inference. First, I estimated overall F ST (Weir & Cockerham, 1984) between populations for both species under the different scenarios using the package assigner (Gosselin, Anderson, Bradbury, 2016) for the R 3.6.1 statistical environment (R Core Team, 2018). I also performed a clustering analysis using Admixture (Alexander & Lange, 2011) to assess genetic structure. For each genome/genotyper combination, I used the cross-validation approach to estimate the optimal number of genetic clusters in the dataset. I allowed the number of clusters (K) to vary from one to ten.

| Reference genomes
The OrthoFinder pipeline recovered a phylogeny similar to that reported in other studies ( Figure 2) (Crête-Lafrenière et al., 2012; Lecaudey et al., 2018). Compared to the Coho salmon genome, both species within the genus Oncorhynchus (Chinook salmon and rainbow trout) had a sequence divergence between 1% and 2% ( Table 2).
The genome from other species in the subfamily Salmoninae (Arctic char and Atlantic salmon) was ~3.5% diverged from the Coho salmon genome. One unexpected result: although the phylogeny generated with the OrthoFinder pipeline ( Figure 2)

| Read mapping
The degree of divergence between the target species and reference genome impacted alignment success. For the Coho salmon dataset, the mean proportion of reads that aligned to the reference genome was highest when the Coho salmon reference genome was used and decreased with increasing divergence from the target species ( Figure 3a). On average, 80% of the reads aligned to Coho reference genome; aligning them to another species produced mean alignment rates < 70%. Aligning Coho salmon reads to non-Oncoryhnchus F I G U R E 2 Neighbor-joining tree based on genetic distances between the salmonid reference genomes used for this study. The tree was constructed using orthologous protein sequences identified by the OrthoFinder pipeline. It is a rooted tree, and branch lengths represent genetic distance. A scale bar depicting genetic distance is provided genomes (i.e., Arctic char, Atlantic salmon, and European grayling) resulted in less than 50% of reads aligning. A similar pattern was observed with the bull trout data, with the highest proportion of reads aligning when the Arctic char genome was used as a reference.
However, the proportion was still < 80%, meaning aligning bull trout data to the Arctic char genome was comparable to aligning Coho salmon data to other Oncorhynchus species.
The proportion of reads that mapped as singletons (i.e., only one read of a mate pair mapped) ( Figure 3b) and the proportion of reads in a mate pair that aligned to different chromosomes (Figure 3c) increased as the reference genome became more diverged. MapQ for the Coho salmon RADseq alignments was highest when aligned to the Coho salmon reference genome, with a gradual decrease as the reference genome became further diverged (Figure 3d). It was highest for the bull trout data when mapped to the Arctic char genome but relatively equivalent across the other genomes.

| Variant discovery
The number of SNPs generated by each of the three genotypers was highly variable for the two datasets and produced different patterns depending on the reference genome that was used to produce the alignments ( Figure 4). For the Coho salmon data genotyped with ANGSD, the fewest number of raw SNPs were identified when the reads were aligned to the Coho salmon genome: More SNPs were detected as the reference genome became increasingly diverged except for the grayling genome. Similarly, the fewest SNPs detected by ANGSD with the bull trout data were when it was aligned to the Arctic char genome: other genomes produced more SNPs aside from the grayling.
With FreeBayes and Stacks, the pattern was reversed: The highest number of SNPs was generated for the Coho salmon data using the Coho salmon reference genome ( Figure 4). There was a similar pattern with the bull trout data, although the highest SNP total produced by Stacks was generated with the Atlantic salmon genome. By far, the fewest SNPs were identified by Stacks: no aligner/genome combination resulted in more than 650,000 SNPs, whereas almost all FreeBayes and ANGSD analyses identified more than that regardless of reference genome.
There was a complication using FreeBayes when the Atlantic salmon genome was used as a reference. FreeBayes requested more memory than available with the computational resources I used when processing alignments from this genome, likely due to repetitive elements. As an alternative, I generated a masked genome to remove repetitive elements. After realigning the data to this masked genome, FreeBayes successfully genotyped, but produced fewer SNPs than other genomes (Figure 4).
For the de novo assembly pipelines, dDocent identified substantially more SNPs than Stacks regardless of the dataset (Figure 4).
Number of SNPs identified de novo by Stacks were within the range identified by Stacks using the reference-based alignments. The de novo dDocent pipeline uses FreeBayes to genotype: compared to TA B L E 2 Summary of genomic divergence between target and related species tested by several studies. Genomic divergence is presented as the percent divergence between the species for which reference genomes were tested species and the target species for which the raw sequence data were generated. For bull trout (Salvelinus confluentus), there is no reference genome to calculate sequence divergence, so it was calculated from the Arctic char genome (S. alpinus) since they are from the same genus. Note that Nevado et al. (2014)

| Population genetics
Several trends were noticeable with the estimates of F ST . First, there were differences in the magnitude of values depending on the genotyper. In general, FreeBayes and Stacks produced estimates within a similar range; ANGSD consistently produced much lower estimates, especially for the bull trout data ( Figure 6) (Table 3) and overlap between estimates generated with the same dataset was variable. Due to using the masked genome, SNPs generated by FreeBayes using the Atlantic salmon genome produced too few SNPs following filtering to be used in the population genetic analyses.
The cross-validation error rates produced by admixture were comparable regardless of genotyper or reference genome (Figure 7).

| Effect of genome divergence
The practice of aligning HTS sequences generated for one species to the genome of another has been a necessary evil for biologists due to the logistical constraints of producing reference genomes for every species on Earth. However, this study demonstrates that the reference genome is an important component of variant discovery. Substantial space has been given in the literature to  (Gopalakrishnan et al., 2017;Nevado et al., 2014;Shafer et al., 2017), they have not documented whether this extended to the initial phase of read mapping as well. Results from this study suggest the impacts of reference genome divergence are rooted in the alignment process, which would subsequently affect downstream analyses. Parameters could be adjusted during the alignment process to account for this, but it will not eliminate the problem.
RADseq data may be particularly sensitive to reference genome divergence given that mutations in the restriction cut-site would lead to complete absence of overlapping sequence data between species (Arnold, Corbett-Detig, Hartl, & Bomblies, 2013;Davey et al., 2013).
Salmonids add another wrinkle in that the ancestor to the entire family underwent a whole-genome duplication around 65 million years ago event that was subsequently rediplodized (Varadharajan et al., 2017). The presence of paralogous regions of the genome may pose problems for read mapping, further exasperating the impact of using a divergent reference genome.
The most practical consequence of this phenomenon is that perfectly good sequencing reads may be removed from a dataset sim- Alternatively, the observed mixing of the two populations may be a real biological signal: Tarboo Creek and Quilcene NFH are within close proximity, and migrants have been found between the two populations (Bohling et al., 2020). However, considering that admixture proportions varied for individuals across genome/ genotyper combinations suggests the shallow divergence played a large role in the lack of clear clustering. Plus, Bohling et al. (2020) found the broodlines propagated at Quilcene NFH were distinct from naturally spawning local populations. Overall, using more divergent reference genomes for alignment may be practical when detecting deep patterns of population structure, but may cause problems in detecting more recent population splits.

| Genotypers
There were also differences in how the three genotyping software  Skotte, Korneliussen, & Albrechtsen, 2013). Thus, providing a direct comparison between called genotypes and probabilities is problematic. For example, ngsAdmix, the equivalent of admixture that was developed to perform clustering of ANGSD output, does not perform the cross-validation procedure for inferring K. Nor is there a straightforward way to calculate overall F ST among more than two populations using genotype probabilities. Using genotype probabilities has inherent appeal with HTS data, but software applications must be developed to meet the diverse needs of geneticists for it to reach its full potential.

| Implications for biodiversity genomics
There are several additional observations gleaned from this study that serve as important lessons for geneticists as we advanced into the genomic era. First relates to the de novo assemblies compared to the reference-aligned datasets. Several studies have suggested for RADseq data that higher coverage and genotyping accuracy are achieved with reference-aligned pipelines compared to de novo (Fountain et al., 2016;Shafer et al., 2017;Torkamaneh, Laroche, & Belzile, 2016). Analyzing these salmonid datasets suggests that the question of de novo versus reference-aligned is more nuanced.
Whether the de novo pipelines produced fewer SNPs depended on the dataset: with bull trout RADseq data, the de novo numbers were comparable to reference aligned. Mean read depth at SNPs was higher for the dDocent de novo pipeline than with the FreeBayes reference-aligned data, but the opposite was true with Stacks. The highest estimates of F ST were obtained with data assembled de novo by Stacks, suggesting it was emphasizing divisions between our populations. Paris et al. (2017) suggested that de novo assemblies take full advantage of the information contained within RADseq data, some of which may be lost due to poor alignment or difference between the target species and reference genome. Thus, de novo assemblies may be the preferred approach in situations when there is no genome from the target species or any closely related species. ined the factors that influence genotyping with HTS, a common thread has emerged that changing a few parameters can alter population genetic estimates (Fountain et al., 2016;Hodel et al., 2017;Mastretta-Yanes et al., 2015;Paris et al., 2017). Even the choice of software should be viewed as a parameter that can influence biological inference (Nevado et al., 2014;Shafer et al., 2017;Torkamaneh et al., 2016).
This has profound consequence for interpretation, especially in situations in which findings have consequences for conservation and other applied applications. Geneticists should be transparent that processing a single dataset through a single pipeline represents a single snapshot view of biodiversity. Researchers should be careful in assuming inferences made when aligning HTS data to the genome of a nontarget species will hold up when another genome is used. Unless consensus builds that particular pipelines and parameters are required for genomic studies, it would be advisable to test varying parameters and software to gain a more generalized view of biological patterns.
To close, this study reveals that the reference genome is an important variable in the processing of HTS data. Over the past decade, there have been a variety of initiatives to develop reference genomes for as many species as possible, such as the Genome 10K and Earth BioGenome Project (Lewin et al., 2018). One practical application of these initiatives is that reference genomes will be available for conducting intraspecific population analyses, reducing the need to align HTS data to more diverged genomes. This would increase the veracity of studies using HTS data, resulting in better coverage and higher-quality variants. It is tempting to generate HTS data to investigate a wide array questions we have in genetics: however, we may need to assess whether we are putting the cart before the horse in generating these data before reference genomes are available for these species.

ACK N OWLED G M ENTS
I would like to thank Abernathy Fish Technology Center for pro- Service.

CO N FLI C T O F I NTE R E S T
The author declares that there are no conflicts of interest.

AUTH O R CO NTR I B UTI O N
Justin Bohling conceived and designed the study, performed the data analysis and interpretation, and wrote and edited the article.

DATA AVA I L A B I L I T Y S TAT E M E N T
The genolames used for this study are publically available. The RADseq data have been described in previous publications (Coho data, https://doi.org/10.1002/tafs.10206; bull trout data, https:// doi.org/10.1007/s1059 2-018-1134-z) that include instructions for data accessibility. All software used for the data analysis is open source, and the specific scripts used for the bioinformatics processing are provided in the Supplemental Material.