Sequence capture: Obsolete or irreplaceable? A thorough validation across phylogenetic distances and its applicability to hybrids and allopolyploids

As whole‐genome sequencing has become pervasive, some have suggested that reduced genomic representation approaches, for example, sequence capture, are becoming obsolete. In the present study, we argue that these techniques still provide excellent tools in terms of price and quality of data as well as in their ability to provide markers with specific features, as required, for example, in phylogenomics. A potential drawback of the wide‐scale application of reduced representation approaches could be their drop in efficiency with increasing phylogenetic distance from the reference species. While some studies have focused on the degree and performance of reduced representation techniques in such situations, to our knowledge, none of them evaluated their applicability to inter‐specific hybrids and polyploids. This highlights a significant gap in current knowledge since there is increasing evidence for the frequent occurrence of natural hybrids and polyploids, as well as for the major importance of both phenomena in evolution. The main aim of the present study was to carry out a thorough validation of SEQcap applicability to (1) a set of non‐model taxa with a wide range of phylogenetic relatedness and (2) inter‐specific hybrids of various ploidies and genomic compositions. Considering the latter point, we especially focused on mechanisms causing allelic bias and consequent allelic dropout, as these could have confounding effects with respect to the evolutionary genomic dynamics of hybrids, especially in asexuals, which virtually reproduce as a frozen F1 generation.

| 1349 BARTOŠ et al. et al., 2018). In the light of this trend, some authors have predicted that sequencing approaches based on reduced genomic representation are becoming obsolete (Ku et al., 2012).
However, although the costs per sequenced base are in general constantly decreasing (Jones & Good, 2016;Therkildsen & Palumbi, 2017), it does not mean that next-generation sequencing (NGS) is always affordable, especially when applied to population genomics or reasonably wide-scaled phylogenetic studies. Furthermore, we do not think that WGS would be largely, if at all, beneficial for such purposes, since phylogenomic studies mostly rely on a limited set of orthologous, mostly coding, loci (Sarver et al., 2017).
As an alternative, some studies have used so-called 'low coverage' WGS to address population structure or phylogenetic history (Therkildsen & Palumbi, 2017;Zhang et al., 2019). However, such an approach may be limited at the individual level (e.g. unreliable single nucleotide polymorphism (SNP) detection, except for mitochondrial genomes) (Therkildsen & Palumbi, 2017). Furthermore, the definition of 'low coverage' is somewhat subjective as Therkildsen and Palumbi (2017) defined this as 1-3×, whereas Zhang et al. (2019) used sequencing coverages ranging between 25 and 30× per sample which are not really low, but rather represent the lower limit for a reasonable genome assembly. Moreover, in the end, Zhang et al., 2019 still focused solely on the identification and analysis of well characterized orthologous genes (e.g. Kriventseva et al., 2019), raising once again the question of the actual need for WGS data.
Therefore, approaches using the so-called reduced representation sequencing, such as sequence capture (SEQcap) or restrictionsite associated DNA sequencing (RADseq) are well suited for many purposes. They provide excellent performance, in terms of both price and quality of data, since they are based on the enrichment of a (specified) set of sequences (genomic fraction) over the genomic background (Mamanova et al., 2010).
The main benefit of RADseq (e.g. Harvey et al., 2016), for example, is that it is a relatively non-laborious and affordable technique that samples the genome more or less randomly (Andrews et al., 2016). Consequently, most RADseq markers are distributed in non-coding regions. On the other hand, RADseq is known to substantially suffer from a phenomenon known as allelic dropout (ADO), where only one allele is sequenced despite the physical presence of two alleles in a heterozygous individual (Gautier et al., 2013;Henning et al., 2014;Mastretta-Yanes et al., 2015). This effect may not pose any issues for many studies, especially if they are dealing mainly with phylogenomic inference (e.g. Comito et al., 2022). However, ADO would seriously hamper studies that actually want to interpret the presence/absence of target loci as biologically relevant phenomena, that is, loss of heterozygosity (e.g. Janko et al., 2021;Westphalen et al., 2022). This is where the major advantage of sequence capture approaches comes in, as they use a priori designed probes to capture desired homologous genomic regions with high specificity (e.g. Glon et al., 2021;Jiang et al., 2019;Powell et al., 2016;Zhang et al., 2019).
Therefore, SEQcap appears as an attractive solution for many evolutionary genomic studies, in terms of both financial costs and sequencing throughput. However, the use of targeted sequence capture in such a context does bring about specific issues, such as sensitivity of designed probes to SNPs and structural variants, which may affect or bias the enrichment of targeted sequences in genomes of more or less diverged species. Previous evaluations of SEQcap technology in phylogenomics have already suggested that the functioning of the probes indeed depends on sequence similarity to the reference species used in their design. The performance and suitability of SEQcap may consequently vary across phylogenetic scales, and it has been shown that significant variation in sequence coverage occurs with increasing phylogenetic distance, ultimately leading to locus dropout (e.g. Andermann et al., 2020;Bragg et al., 2016).
One aspect of SEQcap has remained considerably underexplored, that is, its applicability to inter-specific hybrids and polyploids, which combine two or more genomes, sometimes with high sequence differentiation between orthologous alleles. Genomic analyses of such organisms bring new insights into the mechanisms of genome evolution (e.g. Lu et al., 2022;Yoo et al., 2014) and SEQcap methodologies have indeed been employed to study phenomena like structural variations or loss of heterozygosity (e.g. Mu et al., 2019;Westphalen et al., 2022). However, while the potential of SEQcap to detect such events has been validated, for example, in medical studies (e.g. Westphalen et al., 2022), the fundamental difference here is that studies of hybrid and polyploid organisms typically focus on non-model species whose genomes, and hence genomic variation, are not as well understood.
To address the sensitivity of sequence capture to increasing phylogenetic distance, we applied designed probes to genomes of several species whose genetic distance from the focal taxon ranges from recent divergence to inter-generic splits dating more than 40 Ma. We focused on spined loaches, Cobitidae, a group of freshwater fishes comprising a number of species that have undergone multiple interspecific hybridization and subsequent polyploidization events. Then, we applied SEQcap to interspecific hybrids and allopolyploids to address its ability to detect orthologous alleles and their genomic dose. Specifically, we addressed the susceptibility of the SEQcap methodology to the ADO phenomenon, which appears to represent a crucial parameter for inferences about the evolutionary dynamics of hybrids' (sub)genomes.

| The model taxon: Cobitidae
Freshwater fishes of the family Cobitidae (Cypriniformes: Cobitoidea) are small benthic fishes that are distributed with 18 genera and more than 200 species across Europe and Asia (Kottelat, 2012;Perdices et al., 2016). The so-called 'Northern Clade' collects all species from the northern half of the distribution area (Šlechtová et al., 2008) and contains the largest cobitid genus (about 115 species) Cobitis that is distributed across all of Europe, Near East, Siberia, Northern and Eastern Asia (Bǎnǎrescu, 1992). While Asia houses several major phylogenetic lineages of Cobitis including distant morphotypes formerly referred to as distinct genera (Iksookimia, Kichulchoia and Niwaella) (Perdices et al., 2016;Šlechtová et al., 2008), the species in Central and Eastern Europe are closely related, morphologically very similar and underwent extensive hybridization. However, instead of eroding the species boundaries, the hybridization events repeatedly gave rise to several independent asexual lineages (Janko et al., 2007(Janko et al., , 2018Kuroda et al., 2018;Majtánová et al., 2016). Some of the European hybrid lineages might have originated as much as 300,000 years ago (Janko et al., 2012). Today, most populations of Cobitis in Central and Eastern Europe consist of a diploid, sexually reproducing species plus several di-or triploid gynogenetic (sperm-dependent) hybrid lineages, with the hybrids usually outnumbering their parental species (Bohlen & Rab, 2001;Janko et al., 2007). Such a monophyletic group of species with a variable degree of genetic relatedness and multiple hybridization events makes them perfect candidates for the thorough evaluation of the performance of SEQcap-based analysis.

| Species selection and design of probes
The sequence capture probes applied in this study were obtained within the frame of our previous projects on gene expression and genome evolution in hybrids and polyploids of the so-called C. taenia species complex (Bartoš et al., 2019). Technical details of the development of these probes are available in the Methods S1. Please note that the probes were originally designed using the C. elongatoides provisional transcriptomic reference, while later, its sister species C. taenia was chosen as the reference species since its transcriptomic reference is currently the most reliably assembled and annotated cDNA reference for genus Cobitis and has previously been used for gene expression analyses and population genomics (e.g. Bartoš et al., 2019;Janko et al., 2021;Kočí et al., 2020).
To evaluate the performance of SEQcap across defined phylogenetic scales and in hybrids, we focused on four species of the Cobitis genus plus two from other genera from the Northern Clade of Cobitidae (Bibarba, Sabanejewia) as well as one species from the sister family Nemacheilidae. The sample selection thus includes Cobitis taenia, the reference and type species of the family, plus six taxa with increasing phylogenetic distance ranging from its sister species (C. tanaitica) to a member of a different family (Nemacheilus masyae) whose divergence exceeds 40 My (Šlechtová et al., 2008). The species list and geographic origin of samples are listed in Table 1, and a schematic phylogeny is provided in Figure 1a.
Additionally, in order to evaluate the performance of the custom SEQcap probes (Janko et al., 2018) in hybrid lineages, we assessed several hybrid individuals of various ploidies that originated from a parental combination of C. elongatoides × C. taenia (see Table S1).

| Library preparation, sequencing & data trimming
Sequencing libraries were prepared using standard workflow steps including DNA shearing on a Covaris M220 ultrasonicator followed by quality check on an Agilent 2100 Bioanalyzer. The DNA was then subjected to the target enrichment protocol with hybridization probes as described in Janko et al. (2018). Combined QC report of all samples (after trimming) by MultiQC v1.9 (Ewels et al., 2016) and fastQC 0.11.9 (Andrews et al., 2014) is available in the Appendix S2.
Within this work, we also utilized an independent dataset originating from WGS acquired for different purposes over the TA B L E 1 Overview of basic statistics per sample including: sample ID, species name, geographical origin, total number of reads, proportion of mapped reads and theoretical average coverage expressed in relation to the size of the whole reference (the overall reference length slightly exceeds 22 million bp).  Table S1.

Sample ID
F I G U R E 1 Species are ordered according to increasing phylogenetic distance from Cobitis taenia (always left). (a) Schematic phylogenetic tree showing relationships among C. taenia, C. tanaitica, C. elongatoides, C. strumicae, Sabanejewia aurata, Bibarba bibarba and Nemacheilus masyae; branch lengths are not proportional to time; estimated split of individual species is depicted on the y axis -values displayed in grey colour represent rough estimates (Bohlen et al., 2006(Bohlen et al., , 2019Janko et al., 2018;Šlechtová et al., 2007, 2008; (b) The number of sequenced genes with median per base coverage ≥ 10 for each species (only reads with MQ ≥ 20 and bases with Phred-score ≥ 15 included); the dotted line represents the intersection of genes successfully sequenced across all the species (n = 1282); Black numbers above bars indicate mean GC content. The white numbers inside the bars represent the average difference in genetic sequence between the reference and two different sets of genes. The first set includes genes that were not found in the next species (shown above the line), while the second set includes genes that were found in a more distantly related species (shown below the line).; Note that the overall mean sequence divergence of the most distant species from the reference (N. masyae) was 5.93%; (c) The number of discovered variable sites (SNPs) against C. taenia reference; (d) The number of discovered variable sites (SNPs) against C. taenia reference normalized by the actual reference length (in 1000 bp) that was successfully enriched (sequenced) for given species (nSNPs × 1000 bp −1 ).

| Reference mapping & SNP calling
Trimmed data were mapped onto the published transcriptomic reference of Cobitis taenia (Janko et al., 2018) using BWA-MEM software ) with default settings (shorter splits marked as secondary). Aligned reads were subsequently processed by SAMtools 1.9 ; sorting and deduplication of reads was performed via the fixmate and markdup sub-tools. SNPs were called by BCFtools 1.9 (Hwang et al., 2015) and filtered under the following conditions: SNPs with genotype quality <20 and coverage depth <10 were omitted. We ran BCFtools in -c mode, and indels were excluded while invariant sites were retained for the purposes of phylogeny reconstruction. We used the SAMtools/BCFtools pipeline despite the known problems that can arise with calling low allelic frequency sites using this pipeline (Smith & Yun, 2017), as this strategy is valid for phylogenomic inference. VCF files of samples were merged by VCFtools v. 0.1.15 (Danecek et al., 2011) into the final VCF file, which was evaluated by BCFtools stats. Basic summary statistics are reported in Table 1. For the analysis of hybrids, we used a different approach, and detailed information may be found in Methods S1.

| Alternative reference assembly
We took advantage of the fact that intron sequences from the exon/ intron boundaries are captured even though the probes were designed only for exon targets (e.g. Bi et al., 2012). In order to include such intron sequences into the phylogenomic analyses, we assembled enriched SEQcap data from C. taenia specimen using SPAdes 3.11.1 in careful mode (Bankevich et al., 2012). The resulting contigs were reduced by CD-HIT (Li & Godzik, 2006) using a similarity threshold of 0.95. Then, we searched for homology with the transcriptomic reference using BLASTn (Altschul et al., 1990) and retained only those contigs that locally (at least 10% of their length) matched the transcriptomic reference at 98% similarity threshold and were longer than 300 bp.
To roughly estimate presumable intron content in such a reference, we split the contigs into 50 bp non-overlapping fragments using the Split FASTA tool (Stothard, 2000). Those fragments were then mapped onto the original transcriptomic reference using BWA-MEM . We assumed that the proportion of fragments that were not mapped roughly correlates with the intron content. Reads obtained from individual species were mapped to this alternative reference using the same procedure as described in Section 3.2.

| Phylogeny estimation
The resulting VCF file was exported into a table by BCFtools query tools and subsequently converted by custom Python script into multiple sequence alignment FASTA files, each representing individual contigs. To minimize problems with branch length overestimation and incomplete lineage sorting (Molloy & Warnow, 2017;Xi et al., 2016), alignments were filtered by the AMAS tool (Borowiec, 2016) and contigs containing less than 20 parsimony informative sites were omitted from further analysis, ultimately leading to 2081 contigs available for phylogeny estimation. Individual maximum likelihood gene trees were reconstructed by IQ-TREE v.
1.6.10 (Nguyen et al., 2015) using the extended model selection, as implemented in ModelFinder, with free rate of heterogeneity in combination with 1000 ultrafast bootstrap replicates (Hoang et al., 2018;Kalyaanamoorthy et al., 2017). A consensus species tree was estimated by ASTRAL v. 5.5.6 (Zhang et al., 2018).
Using the alternative reference, we applied similar logic with the only difference that we assumed that intron sequences are evolutionarily less stable and therefore less likely to be recovered within the evolutionary distant species. Therefore, if any sequence contained more than 60% of ambiguous Ns, it was not included in the analysis. Only alignments with more than four species were utilized for the analyses. Subsequently, we obtained 1617 contigs available for phylogeny estimation out of which 858 contained sequences of all analysed species.

| Locus dropout and variance in locus coverage
It may be expected that the phylogenetic distance from the species used for the probes design in general correlates negatively with the resulting coverage of the target sequences in other species (e.g. Bragg et al., 2016;Portik et al., 2016). In an extreme case, when a gene reaches actual coverage equal or close to 0, we might consider such a gene as dropped out, while in a less extreme case, we might set any coverage threshold and consider a gene that did not qualify such criteria as (functionally) dropped out. Hence, to evaluate the phenomenon of locus dropout among studied species, we estimated for each locus the number of mapped reads within each sample using bedtools v.2.27.1 (Quinlan & Hall, 2010). We accepted only reads with mapping quality ≥20. With respect to the SNP calling parameters, we decided to consider loci with median per base coverage <10 as dropped out.
We also investigated the variance in per locus sequencing coverage across individual species pairs, which has moreover important implications for the study of inter-specific hybrids. Furthermore, to understand the dynamics of the enrichment at the inter-specific scale, we found it useful to define a subset of conserved genes, that is, sequences that were successfully enriched and sequenced (retrieved) in all species.

| Allelic dropout and genomic bias in hybrids and allopolyploids
To evaluate the rate of allelic dropout (ADO) of the SEQcap methodology, we investigated one diploid hybrid individual between C. elongatoides and C. taenia (hereafter, hybrids shall be labelled by letter corresponding to genomes of parental species, where "E" stands for one genomic dose coming from C. elongatoides and "T" for one genomic dose coming from C. taenia; in this case ET stands for a diploid hybrid). This specimen has been analysed by both SEQcap and whole-genome shotgun sequencing data (coverage ~35×). As the WGS approach is technically easier and thus less error prone with respect to ADO as it does not depend on target capture, it could be used to evaluate the performance of SEQcap. We focused our comparison on a set of 46,563 bi-allelic diagnostic nuclear SNPs where C. taenia consistently differs from C. elongatoides that were defined by Bartoš et al. (2019), that is, fixed variants that discriminate between those two species. This set of diagnostic SNPs was chosen because it represents loci that are successfully sequenced and bioinformatically assessed between both parental species as well as their hybrids. Therefore, we assume them to represent real variants rather than errors of any type, that is, technical error, reference bias or bioinformatic faults. Moreover, since the hybrids reproduce effectively clonally, they are assumed to bear variants inherited from both parental species at the time of their origin (Janko et al., 2018). In other words, at all positions where the parental species differ consistently from each other, their hybrid is assumed to be heterozygotic.
In addition, we also tested whether sequencing of hybrid individuals is subject to the so-called genomic bias, that is, unbalanced relative coverage of orthologous alleles with respect to their parental ancestry. To do so, we first used the WGS data of a diploid ET hybrid as a null model and compared them with SEQcap data originating from two F1 laboratory hybrids. Next, we focused on naturally occurring hybrid specimens including all EET, ET and ETT genomic combinations.

| RE SULTS & D ISCUSS I ON
The present study investigated two properties of targeted enrichment sequencing approach and consequently, this section is split into two major parts. In the first part, 4.1., we evaluate the general performance of SEQcap across a wide range of phylogenetic scales of tested species and in the second part, 4.2., we will evaluate its performance with respect to hybrids and allopolyploids.

| Performance of SEQcap across phylogenetic scales
3.1.1 | Trends in locus dropout across the phylogeny Given that all samples included in the analysis have successfully passed sequencing with a mapped coverage of 20× or higher, it is likely that any trends observed across investigated phylogenetic scale resulted from biological and phylogenetic differences between the species, rather than from technical issues in library preparation and subsequent sequencing. In order to address the performance of SEQcap methodology across different phylogenetic scales, the phylogenetic hypothesis described in Figure 1a was taken as being correct. The actual support of this phylogeny by the present data will be discussed later. Several notable trends were observed.
First, we corroborated that phylogenetic distance indeed represents the key parameter determining the performance of designed probes (see Figure 1b)  Next, we examined the sharing of successfully sequenced genes among samples. We retrieved 2015 loci in Nemacheilus masyae, which diverged from the genus Cobitis probably 40 Ma (Šlechtová et al., 2008), of which roughly two thirds (i.e. 1282), were successfully sequenced also in each of the remaining species (see Figure 1b).
This suggests that there is a clear trend generally associated with phylogenetic distance. However, this trend is also influenced to some extent either by chance or by other influences that cannot be assessed. For a given gene that has not been sequenced in a given species, we cannot say whether it was merely a random variation or a deterministic phenomenon imposed by the characteristics of the particular gene.
Finally, the total number of SNPs called against the reference (see Figure 1c) does not increase monotonically with the phylogenetic distance, but reaches its maximum in Sabanejewia, where this trend breaks and the total number of SNPs starts to decrease. This decrease is caused by the fact that more and more genes are dropping out of the analysis as we move towards species increasingly more distant from the reference. However, this trend can be corrected by taking this dropout into account as shown on Figure 1d.
Specifically, when we divided the total number of SNPs by the sum of the lengths of the successfully sequenced proportion of the reference in each sample (nSNPs × 1000 bp −1 ). After such a correction, the data showed the expected trend of a monotonically increasing amount of variability accordingly with the phylogenetic distance (see Figure 1d).
Next, we examined whether the locus dropout observed over the phylogenetic scale (see Figure 1b) was related to specific properties of targeted loci. Extreme values of GC content are known to reduce the efficiency of target enrichment, however, the highest enrichment efficiency was observed in genes with above-average GC contents, that is, ~60% (Bi et al., 2012). We observed that the average GC content of successfully sequenced loci increases slightly with phylogenetic distance from the reference species. Since the phylogenetically most distant species show a significant deterioration in the performance of this methodology compared to the reference species, it is easy to understand why GC-enhanced genes are more successfully enriched in light of previous work which demonstrated that such sequences are captured with far greater efficiency (Bi et al., 2012).
Further, we investigated the dynamics of locus dropout in a given situation where we do not know the sequence of the dropped genes (because the enrichment has failed). To overcome this apparent issue, we sorted the genes within each selected species into two categories: genes that were successfully retrieved in another species (which was one step further on the imaginary scale of phylogenetic divergence; see Figure 1a), and genes that were not. For every gene, we calculated its divergence from the reference species (C. taenia) and found that genes which tend to accumulate more changes are preferentially omitted/lost during the enrichment. Although this finding could be said to have been anticipated, it is surprising how universal this trend is and that it was observable already between closely related species. Furthermore, it means that the estimated divergence of the most phylogenetically distant species represents a rather conservative (lower) estimate, that is, an estimate that is most likely based on the subset of most conserved and least mutating genes (see Figure 1b,d).

| Variance in locus coverage across the phylogeny
To evaluate on a finer scale the behaviour of individual loci across tested phylogeny, we investigated the variance in per-locus sequencing coverages across all possible species pairs including C. taenia. It is known that SEQcap data are subject to remarkable variability in the enrichment of individual targets (e.g. Bragg et al., 2016). Such variability may have technical as well as biological explanations. For instance, mitochondrial loci are likely to achieve high coverage since they are overrepresented compared to nuclear loci (e.g. Laczkó et al., 2022;Therkildsen & Palumbi, 2017). However, even single-copy nuclear genes are subject to substantial variation in the resulting coverage.
As expected, such a pairwise comparison with C. taenia showed substantial variance in coverages (see Figure 2), but it is remarkable that designed probes perform consistently on the intrageneric scales as all analysed species of Cobitis were highly correlated with the ref-  Figure 2).
This observation helps to resolve the question whether the observed gene dropout as demonstrated in Figure 1b has been caused by the lack of the physical capability of the designed probes to capture progressively more divergent fragments of DNA, or whether it could have resulted from a technical problem of mapping the reads from genetically distant species to the transcriptomic reference. The trend reported in Figure 2 indeed suggests that, for some reason, the set of defined conserved genes is enriched efficiently, while the performance of enrichment of other genes decreases with the phylogenetic distance from the reference. Therefore, in the Nemacheilus sample, this subset prevails at the expense of other genes, which is a direct consequence of the reduced capabilities of the probes to enrich many target sequences at rates typical for the reference species.
This trend is more evident in Figure 3, which shows the percentage of mapped reads in each sequencing library that originated from the defined subset of conserved genes.

| Phylogeny estimation and the role of offtarget capture
Having analysed and interpreted the general trends of multilocus SEQcap data throughout the phylogeny, we may now address how they actually supported the presumed phylogenetic hypothesis ( Figure 1a). Two datasets were compared in terms of power to resolve the species tree; the first used the SEQcap data mapped onto the published C. taenia transcriptomic reference (Janko et al., 2018), while the other was based on the alternative reference assembled from the SEQcap reads themselves. The latter one may benefit from the fact that intron sequences from the exon/intron boundaries are at least to some extent captured during sequencing (off-target sequencing), even if with lower efficiency than the true targets (Bi et al., 2012). It was found that the alternative SEQcap-based reference included 32.3% of presumably intron sequences, that is, sequences that did not map to the published transcriptomic reference. To reconstruct the species trees from both datasets, we selected only those loci which passed previously defined criteria for being successfully sequenced and phylogenetically informative (see Methods) in at least four species, which left us with 2081 usable loci of the first dataset and 1617 of the second dataset.
Both consensus species trees independently estimated by ASTRAL generally confirmed the phylogenetic hypothesis depicted in Figure 1a. Nevertheless, plotting the rooted trees with DensiTree (see Figure 4) allowed visual inspection of conflicts among individual gene trees. Such conflicts were especially obvious at the shortest internodes of both trees, that is, at the basal relation between Bibarba bibarba and Sabanejewia aurata and at the branch separating C. strumicae from the remaining Cobitis species. For instance, in the first tree based on published C. taenia transcriptomic references, 56.7% of gene trees and only 49.9% of sites support the consensus B. bibarba -S. aurata topology (see Figure 4a). Interestingly, using the alternative reference including intron sequences, the measures of concordance factors increased along each branch, with the biggest improvement concerning the relatively shortest internodes of the tree (see Figure 4b). This finding is consistent with previous studies that have addressed this issue (e.g. Giarla & Esselstyn, 2015).
In any case, this analysis corroborates the usefulness of a multilocus dataset for any relevant phylogenomic analysis, which is in line with the persistent paradigm that strictly differentiates between individual 'gene trees' and the 'species tree' (Edwards, 2009), where a 'gene tree' is an entity that may or may not share evolutionary history with other gene trees of a given species, but only by analysing many such genes do we obtain a qualitatively distinct entity, that is, a 'species tree'.

| Suitability of SEQcap for analysis of hybrids and allopolyploids
To evaluate the suitability of SEQcap methodology for genomic investigation of hybrid and polyploid (especially allopolyploid) organisms, we focused on two phenomena which represent significant obstacles for any multilocus analysis of such organisms. Namely, by comparing the performances of sequence capture with WGS, we investigated the rates of Allelic Dropout (ADO) and also evaluated the bias in coverage of orthologous alleles originating from different parental species within a hybrid (the genomic bias).

| Allelic dropout
We evaluated the rate of allelic dropout (ADO) in a single ET hybrid individual for which we obtained both SEQcap and whole-genome shotgun sequencing data. Since WGS is presumably technically easier and less error prone with respect to ADO as it does not depend on target capture, it may serve as an unbiased null expectancy to evaluate the performance of SEQcap. The test was performed on 46,563 a priori known bi-allelic diagnostic nuclear SNPs where C.
Of these, 38,945 SNPs passed the quality filters (GQ ≥ 20 and allelic depth (AD) ≥ 10) for both datasets, allowing their comparison.
In general, SEQcap inferred identical genotype calls to WGS dataset and differed only at 65 loci, that is, <0.17% of the evaluated positions. Those 65 loci could be further divided into two classes.
First, at 31 positions, the SEQcap appears to be homozygotic compared to WGS, which is interpreted as ADO. In the second class of 34 positions, the opposite pattern occurred with SEQcap appearing heterozygotic compared to the WGS data, which is interpreted as false heterozygosity. When we applied even stricter thresholds (GQ ≥ 50 and AD ≥ 20), the number of loci passing such filters reduced to 28,801, but of these, 11 were classified differently by both methods; nine characterized as ADO by sequence capture (i.e. ~0.03% of the evaluated positions) and three classified as falsely heterozygous (i.e. ~0.01%).
When compared with other correctly inferred loci, we found no differences in GC content at these loci with ADO or false heterozygosity. However, the situation was different when we focused on the average sequence divergence of individual genes measured against the reference, specifically for genes affected by ADO. We found that these genes were among the most different genes that were included in the set of so-called diagnostic genes as described above, although the difference was not statistically significant (wilcox. test p-value = .2039, means 1.54% vs. 1.34%). This suggests that these events are to some extent related to stochastic errors in library preparation or sequencing and, more interestingly, to locus-specific properties.
In summary, our data suggest that SEQcap technology is very efficient in detecting heterozygotic variants and as such appears to be well suited for analyses of hybrids even if their sub-genomes diverged as long as ~9 Ma.

| Allelic/genomic bias in hybrids and allopolyploids
Looking more closely at the aforementioned diagnostic positions, we investigated possible bias towards one of the parental sub-genomes within diploid and polyploid hybrids and thereby tested whether one parental sub-genome is captured and sequenced with a higher probability than the other(s). As a null model, we again used the WGS data from the same diploid hybrid (ET) individual, where no methodological F I G U R E 3 Percentage of total mapped reads which originated from the defined subset of conserved genes within individual species as depicted in Figure 2. Only reads that could be mapped to the reference were considered for the calculation.
F I G U R E 4 Rooted phylogenetic trees plotted by DensiTree (rooted by N. masyae -not displayed): (a) 2081 gene trees that were estimated using the transcriptomic reference with corresponding concordance factors; (b) 858 gene trees (containing all species) estimated using the assembled reference containing presumable intron sequences with corresponding concordance factors (based on the full set of 1617 gene trees); consensus tree for each of those two groups is displayed in bold blue lines; the concordance factors are listed above the branches of both trees, with the first number indicating gene concordance factors and the second indicating site concordance factors (Minh et al., 2020).
bias is expected. At each diagnostic locus that passed GQ ≥ 20 and AD ≥ 10 quality thresholds, we inferred the proportion of E-like and T-like allelic variants, that is, variants coming from the genome copy inherited from C. elongatoides and C. taenia, respectively.
We observed that the WGS dataset was unbiased with respect to nuclear alleles coming from either genome (Figure 5a). Such an unbiased performance excludes any significant role of so-called reference bias (e.g. Brandt et al., 2015), which may stem from the mapping of sequencing reads originating from distinct sub-genomes in a hybrid to a reference assembled from one parental species.  Figure 5a).
To further test the power in resolving ploidy and genomic composition in natural hybrids, we applied the same approach to a set of naturally occurring individuals, including diploid and triploid genomic combinations (ET, EET and ETT) in which the allelic proportions of the T-like variants should theoretically follow a gradient starting from 0.33 in EET, through 0.5 in diploid ET hybrids to 0.66 in ETT triploids ( Figure 5b). However, given that such individuals belong to lineages evolving many generations without sex, they have accumulated a number of genomic rearrangements, such as gene conversions and deletions, which caused up to 3% of loss of heterozygosity in diagnostic SNP positions thereby introducing some noise to the data (Janko et al., 2021). Nevertheless, all three natural biotypes clearly separated from each other in terms of allelic proportions (see Figure 5b, Wilcoxon rank sum test for each pair that consisted of two types of hybrids reached p-value <2.2e−16). This demonstrates that SEQcap technology is well suited to detect genome dosage consistently with theoretical expectations even in naturally occurring hybrids and polyploids. However, we again detected a slight but pervasive bias towards elongatoides-derived alleles.
Since the WGS data were treated using the same bioinformatic tools as SEQcap data but show no bias with respect to allelic ancestry, we conclude that the bias observed in the SEQcap data does not originate from the bioinformatic treatment of the data. Furthermore, given that the same bias was observed in the laboratory F1 hybrids, it is unlikely to be caused by any mutations which might have accumulated after the initial formation of the hybrid clonal lineage, such as the loss of heterozygosity (e.g. Janko et al., 2021). Rather, we argue that the observed bias towards elongatoides-derived alleles is an inherent feature of the SEQcap protocol itself and reflects the probe design and their sensitivity to diverged genomes. Indeed, this is easy to understand considering that the probes were originally designed using C. elongatoides transcriptome.
Our data therefore show that whereas SEQcap protocols tend to be systematically biased by probe origin; despite this, they allowed correct inference of the genomic composition of the hybrids with a high degree of certainty. This effect could be minimized, assuming that we know the sequences of both species in advance either by creating consensual sequences or by designing specific probes for both species (e.g. Johnson et al., 2019).

F I G U R E 5
Boxplots show T-like variant (coming from C. taenia genome copy) proportion estimated from set of 46,563 bi-allelic diagnostic SNPs in each hybrid, that is, fixed variants that discriminate between C. taenia and C. elongatoides (Bartoš et al., 2019). Number of loci actually assessed for each individual varies between ~37,000 and ~39,500.

| CON CLUS IONS
Target enrichment approaches are not only not an outdated method, but have actually gained considerable popularity and are often the only meaningful solution to many research questions, particularly with respect to both efficiency and affordability, especially when dealing with non-model species (e.g. Choquet et al., 2019;Jiménez-Mena et al., 2022). This study has demonstrated that the robust design of exome capture experiment has predictable behaviour even when applied across a wide phylogenetic range encompassing species with very close relatedness to the model taxon to those dating its divergence at least up to ~40 Ma. We have confirmed that the results of this methodology are highly dependent on features such as the similarity of the target sequence to the reference and the GC content.
Most importantly, our study has clearly demonstrated that a properly designed SEQcap experiment could serve as a powerful tool in the study of hybrid and polyploid individuals/lineages. Studies of such organisms and the reconstruction of their evolution and possible posttransformational changes represent a live branch of contemporary evolutionary biology and genomics with profound impacts for theoretical biology (e.g. Du et al., 2020;Janko et al., 2021;Lien et al., 2016) as well as applied disciplines. Our results indicated that reduced representation sequencing based on the sequence capture approach may be reliably applied to hybrids whose sub-genomes diverged as far back as ~9 Ma. Sequence capture appeared very robust not only in terms of SNP calling but also in detecting the proper genomic dosage in polyploids, when evaluated against a suitable control dataset. On the other hand, despite a formidable performance at global genome-wide scales, our analysis also demonstrated that sequence capture is subject to considerable variance in capture efficiency and resulting coverage among targeted loci, which has to be taken into account when planning such experiments and interpreting their results.

ACK N O WLE D G E M ENTS
We are grateful for the financial support from the Czech Academy of Sciences of the Czech Republic RVO67985904 and from the Czech Science foundation (grant GAČR 19-21552S and GAČR 19-18453S) and the Ministry of Education, Youth and Sports of the Czech Republic (grant EXCELLENCE CZ.02.1.01/0.0/0.0/15_003/00004 60 OP RDE).

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The sequence data generated during and/or analysed during the current study are available in the NCBI repositories: BioProject PRJNA928239, GenBank GKIE00000000.