Sequence capture using AFLP‐generated baits: A cost‐effective method for high‐throughput phylogenetic and phylogeographic analysis

Abstract Target sequence capture is an efficient technique to enrich specific genomic regions for high‐throughput sequencing in ecological and evolutionary studies. In recent years, many sequence capture approaches have been proposed, but most of them rely on commercial synthetic baits which make the experiment expensive. Here, we present a novel sequence capture approach called AFLP‐based genome sequence capture (AFLP Capture). This method uses the AFLP (amplified fragment length polymorphism) technique to generate homemade capture baits without the need for prior genome information, thus is applicable to any organisms. In this approach, biotinylated AFLP fragments representing a random fraction of the genome are used as baits to capture the homologous fragments from genomic shotgun sequencing libraries. In a trial study, by using AFLP Capture, we successfully obtained 511 orthologous loci (>700,000 bp in total length) from 11 Odorrana species and more than 100,000 single nucleotide polymorphisms (SNPs) in four analyzed individuals of an Odorrana species. This result shows that our method can be used to address questions of various evolutionary depths (from interspecies level to intraspecies level). We also discuss the flexibility in bait preparation and how the sequencing data are analyzed. In summary, AFLP Capture is a rapid and flexible tool and can significantly reduce the experimental cost for phylogenetic studies that require analyzing genome‐scale data (hundreds or thousands of loci).

However, the custom-bait-based methods require reference genomic sequence of the target regions for bait design and the cost of bait synthesis is normally high (Jones & Good, 2016;Lemmon & Lemmon, 2013;McCormack et al., 2013). To skip the bait design and reduce experiment cost, some researchers have attempted to use homemade baits to implement target capture experiments, for instance, using PCR products as baits (Amplicon capture: Mariac et al., 2014;Maricic, Whitten, & Pääbo, 2010;Peñalba et al., 2014;Tsangaras et al., 2014), using restriction site-associated DNA fragments (RAD) as baits (hyRAD: Suchan et al., 2016), or using cDNA fragments as baits (Puritz & Lotterhos, 2018). These studies show that using homemade baits for sequence capture is feasible.
Therefore, developing more methods to simply and reliably generate homemade baits for sequence capture is of importance to the ecology and evolution research community.
The AFLP (amplified fragment length polymorphism) technique was established as a highly efficient genomic fingerprinting method in 1995 (Vos et al., 1995). Because the AFLP method is suited for fingerprinting genomic DNA of any origin and complexity, it has been widely used for applications in genetic analysis such as addressing genetic relationship, displaying population structure, and assessing genetic diversity (Mueller & Wolfenbarger, 1999;Vuylsteke, Peleman, & van Eijk, 2007;Zhang, van Parijs, & Xiao, 2014). The AFLP technology is based on selective PCR amplification of restriction fragments from the digested genomic DNA. One major advantage of this fingerprinting method is that no prior genomic information is needed and the number of amplified fragments can be controlled by the choice of restriction enzymes and the number of selective bases used in the amplification process (Vos et al., 1995).
Methodologically, AFLP is essentially a genome-reduced representation method combined with PCR and thus can potentially be used to generate baits for sequence capture.
In this study, we develop a novel approach called "AFLP-based genome sequence capture (AFLP Capture)," in which AFLP fragments are used as baits for sequence capture. This method does not require any prior genome information and can generate controllable number of fragments as capture baits by selective PCR amplification.
We also propose a bioinformatic pipeline based on mutual-BLAST for converting the AFLP Capture data into orthologous sequences or SNPs. We test the utility of this method by displaying individual genetic difference and resolving species relationship in the genus Odorrana (Amphibia, Anura, Ranidae). Our results demonstrate that AFLP Capture is capable of producing hundreds of orthologous loci or tens of thousands of SNPs for genetic analysis at both interspecies and intraspecies levels. The method presented here provides a rapid, scalable, and cost-effective way for high-throughput genetic analysis in nonmodel organisms.

| Samples and DNA extraction
Here, we wanted to test the experimental performance of the AFLP Total genomic DNA was extracted from ethanol-preserved liver or muscle tissue, using the TIANamp Genomic DNA Kit (TIANGEN Inc., Beijing, China). All DNA extracts were diluted to a concentration of 10-50 ng/μl with 1× TE and stored at −20°C for further use.

| AFLP bait preparation
The workflow of producing AFLP baits is illustrated in Figure 1.
Because the quality of the genomic DNA is crucial for the outcome of AFLP experiment, we extracted high molecular weight DNA (fragments ~20 KB) from ethanol-preserved tissue of an O. huanggangensis individual (Fujian) for AFLP bait preparation. Genomic DNA (1 μg) was digested for 20 min at 37°C with 1 μl of FastDigest MluI (Thermo Fisher Scientific), 1 μl of FastDigest SbfI (Thermo Fisher Scientific), and 2 μl 10× FastDigest buffer (Thermo Fisher Scientific) in a total volume of 20 μl. The digestion was purified using AMPure XP beads (Beckman Coulter), with a ratio of 1.8:1 to the sample, followed by resuspension in 50 μl of ddH 2 O. Then, restriction-associated doublestranded Y adapters were ligated to the purified fragments in a 60 μl reaction containing 10 μM MluI adapter, 10 μM SbfI adapter, 5U T4 DNA ligase (Fermentas Inc.), 100 mM ATP, and 1× T4 ligation buffer.
Ligation reactions were incubated for 30 min at 25°C. The 60 μl ligation reaction was subsequently size-selected for 500-to 2,000bp fragments with AMPure XP beads by separately adding 15 μl of beads (discard beads) and 20 μl of beads (keep beads) to the solutions. Finally, the size-selected products were eluted from the air-dry beads with 50 μl of 1× TE and used as templates in the next step.
In the preamplification step, the design of Y adapters ensures that only those fragments ligated with both SbfI and MluI adapters at the two ends can be successfully amplified (Figure 1) (Xu et al., 2012). The filtered reads for each sample were assembled into contigs using the SPAdes version 3.8.1 genome assembler (Bankevich et al., 2012), which automatically selects a K-mer value based on read length and data type (-cov-cutoff auto). The obtained contigs were further filtered with CD-HIT-EST (Fu, Niu, Zhu, Wu, & Li, 2012) to reduce redundancy (95% similarity cutoff). The sequencing depths for the filtered contigs were calculated by SAMtools version 1.4.1 . Only contigs of length ≥200 bp and average sequencing depth ≥5× were retained for further analysis.

| Identification of orthologous sequences (for interspecies level)
In a previous study, Suchan et al. (2016) Custom Python script Loci = 5,030 Reference O.huanggangensis (Fujian) to determine the optimal aligning region for all sequences within an OG. In brief, the script uses the mutual-BLAST results to determine the relative position of each sequence to the reference sequence and searches for the optimal upstream and downstream boundaries to trim the sequences (Figure 2). The trimming condition of the upstream and downstream boundaries is adjustable. Here, we demand that at least 50% of species have data at both the upstream and downstream boundaries.

| Phylogenetic analysis
For each OG, trimmed sequences were aligned using MUSCLE version 3.8 (Edgar, 2004) with default settings. Some OG alignments may still contain problematic sequences because of wrong orthology assignment or assembling errors. To circumvent this problem, a sequence was discarded if its average similarity to all other sequences within the alignment is below 30% (this cutoff value is somewhat arbitrary, but we think it is reasonable to remove sequences of such a low similarity). After realigning the retained sequences, we kept only the alignments that contain more than six taxa.
The refined alignments were concatenated into a final supermatrix. Phylogenetic trees were constructed using the maximum-likelihood (ML) method under the GTRGAMMA nucleotide substitution model in RAxML version 8.0 (Stamatakis, 2014). Branch support for the resulting phylogeny was evaluated with 500 rapid bootstrapping replicates (-f a option) implemented in RAxML.

| SNP calling (for intraspecies level) and Phylogeographic analysis
The filtered contigs captured from the O. huanggangensis (Fujian) sample that was used for bait preparation (a total number of 5,030 contigs; Figure 2)  A summary of the assembly, MBH analyses, and mapping statistics is shown in Table 1.
We first investigated the effectiveness of AFLP Capture to generate multilocus data set for phylogenomic analysis across a moderate evolutionary divergence. From the obtained target contigs, 511 OGs were identified that included at least six species of the 11 Odorrana species sampled in this study. The lengths of the 511 OGs ranged from 203 to 7,779 bp, with an average of 1,378 bp (Figure 3a).
Approximately 20% of the OGs were more than 2,000 bp, longer than the bait fragments (500-2,000 bp), which suggests that some captured loci contained flanking sequences of their bait regions. The evolutionary rates of these 511 OGs, measured by overall mean pdistance, ranged from 0.0001 to 0.1709 within the genus Odorrana and ranged from 0.0002 to 0.1883 when outgroup was included ( Figure 3b). As a reference gene, the mitochondrial COI has an evolutionary rate of 0.169 within Odorrana, which indicates that most of our captured nuclear loci evolve slower than COI. The concatenated supermatrix of the 511 OGs was 713,111 bp in length, 71% complete by locus and species or 58% complete by characters. ML analysis F I G U R E 2 Bioinformatic pipeline used to process the AFLP Capture data. For interspecies level analysis, the data were eventually converted into orthologous sequence groups (OGs). For intraspecies level analysis, the clean reads of different individuals were mapped onto the reference sequences to call SNPs. Detailed protocol is given in Materials and methods produced a well-resolved phylogeny for the 11 Odorrana species. All the 10 internal nodes were strongly supported with bootstrap values of 100% (Figure 4). We recognized four major clades within Odorrana (A-D; Figure 4), which was repeatedly found in previous Odorrana studies based on mtDNA data (Cai, Che, Pang, Zhao, & Zhang, 2007;Chen et al., 2013;Jiang & Zhou, 2005;Matsui, Shimada, Ota, & Tanaka-Ueno, 2005), nuclear data (Stuart, 2008), and combinations of mtDNA and nuclear data He, 2017;Pyron & Wiens, 2011;Wiens, Sukumaran, Pyron, & Brown, 2009). The relationships among the four clades is (A,((B,C),D)), most resembles the result of (A,(B,(C,D))) inferred from 14 nuclear genes (He, 2017).
We then explored the capability of AFLP Capture in discover-  Figure 5). This preliminary result indicates that the AFLP Capture data can also provide substantial amount of SNPs for population genetic studies or phylogeographic studies among closely related species.

| D ISCUSS I ON
Our study demonstrates that anonymous AFLP fragments can be used as capture baits to efficiently enrich a wealth of ortholo- to resolve the phylogenetic relationships within the genus, but also generate tens of thousands of high-quality SNPs to display genetic difference among individuals. These results suggest that AFLP Capture can be applied in both phylogenetic and population genetic studies of nonmodel organisms.
A target sequence capture experiment normally included two steps: (a) bait preparation; (b) library preparation, hybridization enrichment, and sequencing. Our study aims to provide a new solution for the first step of bait preparation but not the later step. In terms of bait preparation, most previously published methods used commercially synthesized baits, such as UCE (Faircloth et al., 2012) and AHE (Lemmon, Emme, & Lemmon, 2012). Although the cost of commercial baits can be diluted by applying them to more samples, the initial investment is high (normally ~$2,400). In contrast, to generate a set of AFLP fragments as capture baits, researchers just need to choose one bait species related to the organism group, of their interest. Following the five steps (DNA digestion, adapter ligation, size selection, preamplification, and selective PCR), the bait preparation can be done in no more than 2 days. The initial investment for the reagents (restriction enzymes, adapter, and PCR primer oligos) Our AFLP Capture method controls the number of restriction fragments mainly by selective PCR amplification. Although our method also includes a step of size selection, its purpose is just to remove small restriction fragments (length <500 bp). In our method, the key of controlling the number of DNA fragments is selective PCR amplification. Compared to the size selection strategy (using regular laboratory electrophoresis equipment), using selective PCR to control the number of baits is easier to manipulate and more repeatable (Zhang et al., 2014). In the selective PCR step of this study, each reaction contains a pair of selective primers with a single selective nucleotide at the 3' end (e.g., MluI-F-SA/ SbfI-R-SC), which can theoretically amplify one-sixteenth of the total digestion fragments. If one wants to decrease the complexity of the baits (fewer loci), one can increase the number of selective nucleotide to 2 or 3 to recover smaller proportion of the total fragments. On the contrary, if one needs to increase the complexity of the baits (more loci), one can pool different combinations of the selective PCR products. In this study, the PCR products of four selective primer combinations were pooled together, so we actually obtained a quarter of the total fragments. However, to ensure the capture efficiency, the number of bait fragments cannot increase without limit. According to our laboratory experience, the optimal number of bait fragments is between 500 and 5,000. And higher complexity of the baits also means that more sequencing data are needed to get sufficient coverage.
Unlike custom-designed baits, AFLP baits are anonymous genomic loci. So it is hard to estimate sequence divergence between AFLP baits and the target DNAs. If some species are too distantly related to the species that is used for bait preparation, the anonymous AFLP baits may not work well for them. Our demonstration case is the genus Odorrana of which divergence is ~22 million years (He, 2017). However, genera are subjective and can be very young or very old, and moreover, different organisms have various evolutionary rates. It is not appropriate to say that AFLP Capture works well at genus level or work well for a group of less than 20 million years. We suggest sequencing a common DNA molecular marker such as mitochondrial COI to evaluate the rough sequence divergence between the probe species and other samples. In our case, when the mitochondrial COI divergence between the probe species and some of our samples (outgroup) exceeds 20%, the AFLP Capture method recovers no more than 25% of the 511 target OGs ( Figure 6).
However, when the COI divergence is below 17.2%, the target locus recovery rate increases to 50% -85% ( Figure 6). This threshold value (20% COI divergence) can be used to predict whether the AFLP bait can effectively capture the target sequence.
Most target capture methods use the sequences of baits as reference to map the capture data (Faircloth, 2015;Faircloth et al., 2012;Johnson et al., 2016;Lemmon et al., 2012 mapping. Moreover, in sequence capture experiments, the capture data often contain flanking sequences of the bait regions. If the capture sequencing data are directly mapped to the bait reference sequences, the flanking sequences of the bait regions will not be kept. Alternatively, if the sequencing data are first assembled into large contigs and then assigned to orthologous groups using the MBH strategy, the flanking sequences of the bait regions can be kept and used in the subsequent analysis. In fact, before this study, we had a small pilot experiment to compare the difference in the utilization rate of the AFLP Capture data using the two different strategies: (a) mapping data to bait reference and (b) identifying orthologous contigs by MBH. The pilot experiment included only eight samples, and the AFLP bait was sequenced.
We identified 94 OGs (a total length of 89,168 bp with zero missing data) by mapping data to bait reference and 164 OGs (a total length of 216,328 bp with zero missing data) by MBH. This rough result showed that using MBH rather than direct mapping can obtain more OGs and a longer concatenated data matrix from AFLP Capture data. However, because the pilot experiment is small, further comparison between these two data processing strategies may be needed in the future.
In summary, we presented a method of using the AFLP technique to generate a large number of anonymous genomic fragments as capture baits. Compared to the other target capture methods using commercial synthesis baits, such as AHE (Lemmon et al., 2012), UCE capture (Faircloth et al., 2012), and exon capture (Albert et al., 2007;Bi et al., 2012;Li et al., 2013;Ng et al., 2009), our approach has the benefits of less expensive and more flexible bait preparation steps without the need of prior genome information. The AFLP Capture method can recover hundreds to thousands of homologous loci from relatively diverged taxa, making it particularly suitable to study shallow-scale divergence events, such as the relationships within a genus or a species complex. We hope that this method can serve as a new tool of data collection for the evolutionary biologists working in the era of high-throughput sequencing.

DATA ACCE SS I B I LIT Y
Raw read data were deposited in NCBI SRA (accession PRJNA503087). Experiment protocol, python script, the concatenated data matrix, the SNPs matrix, raw capture contigs, and the resulting phylogenetic trees were deposited in FigShare https://figshare.com/s/5a4eee383e2dc9afba53.