Relative performance of customized and universal probe sets in target enrichment: A case study in subtribe Malinae

Premise Custom probe design for target enrichment in phylogenetics is tedious and often hinders broader phylogenetic synthesis. The universal angiosperm probe set Angiosperms353 may be the solution. Here, we test the relative performance of Angiosperms353 on the Rosaceae subtribe Malinae in comparison with custom probes that we specifically designed for this clade. We then address the impact of bioinformatically altering the performance of Angiosperms353 by replacing the original probe sequences with orthologs extracted from the Malus domestica genome. Methods To evaluate the relative performance of these probe sets, we compared the enrichment efficiency, locus recovery, alignment length, proportion of parsimony‐informative sites, proportion of potential paralogs, the topology and support of the resulting species trees, and the gene tree discordance. Results Locus recovery was highest for our custom Malinae probe set, and replacing the original Angiosperms353 sequences with a Malus representative improved the locus recovery relative to Angiosperms353. The proportion of parsimony‐informative sites was similar between all probe sets, while the gene tree discordance was lower in the case of the custom probes. Discussion A custom probe set benefits from data completeness and can be tailored toward the specificities of the project of choice; however, Angiosperms353 was equally as phylogenetically informative as the custom probes. We therefore recommend using both a custom probe set and Angiosperms353 to facilitate large‐scale systematic studies, where financially possible.

The Angiosperms353 markers seem to resolve the phylogenies of rapidly radiating groups well (Larridon et al., 2020;Shee et al., 2020) and show potential for resolving from deep  to shallow phylogenetic scales, even resolving within-species relationships (Van Andel et al., 2019;Murphy et al., 2020), but their utility has not yet been fully evaluated. To date, only one study has compared the benefits of Angiosperms353 with those of a custom probe set (Larridon et al., 2020). Two other studies that did not use Angiosperms353 as a universal probe set but compared the performance of universal probes with that of custom probes are Kadlec et al. (2017), who designed universal markers for Erica L. (Ericoideae), and Chau et al. (2018), who used three universal probe sets for Buddleja L. (Scrophulariaceae). All three studies, although utilizing different universal and custom probe sets, reported a similar degree of phylogenetic informativeness for both Angiosperms353 and the custom probe set. However, differences between these sets were also reported. Target locus recovery in outgroup taxa was more consistent for the universal probes than custom probes (Chau et al., 2018), due to a similar sequence divergence between probes and ingroup as well as outgroup taxa; for custom probes, the probe-to-outgroup sequence divergence was comparatively higher (e.g., Carlsen et al., 2018). On the other hand, the total number of target loci is usually higher in a custom probe set (Kadlec et al., 2017) because they are not constrained by the need to find conserved single-copy loci across a large number of distantly related taxa. Locus recovery in the ingroup seemed to be generally higher for the custom probes, due to inherent features of the design of universal probes (Larridon et al., 2020). Universal and custom probe sets thus both have their advantages and disadvantages.
Increased data completeness through improved locus recovery could potentially be achieved by an optimization of the Angiosperms353 sequences toward a genome closely related to the study group, which would then be used as a reference for read mapping. To our knowledge, no one has yet attempted to optimize Angiosperms353 for target enrichment in the data analysis step, although the optimization of Angiosperms353 has recently been performed for the probe design (Jantzen et al., 2020).
In this study, we generated target enrichment data for the Rosaceae subtribe Malinae, with a particular focus on the genus Crataegus L., using both Angiosperms353 and our newly designed Malinae custom probe set (hereafter referred to as Malinae481) to evaluate the strengths and weaknesses of Angiosperms353. We were interested in the phylogenetic utility of Angiosperms353 from two angles: (i) in a comparison with Malinae481, and (ii) in a comparison with optimized Angiosperms353 sequences toward representatives from the Malus domestica (Suckow) Borkh. genome, which we used as a reference for read mapping.

Taxonomic focus
The subtribe Malinae (tribe Maleae, family Rosaceae) includes up to 30 genera (Robertson et al., 1991) and over 10 hybrid genera, totaling more than 1000 species and interspecific as well as intergeneric hybrids (Phipps et al., 1990). Members of the Malinae natively occur mostly in the temperate zone of the Northern Hemisphere. According to both molecular (e.g., Potter et al., 2002) and morphological studies (e.g., Kalkman, 1988), the Malinae appear to be a monophyletic group. All members share the same base chromosome number (x = 17) and several characteristic traits, such as fruits with a varying degree of fleshiness derived from hypanthial ovaries, as well as widespread apomixis and polyploidy (Dickinson, 2018). It is hypothesized that a whole-genome duplication (WGD) event followed by a rapid radiation played a central role in the origin of the group (Evans and Campbell, 2002;Velasco et al., 2010). The level of divergence between the genera of the Malinae is generally low, even though its origin dates back to at least the Middle Eocene (Lo and Donoghue, 2012); this explains the numerous intergeneric hybrids and the lack of resolution between its major clades.
Here, we selected 25 species within the Malinae, 13 in Crataegus and 12 belonging to seven genera from various other clades (Appendix 1). Prunus tenella Batsch from the tribe Amygdaleae was taken as an outgroup, although a rather divergent one. This sample set allowed us to test the relative performance of Angiosperms353 and Malinae481 at the subtribal and genus level.

DNA ploidy estimation
We avoided neopolyploids in our sampling and therefore estimated the DNA ploidy level (Suda et al., 2006) of the various tissue samples (Appendix 1) using flow cytometry. The sample preparation followed the simplified two-step protocol (Doležel et al., 2007). The seeds (rarely silica-dried leaves) and an appropriate leaf volume of the internal standards (Pisum sativum L. cv. Ctirad [2C = 9.09 pg; Doležel et al., 1998] for Malinae; Carex acutiformis Ehrh. [2C = 0.82 pg; Lipnerová et al., 2013] for the outgroup Prunus tenella, which has a much smaller genome size) were chopped with a razor blade in a Petri dish containing 0.5 mL of ice-cold Otto I buffer (0.1 M citric acid, 0.5% Tween 20). After a 10-min incubation at room temperature, the suspension of nuclei was filtered through a 42μm nylon mesh. For the estimation of ploidy level, 1 mL of staining solution containing Otto II buffer (0.4 M Na 2 HPO 4 · 12 H 2 O) and 4 μg/mL 4′,6-diamidino-2-phenylindole (DAPI) was added to the suspension of nuclei. After a 5-min incubation at room temperature, the solution of stained nuclei was analyzed using a CyFlow ML cytometer (Sysmex Partec, Görlitz, Germany) equipped with a 365-nm UV-LED as the source of UV light for DAPI excitation. The fluorescence intensity of 5000 particles was recorded for further data processing. We determined the ratio of DAPI-stained nuclei in the G1 phase of the cell cycle (G1 peak) and the respective internal standards from the resulting histograms using the software FloMax FCS 2.0 (Sysmex Partec) to estimate the relative genome size. We then inferred the DNA ploidy level by comparing the relative genome sizes to the range of DNA amounts that Talent and Dickinson (2005) accepted as representing certain ploidy levels in Crataegus. These ploidy levels, together with genome sizes adopted from Talent and Dickinson (2005), are given in Appendix 1.

Design of Malinae481
To design a custom probe set for the Malinae, we first evaluated the previously published divergence between single-copy orthologs and relatively recently diverged paralogs in the genome of M. domestica. The average sequence divergence among orthologs was 3.65% in a http://www.wileyonlinelibrary.com/journal/AppsPlantSci © 2021 Ufimov et al.
genome-wide comparison of Malus Mill. and Pyrus L. (Velasco et al., 2010), which is about one third of the average sequence divergence of 9.36% between the Maleae and Amygdaleae orthologs (Velasco et al., 2010). This divergence was approximately half as large as the 8% divergence between paralogs from the most recent WGD within each of these genomes (Wu et al., 2013), based on a four-fold degenerate site transversion. Based on this information, we developed a strategy to identify in the Malus genome (1) single-copy loci, presumably hidden paralogs due to the independent loss of distinct paralogs in different lineages, i.e., "one-duplicate loss" (Xiang et al., 2017), and (2)  In an initial filtering step, we only retained the hits exceeding 70 bp and 10% of the query length with a ≥80% sequence similarity between the query and the subject. We then assigned the hits (usually corresponding to exons) to loci based on the criterion that the length of introns separating the hits did not exceed 10,000 bp. Queries that showed hits with more than six loci were not taken into account. In a second, refined filtering step, we retained only the loci that fulfilled the following criteria: length coverage and sequence similarity of the sum of all hits for a particular locus of ≥90%, length of single hits ≥100 bp (in accordance with the bait length of 100 bp), intron length ≤1200 bp, number of loci per query ≤2, and sequence divergence among loci ≥6%. We then BLASTed the 1280 Malus queries that fulfilled our selection criteria against the Pyrus communis L. Bartlett DH genome version 2.0 (Linsmith et al., 2019) and applied the same selection criteria, which 616 of these queries fulfilled. Subject sequences from the Malus (799) and Pyrus (764) genomes, which matched the chosen mRNAs and represented full loci (including introns), were then extracted and the exon-intron boundaries were inferred based on the alignments, together with the BLASTed queries. After filtering for identical numbers of loci in both genomes, a sequence divergence between the exons of single-copy Malus and Pyrus loci of ≤15%, and an exon length ≥80 bp, we ended up with 713 loci (481 loci if pairs of paralogous loci are treated as single loci), which corresponded to 546 mRNAs. All subject exons and introns were separated. Introns ≥80 bp in length were selected for bait design alongside the exons; however, the data set based on the targeted introns is not included in this study to ensure the comparability of the results based on Malinae481 with those of Angiosperms353, which are exclusively based on targeting exons. The extracted sequences were collapsed at ≥95% similarity and used for bait design. The final exonic probe set covers 2,008,479 bp in total.

Illumina library preparation and target enrichment
The genomic library preparation followed two slightly different protocols for the different sequencing runs, which we indicate below. The affiliation of samples to sequencing runs is given in Appendix 1.
Genomic DNA was extracted using the DNeasy Plant Mini Kit (QIAGEN, Venlo, Netherlands). Between 200 ng and 1 µg of extracted DNA was sheared in 50 µL of double-distilled water using an M220 Focused-ultrasonicator (Covaris, Woburn, Massachusetts, USA) with the program for fragmentation to 500 bp for 25 s. Library preparation was performed using the NEBNext Ultra DNA Library Prep (New England Biolabs, Ipswich, Massachusetts, USA) protocol for Illumina with a few modifications: (1) a half volume of the samples and NEBNext chemicals were used during library preparation; (2) one additional cleanup step was implemented after the adapter ligation, for which a QIAquick PCR Purification Kit (QIAGEN) was used to clean residual nucleotides, enzymes, and salts from the DNA fragments as a prerequisite for efficient bead-based size selection; (3) size selection (~400-600 bp) was performed using SPRIselect beads (Beckman Coulter, Brea, California, USA) with the ratio 0.65× for the left side and 0.55× for the right side selection, and amplification of the ligated size-selected fragments was performed with eight cycles of PCR, using NEBNext Multiplex Oligos for Illumina Index Primers Set 1 and 2 (New England Biolabs) or Dual Index Set 1 (New England Biolabs); (4) enriched PCR products were cleaned twice with Agencourt AMPure XP beads (Beckman Coulter) with the ratios 0.75× and 0.7×. The libraries were subsequently pooled in approximately equimolar ratios in a 24-plex (Angiosperms353) or 13-plex (Malinae481) reaction.
We performed a solution hybridization using MyBaits biotinylated RNA baits (Arbor Biosciences, Ann Arbor, Michigan, USA). The enrichment followed the MYbaits manual version 3.02 (Angiosperms353) or version 4.01 (Malinae481) with approximately 700 ng (Angiosperms353) or 2 × 400 ng (Malinae481) input DNA and 12 cycles of PCR enrichment. Target-enriched libraries were mixed with unenriched libraries (ratio 2 : 1) to increase the fraction of off-target plastid reads, which tends to be small when using the most recent MYbaits kit versions. The majority of samples were sequenced on an Illumina (San Diego, California, USA) MiSeq at BIOCEV (Vestec, Czech Republic) using kit version 2 to obtain 250-bp paired-end reads. Together with 94 samples for a different study, two samples were sequenced on an Illumina NextSeq at the Genomics Core Facility of CEITEC (Brno, Czech Republic), utilizing the mid-output kit to obtain 150-bp paired-end reads. All DNA concentration measurements were taken using a Qubit 2.0 fluorometer (Thermo Fisher Scientific, Waltham, Massachusetts, USA).

Bioinformatic optimization of Angiosperms353 toward improved locus recovery
To improve the locus recovery from raw Angiosperms353 data, we created a reference for read mapping that was optimized for the Malinae. First, we needed to know the exon-intron boundaries of Angiosperms353, and thus BLASTed them against the genome of M. domestica GDDH13 using the National Center for Biotechnology Information (NCBI) Web BLAST with the 'BLASTn' default options (Johnson et al., 2008). The BLAST output was inspected using NCBI Genome Data Viewer (https://www.ncbi.nlm. nih.gov/genom e/gdv/). Most of the probes did not have continuous hits, i.e., different parts of the same sequences usually had hits interrupted by intervals of various lengths. In addition, the exons sometimes appeared to be partial. Furthermore, we noticed a relatively low sequence similarity to the Malus genome. Around 80% of all sequence representatives of Angiosperms353 with hits showed a sequence similarity below 80%.
For the modification of the Angiosperms353 reference, two custom python scripts were employed (available from https://github. com/rufim ov/2ex [accessed May 2020]). In the first step, using  Zhang et al., 2019]), based on the genome annotation. These two genomes were selected because they differ in the quality and completeness of their assembly and annotation, which gave us a higher number of annotated genes when combining them. For the exon extraction, we included protein-coding genes, long non-coding RNAs, and pseudogenes (including transcribed ones), but we omitted all other types of RNAs, such as ribosomal RNAs, transfer RNAs, microRNAs, and small nucleolar RNAs, as well as plastid and mitochondrial sequences. The exons were then concatenated into transcripts, resulting in 67,246 transcripts for 43,659 loci in GDDH13 and 42,841 transcripts for 42,841 loci in HFTH1. Subsequently, using '2ex_split.py' , we BLASTed the Angiosperms353 sequences against the concatenated exons from GDDH13 and HFTH1, which were the outcome of the previous step, using standalone BLAST version 2.10.0+ (Camacho et al., 2009), and sorted the hit table according to hit length. By doing so, it was possible to obtain the top hit representative of each of the up to 18 representatives for each Angiosperms353 locus, as well as the best matching locus from either of the two Malus genomes. The best matching Malus loci were then split into individual exons based on the annotations, which were obtained in the beginning using '2ex_extract.py' , and they were BLASTed back against the best Angiosperms353 representatives using standalone BLAST. The resulting hit table was sorted according to hit length, and only the top hit for each exon was kept. The sequences of the best representatives of Angiosperms353 and the corresponding Malus exons were eventually trimmed to the size of the top hits (based on the start and end of hits) and subsequently concatenated.
Our bioinformatic optimization of Angiosperms353 resulted in two modified Angiosperms353 references with one sequence representative each per locus, and of relatively equal size (Table 1). One reference comprised the best single Angiosperms353 representative per locus (353 sequences with a total length of 225,060 bp; https:// github.com/rufim ov/2ex/blob/main/bestH it-modif ied_Angio sperm s353.fasta), whereas the other reference contained the original Angiosperms353 sequence representatives replaced by the best matching Malus sequence for each locus (353 sequences with a total length of 225,289 bp; https://github.com/rufim ov/2ex/blob/main/ Malin ae-optim ized_Angio sperm s353.fasta). We will refer to these references here as bestHit-modified and Malinae-optimized, respectively. The results based on the bestHit-modified reference are presented, together with the results based on the other references (which we call probe sets if these sequences had initially been used for bait synthesis); this may help elucidate the reasons for any differences in the recovery of the target loci, if the number of sequence representatives per locus would play a role. If we had only compared the Malinae-optimized reference with Angiosperms353, it could be argued that differences in locus recovery stem from the fact that the Angiosperms353 markers have multiple sequence representatives per locus, whereas the Malinae-optimized reference has only one. All probe sets and references used in this study are summarized in Table 1.

Processing of target nuclear sequences
The raw data were pre-processed using custom scripts (available at https://github.com/V-Z/hybse q-scripts [accessed May 2020]) and GNU parallel (Tange, 2018). The reads were trimmed using Trimmomatic One "best matching" out of the up to 18 angiosperm representatives version 0.39 (Bolger et al., 2014;SLIDINGWINDOW:5:20 LEADING:20 TRAILING:20 MINLEN:50). For deduplication, 'clumpify.sh' from BBmap version 38.42 (dedupe optical spany adjacent) was utilized. We then analyzed the pre-processed reads using HybPiper version 1.3.1 (Johnson et al., 2016) with the Burrows-Wheeler aligner (BWA; Li andDurbin, 2009, 2010;Li, 2013) option using different reference files: up to four representatives per locus for Malinae481, up to 18 sequence representatives per locus for Angiosperms353, and a single sequence representative per locus for both Malinae-optimized and bestHitmodified. In the case of Malinae481, each pair of targeted paralogous loci was treated as a single locus. Read mapping was conducted using BWA and the contig assembly was performed using SPAdes (Bankevich et al., 2012). Subsequently, custom scripts were used to post-process the HybPiper output, obtain assembly statistics, and align all contigs retrieved by HybPiper using MAFFT version 7.453 (Katoh and Standley, 2013) with the -auto and -adjustdirectionaccurately options. The packages 'ape 5' (Paradis and Schliep, 2019) and 'ips' (Heibl, 2008) in R version 3.6.2 (R Core Team, 2019) were used to trim all alignments (every row and then every column with more than 30% missing data were removed) and to obtain alignment statistics. Alignments containing fewer than four sequences were discarded. The alignment characteristics (number of taxa, alignment length, number and proportion of variable sites and parsimony-informative [PI] sites) were calculated using AMAS version 0.98 (Borowiec, 2016). We finally utilized our scripts to estimate the gene trees in a maximum likelihood framework using IQTREE version 1.6.12 (Nguyen et al., 2015). ModelFinder (Kalyaanamoorthy et al., 2017) implemented in IQTREE was applied to determine the best-fit model in combination with the invariable site plus FreeRate model to predict the sequence evolution for each gene, and 10,000 ultrafast bootstrap replicates (Hoang et al., 2018) were performed. The gene trees were rooted utilizing Newick Utilities version 1.6 (Junier and Zdobnov, 2010). The species tree was reconstructed using ASTRAL version 5.6.1 (Zhang et al., 2018) and rooted using Newick Utilities.
It should be noted that we did not remove potentially paralogous loci from our data sets. In a plant group such as the Malinae, the removal of paralogous loci resulting from the most recent WGD event would imply the omission of a large proportion of the data. Furthermore, if certain loci appear as single-copy, it cannot be excluded that this is the result of one-duplicate loss leading to hidden paralogy (Xiang et al., 2017). We overcame this dilemma by developing an approach of utilizing paralogs for phylogenetic reconstruction, which will be published in due course (Ufimov et al., in prep.); therefore, in this study, we only compared the proportion of potentially paralogous loci that were identified as such using HybPiper between all data sets.
We addressed topological conflict between gene trees and support for the species tree utilizing Phyparts (Smith et al., 2015). The resulting pie charts were mapped onto the species tree using 'phypartspiecharts.py' (available at https://github.com/mossm atter s/MJPyt honNo tebooks [accessed May 2020]). Phyparts requires the rooting of the gene trees and species tree. As our trees were rooted using Prunus tenella, which was absent in certain gene trees, the number of gene trees was reduced to 350 for the Malinae481 data set and 330 for the Malinae-optimized data set.
The completeness of target enrichment data sets may be influenced by the methods used for data analysis, particularly the read mapping approach (Larridon et al., 2020). We therefore compared HybPiper with the BWA option and the BLASTX (Altschul et al., 1990) option for read mapping (Table 2), in addition to comparing HybPiper, which uses read mapping and subsequent de novo In the case of Angiosperms353, we compared HybPiper with the BWA option and the BLASTX option for read mapping.
b "Target length" refers to the recovered length per target locus. In the case of Angiosperms353, with multiple sequence representatives of differing lengths for each locus, the average length was calculated.
c "Accessions presence" refers to the proportion of accessions with sequence information for each target locus.
assembly, with HybPhyloMaker , which is built on a reference-guided read assembly. HybPiper showed a slightly better performance over HybPhyloMaker; all details about the HybPhyloMaker analyses can be found in Appendix 2.
Processing off-target plastid sequences As plastid reads are a byproduct of target enrichment (Weitemier et al., 2014), we were interested in the recovery of the plastome using Angiosperms353 and Malinae481. BWA, implemented in HybPhyloMaker, was used to map the quality-trimmed, deduplicated reads from all sequencing runs to the plastome of M. angustifolia (Aiton) Michx. (GenBank: NC_045410.1; , from which we had removed one inverted repeat. For consensus calling using Kindel version 0.1.4 (Constantinides and Robertson, 2017), the minimum read depth was set to 2× and majority rule to 51%. To compare the number of plastid reads, we separately mapped the reads from the runs that used the same baits for target enrichment, but subsequently, because we used the same samples in the case of both bait sets, we merged all reads to infer the plastome phylogeny. This phylogeny was built from all coding regions and spacers/introns, which we extracted from the reference plastome based on its annotation. The resulting sequences were aligned using MAFFT version 7.029 and concatenated and partitioned using AMAS, so that each partition included either coding sequences or spacers/introns. In cases with multiple exons per gene, the exons were concatenated and partitioned as a whole. A model test was performed separately for all partitions using ModelTest-NG version 0.1.6 (Darriba et al., 2020). The plastome tree was reconstructed using RAxML-NG version 0.9.0 (Kozlov et al., 2019) with the best model for each partition and the bootstopping option with a maximum of 1000 bootstrap replicates. Bootstrapping converged after 150 replicates. The tree was rooted and visualized using FigTree version 1.4.4 (http://tree.bio.ed.ac. uk/softw are/figtree).

Ploidy estimation
A flow cytometric analysis resulted in high-resolution histograms with the mean coefficient of variation (CV) of the G1 peak of the Malinae samples being 2.95% (range 2.20-3.45%) and 3.16% for Prunus tenella. The mean CV of G1 for the internal standard Pisum sativum was 1.66% (range 1.05-2.73%) and 1.50% for Carex acutiformis. Nevertheless, the ploidy level could be successfully determined from only 14 species within the Malinae (Appendix 1). These were found to be mainly diploid, except for a few Crataegus accessions, for which it was difficult to estimate the ploidy level due to an intermediate ratio between 2x and 3x. Moreover, the genome size and ploidy level of the four Crataegus samples were adopted from a previous study by Talent and Dickinson (2005); their genome sizes (ranging from 2C = 1.43 pg to 1.82 pg) suggested diploidy.

Comparison of Angiosperms353 with Malinae481
Using HybPiper, the average assembly performance of Angiosperms353 and Malinae481 was compared (Table 2). Angiosperms353 resulted in 25.0% mapped reads for the Malinae and 22.7% mapped reads for the outgroup Prunus tenella, whereas Malinae481 allowed 55.4% and 43.4% of the reads to be mapped for the species within the Malinae and for the outgroup, respectively. Both probe sets showed a decrease in the number of loci when higher values of target length (i.e., recovered length per target locus) and accessions presence (i.e., proportion of accessions with sequence information per target locus) were applied, but with a different decline (Fig. 1, Table 2). This we found to be the most striking difference in the performance of Angiosperms353 vs. Malinae481. The best yield for species within the Malinae was obtained using Malinae481, for which the percentage of recovered loci was only slightly lower with a stricter missing data filtering, ranging from 99.4% to 97.7%, whereas Angiosperms353 showed a rapid decline for species within the Malinae (from 93.2% to 53.3% of recovered loci). For the outgroup, the percentage of recovered loci was much lower for Malinae481 (from 63.4% to 51.4%) than for Angiosperms353 (95.2% to 72.8%). The alignment characteristics for each data set (number of loci, average number of taxa per locus, alignment length, total alignment length, proportion of variable and PI sites per alignment, total number of PI sites) are provided for each probe set in Table 3, and the alignment length and proportion of PI sites are shown in Fig. 2. The number of loci and taxa were higher for Malinae481 (481 loci, 25 taxa; exonic data) than for Angiosperms353 (344 loci, 23 taxa; exonic data). The average and total alignment length were substantially lower when using Angiosperms353 compared with Malinae481 for the exonic data (average alignment lengths of 604 bp and 1415 bp, respectively; total alignment lengths of 207,717 bp and 680,658 bp, respectively), as was the supercontig data (average alignment lengths of 1850 bp and 2161 bp, respectively; total alignment lengths of 636,398 bp and 1,039,356 bp, respectively). In contrast, the intronic data set had higher values for Angiosperms353 than for Malinae481 (average alignment lengths of 1242 bp and 758 bp, respectively; total alignment lengths of 427,280 bp and 364,669 bp, respectively; Table 3). The proportion of variable and PI sites was similar between the Angiosperms353 and Malinae481, but differed between the exonic, intronic, and supercontig data sets (average proportion of PI sites: 6.7% and 9.0%, respectively, for exons; 18.0% and 23.8%, respectively, for introns; and 13.8% and 13.4%, respectively, for supercontigs; Table 3). The total number of PI sites was substantially higher for Malinae481 compared with Angiosperms353 for both the exonic (59,369 and 13,561, respectively) and supercontig data sets (133,632 and 82,339, respectively), although it must be acknowledged that the total locus number differed between Malinae481 (481 loci) and Angiosperms353 (353 loci).
Approximately one third more loci were flagged by HybPiper as being potentially paralogous in the data set using Malinae481 (37.6%) compared with Angiosperms353 (24.2%; Appendix 3), which is the consequence of intentionally including paralogous loci when designing Malinae481.

Bioinformatic optimization of the Angiosperms353 reference for the Malinae
To evaluate the outcome of our optimization of the original Angiosperms353 reference, the results based on the Malinaeoptimized reference were compared with those for Angiosperms353. The HybPiper results showed that the Malinae-optimized reference performed similarly to Angiosperms353 ( Fig. 1; Tables 2, (Table 2): for the Malinae-optimized reference, it slightly decreased with higher values of target length and accessions presence (from 94.1% to 72.8%), but it rapidly declined for Angiosperms353 (from 93.2% to 53.3%) when used for the Malinae. The decrease in data completeness was less pronounced for Angiosperms353 in the case of the outgroup (Fig. 1). Alignment characteristics for concatenated exons, introns, and supercontigs are shown for Angiosperms353 and the Malinaeoptimized reference in Table 3 and Fig. 2. The number of loci and taxa were similar for Angiosperms353 and the Malinae-optimized reference (344 loci and 23 taxa vs. 346 loci and 24 taxa, respectively; exonic data). The average alignment length was also comparable across all three data sets (604 bp and 575 bp, respectively, for exons; 1242 bp and 1336 bp, respectively, for introns; and 1850 bp and 1902 bp, respectively, for supercontigs). The proportion of variable and PI sites was slightly higher for Angiosperms353 than the Malinaeoptimized reference and differed between the exonic, intronic, and supercontig data sets (average proportion of PI sites: 6.7% and 6.2%, respectively, for exons; 18.0% and 16.9%, respectively, for introns; and 13.8% and 13.4%, respectively, for supercontigs; Table 3).
The topologies of the species trees based on the Malinae-optimized reference and Malinae481 differed slightly (Fig. 3, Appendix 4). The Malinae481 tree had substantially higher branch lengths and node support than the Malinae-optimized tree (Appendix 4). The generally high gene tree discordance, which is common for target enrichment data sets (e.g., Morales-Briones et al., 2018;Herrando-Moraira et al., 2019), was lower for Malinae481 (Fig. 3).
As an intermediate step toward optimizing the Angiosperms353 reference for the Malinae, we chose the best hit of up to 18 sequence representatives of the Angiosperms353 loci for the Malus genome, which resulted in the bestHit-modified reference (see the Methods section "Bioinformatic optimization of Angiosperms353 toward improved locus recovery"). We thereby evaluated the performance of this bestHit-modified reference in comparison with Angiosperms353. The detailed assembly and alignment characteristics are presented in Tables 2 and 3. This reference performed by far the worst for all characteristics, which shows that the selected "best" Angiosperms353 sequence representative for each locus is in fact not the best because it still has a high sequence divergence from the reads; for example, only six representatives of the original Angiosperms353 are from the Malinae. Only in concert with other sequence representatives per locus can a satisfactory locus recovery be achieved.
More loci were flagged by HybPiper as potentially paralogous in the data set using the Malinae-optimized reference (31.2%) compared with Angiosperms353 (24.2%; Appendix 3), which is probably a byproduct of the generally increased locus recovery in the case of the Malinae-optimized reference ( Table 3).

Recovery of plastid data using Angiosperms353 vs. Malinae481
Plastome recovery was similar for the runs with Angiosperms353 and Malinae481 (Appendix 5); the average proportion of mapped plastid reads (1.9% and 2.6%, respectively) and the percentage of missing data (19.8% and 18.1%, respectively) showed little difference. When the reads from all runs were combined, the percentage of missing data dropped to a negligible 0.6%, and almost the entire plastome was successfully assembled for each accession.
The plastome tree showed high levels of support (Appendix 6) and had two main clades, similar to the multilocus nuclear tree: Amelanchier+Crataegus and Malus+Pyrus+Sorbus. The positions of Pyracantha coccinea M. Roem., Micromeles alnifolia Koehne, FIGURE 3. Comparison of topology and gene tree (in)congruence of ASTRAL species trees. The Malinae-optimized reference and Malinae481 were used to generate these trees. For each branch, the top number indicates the number of gene trees concordant with the species tree at that node, and the bottom number indicates the number of gene trees in conflict with that node. The pie charts present the proportion of gene trees that support that clade (blue), the proportion that support the main alternative topology for that clade (green), the proportion that support the remaining alternative topologies (red), and the proportion that inform (support or conflict) that clade with <50% bootstrap support (gray).
and Pourthiaea villosa Decne. were in discordance with the nuclear phylogeny; depending on the probe set used for target enrichment, the positions of these three taxa shift between clades. This may be due to the limited sampling in combination with the very short branches of the basal relationships.

Phylogenetic informativeness of Angiosperms353 vs. custom probes
Our results showed that universal probes and custom probes have a similar proportion of PI sites (Table 3), which is in agreement with the findings of Kadlec et al. (2017), Chau et al. (2018), Larridon et al. (2020), and Ogutcen et al. (2021). This seems to be a robust finding because both the universal and custom probe sets that were compared in the previous and present studies were generated in different ways and from different sources; only Larridon et al. (2020), Ogutcen et al. (2021), and this study used Angiosperms353 as the universal probe set. A similar proportion of PI sites in data sets generated with universal vs. custom probes was found across targeted plant groups from the genus level (Erica [Kadlec et al., 2017], Buddleja [Chau et al., 2018], Cyperus [Larridon et al., 2020], Crataegus [this study]) to the levels of subtribe (Malinae [this study]), family (Cyperaceae [Larridon et al., 2020], Gesneriaceae [Ogutcen et al., 2021]), and order (Ericales [Kadlec et al., 2017]). However, the scale to which phylogenetic informativeness applies may vary between universal and custom probe sets, with custom probes having a broader scale, from the infraspecific to the family level (Straub et al., 2020). Furthermore, phylogenetic informativeness per se does not guarantee the improvement of a phylogenetic hypothesis, as it is linked with data completeness, which we discuss below.

Data completeness is high with custom probes
There are certain differences between universal and custom probes that deserve detailed consideration. The total number of loci in a custom probe set is often higher than in a universal probe set, e.g., 1164 (Schmickl et al., 2016) or 806 (Medina et al., 2019) (but see also smaller custom probe sets of 276 loci created by Nicholls et al. [2015] or of 176 loci in Heyduk et al. [2016]), which increases the total number of phylogenetically informative characters. This could be crucial for deciphering phylogenetic relationships between recently diverged taxa or within species, a future prospect of the target enrichment approach (Villaverde et al., 2018). However, even if the total locus number in a custom probe set is similar to a universal probe set, the average alignment length is often higher in custom probe sets, such as in the case of Malinae481: 604 bp is the average alignment length for Angiosperms353 in the Malinae, while for Malinae481 it is 1415 bp. Note that this length refers to the exonic data set, which resembles the sequences of Angiosperms353 and the majority of sequences of Malinae481 (we targeted introns as well, but did not include the data in this study). Interestingly, the average alignment length for both probe sets approximated each other for the supercontig data set due to increased intron recovery using Angiosperms353. This surprised us, as around 10% of Malinae481 were intronic sequences; thus, we expected a higher proportion of intronic data to be recovered, contributing to longer supercontigs.
This suggests that explicitly targeting introns biased intron recovery toward the targeted introns, thereby limiting the total number of introns.
Another difference between universal and custom probes is the extent of missing data, which may differ between the in-and outgroups. Chau et al. (2018) emphasized that outgroup taxa have a similar data completeness to ingroup taxa when using universal probes, due to their more equal sequence similarity to both in-and outgroups in contrast with the custom probes. When comparing Angiosperms353 and Malinae481 in this respect, we did not find a pronounced difference. Regarding the ingroup, we observed a strong decrease in the number of recovered loci when applying the stricter missing data filter criteria for Angiosperms353 compared with Malinae481. This demonstrates that Angiosperms353, despite being as phylogenetically informative as custom probes, could be less informative for the ingroup because of a strongly reduced number of loci or a high degree of missing data if the phylogenetic reconstruction is performed on a highly fragmented data set. Nevertheless, Larridon et al. (2020) showed that the Angiosperms353 probe set has the potential to resolve rapid radiations, despite the abovementioned potential caveats. In addition, several phylogenomic studies support the notion that a larger fraction of missing data has no negative effect on phylogenetic reconstruction (Smith and Hahn, 2021, and references therein).
Data completeness may be influenced through the data analysis; for example, the read mapping approach seems to influence target enrichment efficiency (Jantzen et al., 2020;Larridon et al., 2020). BWA was superior to BLASTX in the case of custom probes, whereas for Angiosperms353, BLASTX was the better option as it permitted a lower sequence similarity for the matches, which is needed in the case of Angiosperms353 because the multiple sequence representatives for each locus are up to 30% diverged from each other . Read mapping using BLASTX also improved the target enrichment efficiency in the case of the Malinae (Table 2); however, the optimization of Angiosperms353 gave a bigger improvement.

Angiosperms353 optimization improved locus recovery
In the case of universal probe sets, another possibility for improving the recovery of target loci during data analysis could be an optimization of the probe sequences toward a genome closely related to the study group, which is then used as a reference for read mapping. Based on the genomic resources for the study group, the original sequences of a universal probe set could be replaced by appropriate orthologs from these sources, which we did for the Malinae. This should result in a higher sequence similarity between probes and reads, thereby leading to higher locus recovery. Although we did not detect a great improvement in the proportion of mapped reads when optimizing the Angiosperms353 reference using the Malus genome, we saw an increase in locus recovery: the percentage of loci with both a ≥75% target length and accessions presence was 53% for Angiosperms353 vs. 73% for the Malinae-optimized reference. Bioinformatically altering the performance of Angiosperms353 could alternatively be achieved by increasing the number of taxa per locus in the reference file used for HybPiper analysis (McLay et al., 2021). Alternatively, the optimization of Angiosperms353 could be performed during probe design by replacing the Angiosperms353 sequences with appropriate orthologs from a genomic resource of the study group, such as genome skimming data (Jantzen et

2020)
, although the authors of the study emphasize that this genomic resource should be closely related to the study group.

Unite them: Take the best from Angiosperms353 and a custom probe set
In summary, our results, taken together with insights from Larridon et al. (2020), clearly distinguish the pros and cons of Angiosperms353 and a custom probe set. The Angiosperms353 markers are powerful in elucidating the phylogenetic relationships among the angiosperms at various phylogenetic depths (Van Andel et al., 2019;Gaynor et al., 2020;Larridon et al., 2020;Murphy et al., 2020;Shee et al., 2020); thus, if a research group cannot afford to design a custom probe set, financially or bioinformatically, Angiosperms353 will likely be the best choice for angiosperm phylogenomics. Alternatively, if a research group can design custom probes, this will likely provide improved phylogenomic performance and allow them to address specific questions (e.g., by targeting specific genes or paralogs). We advise utilizing a combination of both probe sets, as shown by Hendriks et al. (2021) for a custom Brassicaceae probe set in combination with Angiosperms353, or, in case of financial restrictions, including appropriate orthologs for the Angiosperms353 loci when designing a custom probe set. Both of these approaches will enhance the access to outgroup data and create downstream opportunities for phylogenomic integration.

A bright future for Malinae phylogenomics?
Our limited sample size does not allow for broad conclusions about possible improvements of the Malinae and Crataegus phylogeny using Angiosperms353, in particular their Malinae-optimized version, as well as the use of Malinae481 for target enrichment. We built well-resolved phylogenies, which are relatively consistent with the previously published Malinae phylogenies based on plastid and nuclear ITS data (Lo and Donoghue, 2012) and on plastome and nuclear ribosomal DNA cistron data (Liu et al., , 2020, as well as the Malinae phylogeny produced as part of the Rosaceae phylogeny (Xiang et al., 2017). With the exception of the ambiguous placement of Pyracantha M. Roem., our plastome tree is identical to the ones obtained with a much broader taxon sampling. A comparison with the published nuclear phylogenies is less straightforward, as these are either not well resolved (Lo and Donoghue, 2012) or include an insufficient number of representatives (Xiang et al., 2017;Liu et al., , 2020, but we see some similarities between them and the multilocus species trees presented in this study. The split between the Crataegus-Hesperomeles clade and the majority of taxa within the Malinae is congruent with all previous studies, and the sister relationship of Malus and Pyrus coincides with the placement of these genera in the Rosaceae phylogeny of Xiang et al. (2017), which was based on low-copy nuclear genes. Contrary to Lo and Donoghue (2012), but in congruence with the more recent studies cited above, Pyracantha is placed inside the Malinae. In general, the species tree topologies obtained using the Malinae-optimized Angiosperms353 reference vs. Malinae481 are basically the same, with the exception of the position of Pourthiaea Decne. and Micromeles Decne. and the infrageneric relationships within Crataegus and Sorbus L. The lower gene tree discordance for Malinae481 in contrast with the Malinaeoptimized reference is probably the result of the slightly more informative Malinae481 loci, which are much longer on average. The phylogenetic relationships within Crataegus are generally in accordance with what we know so far about the evolution of this genus (Ufimov and Dickinson, 2020), giving partial support for the division of the genus into five subgenera, although our limited sampling prevents a detailed interpretation of the infrageneric relationships.
Our preliminary phylogenies must also be viewed with caution because of the insufficient processing of the paralogous loci that are the result of the most recent WGD in the evolution of the Malinae. The removal of the paralogous loci resulting from this WGD event would have depleted the majority of data, and we recognized the need to utilize paralogs for phylogenetic reconstruction; however, this is the focus of a follow-up study (Ufimov et al., in prep.). In addition, the many auto-and allopolyploids in the Malinae require the proper discrimination between alleles and paralogs as the result of neopolyploidization or hybridization events. Such discrimination is particularly challenging when using short-read Illumina data in combination with the usually short length of the targeted exons, but several recent phylogenomic studies using target enrichment have demonstrated that it is possible to overcome this challenge (Kamneva et al., 2017;Morales-Briones et al., 2018), and the relatively long exons of our Malinae481 probes promise to be beneficial in this respect. Academy of Sciences. We dedicate this paper to our colleague and friend Christoph Dobeš, who was a great mentor in science for R.S., and who recently passed away.
work; T.F. advised on the computational analyses; R.U., V.Z., and S.P. analyzed the data; and R.S., R.U., and S.P. wrote the initial draft of the manuscript. All authors contributed to and approved the final version of the manuscript.