RADseq dataset with 90% missing data fully resolves recent radiation of Petalidium (Acanthaceae) in the ultra‐arid deserts of Namibia

Abstract Deserts, even those at tropical latitudes, often have strikingly low levels of plant diversity, particularly within genera. One remarkable exception to this pattern is the genus Petalidium (Acanthaceae), in which 37 of 40 named species occupy one of the driest environments on Earth, the Namib Desert of Namibia and neighboring Angola. To contribute to understanding this enigmatic diversity, we generated RADseq data for 47 accessions of Petalidium representing 22 species. We explored the impacts of 18 different combinations of assembly parameters in de novo assembly of the data across nine levels of missing data plus a best practice assembly using a reference Acanthaceae genome for a total of 171 sequence datasets assembled. RADseq data assembled at several thresholds of missing data, including 90% missing data, yielded phylogenetic hypotheses of Petalidium that were confidently and nearly fully resolved, which is notable given that divergence time analyses suggest a crown age for African species of 3.6–1.4 Ma. De novo assembly of our data yielded the most strongly supported and well‐resolved topologies; in contrast, reference‐based assembly performed poorly, perhaps due in part to moderate phylogenetic divergence between the reference genome, Ruellia speciosa, and the ingroup. Overall, we found that Petalidium, despite the harshness of the environment in which species occur, shows a net diversification rate (0.8–2.1 species per my) on par with those of diverse genera in tropical, Mediterranean, and alpine environments.


| 7921
TRIPP eT al. of the Atacama that, while highly endemic, has very few genera that contain >10 species (Rundel et al., 1991). Although we are unaware of any explicit hypotheses or tests to have addressed this discrepancy, a combination of variables ranging from lower organismal densities to harsher climatic regimes to historical factors likely contributes to the difference. Empirical data from lineages that represent marked deviations from the above pattern are needed to explain discrepancies in standing diversities between wet tropical and Mediterranean versus desert ecosystems.
Species diversity is centered in the "Kaokoveld" region ( Figure 3), which comprises the ultra-arid northwestern mountains, valleys, and coast of Namibia and adjacent southwestern Angola (Craven, 2009;Gil-Romera, Scott, Marais, & Brook, 2006), and prior authors have hypothesized this to be an "obvious area of active speciation" (Meyer, 1968; see Craven, 2009 for detailed study of the "Kaokoveld Center of Endemism"). Precipitation in the Kaokoveld is exceptionally low (<100 mm/year), falls almost entirely in the summer when evapotranspiration is extremely high, and is highly variable from year to year (Becker & Jürgens, 2000). In contrast, the nearby Nama and Succulent Karoo of South Africa and southernmost Namibia receive slightly more precipitation (100-500 mm/year) that falls in winter months and is highly consistent from year to year, thus establishing a Mediterranean climate (Cowling, Esler, & Rundel, 1999). A pattern of low infrageneric diversity in deserts seems to hold for the deserts of Namibia and Angola (Craven, 2009), aside from Petalidium, which represents an exceptional case of high infrageneric diversity.
That Petalidium may be in the early to middle stages of an evolutionary radiation challenges a paradigm of low infrageneric species richness and evolutionary diversification within deserts. To build tools to enable testing of this hypothesis as well as future investigation of drivers of speciation within the genus, we used RADseq data to generate the first phylogenetic hypotheses for Petalidium. Our phylogenetic investigation follows two emerging trends in the use of RADseq data: (1) that setting higher missing data thresholds for loci inclusion (i.e., allowing loci with higher levels of missing data) leads to greater phylogenetic resolution and support (Rubin, Ree, & Moreau, 2012;Wagner et al., 2013;Wessinger, Freeman, Mort, Rausher, & Hileman, 2016), and (2)  Using a reference genome is generally thought to confer data assembly benefits by improving RADseq data quality, for example, by helping remove paralogous loci from the dataset or by correcting for sequencing error in the reads (Davey & Blaxter, 2010;Rubin et al., 2012;Hipp et al., 2014;Gonen, Bishop, & Houston, 2015;Andrews, Good, Miller, Luikart, & Hohenlohe, 2016; but see Bertels, Silander, Pachkov, Rainey, & Nimwegen, 2014). However, loci with higher rates of molecular evolution are less likely to be identified during reference genome-based assembly and thus more likely to be excluded; these are the same loci that will have higher levels of missing data within a RADseq dataset.
Thus, there likely exists a trade-off in phylogenetic power between (1) an inclusive approach that incorporates the maximum number of loci, including those with much missing data, and (2) use of a reference genome in RADseq data assembly.
In this study, we reconstruct phylogenetic hypotheses for Petalidium and explore simultaneously the effects on phylogenetic resolution and support of including loci with increasing levels of missing data, of varying other parameters in RADseq data assembly, and of using a reference genome in data assembly versus de novo assembly of the data.

| Field methods
The large majority of confirmed species of Petalidium (37 of 40) also Acanthaceae) were included in this study (Table 1). This sampling included multiple accessions of 11 species enabling assessment of RADseq data at both the species and below-species (i.e., population) levels as well as testing of the monophyly of species of Petalidium.
Voucher specimens of field collections were deposited at WIND, RSA, COLO, K, E, PRE, and additional institutions.

| DNA Isolation and sequencing methods
DNA was extracted from leaf tissue using either CTAB (Doyle & Doyle, 1987)

| Barcode design
In total, a set of 96 variable-length single-indexed barcodes were designed (Supporting Information) to facilitate pooling of up to 96 samples into a single lane. RADseq barcodes should be variable in length because the five bases sequenced immediately following them will be identical due to the nature of the cutsite of the restriction enzyme. Thus, variable-length barcodes of 7-to 10-bp stagger cutsite sequences, which in turn introduces base (frameshift) diversity into the sequencing process, drastically improving clustering during bridge amplification and base-calling accuracy on the Illumina platform (Elshire et al., 2011).
Barcodes were designed by first creating 1,284 10-bp constantlength barcodes using create_index_sequences.py (Meyer & Kircher, 2010), with an edit distance of 4. Barcodes were then processed F I G U R E 3 Distribution map for collections of Petalidium.
Locations derive from a curated database (Dexter & Tripp, unpubl. data) derived from our collections and those seen by us at WIND, PRE, K, and BM. Angolan occurrences are almost certainly underrepresented in this database. Specific locations pertaining to P. barlerioides (diamonds), P. aromaticum (upside-down triangles), and P. oblongifolium (rightside-up triangles) are indicated on the map, while those of the remaining 37 species are indicated via a single icon (circles) T A B L E 1 The 40 accepted species of Petalidium (sensu Tripp & Dexter, this study). Species Column: in bold are the 26 species that we have seen and collected in the field; the remaining 14 species are for the most part Angolan. Representative Collections Column: voucher specimens, which are deposited at WIND with duplicates at RSA, US, CAS, COLO, K, and/or E if ours (if collections of others, location of specimen is indicated); bolded vouchers w/genetic codes indicate the specimen was used in our RADseq and/or Sanger sequencing analyses (see Figure 7 and Supporting Information). All RADseq data are deposited in GenBank as a Sequence Read Archive (Study #PRJNA392452; SRA #SRP110762). The asterisk denotes a collection that we suspect represents an as yet undescribed species, pending further study through a custom R script that recalculated all pairwise distances between barcodes, allowing for slippage of a single bp at the beginning or end of either barcode (Supporting Information). Barcode pairs with a minimum distance <3 bp were labeled as "possibly problematic," and if a barcode was flagged in this manner two or more times, it was discarded. Pairwise distances were recalculated on the remaining barcode set, again allowing possible slippage of 1 bp. During this second iteration, barcodes with minimum distances <3 bp were discarded, leaving a total of 809 barcodes that remained in the working set. To design barcodes of lengths 7, 8, and 9 bp from the 10-bp barcode set, we selected barcodes with specific end sequences that matched portions of the sequence of the EcoRI cutsite (CAATTC). Adaptor ligation occurs at the cutsite, and thus, the cutsite is necessarily always sequenced immediately following the barcode. As one example, for 7-bp barcodes, we chose barcodes that had "CAA" as the last three bases, truncated the "CAA," and what remained were barcodes that were only 7 bp long.
Similarly, we designed 8-and 9-bp barcodes by selecting 10-bp barcodes with "CA" and "C" at the 3′ end positions, respectively. Twentyfour 8-bp and 24 9-bp barcodes were randomly chosen to comprise the final barcode set along with all 18 7-bp barcodes. The final 30 barcodes were chosen from the remaining 10-bp barcodes, attempting to maximize diversity in base composition. Each site was assigned a weight based on the site's base composition, with more skewed base compositions receiving a higher weight. For each base at each site, underrepresented bases were given a higher weight. Weights were assigned to each barcode based on multiplying the weights for each base with the weights for each site. Finally, 30 10-bp barcodes were sampled randomly based on their weighted probabilities.
Raw sequence reads were demultiplexed, cleaned, and filtered according to quality scores using the process_radtags script; default parameters were assumed with a sliding window of 15 bp and phred quality score threshold of 10. Cleaned data were then assembled into loci following two main approaches: (1) de novo, that is, without a reference genome and (2) with a reference genome. For de novo assembly, 18 different combinations of four parameters were implemented: m (minimum stack depth), M (maximum between stack distance), max_ locus_stacks (maximum number of stacks at a locus), and n (number of mismatches allowed between samples). Values of m ranged from 2 to 6 and control the minimum depth of coverage required for a locus stack to be initially formed; M ranged from 2 to 8 and controls how different two alleles can be and still be merged into the same locus; max_locus_stacks ranged from 2 to 6 and dictates how many alleles are allowed per locus; n ranged from 0 to 16 and controlled how different loci identified in different populations can be and still be merged into the same locus in the overall catalog. These variables are further defined in the Stacks manual (see also Wagner et al., 2013).
For reference-based assembly, we implemented a single set of parameters ( Table 2) that closely mirrored the settings that yielded our best trees (i.e., most resolved and strongly supported) from de novo assembly (see Section 3). The de novo and reference-based assemblies were implemented across nine different levels of missing data to explore resulting numbers of SNPs recovered and the impact on phylogenetic support and resolution (Table 2). In this study, we use the phrase "missing data threshold" to refer to the proportion of individuals that lack sequence data for a given locus (or the minimum number of individuals required to have sequence data at a given locus; elsewhere, this has been termed "min individuals" [Wagner et al., 2013] or "Min taxa " [Wessinger et al., 2016]). Thus, in this study, a 60% missing data threshold refers to a dataset in which loci were included if at least 40% of the individuals had read data for that locus.
In the de novo approach, cleaned data were passed through the ustacks program, which calls loci for each individual sample. In the reference genome-based approach, we used the "-very-sensitive-local" samples to "repair" loci calls in each sample prior to re-running cstacks and sstacks. Finally, the assembled and matched sample loci were passed to the populations script to create phylip files with different levels of p (missing data thresholds). The populations script can be exceptionally memory intensive, even on a high-memory supercomputer, and as such could not be reliably run at levels of missing data higher than 50%. To circumvent this issue, the populations script was first run with all samples assigned to a single population and default settings.
We then built a custom R script (see Supp_Script_creating_whitelists.r, Supporting Information) that (1) parsed the sumstats.tsv output by the number of samples containing a given locus and (2) constructed lists of loci at varying levels of missing data thresholds (10%-90%).
The populations program was then used to input a given list of loci, thus overcoming the memory-intensive task of computing which loci met the missing data criterion; populations was then used to output most phylip files in this study. However, at the highest levels of missing data thresholds (i.e., ≥70%), populations was unable to output the high numbers of accompanying SNPs; as such, the loci lists at these higher missing data thresholds were further subdivided into 10 subsets each

| Phylogenetic reconstruction
To estimate phylogenetic relationships among Petalidium as well as to gauge the effects of different data assembly protocols on phylogenetic reconstruction, maximum-likelihood tree searches were conducted on the 19 runs described in Table 2, across different thresholds of missing data, using RAxML v 8.0.0 (Stamatakis, 2014). Barleria, which is a genus belonging to a different tribe of Acanthaceae, was used to root RADSeq trees. All analyses were conducted on a concatenated matrix utilizing a GTR + G model of sequence evolution. Support for relationships was assessed using 100 rapid bootstrap replicates. A 50% majority-rule consensus tree was computed for each set of bootstrap trees across all analyses, and node support was visualized using a custom R script (Supp_Script_processing_bootstraptrees.r, Supporting Information) that incorporated functions from the ape v. 3.2 R package (Paradis, Claude, & Strimmer, 2004). Phylogenetic support was quantified across all resulting phylogenies by tallying the proportion of nodes with bootstrap values ≥70%.

| Divergence time estimation
To date, the origin and subsequent radiation of Petalidium as well as to estimate net diversification rates, we conducted divergence time analyses that benefitted from a powerful fossil dataset available for Acanthaceae (Tripp & McDade, 2014 Table 2 for details). For run R, the demultiplexed reads were first aligned to a reference genome (Ruellia speciosa; left branch of pathway) using Bowtie2, then assembled into loci via pstacks. The resulting stacks of loci for each sample were input into cstacks, which built a catalog of loci across all samples. The parameter n, implemented in cstacks, was varied from 0 to 16. Sample loci were matched to those in the catalog in the program sstacks. Loci were corrected with the program module rxstacks and rerun through cstacks and sstacks. SNPs were output to phylip files with varying levels (10%-90%) of missing data via the p parameter implemented in the populations script.  (Tripp, 2007(Tripp, , 2010; information on primers and amplification conditions can be found in these publications); these markers are for the most part sufficient for yielding information on relationships among genera in Ruellieae but insufficient to resolve the phylogeny of Petalidium (seeSection 3) or other species-level questions (e.g., within sublineages of Ruellia; Tripp, 2010). In total, this T A B L E 2 Number of SNPs recovered from each Stacks run of Petalidium RADseq data at multiple levels of missing data. Run 1 corresponds to default settings for de novo assembly in Stacks except for the n parameter (no default values specified by the program). Run R corresponds to the reference-based assembly run. m = minimum stack depth; M = maximum between stack distance; mls = max_locus_stacks; n = number of mismatches allowed between sample Our outgroup selection allowed us to conduct two different time calibration analyses, making use of two fossils both ranked as "high utility" in Tripp & McDade (2014; see that publication for indepth explanation of fossil identities, clades represented, and utility assessments). These fossils were here used as minimum age constraints for different taxon sets (Table 3). Analyses conducted using BEAUTi, BEAST, Tracer, and TreeAnnotator (Drummond & Rambaut, 2007) followed methods implemented in Tripp and McDade (2014).
Rate heterogeneity across branches was permitted via implementation of a relaxed clock model (Drummond, Ho, Phillips, & Rambaut, 2006), and the uncorrelated lognormal distribution was selected because previous simulation studies have demonstrated its superior performance (e.g., Drummond, Ho, Phillips, & Rambaut, 2009). A Yule speciation model was specified for the tree prior (Gernhard, Hartmann, & Steel, 2008). For each analysis, we ran BEAST for 10 million generations, which yielded sufficient sampling of posterior distributions based on resulting ESS values (Drummond & Rambaut, 2007). We discarded the first 3 million (analysis 1) or 5.1 million (analysis 2) states as burn-in (i.e., the interval during which posterior probabilities had not yet stabilized), and the remaining trees were used to construct a 50% maximum clade credibility tree, keeping target age heights.

| RADSeq data and loci assembly
The When the filtered reads were aligned to the Ruellia speciosa genome, the average alignment rate was 30%, with a minimum of 3% (Petalidium lanatum-3) and a maximum of 51% (Petalidium halimoides-2). Two samples had exceptionally few reads (Petalidium lanatum-3, Petalidium halimoides) and as such were excluded from all downstream analyses.

| Variation in de novo assembly strategies
Results from analyses indicated that altering values of m, M, max_ locus_stacks, and n had a relatively minor impact on numbers of SNPs recovered, although variation in m yielded the largest effects: higher levels (e.g., m = 4 and m = 6) yielded lower numbers of recovered SNPs ( Figure 5a). Increasing M from 2 to 4, 6, or 8 yielded larger numbers F I G U R E 5 The effects of (a) minimum stack depth, (b) maximum stack distance, (c) maximum stacks allowed per locus, and (d) maximum mismatches allowed between samples on the numbers of SNPs identified in our Petalidium dataset. Aside from the focal parameter in each panel, all remaining parameters were at default values (see Table 2 for details) m: minimum stack depth M : maximum stack distance max_locus_stacks: max stacks per locus n : max mismatches allowed between samples of recovered SNPs although there were negligible differences in numbers of SNPs, as a function of the missing data threshold (Figure 5b).
Whereas increasing max_locus_stacks had a negligible effect as a function of missing data (Figure 5c), increasing n from n = 0 to n = 2 yielded marked differences in recovered SNPs, with diminishing returns at higher values of n (Figure 5d).
In contrast to the above, different allowances of missing data thresholds had substantial impact on the resulting numbers of SNPs, with on average a 15,000-fold difference between our lowest missing data threshold (10%) and our highest missing data threshold (90%; Figure 6). Across all 18 combinations of the four variables plus the reference-based assembly, increasing missing data thresholds consistently yielded increasing numbers of SNPs ( Figure 6). Higher thresholds of missing data also yielded higher node support ( Figure 6).
Curves depicting numbers of supported nodes climbed steadily between zero and ~25,000 SNPs but began to level off after >25,000 SNPs ( Figure 6). Phylogenies achieved close to their maximum number of supported nodes at 50,000 SNPs or greater, with >100,000 SNPs yielding essentially no newly supported nodes ( Figure 6).

| De novo versus reference-based assembly
Analyses indicated that reference-based alignment yielded slightly higher numbers of SNPs at low thresholds of missing data (primarily <30%) but de novo assembly outperformed (i.e., yielded better and more strongly resolved topologies) reference-based assembly at higher thresholds of missing data ( Figure 6). Comparison of de novo to reference-based assembly also indicated that reference-based assembly yielded, on the whole, topologies with fewer supported nodes across nearly all thresholds of missing data ( Figure 6).

| Phylogenetics
RADseq data yielded a fully resolved and nearly fully supported phylogenetic hypothesis of relationships among species and individuals of Petalidium (Figure 7). Analyses demonstrated that higher allowances of missing data increased both phylogenetic resolution and node support ( Figure 8). One of our best-supported and most fully resolved trees (e.g., R1.m90, corresponding to Run 1 with 90% missing data threshold; variabile]) comprising the Petalidium radiation (Figure 7). Of the 45 total nodes present in our most fully resolved phylogenies such as that of R1.m90, 42 were supported by ML bootstrap values ≥70% (Figure 7).
The only unsupported nodes in the R1.m90 phylogenetic hypothesis were those pertaining to the monophyly of Clades 6-10, the sister group relationship between P. coccineum and P. bracteatum, and the sister group relationship between two of three accessions of P. engleranum (Figure 7). The phylogenetic hypothesis derived from assembly to the reference genome gave similar relationships albeit less supported and less resolved than the phylogenetic hypothesis of the de novoassembled data that yielded R1.m90 (Figure 7). Of the 11 species for which multiple accessions were included in phylogenetic analyses, all were monophyletic (Figure 7). RADseq data used in this study successfully resolved both species limits and species-level relationships.

| Divergence timing
Overall, both analyses of Sanger sequence data yielded extremely poor  F I G U R E 7 One of our best estimates of phylogenetic relationships among species of Petalidium. Right: results from analysis of Run 1 with 90% missing data ("R1.m90"; Table 2). The ten clades here resolved are strongly supported; all but three nodes in the phylogeny have ML bootstrap support ≥70%. Four species marked by arrows represent accessions not here assigned to clades (P. oblongifolium, P. aromaticum, P. cirrhiferum, and P. sp. 8 (vel. aff. variabile) but are suspected to form clades with other species following complete taxon sampling of the genus (see Table 1). Of the 11 species for which more than one accession per species was sequenced (from top to bottom: P. crispum, P. coccineum, P. rossmannianum, P. welwitschii, P. variabile, P. pilosi-bracteolatum, P. engleranum, P. setosum, P. canescens, P. halimoides, and P. luteo-album), all formed reciprocally monophyletic clades. Clades 1 & 2 (red) and clades 3-10 (blue) corroborate Obermeijer's (1936) and Meyer's (1968) classification of species into one of two sections: the first (red) composed of plants with regular, five-parted calyces and the second (blue) composed of species with irregular, four-parted calyces. Left (smaller inset phylogeny): results from analysis of loci assembled with a reference, with 90% missing data. Relationships are consistent with those based on the de novo assembly (right) but are less resolved and less well supported

| DISCUSSION
Phylogenetic analyses based on RADseq data fully resolved a first estimate of evolutionary relationships within the enigmatically diverse angiosperm genus Petalidium. These data together with data from earlier studies further demonstrate the power of RADseq data for analyses at the population as well as species levels. Phylogenetic relationships were successfully and strongly reconstructed in de novo data assemblies with very high thresholds (e.g., ≥90%) of missing data.
Our comparison of de novo assemblies to an assembly generated using a reference genome showed that de novo assemblies are capable of generating phylogenies with better resolution and node support, particularly when loci with higher thresholds of missing data (e.g., ≥60%) are included in the analysis. The improved phylogenetic performance of de novo assemblies and those that include loci with high thresholds of missing data derives from the greatly increased number of SNPs that are employed in downstream phylogenetic analyses.

| Phylogeny and evolution of petalidium
Attempts to reconstruct relationships among Petalidium using Sangerbased approaches have yielded trees that are nearly void of node support and full of polytomies (Supporting Information). Although the primary purpose of the present study was not to explicitly compare Sanger-based phylogenies to those of NGS phylogenies given that such differences in performance have been demonstrated clearly by numerous recent works (e.g., Cruad et al., 2014;Massatti, Reznicek, & Knowles, 2016;Nicholls et al., 2015), visual comparison of bootstrap support in Figure 7 (derived from NGS data) to support in divergence time trees in the Supporting Information (derived from Sanger data) clearly depict much poorer performance of the Sanger sequence data.
Regardless of data type (RADseq vs. Sanger), one clear result from all analyses in our study is that Petalidium forms a monophyletic group.
Strong morphological synapomorphies exist to confirm this monophyly (see . The single most salient feature of species of Petalidium is the paired, leaf-like bracts, those typically large and prominent, that enclose flowers (seen readily in Figure 1a,c,k,l,o,w,x,y,z,aa,bb).
Additional synapomorphies include two-seeded capsules with elastically dehiscent septa that break away from capsule walls at maturity (in addition to explosive capsule dehiscence typical of all Acanthaceae s.s.) and pollen that is ellipsoid, triangular in polar view, 12-pseudocolpate, and triporate, with each pore being flanked by two sexine lips along the equatorial axis plus two additional areas of raised tectum along the polar axis .
Prior to the present study, there have been two major attempts at treating relationships within the genus in any level of detail: the works by Obermeijer (1936) and Meyer (1968). Both proposed infrageneric classifications consisting primarily of two sections (a third monotypic section was also proposed in Obermeijer [1936]): one containing the species with regular, five-parted calyces (highlighted in red in Figure 7), and the other containing species with irregular, four-parted calyces (highlighted in blue in Figure 7); neither P. oblongifolium nor P.
aromaticum were treated in infrageneric classifications in Obermeijer (1936) or Meyer (1968). Our phylogenetic results corroborate this division. Although our RADseq sampling did not include the only species of Petalidium found outside of Africa, P. barlerioides, our Sanger sequence phylogeny derived from Analysis 2 shows this species to be strongly supported as sister to all African species and we predict that RADseq data will in the future similarly resolve this relationship.
Regarding calyx arrangement of unplaced species: (1) Petalidium barlerioides has a five-parted calyx like the red group in Figure 7; (2) we have been unable to study the type specimen of P. oblongifolium to deduce its configuration, which we predict will be five-parted based on Desert (primarily, Kaokoveld) landscapes (Figure 1b,d,e,g,h). We suspect that the eight species of Petalidium that are endemic to Angola and that we have not sampled will also fall into this clade. This clade therefore likely represents the bulk of diversification in the genus.
Remarkably, all species for which multiple samples were included in our analyses formed reciprocally monophyletic groups, which is particularly surprising for P. variabile that shows substantial morphological heterogeneity (e.g., see Figure 1 for flower color variation). The repeated monophyly of species suggests a relatively rapid process of lineage sorting, little gene flow among the sampled taxa, or that we have not yet included other taxa or individuals that are likely to be "problematic" (e.g., putative hybrid plants that we have collected in the field). The present phylogenetic framework establishes an important resource for reconstructing a fuller phylogenetic and evolutionary story of Petalidium.

| Divergence timing in Petalidium
Our crown ages for Petalidium indicate a relatively young age of origin, ranging from 4.8 to 1.5 mya depending on the primary fossil calibration. These age estimates include P. barlerioides, hypothesized to be sister to the African radiation (and supported by Analysis 2; Supporting Information), and if they approximate true divergence times, then the crown age of the African clade of Petalidium dates to ca. 3.6 to 1.4 my a (see Supporting Information for credibility intervals). Using the approach of Magallón and Sanderson (2000), these crown ages yield an estimated net diversification rate ranging from 0.8 to 2.1 species per my, which is on par with rates found for diverse lineages in tropical, Mediterranean, and alpine ecosystems (Madriñan, Cortés, & Richardson, 2013;Valente, Savolainen, & Vargas, 2010).

| Strategies and Impacts
Numerous studies have highlighted various benefits of referencebased assembly (Andrews et al., 2016;Davey & Blaxter, 2010;Gonen et al., 2015;Hipp et al., 2014;Rubin et al., 2012). McCluskey and Postlethwait (2015) in particular harnessed the power of a reference genome to improve quality filtering of reads by removing those that mapped to repeat regions, mapped to multiple regions of a genome, or mapped with low support. We agree that reference-based methods yield data that are generally higher in quality than are data derived from de novo assembly. of missing data (<30%) did a reference-based assembly outperform the de novo assemblies in our study ( Figure 6). Moreover, topologies resulting from our reference-based assembly had overall fewer wellsupported nodes, reflecting the fewer numbers of SNPs recovered with this assembly method ( Figure 6). This raises a question of how much impact the evolutionary relatedness of the reference genome has on assembly and thus phylogenetic resolution and support. Stetter and Schmid (2017) 2011). Future studies to compare assemblies to reference genomes of varying degrees of phylogenetic relatedness will provide further insight as to whether and when to employ a reference genome, which will likely depend on phylogenetic distance to the ingroup and need to balance high quality with high quantity of loci (i.e., the trade-off).
Future research could also explore the effects of mapping reads to multiple reference genomes simultaneously, which while more computationally intensive, could yield improvements in numbers of called loci as well as improve confidence in the orthology of resulting loci.
Another strategy would be to first conduct a reference-based assembly, then a de novo assembly on the remaining reads thereby preserving positional information for at least part of the dataset.
For over two decades, the problem of missing data has been and remains central to discussion of phylogenetic reconstruction (Grievnik, Maddison, 1993;Roure, Baurain, & Philippe, 2013;Simmons, 2014;Wiens, 1998). In 2003, Wiens published a landmark paper in which he used simulations to demonstrate that reduced phylogenetic accuracy commonly associated with missing data actually reflects a dataset that lacks enough information rather than having too much missing data per se. Wiens (2003) further discussed the potential for accurate phylogenetic placement of taxa with extremely high levels of missing data. In our study, after building matrices that included loci with very high thresholds of missing data across samples (e.g., 90%), even accessions represented by very few SNPs (e.g., R1.m90, Petalidium giessii-3: 2,502 of 53,972 SNPs or 5%) were placed with high bootstrap support (see Figure 7). From our and numerous other recent studies that have employed RADseq data (e.g., Rubin et al., 2012;Wagner et al., 2013;Wessinger et al., 2016), loci with much missing data are better thought of as beneficial rather than detrimental. Considering the balance between signal and noise, our analyses demonstrate that these loci add more signal than noise and should thus be included.
In this study, we evaluated the performance of different approaches to data assembly based on support levels in phylogenetic hypotheses that were derived from them. Ideally, we would know the true phylogeny and be able to evaluate performance based on the accuracy of the phylogenetic hypotheses derived from different approaches, but as with most nonexperimental studies of extant taxa, the true phylogeny is unknowable (although simulation study can be helpful). Our analyses do show a clear monotonic relationship between the number of SNPs present in a given data assembly and the level of support in the phylogeny derived from the data assembly ( Figure 6). There are good a priori reasons to believe that more SNPs will result in not only better-supported phylogenies, but also more accurate phylogenies (see Wortley, Rudall, Harris, & Scotland, 2005;Xi, Liu, & Davis, 2015).
Another important point in this context is that phylogenetic hypotheses derived from concatenated sequence data from many loci often result in higher node support values than hypotheses derived from a gene tree-species tree approach (Lambert, Reeder, & Wiens, 2015;Nicholls et al., 2015;Thiergart, Landan, & Martin, 2014). However, the manner in which RADseq data are assembled precludes the straightforward implementation of gene tree-species tree approaches, at least given current computational approaches. In any case, our focus here is on the relative support levels in phylogenies derived from different data assembly approaches, and their relative performances may be the same regardless of whether concatenated or gene tree-species tree phylogenetic analyses are conducted.
Finally, there continues to exist debate about the evolutionary "depth" at which RADseq can be used (Eaton, 2014;Rubin et al., 2012 RADseq data to resolve species relationships across substantial evolutionary time spans. Although not all of these explicitly placed phylogenetic results in a temporal framework, a general impression can be garnered from the following macroevolutionary RADseq studies: (1) Wagner et al. (2013)  3.6 to 1.4 mya. From the above, it seems clear that RADseq data are remarkably robust to varying degrees of evolutionary divergence and that obtaining sufficient numbers of SNPs is the single biggest predictor of capacity to resolve a supported phylogeny.

| CONCLUSION
This investigation used RADseq data to investigate a radiation of plants in an ultra-arid ecosystem and provided a first phylogenetic hypothesis for evolutionary relationships among Petalidium. Results add further support to the growing consensus that these data are exceptionally useful at numerous phylogenetic scales. Phylogenetic relationships among Petalidium herein resolved will contribute meaningfully to future studies that investigate correlates of speciation in the genus, particularly with respect to factors that have promoted high infrageneric diversification in a geographically limited area, which defies general patterns of plant diversity in deserts worldwide.

ACKNOWLEDGMENTS
We are indebted to our Namibian colleagues for graciously welcoming us into their country and herbarium, and for their contributions

AUTHOR CONTRIBUTIONS
EAT, YET, and KGD conceptualized the study; EAT and KGD conducted the fieldwork; EAT and YET designed and executed the experiment; YET conducted the analyses; YZ assembled and annotated genome for Ruellia speciosa; EAT, YET, and KGD wrote the manuscript.

CONFLICT OF INTEREST
None declared.